July 22, 2016

William Stein

DataDog's pricing: don't make the same mistake I made

I stupidly made a mistake recently by choosing to use DataDog for monitoring the infrastructure for my startup (SageMathCloud).

I got bit by their pricing UI design that looks similar to many other sites, but is different in a way that caused me to spend far more money than I expected.

I'm writing this post so that you won't make the same mistake I did.  As a product, DataDog is of course a lot of hard work to create, and they can try to charge whatever they want. However, my problem is that what they are going to charge was confusing and misleading to me.

I wanted to see some nice web-based data about my new autoscaled Kubernetes cluster, so I looked around at options. DataDog looked like a new and awesomely-priced service for seeing live logging. And when I looked (not carefully enough) at the pricing, it looked like only $15/month to monitor a bunch of machines. I'm naive about the cost of cloud monitoring -- I've been using Stackdriver on Google cloud platform for years, which is completely free (for now, though that will change), and I've also used self hosted open solutions, and some quite nice solutions I've written myself. So my expectations were way out of whack.

Ever busy, I signed up for the "$15/month plan":


One of the people on my team spent a little time and installed datadog on all the VM's in our cluster, and also made DataDog automatically start running on any nodes in our Kubernetes cluster. That's a lot of machines.

Today I got the first monthly bill, which is for the month that just happened. The cost was $639.19 USD charged to my credit card. I was really confused for a while, wondering if I had bought a year subscription.



After a while I realized that the cost is per host! When I looked at the pricing page the first time, I had just saw in big letters "$15", and "$18 month-to-month" and "up to 500 hosts". I completely missed the "Per Host" line, because I was so naive that I didn't think the price could possibly be that high.

I tried immediately to delete my credit card and cancel my plan, but the "Remove Card" button is greyed out, and it says you can "modify your subscription by contacting us at success@datadoghq.com":



So I wrote to success@datadoghq.com:

Dear Datadog,

Everybody on my team was completely mislead by your
horrible pricing description.

Please cancel the subscription for wstein immediately
and remove my credit card from your system.

This is the first time I've wasted this much money
by being misled by a website in my life.

I'm also very unhappy that I can't delete my credit
card or cancel my subscription via your website. It's
like one more stripe API call to remove the credit card
(I know -- I implemented this same feature for my site).


And they responded:

Thanks for reaching out. If you'd like to cancel your
Datadog subscription, you're able to do so by going into
the platform under 'Plan and Usage' and choose the option
downgrade to 'Lite', that will insure your credit card
will not be charged in the future. Please be sure to
reduce your host count down to the (5) allowed under
the 'Lite' plan - those are the maximum allowed for
the free plan.

Also, please note you'll be charged for the hosts
monitored through this month. Please take a look at
our billing FAQ.


They were right -- I was able to uninstall the daemons, downgrade to Lite, remove my card, etc. all through the website without manual intervention.

When people have been confused with billing for my site, I have apologized, immediately refunded their money, and opened a ticket to make the UI clearer.  DataDog didn't do any of that.

I wish DataDog would at least clearly state that when you use their service you are potentially on the hook for an arbitrarily large charge for any month. Yes, if they had made that clear, they wouldn't have had me as a customer, so they are not incentivized to do so.

A fool and their money are soon parted. I hope this post reduces the chances you'll be a fool like me.  If you chose to use DataDog, and their monitoring tools are very impressive, I hope you'll be aware of the cost.


ADDED:

On Hacker News somebody asked: "How could their pricing page be clearer? It says per host in fairly large letters underneath it. I'm asking because I will be designing a similar page soon (that's also billed per host) and I'd like to avoid the same mistakes."  My answer:

[EDIT: This pricing page by the top poster in this thread is way better than I suggest below -- https://www.serverdensity.com/pricing/]

1. VERY clearly state that when you sign up for the service, then you are on the hook for up to $18*500 = $9000 + tax in charges for any month. Even Google compute engine (and Amazon) don't create such a trap, and have a clear explicit quota increase process.
2. Instead of "HUGE $15" newline "(small light) per host", put "HUGE $18 per host" all on the same line. It would easily fit. I don't even know how the $15/host datadog discount could ever really work, given that the number of hosts might constantly change and there is no prepayment.
3. Inform users clearly in the UI at any time how much they are going to owe for that month (so far), rather than surprising them at the end. Again, Google Cloud Platform has a very clear running total in their billing section, and any time you create a new VM it gives the exact amount that VM will cost per month.
4. If one works with a team, 3 is especially important. The reason that I had monitors on 50+ machines is that another person working on the project, who never looked at pricing or anything, just thought -- he I'll just set this up everywhere. He had no idea there was a per-machine fee.

by William Stein (noreply@blogger.com) at July 22, 2016 02:17 PM

July 13, 2016

Continuum Analytics news

The Gordon and Betty Moore Foundation Grant for Numba and Dask

Posted Thursday, July 14, 2016

I am thrilled to announce that the Gordon and Betty Moore Foundation has provided a significant grant in order to help move Numba and Dask to version 1.0 and graduate them into robust community-supported projects. 

Numba and Dask are two projects that have grown out of our intense foundational desire at Continuum to improve the state of large-scale data analytics, quantitative computing, advanced analytics and machine learning. Our fundamental purpose at Continuum is to empower people to solve the world’s greatest challenges. We are on a mission to help people discover, analyze and collaborate by connecting their curiosity and experience with any data.    

One part of helping great people do even more with their computing power is to ensure that modern hardware is completely accessible and utilizable to those with deep knowledge in other areas besides programming. For many years, Python has been simplifying the connection between computers and the minds of those with deep knowledge in areas such as statistics, science, business, medicine, mathematics and engineering. Numba and Dask strengthen this connection even further so that modern hardware with multiple parallel computing units can be fully utilized with Python code. 

Numba enables scaling up on modern hardware, including computers with GPUs and extreme multi-core CPUs, by compiling a subset of Python syntax to machine code that can run in parallel. Dask enables Python code to take full advantage of both multi-core CPUs and data that does not fit in memory by defining a directed graph of tasks that work on blocks of data and using the wealth of libraries in the PyData stack. Dask also now works well on a cluster of machines with data stored in a distributed file-system, such as Hadoop’s HDFS. Together, Numba and Dask can be used to more easily build solutions that take full advantage of modern hardware, such as machine-learning algorithms, image-processing on clusters of GPUs or automatic visualization of billions of data-points with datashader

Peter Wang and I started Continuum with a desire to bring next-generation array-computing to PyData. We have broadened that initial desire to empowering entire data science teams with the Anaconda platform, while providing full application solutions to data-centric companies and institutions. It is extremely rewarding to see that Numba and Dask are now delivering on our initial dream to bring next-generation array-computing to the Python ecosystem in a way that takes full advantage of modern hardware.    

This award from the Moore Foundation will make it even easier for Numba and Dask to allow Python to be used for large scale computing. With Numba and Dask, users will be able to build high performance applications with large data sets. The grant will also enable our Community Innovation team at Continuum to ensure that these technologies can be used by other open source projects in the PyData ecosystem. This will help scientists, engineers and others interested in improving the world achieve their goals even faster.

Continuum has been an active contributor to the Python data science ecosystem since Peter and I founded the company in early 2012. Anaconda, the leading Open Data Science platform, is now the most popular Python distribution available. Continuum has also conceived and developed several new additions to this ecosystem, making them freely available to the open­ source community, while continuing to support the foundational projects that have made the ecosystem possible.  

The Gordon and Betty Moore Foundation fosters pathbreaking scientific discovery,  environmental conservation, patient care improvements and preservation of the special character of the Bay Area. The Numba and Dask projects are funded by the Gordon and Betty Moore Foundation through Grant GBMF5423 to Continuum Analytics (Grant Agreement #5423).    

We are honored to receive this grant and look forward to working with The Moore Foundation. 

To hear more about Numba and Dask, check out our related SciPy sessions in Austin, TX this week:

  • Thursday, July 14th at 10:30am: “Dask: Parallel and Distributed Computing” by Matthew Rocklin & Jim Crist of Continuum Analytics
  • Friday, July 15th at 11:00am: “Scaling Up and Out:  Programming GPU Clusters with Numba and Dask” by Stan Seibert & Siu Kwan Lam of Continuum Analytics
  • Friday, July 15th at 2:30pm: “Datashader: Revealing the Structure of Genuinely Big Data” by James Bednar & Jim Christ of Continuum Analytics. 

by swebster at July 13, 2016 04:49 PM

Automate your README: conda kapsel Beta 1

Posted Wednesday, July 13, 2016

TL;DR: New beta conda feature allows data scientists and others to describe project runtime requirements in a single file called kapsel.yml. Using kapsel.yml, conda will automatically reproduce prerequisites on any machine and then run the project. 

Data scientists working with Python often create a project directory containing related analysis, notebook files, data-cleaning scripts, Bokeh visualizations, and so on. For a colleague who wants to replicate your project, or even for the original creator a few months later, it can be tricky to run all this code exactly as it was run the first time. 

Most code relies on some specific setup before it’s run -- such as installing certain versions of packages, downloading data files, starting up database servers, configuring passwords, or configuring parameters to a model. 

You can write a long README file to manually record all these steps and hope that you got it right. Or, you could use conda kapsel. This new beta conda feature allows data scientists to list their setup and runtime requirements in a single file called kapsel.yml. Conda reads this file and performs all these steps automatically. With conda kapsel, your project just works for anyone you share it with.

Sharing your project with others

When you’ve shared your project directory (including a kapsel.yml) and a colleague types conda kapsel run in that directory, conda automatically creates a dedicated environment, puts the correct packages in it, downloads any needed data files, starts needed services, prompts the user for missing configuration values, and runs the right command from your project.

As with all things conda, there’s an emphasis on ease-of-use. It would be clunky to first manually set up a project, and then separately configure project requirements for automated setup. 

With the conda kapsel command, you set up and configure the project at the same time. For example, if you type conda kapsel add-packages bokeh=0.12, you’ll get Bokeh 0.12 in your project's environment, and automatically record a requirement for Bokeh 0.12 in your kapsel.yml. This means there’s no extra work to make your project reproducible. Conda keeps track of your project setup for you, automatically making any project directory into a runnable, reproducible “conda kapsel.”

There’s nothing data-science-specific about conda kapsel; it’s a general-purpose feature, just like conda’s core package management features. But we believe conda kapsel’s simple approach to reproducibility will appeal to data scientists.

Try out conda kapsel

To understand conda kapsel, we recommend going through the tutorial. It’s a quick way to see what it is and learn how to use it. The tutorial includes installation instructions.

Where to send feedback

If you want to talk interactively about conda kapsel, give us some quick feedback, or run into any questions, join our chat room on Gitter. We would love to hear from you!

If you find a bug or have a suggestion, filing a GitHub issue is another great way to let us know.

If you want to have a look at the code, conda kapsel is on GitHub.

Next steps for conda kapsel

This is a first beta, so we expect conda kapsel to continue to evolve. Future directions will depend on the feedback you give us, but some of the ideas we have in mind:

  • Support for automating additional setup steps: What’s in your README that could be automated? Let us know!
  • Extensibility: We’d like to support both third-party plugins, and custom setup scripts embedded in projects.
  • UX refinement: We believe the tool can be even more intuitive and we’re currently exploring some changes to eliminate points of confusion early users have encountered. (We’d love to hear your experiences with the tutorial, especially if you found anything clunky or confusing.)

For the time being, the conda kapsel API and command line syntax are subject to change in future releases. A project created with the current “beta” version of conda kapsel may always need to be run with that version of conda kapsel and not conda kapsel 1.0. When we think things are solid, we’ll switch from “beta” to “1.0” and you’ll be able to rely on long-term interface stability.

We hope you find conda kapsel useful!

by swebster at July 13, 2016 02:11 PM

July 12, 2016

Matthieu Brucher

Book review: Team Geek

Sometimes I forget that I have to work with teams, whether they are virtual teams or physical teams. And although I started working on understanding the culture map, I still have to understand how to efficiently work in a team. Enters the book.

Content and opinions

Divided in 6 chapters, the book tries to move from a centric point of view to the most general one, with team members and users, around a principle summed it by HRT. First of all, the book spends a chapter on geniuses. Actually, it’s not really geniuses and not people who think they are geniuses and spend times in their cave working on something and 10 weeks later get out and share their wonderful (crappy) code. Here, the focus is visibility and communication: we all make mistakes (let’s move on, as would say Leonard Hofstadter), so we need to face ourselves with the rest of the team as early as possible.

To achieve this, you need a team culture, a place where people can communicate. There are several levels in this, different ways to achieve this and probably a good balance between all elements, as explained in the second chapter of the book. And with this, you need a good team leader (chapter 3) that will nurture the team culture. Strangely, the book seems to advocate technical people to become team leaders, which is something I find difficult. Actually the book help me understand the good aspects of this, and from the different teams I saw around me, it seems that a pattern that has merits and with a good help to learn delegation, trust… it could be an interesting future for technical people (instead of having bad technical people taking management positions and fighting them because let’s face it, they don’t understand a thing :p ).

Fourth chapter is about dealing with poisonous people. One of the poison is… exactly what I shown in the last paragraph: resent and bitterness! A team is a team with his captain, we are all in the same boat. We can’t badmouth the people we work with (as hard it is!). Fifth chapter is more about the maze above you (fourth was more about dealing with the maze below), how to work with a good manager, and how to deal with a bad one. Sometimes it’s just about communication, sometimes, it’s not, so what should you do?

Finally, the other member of the team is the end-user. As the customer pays the bill in the end, he has to be on board and feel respected, trusted (as much as the team is, it’s a balance!). There are not that many chapters about users in software engineering books, it’s a hard topic. This final chapter gives good advice on the subject.

Conclusion

There are lots of things that are obvious. There are things that are explained in other books as well, but the fact that all relevant topics for computer scientists is in this book makes it an excellent introduction to people starting to work in a team.

by Matt at July 12, 2016 07:33 AM

July 11, 2016

Continuum Analytics news

Anaconda 4.1 Released

Posted Monday, July 11, 2016

We are happy to announce that Anaconda 4.1 has been released. Anaconda is the leading open data science platform powered by Python.

The highlights of this release are:

  • Addition of Jupyter Notebook Extensions

  • Windows installation - silent mode fixes & now compatible with SCCM (System Center Configuration Manager)

  • conda-recipes used to build (using conda-build) the vast majority of the packages in the Anaconda installer have been published at: https://github.com/ContinuumIO/anaconda-recipes

Updates:

  • Python 2.7.12, 3.4.5, 3.5.2

  • numpy 1.11.1

  • scipy 0.17.1

  • pandas 0.18.1

  • MKL 11.3.3

  • Navigator update from 1.1 to 1.2, in particular it no longer installs a desktop shortcut on MacOSX

  • Over 80 other packages, see changelog and package list

To update to Anaconda 4.1, use conda update conda followed by conda update anaconda

by swebster at July 11, 2016 05:42 PM

July 06, 2016

Continuum Analytics news

The Journey to Open Data Science Is Not as Hard as You Think

Posted Wednesday, July 6, 2016

Businesses are in a constant struggle to stay relevant in the market and change is rarely easy — especially when it involves technological overhaul.

Think about the world’s switch from the horse and buggy to automobiles: It revolutionized the world, but it was hardly a smooth transition. At the turn of the 20th century, North America was a loose web of muddy dirt roads trampled by 24 million horses. It took a half-century of slow progress before tires and brakes replaced hooves and reins.

Just as driverless cars hint at a new era of automobiles, the writing’s on the wall for modern analytics: Companies will need to embrace the world’s inevitable slide toward Open Data Science. Fortunately, just as headlights now illuminate our highways, there is a light to guide companies through the transformation.

The Muddy Road to New Technologies

No matter the company or the technological shift, transitions can be challenging for multiple reasons.

One reason is the inevitable skills gap in the labor market when new technology comes along. Particularly in highly specialized fields like data science, finding skilled employees with enterprise experience is difficult. The right hires can mean the difference between success and failure.

Another issue stems from company leaders’ insufficient understanding of existing technologies — both what they can and cannot do. Applications that use machine and deep learning require new software, but companies often mistakenly believe their existing systems are capable of handling the load. This issue is compounded by fragile, cryptic legacy code that can be a nightmare to repurpose.

Finally, these two problems combine to form a third: a lack of understanding about how to train people to implement and deploy new technology. Ultimately, this culminates in floundering and wasted resources across an entire organization.

Luckily, it does not have to be this way.

Open Data Science Paves a New Path

Fortunately, Open Data Science is the guiding light to help companies switch to modern data science easily. Here’s how such an initiative breaks down transitional barriers:

  • No skills gap: Open Data Science is founded on Python and R — both hot languages in universities and in the marketplace. This opens up a massive pool of available talent and a worldwide base of excited programmers and users.
  • No tech stagnation: Open Data Science applications connect via APIs to nearly any data source. In terms of programming, there’s an open source version of any proprietary software on the market. Open Data Science applications such as Anaconda allow for easy interoperability between systems, which is central to the movement.
  • No floundering: Open Data Science bridges old and new technologies to make training and deployment a breeze. One such example is Anaconda Fusion, which offers business analysts command of powerful Python Open Data Science libraries through a familiar Excel interface.

A Guided Pathway to Open Data Science

Of course, just knowing that Open Data Science speeds the transition isn’t enough. A well-trained guide is equally vital for leading companies down the best path to adoption.

The first step is a change management assessment. How will executive, operational and data science teams quickly get up to speed on why Open Data Science is critical to their business? What are the first steps? This process can seem daunting when attempted alone. But this is where consultants from the Open Data Science community can provide the focus and knowledge necessary to quickly convert a muddy back road into the Autobahn.

No matter the business and its existing technologies, any change management plan should include a few key points. First, there should be a method for integration of legacy code (which Anaconda makes easier with packages for melding Python with C or Fortran code). Intelligent migration of data is also important, as is training team members on new systems or recruiting new talent.

While the business world leaves little room for fumbles, like delayed adoption of new technologies or poor change management, Open Data Science can prevent your company’s data science initiatives from meeting a dead end. Open Data Science software provides more than low-cost applications, interoperability, transparency and access — it also brings community know-how to guide change management and facilitate an analytics overhaul in any company at any scale.

Thanks to Open Data Science, the road to data science superiority is now paved with gold.

 

by swebster at July 06, 2016 02:10 PM

July 05, 2016

Thomas Wiecki

Bayesian Deep Learning Part II: Bridging PyMC3 and Lasagne to build a Hierarchical Neural Network

(c) 2016 by Thomas Wiecki

Recently, I blogged about Bayesian Deep Learning with PyMC3 where I built a simple hand-coded Bayesian Neural Network and fit it on a toy data set. Today, we will build a more interesting model using Lasagne, a flexible Theano library for constructing various types of Neural Networks. As you may know, PyMC3 is also using Theano so having the Artifical Neural Network (ANN) be built in Lasagne, but placing Bayesian priors on our parameters and then using variational inference (ADVI) in PyMC3 to estimate the model should be possible. To my delight, it is not only possible but also very straight forward.

Below, I will first show how to bridge PyMC3 and Lasagne to build a dense 2-layer ANN. We'll then use mini-batch ADVI to fit the model on the MNIST handwritten digit data set. Then, we will follow up on another idea expressed in my last blog post -- hierarchical ANNs. Finally, due to the power of Lasagne, we can just as easily build a Hierarchical Bayesian Convolution ANN with max-pooling layers to achieve 98% accuracy on MNIST.

Most of the code used here is borrowed from the Lasagne tutorial.

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('white')
sns.set_context('talk')

import pymc3 as pm
import theano.tensor as T
import theano

from scipy.stats import mode, chisquare

from sklearn.metrics import confusion_matrix, accuracy_score

import lasagne

Data set: MNIST

We will be using the classic MNIST data set of handwritten digits. Contrary to my previous blog post which was limited to a toy data set, MNIST is an actually challenging ML task (of course not quite as challening as e.g. ImageNet) with a reasonable number of dimensions and data points.

In [2]:
import sys, os

def load_dataset():
    # We first define a download function, supporting both Python 2 and 3.
    if sys.version_info[0] == 2:
        from urllib import urlretrieve
    else:
        from urllib.request import urlretrieve

    def download(filename, source='http://yann.lecun.com/exdb/mnist/'):
        print("Downloading %s" % filename)
        urlretrieve(source + filename, filename)

    # We then define functions for loading MNIST images and labels.
    # For convenience, they also download the requested files if needed.
    import gzip

    def load_mnist_images(filename):
        if not os.path.exists(filename):
            download(filename)
        # Read the inputs in Yann LeCun's binary format.
        with gzip.open(filename, 'rb') as f:
            data = np.frombuffer(f.read(), np.uint8, offset=16)
        # The inputs are vectors now, we reshape them to monochrome 2D images,
        # following the shape convention: (examples, channels, rows, columns)
        data = data.reshape(-1, 1, 28, 28)
        # The inputs come as bytes, we convert them to float32 in range [0,1].
        # (Actually to range [0, 255/256], for compatibility to the version
        # provided at http://deeplearning.net/data/mnist/mnist.pkl.gz.)
        return data / np.float32(256)

    def load_mnist_labels(filename):
        if not os.path.exists(filename):
            download(filename)
        # Read the labels in Yann LeCun's binary format.
        with gzip.open(filename, 'rb') as f:
            data = np.frombuffer(f.read(), np.uint8, offset=8)
        # The labels are vectors of integers now, that's exactly what we want.
        return data

    # We can now download and read the training and test set images and labels.
    X_train = load_mnist_images('train-images-idx3-ubyte.gz')
    y_train = load_mnist_labels('train-labels-idx1-ubyte.gz')
    X_test = load_mnist_images('t10k-images-idx3-ubyte.gz')
    y_test = load_mnist_labels('t10k-labels-idx1-ubyte.gz')

    # We reserve the last 10000 training examples for validation.
    X_train, X_val = X_train[:-10000], X_train[-10000:]
    y_train, y_val = y_train[:-10000], y_train[-10000:]

    # We just return all the arrays in order, as expected in main().
    # (It doesn't matter how we do this as long as we can read them again.)
    return X_train, y_train, X_val, y_val, X_test, y_test

print("Loading data...")
X_train, y_train, X_val, y_val, X_test, y_test = load_dataset()
Loading data...
In [3]:
# Building a theano.shared variable with a subset of the data to make construction of the model faster.
# We will later switch that out, this is just a placeholder to get the dimensionality right.
input_var = theano.shared(X_train[:500, ...].astype(np.float64))
target_var = theano.shared(y_train[:500, ...].astype(np.float64))

Model specification

I imagined that it should be possible to bridge Lasagne and PyMC3 just because they both rely on Theano. However, it was unclear how difficult it was really going to be. Fortunately, a first experiment worked out very well but there were some potential ways in which this could be made even easier. I opened a GitHub issue on Lasagne's repo and a few days later, PR695 was merged which allowed for an ever nicer integration fo the two, as I show below. Long live OSS.

First, the Lasagne function to create an ANN with 2 fully connected hidden layers with 800 neurons each, this is pure Lasagne code taken almost directly from the tutorial. The trick comes in when creating the layer with lasagne.layers.DenseLayer where we can pass in a function init which has to return a Theano expression to be used as the weight and bias matrices. This is where we will pass in our PyMC3 created priors which are also just Theano expressions:

In [4]:
def build_ann(init):
    l_in = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),
                                     input_var=input_var)

    # Add a fully-connected layer of 800 units, using the linear rectifier, and
    # initializing weights with Glorot's scheme (which is the default anyway):
    n_hid1 = 800
    l_hid1 = lasagne.layers.DenseLayer(
        l_in, num_units=n_hid1,
        nonlinearity=lasagne.nonlinearities.tanh,
        b=init,
        W=init
    )

    n_hid2 = 800
    # Another 800-unit layer:
    l_hid2 = lasagne.layers.DenseLayer(
        l_hid1, num_units=n_hid2,
        nonlinearity=lasagne.nonlinearities.tanh,
        b=init,
        W=init
    )

    # Finally, we'll add the fully-connected output layer, of 10 softmax units:
    l_out = lasagne.layers.DenseLayer(
        l_hid2, num_units=10,
        nonlinearity=lasagne.nonlinearities.softmax,
        b=init,
        W=init
    )
    
    prediction = lasagne.layers.get_output(l_out)
    
    # 10 discrete output classes -> pymc3 categorical distribution
    out = pm.Categorical('out', 
                         prediction,
                         observed=target_var)
    
    return out

Next, the function which create the weights for the ANN. Because PyMC3 requires every random variable to have a different name, we're creating a class instead which creates uniquely named priors.

The priors act as regularizers here to try and keep the weights of the ANN small. It's mathematically equivalent to putting a L2 loss term that penalizes large weights into the objective function, as is commonly done.

In [5]:
class GaussWeights(object):
    def __init__(self):
        self.count = 0
    def __call__(self, shape):
        self.count += 1
        return pm.Normal('w%d' % self.count, mu=0, sd=.1, 
                         testval=np.random.normal(size=shape).astype(np.float64),
                         shape=shape)

If you compare what we have done so far to the previous blog post, it's apparent that using Lasagne is much more comfortable. We don't have to manually keep track of the shapes of the individual matrices, nor do we have to handle the underlying matrix math to make it all fit together.

Next are some functions to set up mini-batch ADVI, you can find more information in the prior blog post.

In [6]:
# Tensors and RV that will be using mini-batches
minibatch_tensors = [input_var, target_var]

# Generator that returns mini-batches in each iteration
def create_minibatch(data, batchsize=500):
    
    rng = np.random.RandomState(0)
    start_idx = 0
    while True:
        # Return random data samples of set size batchsize each iteration
        ixs = rng.randint(data.shape[0], size=batchsize)
        yield data[ixs]

minibatches = zip(
    create_minibatch(X_train, 500),
    create_minibatch(y_train, 500),
)

total_size = len(y_train)

def run_advi(likelihood, advi_iters=50000):
    # Train on train data
    input_var.set_value(X_train[:500, ...])
    target_var.set_value(y_train[:500, ...])
    
    v_params = pm.variational.advi_minibatch(
        n=advi_iters, minibatch_tensors=minibatch_tensors, 
        minibatch_RVs=[likelihood], minibatches=minibatches, 
        total_size=total_size, learning_rate=1e-2, epsilon=1.0
    )
    trace = pm.variational.sample_vp(v_params, draws=500)
    
    # Predict on test data
    input_var.set_value(X_test)
    target_var.set_value(y_test)
    
    ppc = pm.sample_ppc(trace, samples=100)
    y_pred = mode(ppc['out'], axis=0).mode[0, :]
    
    return v_params, trace, ppc, y_pred

Putting it all together

Lets run our ANN with mini-batch ADVI:

In [14]:
with pm.Model() as neural_network:
    likelihood = build_ann(GaussWeights())
    v_params, trace, ppc, y_pred = run_advi(likelihood)
Iteration 0 [0%]: ELBO = -126739832.76
Iteration 5000 [10%]: Average ELBO = -17180177.41
Iteration 10000 [20%]: Average ELBO = -304464.44
Iteration 15000 [30%]: Average ELBO = -146289.0
Iteration 20000 [40%]: Average ELBO = -121571.36
Iteration 25000 [50%]: Average ELBO = -112382.38
Iteration 30000 [60%]: Average ELBO = -108283.73
Iteration 35000 [70%]: Average ELBO = -106113.66
Iteration 40000 [80%]: Average ELBO = -104810.85
Iteration 45000 [90%]: Average ELBO = -104743.76
Finished [100%]: Average ELBO = -104222.88

Make sure everything converged:

In [17]:
plt.plot(v_params.elbo_vals[10000:])
sns.despine()
In [18]:
sns.heatmap(confusion_matrix(y_test, y_pred))
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21764cf910>
In [20]:
print('Accuracy on test data = {}%'.format(accuracy_score(y_test, y_pred) * 100))
Accuracy on test data = 91.83%

The performance is not incredibly high but hey, it seems to actually work.

Hierarchical Neural Network: Learning Regularization from data

The connection between the standard deviation of the weight prior to the strengh of the L2 penalization term leads to an interesting idea. Above we just fixed sd=0.1 for all layers, but maybe the first layer should have a different value than the second. And maybe 0.1 is too small or too large to begin with. In Bayesian modeling it is quite common to just place hyperpriors in cases like this and learn the optimal regularization to apply from the data. This saves us from tuning that parameter in a costly hyperparameter optimization. For more information on hierarchical modeling, see my other blog post.

In [20]:
class GaussWeightsHierarchicalRegularization(object):
    def __init__(self):
        self.count = 0
    def __call__(self, shape):
        self.count += 1
        
        regularization = pm.HalfNormal('reg_hyper%d' % self.count, sd=1)
        
        return pm.Normal('w%d' % self.count, mu=0, sd=regularization, 
                         testval=np.random.normal(size=shape),
                         shape=shape)
In [ ]:
with pm.Model() as neural_network_hier:
    likelihood = build_ann(GaussWeightsHierarchicalRegularization())
    v_params, trace, ppc, y_pred = run_advi(likelihood)
In [22]:
print('Accuracy on test data = {}%'.format(accuracy_score(y_test, y_pred) * 100))
Accuracy on test data = 92.13%

We get a small but nice boost in accuracy. Let's look at the posteriors of our hyperparameters:

In [23]:
pm.traceplot(trace, varnames=['reg_hyper1', 'reg_hyper2', 'reg_hyper3', 'reg_hyper4', 'reg_hyper5', 'reg_hyper6']);

Interestingly, they all are pretty different suggesting that it makes sense to change the amount of regularization that gets applied at each layer of the network.

Convolutional Neural Network

This is pretty nice but everything so far would have also been pretty simple to implement directly in PyMC3 as I have shown in my previous post. Where things get really interesting, is that we can now build way more complex ANNs, like Convolutional Neural Nets:

In [9]:
def build_ann_conv(init):
    network = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),
                                        input_var=input_var)

    network = lasagne.layers.Conv2DLayer(
            network, num_filters=32, filter_size=(5, 5),
            nonlinearity=lasagne.nonlinearities.tanh,
            W=init)

    # Max-pooling layer of factor 2 in both dimensions:
    network = lasagne.layers.MaxPool2DLayer(network, pool_size=(2, 2))

    # Another convolution with 32 5x5 kernels, and another 2x2 pooling:
    network = lasagne.layers.Conv2DLayer(
        network, num_filters=32, filter_size=(5, 5),
        nonlinearity=lasagne.nonlinearities.tanh,
        W=init)
    
    network = lasagne.layers.MaxPool2DLayer(network, 
                                            pool_size=(2, 2))
    
    n_hid2 = 256
    network = lasagne.layers.DenseLayer(
        network, num_units=n_hid2,
        nonlinearity=lasagne.nonlinearities.tanh,
        b=init,
        W=init
    )

    # Finally, we'll add the fully-connected output layer, of 10 softmax units:
    network = lasagne.layers.DenseLayer(
        network, num_units=10,
        nonlinearity=lasagne.nonlinearities.softmax,
        b=init,
        W=init
    )
    
    prediction = lasagne.layers.get_output(network)
    
    return pm.Categorical('out', 
                   prediction,
                   observed=target_var)
In [10]:
with pm.Model() as neural_network_conv:
    likelihood = build_ann_conv(GaussWeights())
    v_params, trace, ppc, y_pred = run_advi(likelihood, advi_iters=50000)
Iteration 0 [0%]: ELBO = -17290585.29
Iteration 5000 [10%]: Average ELBO = -3750399.99
Iteration 10000 [20%]: Average ELBO = -40713.52
Iteration 15000 [30%]: Average ELBO = -22157.01
Iteration 20000 [40%]: Average ELBO = -21183.64
Iteration 25000 [50%]: Average ELBO = -20868.2
Iteration 30000 [60%]: Average ELBO = -20693.18
Iteration 35000 [70%]: Average ELBO = -20483.22
Iteration 40000 [80%]: Average ELBO = -20366.34
Iteration 45000 [90%]: Average ELBO = -20290.1
Finished [100%]: Average ELBO = -20334.15
In [13]:
print('Accuracy on test data = {}%'.format(accuracy_score(y_test, y_pred) * 100))
Accuracy on test data = 98.21%

Much higher accuracy -- nice. I also tried this with the hierarchical model but it achieved lower accuracy (95%), I assume due to overfitting.

Lets make more use of the fact that we're in a Bayesian framework and explore uncertainty in our predictions. As our predictions are categories, we can't simply compute the posterior predictive standard deviation. Instead, we compute the chi-square statistic which tells us how uniform a sample is. The more uniform, the higher our uncertainty. I'm not quite sure if this is the best way to do this, leave a comment if there's a more established method that I don't know about.

In [14]:
miss_class = np.where(y_test != y_pred)[0]
corr_class = np.where(y_test == y_pred)[0]
In [15]:
preds = pd.DataFrame(ppc['out']).T
In [16]:
chis = preds.apply(lambda x: chisquare(x).statistic, axis='columns')
In [18]:
sns.distplot(chis.loc[miss_class].dropna(), label='Error')
sns.distplot(chis.loc[corr_class].dropna(), label='Correct')
plt.legend()
sns.despine()
plt.xlabel('Chi-Square statistic');

As we can see, when the model makes an error, it is much more uncertain in the answer (i.e. the answers provided are more uniform). You might argue, that you get the same effect with a multinomial prediction from a regular ANN, however, this is not so.

Conclusions

By bridging Lasagne and PyMC3 and by using mini-batch ADVI to train a Bayesian Neural Network on a decently sized and complex data set (MNIST) we took a big step towards practical Bayesian Deep Learning on real-world problems.

Kudos to the Lasagne developers for designing their API to make it trivial to integrate for this unforseen application. They were also very helpful and forthcoming in getting this to work.

Finally, I also think this shows the benefits of PyMC3. By relying on a commonly used language (Python) and abstracting the computational backend (Theano) we were able to quite easily leverage the power of that ecosystem and use PyMC3 in a manner that was never thought about when creating it. I look forward to extending it to new domains.

This blog post was written in a Jupyter Notebook. You can get access the notebook here and follow me on Twitter to stay up to date.

by Thomas Wiecki at July 05, 2016 02:00 PM

Matthieu Brucher

Announcement: ATKTransientSplitter 1.0.0

I’m happy to announce the release of a mono transient splitter based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATK Transient SplitterATK Transient Splitter

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKTransientSplitter .

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at July 05, 2016 07:18 AM

July 01, 2016

Continuum Analytics news

Verifying conda Recipes and Packages with anaconda-verify

Posted Friday, July 1, 2016

anaconda-verify is a tool for (passively) verifying conda recipes and conda packages. All Anaconda recipes, as well as the Anaconda packages, need to pass this tool before they are made publicly available. The purpose of this verification process is to ensure that recipes don't contain obvious bugs and that the conda packages we distribute to millions of users meet our high quality standards.

Historically, the conda packages which represent the Anaconda distribution were not created using conda-build, but an internal build system. In fact, conda-build started as a public fork of this internal system 3 years ago. At that point, the Anaconda distribution had already been around for almost a year, and the only way to create conda packages was by using the internal system. While conda-build has made a lot of progress, the internal system basically stayed unchanged, because the needs on a system for building a distribution are quite different and not driven by the community using conda-build for continuous integration and other language support (e.g. Perl, Lua). On the other hand, the internal system has been developed to support Anaconda distribution specific needs, such as MKL featured packages, source and license reference meta-data, and interoperability between collections of packages.

In an effort to bridge the gap between our internal system and conda-build, we started using conda-build to create conda packages for the Anaconda distribution itself about a year ago. By now, more than 85% of the conda packages in the Anaconda distribution are created using conda-build. However, because of the different requirements mentioned above, we only allow certain features that conda-build offers. This helps to keep the Anaconda recipes simple, maintainable and functional with the rest of the internal system, which reads meta-data from the recipes. This is why we require conda recipes to be valid according to this tool.

Verifying recipes is easy. After installing the tool, using:

conda install anaconda-verify

you run:

anaconda-verify <path to conda recipe>

Another aspect of anaconda-verify is the ability to verify conda packages. These are the most important checks anaconda-verify performs, and, more importantly, we explain why these checks are necessary or useful:

  • Ensure the content of info/files corresponds to the actual archived files in the tarball (except the ones in info/, obviously). This is important, because the files listed in info/files determine which files are linked into the conda environment. Any mismatch here would indicate either (i) the tarball contains files which are not getting linked anywhere or (ii) files which do no exist are attempting to get linked (which would result in an error).
  • Check now for allowed archives in the tarball. A conda package should not contain files in the following directories: conda-meta/, conda-bld/, pkgs/, pkgs32/ and envs/, because this would, for example, allow a conda package to modify another existing environment.
  • Make sure the name, version and build values exist in info/index.json and that they correspond to the actual filename.
  • Ensure there are no files with both .bat and .exe extension. For example, if you had Scripts/foo.bat and Scripts/foo.exe one would shadow the other, and this would become confusing as to which one is actually executed when the user types foo. Although this check is always done, it is only relevant on Windows.
  • Ensure no easy-install.pth file exists. These files would cause problems, as they would overlap (two or more conda packages would contain a easy-install.pth file, which overwrite each other when installing the package).
  • Ensure no "easy install scripts" exists. These are entry point scripts which setuptools creates which are extremely brittle, and should by replaced (overwritten) by the simple entry points scripts conda-build offers (use build/entry_points in your meta.yaml).
  • Ensure no .pyd or .so files have a .py file next to them. This is confusing, as it is not obvious which one the Python interpreter will import. Under certain circumstances, setuptools creates .py next to shared object files for obscure reasons.
  • For packages (other than python), ensure that .pyc are not in Python's standard library directory. This would happen when a .pyc file is missing from the standard library and then created during the build process of another package.
  • Check for missing .pyc files. Missing .pyc files cause two types of problems: (i) When building new packages, they might get included in the new package. For example, when building scipy and numpy is missing .pyc files, then these (numpy .pyc files) get included in the scipy package. (ii) There was a (buggy) Python release which would crash when .pyc files could not written (due to file permissions).
  • Ensure Windows conda packages only contain object files which have the correct architecture. There was a bug in conda-build which would create 64-bit entry point executables when building 32-bit packages on a 64-bit system.
  • Ensure that site-packages does not contain certain directories when building packages. For example, when you build pandas, you don't want a numpy, scipy or setuptools directory to be contained in the pandas package. This would happen when the pandas build dependencies have missing .pyc files.

Here is an example of running the tool on conda packages:

$ anaconda-verify bitarray-0.8.1-py35_0.tar.bz2
==> /Users/ilan/aroot/tars64/bitarray-0.8.1-py35_0.tar.bz2 <==
    bitarray

In this case all is fine, and we see that only the bitarray directory is created in site-packages.

If you have questions about anaconda-verify, please feel free to reach out to our team

by swebster at July 01, 2016 02:58 PM

June 28, 2016

Matthieu Brucher

Analog modeling: Triode circuit

When I started reviewing the diode clippers, the goal was to end up modeling a triode simple preamp. Thanks to Ivan Cohen from musical entropy, I’ve finally managed to drive the proper equation system to model this specific type of preamp.

Schematics

Let’s have a look at the circuit:

Triode simple modelTriode simple model

There are several things to notice:

  • We need to have equations of the triode based on the voltage on its bounds
  • There is a non-null steady state, meaning there is current in the circuit when there is no input

For the equations, I’ve used once again Ivan Cohen’s work available in his papers (the modified Koren’s equations), available in Audio Toolkit.

Steady state

So, for the second point, we need to compute the steady state of the circuit. This can be achieved by putting the input to the ground voltage and remove the capacitors. Once this is done, we can have the final equations of the system in y (for voltage of the plate, the grid, and finally the cathode):

F = \begin{matrix} y(0) - V_{Bias} + I_p * R_p \\ I_g * R_g + y(1) \\ y(2) - (I_g + I_p) * R_k \end{matrix}

The Jacobian:

J= \begin{matrix} 1 + R_p * \frac{dI_p}{V_{pk}} && R_p * \frac{dI_p}{V_{gk}} && -R_p * (\frac{dI_p}{V_{pk}} + \frac{dI_p}{V_{gk}}) \\ \frac{dI_g}{V_{pk}} * R_g && 1 + R_g * \frac{dI_g}{V_{gk}} && -R_g * (\frac{dI_g}{V_{gk}} + \frac{dI_g}{V_{pk}}) \\ -(\frac{dI_p}{V_{pk}} + Ib_Vce) * R_k && -(\frac{dI_c}{V_{gk}} + \frac{dI_g}{V_{gk}}) * R_k && 1 + (\frac{dI_p}{V_{gk}} + \frac{dI_p}{V_{pk}} + \frac{dI_g}{V_{gk}} + \frac{dI_g}{V_{pk}}) * R_k \end{matrix}

With this system, we can run a Newton Raphson optimizer to find the proper stable state of the system. It may require lots of iterations, but this is not a problem: it’s done once at the beginning, and then we will use the next system for computing the new state when we input a signal.

Transient state

As in the previous analog modeling posts, I’m using the SVF/DK-method to simplify the ODE (to remove the derivative dependency, turning the ODE in a non linear system). So there are two systems to solve. The first one is the ODE with traditional Newton Raphson optimizer (from x, we want to compute y=\begin{matrix} V_k \\ V_(out) - V_p \\ V_p \\ V_g \end{matrix}):

F = \begin{matrix} I_b + I_c + i{ckeq} - y(0) * (1/R_k + 2*C_k/dt) \\  i_{coeq} + (y(1) + y(2)) / R_o + y(1) * 2*C_o/dt \\ (y(2) - V_{Bias}) / R_p + (I_p + (y(1) + y(2)) / R_o) \\ (y(3) - x(i)) / R_g + I_g \end{matrix}

Which makes the Jacobian:

J= \begin{matrix} -(\frac{dI_g}{V_{gk}} + \frac{dI_p}{V_{gk}} + \frac{dI_g}{V_{pk}} + \frac{dI_p}{V_{pk}}) - 1/R_k + 2*C_k/dt && 0 && (\frac{dI_g}{V_{pk}} + \frac{dI_p}{V_{pk}}) && (\frac{dI_g}{V_{gk}} + \frac{dI_p}{V_{gk}}) \\ 0 && 1/R_o + 2*C_o/dt && 1/R_o && 0 \\ -(\frac{dI_p}{V_{gk}} + \frac{dI_p}{V_{pk}}) && 1/R_o && 1/R_p + 1/R_o + \frac{dI_p}{V_{pk}} && \frac{dI_p}{V_{gk}} \\ -(\frac{dI_g}{V_{gk}} + \frac{dI_g}{V_{pk}}) && 0 && \frac{dI_g}{V_{pk}} && \frac{dI_g}{V_{gk}} + 1/R_g \end{matrix}

Once more, this system can be optimized with a classic NR optimizer, this time in a few iterations (3 to 5, depending on the oversampling, the input signal…)

The updates for ickeq and icoeq are:

\begin{matrix} i_{ckeq} \\ i_{coeq} \end{matrix} = \begin{matrix} 4 * C_k / dt * y(1) - i{ckeq} \\ -4 * C_o / dt * y(2) - i_{coeq} \end{matrix}

Of course, we need a start value for the steady state. It is quite simple as the previous state is the same as the new one which makes the equations:

\begin{matrix} i_{ckeq} \\ i_{coeq} \end{matrix} = \begin{matrix} 2 * C_k / dt * y(1) \\ -2 * C_o / dt * y(2) \end{matrix}

Result

Let’s start with the behavior for a simple sinusoid signal:

Tube behavior for a 100Hz signal (10V)Tube behavior for a 100Hz signal (10V)

You can spot that in the beginning of the sinusoid, the output signal is not stable, the average moves down with time. This is to be expected as the tube obviously compresses the negative side of the sinusoid, while it almost chops the positive side after 30 to 40V. The non symmetric behavior is what gives the warmth to the tube. The even harmonics are also clear with a sine sweep of the system:

Sine sweep on the triode circuitSine sweep on the triode circuit

Strangely enough, even if the signal seems quite distorted, really harmonics are not strong (compared to SD1 or TS9). This makes me believe that it is a reason for the love of the user for tube preamps.

Conclusion

There are a few new developments here compared to my previous posts (SD1 and TS9). The first is the fact that we have a more complex system, with several capacitors that need to be individually simulated. This leads to a vectorial implementation of the NR algorithm.

The second one is the requirement to compute a steady state that is not zero everywhere. Without this, the first iterations of the system would be more than chaotic, not realistic and hard on the CPU. Definitely not what you would want!

Based on these two developments, it is now possible to easily develop more complex circuit modeling. Even if the cost here is high (due to the complex triode equations), the conjunction of the NR algorithm with the DK method makes it doable to have real-time simulation.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 28, 2016 07:21 AM

June 27, 2016

Continuum Analytics news

Continuum Analytics Unveils Anaconda Mosaic to Make Enterprise Data Transformations Portable for Heterogeneous Data

Posted Tuesday, June 28, 2016

Empowers data scientists and analysts to explore, visualize, and catalog data and transformations for disparate data sources enabling data portability and faster time-to-insight

AUSTIN, TX—June 28, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced the availability of Anaconda Mosaic. With the ability to easily create and catalog transformations against heterogeneous data stores, Anaconda Mosaic empowers data scientists, quants and business analysts to interactively explore, visualize, and transform larger-than-memory datasets to more quickly discover new insights.   

Enterprise data architecture is becoming increasingly complex. Data stores have a relatively short half life and data is being shifted to new data stores - NoSQL, SQL, flat files - at a higher frequency. In order for organizations to find insights from the data they must first find existing transformations and rewrite the transformations for the new data store. This creates delays in getting the insights from the data. Continuum Analytics’ Anaconda Mosaic enables organizations to quickly explore, visualize, and redeploy transformations based on pandas and SQL without rewriting the transformations while maintaining governance by tracking data lineage and provenance.

“Through the course of daily operations, businesses accumulate huge amounts of data that gets locked away in legacy databases and flat file repositories. The transformations that made the data usable for analysis gets lost, buried or simply forgotten,” said Michele Chambers, Executive Vice President Anaconda Business Unit & CMO at Continuum Analytics. “Our mission is for Anaconda Mosaic to unlock the mystery of this dark data, making it accessible for businesses to quickly redeploy to new data stores without any refactoring so enterprises can reap the analytic insight and value almost instantly. By eliminating refactoring of transformations, enterprises dramatically speed up their time-to-value, without having to to spend lengthy cycles on the refactoring process.”

Some of the key features of Anaconda Mosaic include: 

  • Visually explore your data. Mosaic provides built-in visualizations for large heterogeneous datasets that makes it easy for data scientists and business analysts to accurately understand data including anomalies. 
  • Instantly get portable transformations. Create transformations with the expression builder to catalog data sources and transformations. Execute the transformation against heterogeneous data stores while tracking data lineage and provenance. When data stores change, simply deploy the transformations and quickly get the data transformed and ready for analysis. 
  • Write once, compute anywhere. For maximum efficiency, Mosaic translates transformations and orchestrates computation execution on the data backend, minimizing the costly movement of data across the network and taking full advantage of the built-in highly optimized code featured in the data backend. Users can access data in multiple data stores with the same code without rewriting queries or analytic pipelines.
  • Harvest large flat file repositories in place: Mosaic combines flat files, adds derived data and filters for performance easily. This allows users to describe the structure of their data in large flat file repositories and uses that description in data discovery, visualization, and transformations, saving the user from writing tedious ETL code.  Mosaic ensures that the data loaded is only what is necessary to compute the transformation, which can lead to significant memory and performance gains.

Continuum Analytics is hosting a webinar which will take attendees through how to use Mosaic to simplify transformations and get to faster insights on June 30. Please register here

Mosaic is available to current Anaconda Enterprise subscribers; to find out more about Anaconda Mosaic, get in touch

About Continuum Analytics

Continuum Analytics is the creator and driving force behind Anaconda, the leading, Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, data engineers and business analysts – to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

To learn more about Continuum Analytics, visit www.continuum.io.

###

Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

by swebster at June 27, 2016 07:39 PM

Anaconda Fusion: A Portal to Open Data Science for Excel

Posted Monday, June 27, 2016

Excel has been business analysts’ go-to program for years. It works well, and its familiarity makes it the currency of the realm for many applications.

But, in a bold new world of predictive analytics and Big Data, Excel feels cut off from the latest technologies and limited in the scope of what it can actually take on.

Fortunately for analysts across the business world, a new tool has arrived to change the game — Anaconda Fusion.

A New Dimension of Analytics

The interdimensional portal has been a staple of classic science fiction for decades. Characters step into a hole in space and emerge instantly in an entirely different setting — one with exciting new opportunities and challenges.

Now, Data Science has a portal of its own. The latest version of Anaconda Fusion, an Open Data Science (ODS) integration for Microsoft Excel, links the familiar world of spreadsheets (and the business analysts that thrive there) to the “alternate dimension” of Open Data Science that is reinventing analytics.

With Anaconda Fusion and other tools from Anaconda, business analysts and data scientists can share work — like charts, tables, formulas and insights — across Excel and ODS languages such as Python easily, erasing the partition that once divided them.

Jupyter (formerly IPython) is a popular approach to sharing across the scientific computing community, with notebooks combining  code, visualizations and comments all in one document. With Anaconda Enterprise Notebooks, this is now available under a governed environment, providing the collaborative locking, version control, notebook differencing and searching needed to operate in the enterprise. Since Anaconda Fusion, like the entire Anaconda ecosystem, integrates seamlessly with Anaconda Enterprise Notebooks, businesses can finally empower Excel gurus to collaborate effectively with the entire Data Science team.

Now, business analysts can exploit the ease and brilliance of Python libraries without having to write any code. Packages such as scikit-learn and pandas drive machine learning initiatives, enabling predictive analytics and data transformations, while plotting libraries, like Bokeh, provide rich interactive visualizations.

With Anaconda Fusion, these tools are available within the familiar Excel environment—without the need to know Python. Contextually-relevant visualizations generated from Python functions are easily embedded into spreadsheets, giving business analysts the ability to make sense of, manipulate and easily interpret data scientists’ work. 

A Meeting of Two Cultures

Anaconda Fusion is connecting two cultures from across the business spectrum, and the end result creates enormous benefits for everyone.

Business analysts can leverage the power, flexibility and transparency of Python for data science using the Excel they are already comfortable with. This enables functionality far beyond Excel, but also can teach business analysts to use Python in the most natural way: gradually, on the job, as needed and in a manner that is relevant to their context. Given that the world is moving more and more toward using Python as a lingua franca for analytics, this benefit is key.

On the other side of the spectrum, Python-using data scientists can now expose data models or interactive graphics in a well-managed way, sharing them effectively with Excel users. Previously, sharing meant sending static images or files, but with Anaconda Fusion, Excel workbooks can now include a user interface to models and interactive graphics, eliminating the clunky overhead of creating and sending files.

It’s hard to overstate how powerful this unification can be. When two cultures learn to communicate more effectively, it results in a cross-pollination of ideas. New insights are generated, and synergistic effects occur.

The Right Tools

The days of overloaded workarounds are over. With Anaconda Fusion, complex and opaque Excel macros can now be replaced with the transparent and powerful functions that Python users already know and love.

The Python programming community places a high premium on readability and clarity. Maybe that’s part of why it has emerged as the fourth most popular programming language used today. Those traits are now available within the familiar framework of Excel.

Because Python plays so well with web technologies, it’s also simple to transform pools of data into shareable interactive graphics — in fact, it's almost trivially easy. Simply email a web link to anyone, and they will have a beautiful graphics interface powered by live data. This is true even for the most computationally intense cases — Big Data, image recognition, automatic translation and other domains. This is transformative for the enterprise.

Jump Into the Portal

The glowing interdimensional portal of Anaconda Fusion has arrived, and enterprises can jump in right way. It’s a great time to unite the experience and astuteness of business analysts with the power and flexibility of Python-powered analytics.

To learn more, you can watch our Anaconda Fusion webinar on-demand, or join our Anaconda Fusion Innovators Program to get early access to exclusive features -- free and open to anyone. You can also contact us with any questions about how Anaconda Fusion can help improve the way your business teams share data. 

by swebster at June 27, 2016 03:36 PM

June 23, 2016

Enthought

5 Simple Steps to Create a Real-Time Twitter Feed in Excel using Python and PyXLL

PyXLL 3.0 introduced a new, simpler, way of streaming real time data to Excel from Python. Excel has had support for real time data (RTD) for a long time, but it requires a certain knowledge of COM to get it to work. With the new RTD features in PyXLL 3.0 it is now a lot […]

by Isaac Franz at June 23, 2016 06:13 PM

June 21, 2016

Enthought

AAPG 2016 Conference Technical Presentation: Unlocking Whole Core CT Data for Advanced Description and Analysis

Microscale Imaging for Unconventional Plays Track Technical Presentation: Unlocking Whole Core CT Data for Advanced Description and Analysis American Association of Petroleum Geophysicists (AAPG) 2016 Annual Convention and Exposition Technical Presentation Tuesday June 21st at 4:15 PM, Hall B, Room 2, BMO Centre, Calgary Presented by: Brendon Hall, Geoscience Applications Engineer, Enthought, and Andrew Govert, Geologist, […]

by admin at June 21, 2016 03:15 PM

Matthieu Brucher

Audio Toolkit: Transient splitter

After my transient shaper, some people told me it would be nice to have a splitter: split the signal in two tracks, one with the transient, another with the sustain. For instance, it would be interesting to apply a different distortion on both signals.

So for instance this is what could happen for a simple signal. The sustain signal is not completely shut off, and there can be a smooth transition between he two signals (thanks to the smoothness parameter). Of course, the final signals have to sum back to the original signal.

How a transient splitter would workHow a transient splitter would work

I may end up doing a stereo version (with M/S capabilities) for the splitter, but maybe also another one with some distortion algorithms before everything is summed up again.

Let me know what you think about these idea.

by Matt at June 21, 2016 07:59 AM

June 19, 2016

Filipe Saraiva

My LaKademy 2016

LaKademy 2016 group photo

In the end of May, ~20 gearheads from different countries of Latin America were together in Rio de Janeiro working in several fronts of the KDE. This is our ‘multiple projects sprint’ named LaKademy!

Like all previous editions of LaKademy, this year I worked hard in Cantor; unlike all previous editions, this year I did some work in new projects to be released in some point in the future. So, let’s see my report of LaKademy 2016.

Cantor

LaKademy is very important to Cantor development because during the sprint I can to focus and work hard to implement great features to the software. In past editions I started the Python 2 backend development, ported Cantor to Qt5/KF5, drop kdelibs4support, and more.

This year is the first LaKademy after I got the maintainer status of Cantor and, more amazing, it is the first edition where I was not the only developer working in Cantor: we had a team working in different parts of the project.

My main work was to perform a heavy bug triage in Cantor, closing old bugs and confirming some of them. In addition I could to fix several bugs like the LaTeX rendering and the crash after close the window for Sage backend, or the fix for plot commands for Octave backend.

My second work was to help the others developers working in Cantor, I was very happy to work with different LaKademy attendees in the software. I helped Fernando Telles, my SoK 2015 student, to fix the support for Sage backend for Sage version > 7.2. Wagner Reck was working in a new backend for Root, the scientific programming framework developed by CERN. Rafael Gomes created a Docker image to Cantor in order to make easy the environment configuration, build, and code contribution for new developers. He wants to use it in other KDE software and I am really excited to see Cantor as the first software in this experiment.

Other relevant work was some discussions with other developers about the selection of an “official” technology to create backends for Cantor. Currently Cantor has backends developed in several ways: some of them use C/C++ APIs, others use Q/KProcess, others use DBus… you can think about how to maintain all these backends is a work for crazy humans.

I did not select the official technology yet. Both DBus and Q/KProcess has advantages and disadvantages (DBus is a more ‘elegant’ solution but bring Cantor to other OS can be more easy if we use Q/KProcess)… well, I will wait for the new DBus-based Julia backend, in development by our GSoC 2016 student, to make decision about which solution to use.

From left to right: Ronny, Fernando, Ícaro, and me ;)

New projects: Sprat and Leibniz (non-official names)

This year I could to work in some new projects to be released in the future. Their provisional names are Sprat and Leibniz.

Sprat is a text editor to write drafts of scientific papers. The scientific text follows some patterns of sentences and communication figures. Think about “A approach based in genetic algorithm was applied to the travel salesman problem”: it is easy to identify the pattern in that text. Linguistics has worked in this theme and it is possible to classify sentences based in the communication objective to be reached for a sentence. Sprat will allow to the user to navigate in a set of sentences and select them to create drafts of scientific papers. I intent to release Sprat this year, so please wait for more news soon.

Leibniz is Cantor without worksheets. Sometimes you want just to run your mathematical method, your scientific script, and some related computer programs, without to put explanations, figures, or videos in the terminal. In KDE world we have amazing technologies to allow us to develop a “Matlab-like” interface (KonsolePart, KTextEditor, QWidgets, and plugins) to all kind of scientific programming languages like Octave, Python, Scilab, R… just running these programs in KonsolePart we have access to syntax highlighting, tab completion… I would like to have a software like this so I started the development. I decided to develop a new software and not a new view to Cantor because I think the source code of Leibniz will be small and more easy to maintain.

So, if you are excited with some of them, let me know in comments below and wait a few months for more news! 🙂

Community-related tasks

During LaKademy we had our promo meeting, an entire morning to discuss KDE promo actions in Latin America. KDE will have a day of activities at FISL and we are excited to make amazing KDE 20th birthday parties in the main free software events in Brazil. We also evaluated and discussed the continuation of some interesting activities like Engrenagem (our videocast series) and new projects like demo videos for KDE applications.

In that meeting we also decided the city to host LaKademy 2017: Belo Horizonte! We expect to have a incredible year with KDE activities in Latin America to be evaluated in our next promo meeting.

Conclusion: “O KDE na América Latina continua lindo

This edition of LaKademy had strong and dedicated work by all attendees in several fronts of KDE, but we had some moments to stay together and consolidate our community and friendship. Unfortunately we did not have time to explore Rio de Janeiro (it was my first time in the city) but I had good impressions of the city and their people. I intent to go back to there, maybe this year yet.

The best part of to be a member of a community like KDE is to make friends for the life, people with you like to share beers and food while chat about anything. This is amazing for me and I found it in KDE. <3

Thank you KDE and see you soon in next LaKademy!

by Filipe Saraiva at June 19, 2016 10:52 PM

June 17, 2016

Continuum Analytics news

Anaconda and Docker - Better Together for Reproducible Data Science

Posted Monday, June 20, 2016

Anaconda integrates with many different providers and platforms to give you access to the data science libraries you love on the services you use, including Amazon Web Services, Microsoft Azure, and Cloudera CDH. Today we’re excited to announce our new partnership with Docker.

As part of the announcements at DockerCon this week, Anaconda images will be featured in the new Docker Store, including Anaconda and Miniconda images based on Python 2 and Python 3. These freely available Anaconda images for Docker are now verified, will be featured in the Docker Store when it launches, are being regularly scanned for security vulnerabilities and are available from the ContinuumIO organization on Docker Hub.

The Anaconda images for Docker make it easy to get started with Anaconda on any platform, and provide a flexible starting point for developing or deploying data science workflows with more than 100 of the most popular Open Data Science packages for Python and R, including data analysis, visualization, optimization, machine learning, text processing and more.

Whether you’re a developer, data scientist, or devops engineer, Anaconda and Docker can provide your entire data science team with a scalable, deployable and reproducible Open Data Science platform.

Use Cases with Anaconda and Docker

Anaconda and Docker are a great combination to empower your development, testing and deployment workflows with Open Data Science tools, including Python and R. Our users often ask whether they should be using Anaconda or Docker for data science development and deployment workflows. We suggest using both - they’re better together!

Anaconda’s sandboxed environments and Docker’s containerization complement each other to give you portable Open Data Science functionality when you need it - whether you’re working on a single machine, across a data science team or on a cluster.

Here are a few different ways that Anaconda and Docker make a great combination for data science development and deployment scenarios:

1) Quick and easy deployments with Anaconda

Anaconda and Docker can be used to quickly reproduce data science environments across different platforms. With a single command, you can quickly spin up a Docker container with Anaconda (and optionally with a Jupyter Notebook) and have access to 720+ of the most popular packages for Open Data Science, including Python and R.

2) Reproducible build and test environments with Anaconda

At Continuum, we’re using Docker to build packages and libraries for Anaconda. The build images are available from the ContinuumIO organization on Docker Hub (e.g., conda-builder-linux and centos5_gcc5_base). We also use Docker with continuous integration services, such as Travis CI, for automated testing of projects across different platforms and configurations (e.g., Dask.distributed and hdfs3).

Within the open-source Anaconda and conda community, Docker is also used for reproducible test and build environments. Conda-forge is a community-driven infrastructure for conda recipes that uses Docker with Travis CI and CircleCI to build, test and upload conda packages that include Python, R, C++ and Fortran libraries. The Docker images used in conda-forge are available from the conda-forge organization on Docker Hub.

3) Collaborative data science workflows with Anaconda

You can use Anaconda with Docker to build, containerize and share your data science applications with your team. Collaborative data science workflows with Anaconda and Docker make the transition from development to deployment as easy as sharing a Dockerfile and conda environment.

Once you’ve containerized your data science applications, you can use container clustering systems, such as Kubernetes or Docker Swarm, when you’re ready to productionize, deploy and scale out your data science applications for many users.

4) Endless combinations with Anaconda and Docker

The combined portability of Anaconda and flexibility of Docker enable a wide range of data science and analytics use cases.

A search for “Anaconda“ on Docker Hub shows many different ways that users are leveraging libraries from Anaconda with Docker, including turnkey deployments of Anaconda with Jupyter Notebooks; reproducible scientific research environments; and machine learning and deep learning applications with Anaconda, TensorFlow, Caffe and GPUs.

Using Anaconda Images with Docker

There are many ways to get started using the Anaconda images with Docker. First, choose one of the Anaconda images for Docker based on your project requirements. The Anaconda images include the default packages listed here, and the Miniconda images include a minimal installation of Python and conda.

continuumio/anaconda (based on Python 2.7)
continuumio/anaconda3 (based on Python 3.5)
continuumio/miniconda (based on Python 2.7)
continuumio/miniconda3 (based on Python 3.5)

For example, we can use the continuumio/anaconda3 image, which can be pulled from the Docker repository:

$ docker pull continuumio/anaconda3

Next, we can run the Anaconda image with Docker and start an interactive shell:

$ docker run -i -t continuumio/anaconda3 /bin/bash

Once the Docker container is running, we can start an interactive Python shell, install additional conda packages or run Python applications.

Alternatively, we can start a Jupyter Notebook server with Anaconda from a Docker image:

$ docker run -i -t -p 8888:8888 continuumio/anaconda3 /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser"

You can then view the Jupyter Notebook by opening http://localhost:8888 in your browser, or http://<DOCKER-MACHINE-IP>:8888 if you are using a Docker Machine VM.

Once you are inside of the running notebook, you can import libraries from Anaconda, perform interactive computations and visualize your data.

Additional Resources for Anaconda and Docker

Anaconda and Docker complement each other and make working with Open Data Science development and deployments easy and scalable. For collaborative workflows, Anaconda and Docker provide everyone on your data science team with access to scalable, deployable and reproducible Open Data Science.

Get started with Anaconda with Docker by visiting ContinuumIO organization on Docker Hub. The Anaconda images will also be featured in the Docker Store when it launches.

Interested in using Anaconda and Docker in your organization for Open Data Science development, reproducibility and deployments? Get in touch with us if you’d like to learn more about how Anaconda can empower your enterprise with Open Data Science, including an on-premise package repository, collaborative notebooks, cluster deployments and custom consulting/training solutions.

by swebster at June 17, 2016 04:03 PM

June 14, 2016

Continuum Analytics news

Orange Part II: Monte Carlo Simulation

Posted Wednesday, June 22, 2016

For the blog post Orange Part I: Building Predictive Models, please click here.

In this blogpost, we will explore the versatility of Orange through a Monte Carlo simulation of Apple’s stock price.  For an explanation on Monte Carlo simulation for stocks, visit Investopedia

Let’s take a look at our schema:

We start off by grabbing AAPL stock data off of Yahoo! Finance and loading it into our canvas. This gives us all of AAPL’s data starting from 1980, but we only want to look at relatively recent data. Fortunately, Orange comes with a variety of data management and preprocessing techniques. Here, we can use the “Purge Domain” widget to simply remove the excess data.
 
After doing so, we can see what AAPL’s closing stock price is post-2008 through a scatter plot. 

In order to run our simulation, we need certain inputs, including daily returns. Our data does not come with AAPL’s daily returns, but, fortunately, daily returns can easily be calculated via pandas. We can save our current data, add on our daily returns and then load up the modified dataset back into our canvas. After saving the data to AAPL.tab, we run the following script on our data: https://anaconda.org/rahuljain/monte-carlo-with-orange/notebook. After doing so, we simply load up the new data. Here is what our daily returns look like: 

Now, we need to use our daily returns to run the Monte Carlo simulation. We can again use a Python script for this task; this time let’s use the built-in Python script widget. Note that we could have used the built-in widget for the last script as well, but we wanted to see how we could save/load our data within the canvas. For our Monte Carlo simulation, we will need four parameters: starting stock price, number of days we want to simulate and the standard deviation and mean of AAPL’s daily returns. We can find these inputs with the following script: 

We go ahead and run our simulation 1000 times with the starting stock price of $125.04. The script takes in our stock data and outputs a dataset containing 1000 price points 365 days later. 
We can visualize these prices via a box plot and histograms: 

With this simulated data, we can make various calculations; a common calculation is Value at Risk (VaR). Here, we can say with 99% confidence that our stock’s price will be above $116.41 in 365 days, so we are putting $8.63 (starting price - 116.41) at risk 99% of the time. 

We have successfully built a monte carlo simulation via Orange; this task demonstrated how we can use Orange outside of its machine learning tools. 

Summary

These three demos in this Orange blogpost series showed how Orange users can quickly and intuitively work with data sets.  Because of component-based design and integration with Python, Orange should appeal to machine learning researchers for the speed of execution and ease of prototyping of new methods. Graphical user’s interface is provided through visual programming and a large toolbox of widgets that support interactive data exploration. Component-based design, both on the level of procedural and visual programming, flexibility in combining components to design new machine learning methods and data mining applications and user-friendly environment are also the most significant attributes of Orange and where Orange can make its biggest contribution to the community. 

by swebster at June 14, 2016 08:26 PM

Orange Part I: Building Predictive Models

Posted Wednesday, June 15, 2016

In this blog series we will showcase Orange, an open source data visualization and data analysis tool, through two simple predictive models and a Monte Carlo Simulation. 

Introduction to Orange

Orange is a comprehensive, component-based framework for machine learning and data mining. It is intended for both experienced users and researchers in machine learning, who want to prototype new algorithms while reusing as much of the code as possible, and for those just entering the field who can either write short Python scripts for data analysis or enjoy the powerful, easy-to-use visual programming environment. Orange includes a range of techniques, such as data management and preprocessing, supervised and unsupervised learning, performance analysis and a range of data and model visualization techniques.

Orange has a visual programming front-end for explorative data analysis and visualization called Orange Canvas. Orange Canvas is a visual, component-based programming approach that allows us to quickly explore and analyze data sets. Orange’s GUI is composed of widgets that communicate through channels; a set of connected widgets is called a schema. The creation of schemas is quick and flexible, because widgets are added on through a drag-and-drop method.

Orange can also be used as a Python library. Using the Orange library, it is easy to prototype state-of-the-art machine learning algorithms.

Building a Simple Predictive Model in Orange

We start with two simple predictive models in the Orange canvas and their corresponding Jupyter notebooks. 

First let’s take a look at our Simple Predictive Model- Part 1 notebook. Now, let’s recreate the model in the Orange Canvas. Here is the schema for predicting the results of the Iris data set via a classification tree in Orange: 

Notice the toolbar on the left of the canvas- this is where the 100+ widgets can be found and dragged onto the canvas. Now, let’s take a look at how this simple schema works. The schema reads from left to right, with information flowing from widget to widget through the pipelines. After the Iris data set is loaded in, it can be viewed through a variety of widgets. Here, we chose to see the data in a simple data table and a scatter plot. When we click on those two widgets, we see the following: 

With just three widgets, we already get a sense of the data we are working with. The scatter plot has an option to “Rank Projections,” determining the best way to view our data. In this case, having the scatter plot as “Petal Width vs Petal Length” allows us to immediately see a potential pattern in the width of a flower’s petal and the type of iris the flower is. Beyond scatter plots, there are a variety of different widgets to help us visualize our data in Orange. 

Now, let’s look at how we built our predictive model. We simply connected the data to a Classification Tree widget and can view the tree through a Classification Tree Viewer widget. 

We can see exactly how our predictive model works. Now, we connect our model and our data to the “Test and Score” and “Predictions” widgets. The Test and Score widget is one way of seeing how well our Classification Tree performs: 

The Predictions widget predicts the type of iris flower given the input data. Instead of looking at a long list of these predictions, we can use a confusion matrix to see our predictions and their accuracy. 

Thus, we see our model misclassified 3/150 data instances. 

We have seen how quickly we can build and visualize a working predictive model in the Orange canvas. Now, let’s take a look at how the exact same model can once again be built via scripting in Orange, a Python 3 data mining library

Building a Predictive Model with a Hold Out Test Set in Orange

In our second example of a predictive model, we make the model slightly more complicated by holding out a test set. By doing so, we can use separate datasets to train and test our model, thus helping to avoid overfitting. Here is the original notebook. 

Now, let’s build the same predictive model in the Orange Canvas. The Orange Canvas will allow us to better visualize what we are building. 

Orange Schema:

As you can tell, the difference between Part 1 and Part 2 is the Data Sampler widget. This widget randomly separates 30% of the data into the testing data set. Thus, we can build the same model, but more accurately test it using data the model has never seen before. 

This example shows how easy it is to modify existing schemas. We simply introduced one new widget to vastly improve our model. 

Now let’s look at the same model built via the Orange Python 3 library.

Summary

In this blogpost, we have introduced Orange, an open source data visualization and data analysis tool, and presented two simple predictive models. In our next blogpost, we will instruct how to build a Monte Carlo Simulation done with Orange.

by swebster at June 14, 2016 08:06 PM

Matthieu Brucher

Analog modeling: SD1 vs TS9

There are so many different distortion/overdrive/fuzz guitar pedals, and some have a better reputation than other. Two of them have a reputation of being closed (one copied on the other), and I already explained how one of these could be modeled (and I have a plugin with it!). So let’s work on comparing the SD1 and the TS9.

Global comparison

I won’t focus on the input and output stage, although they can play a role in the sound (especially since the output stage is the only difference between the TS9 and the TS808…).

Let’s have a look at the schematics:

SD1 annotated schematicSD1 annotated schematic

TS9 schematicTS9 schematic

The global circuits seem similar, with similar functionalities. The overdrives are heavily closed, and actually without the asymmetry of the SD1, they would be identical (there are versions of SD1 with the 51p capacitor or a close enough value). The tone circuits have more differences, with the TS9 having an additional 10k resistor and a “missing” capacitor around the AOP. The values are also quite different, but based on a similar design.

Now, the input stages are different. SD1 has a higher input capacitor, but removes around 10% of the input signal compared to 1% for TS9 (not accounting for the guitar output impedance). Also there are two high pass filters on SD1 with the same cut frequency at 50Hz, whereas the TS9 has “only” one at 100Hz. They more or less end up being similar. For the output, the SD1 ditches 33% of the final signal before the output stage that also has a high pass filter at 20Hz and finally another one at 10Hz. The TS9 has also a 20Hz high pass but it is followed by another 1Hz high pass. All things considered, except for the overdrive and the tone circuits, there should be not audible difference on a guitar, but I wouldn’t advice either pedal for a bass guitar, the input stages are chopping off too much.

Overdrive circuit

The overdrive circuits are almost a match. The only difference is that the potentiometer has double resistance on the SD1 and there is 2 diodes in one path (the capacitor has no impact according to the LTSpice simulation I ran). And leads to exactly what I expected for similar drive value:

SD1 and TS9 behavior on a 100Hz signalSD1 and TS9 behavior on a 100Hz signal

This is the behavior for all frequencies. The only difference is the slightly small voltage on the down part of the curve. This shows up more clearly on the spectrum:

SD1 sine sweep with an oversampling x4SD1 sine sweep with an oversampling x4

TS9 sine sweep with an oversampling x4TS9 sine sweep with an oversampling x4

To limite the noise in this case, I ran the sine sweep again, with an oversampling of x8. The difference with the additional even frequencies in SD1 is obvious.

SD1 sine sweep with an oversampling x8SD1 sine sweep with an oversampling x8

TS9 sine sweep with an oversampling x8TS9 sine sweep with an oversampling x8

Tone

The tone circuit is a nightmare to compute by hand. The issues are with the simplification of the potentiometer in the equations. I did it for the SD1 tone circuit, and as the TS9 is a little bit different, I had to start over (several years after solving SD1 :/).

I won’t display the equations here, the coefficients can be found in the pedal tone stack filters in Audio Toolkit. Suffice to say that TS9 can be a high pass filter whereas SD1 is definitely an EQ. The different behavior is obvious in the following pictures:

SD1 tone spectrumSD1 tone transfer function

TS9 tone spectrumTS9 tone transfer function

The transfer functions are different even if their analog circuit is quite similar. This is definitely the difference that people hear between SD1 and TS9.

Conclusion

The two pedals are quite similar when checking the circuit, and even if SD1 is labelled as an asymmetric overdrive, the actual sound different between the two pedals may be more related to the tone circuit than the overdrive.

Now that these filters are available in Audio Toolkit, it is easy to try different combinations!

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 14, 2016 07:06 AM

June 13, 2016

Fernando Perez

In Memoriam, John D. Hunter III: 1968-2012

I just returned from the SciPy 2013 conference, whose organizers kindly invited me to deliver a keynote. For me this was a particularly difficult, yet meaningful edition of SciPy, my favorite conference. It was only a year ago that John Hunter, creator of matplotlib, had delivered his keynote shortly before being diagnosed with terminal colon cancer, from which he passed away on August 28, 2012 (if you haven't seen his talk, I strongly recommend it for its insights into scientific open source work).

On October 1st 2012, a memorial service was held at the University of Chicago's Rockefeller Chapel, the location of his PhD graduation. On that occasion I read a brief eulogy, but for obvious reasons only a few members from the SciPy community were able to attend. At this year's SciPy conference, Michael Droetboom (the new project leader for matplotlib) organized the first edition of the John Hunter Excellence in Plotting Contest, and before the awards ceremony I read a slightly edited version of the text I had delivered in Chicago (you can see the video here). I only made a few changes for brevity and to better suit the audience of the SciPy conference. I am reproducing it below.

I also went through my photo albums and found images I had of John. A memorial fund has been established in his honor to help with the education of his three daughers Clara, Ava and Rahel (Update: the fund was closed in late 2012 and its proceeds given to the family; moving forward, NumFOCUS sponsors the John Hunter Technology Fellowship, that anyone can make contributions to).


Dear friends and colleagues,

I used to tease John by telling him that he was the man I aspired to be when I grew up. I am not sure he knew how much I actually meant that. I first met him over email in 2002, when IPython was in its infancy and had rudimentary plotting support via Gnuplot. He sent me a patch to support a plotting syntax more akin to that of matlab, but I was buried in my effort to finish my PhD and couldn’t deal with his contribution for at least a few months. In the first example of what I later came to know as one of his signatures, he kindly replied and then simply routed around this blockage by single-handedly creating matplotlib. For him, building an entire new visualization library from scratch was the sensible solution: he was never one to be stopped by what many would consider an insurmountable obstacle.

Our first personal encounter was at SciPy 2004 at Caltech. I was immediately taken by his unique combination of generous spirit, sharp wit and technical prowess, and over the years I would grow to love him as a brother. John was a true scholar, equally at ease in a conversation about monetary policy, digital typography or the intricacies of C++ extensions in Python. But never once would you feel from him a hint of arrogance or condescension, something depressingly common in academia. John was driven only by the desire to work on interesting questions and to always engage others in a meaningful way, whether solving their problems, lifting their spirits or simply sharing a glass of wine. Beneath a surface of technical genius, there lied a kind, playful and fearless spirit, who was quietly comfortable in his own skin and let the power of his deeds speak for him.

Beyond the professional context, John had a rich world populated by the wonders of his family, his wife Miriam and his daughters Clara, Ava and Rahel. His love for his daughters knew no bounds, and yet I never once saw him clip their wings out of apprehension. They would be up on trees, dangling from monkeybars or riding their bikes, and he would always be watchful but encouraging of all their adventures. In doing so, he taught them to live like he did: without fear that anything could be too difficult or challenging to accomplish, and guided by the knowledge that small slips and failures were the natural price of being bold and never settling for the easy path.

A year ago in this same venue, John drew lessons from a decade’s worth of his own contributions to our community, from the vantage point of matplotlib. Ten years earlier at U. Chicago, his research on pediatric epilepsy required either expensive and proprietary tools or immature free ones. Along with a few similarly-minded folks, many of whom are in this room today, John believed in a future where science and education would be based on openly available software developed in a collaborative fashion. This could be seen as a fool’s errand, given that the competition consisted of products from companies with enormous budgets and well-entrenched positions in the marketplace. Yet a decade later, this vision is gradually becoming a reality. Today, the Scientific Python ecosystem powers everything from history-making astronomical discoveries to large financial modeling companies. Since all of this is freely available for anyone to use, it was possible for us to end up a few years ago in India, teaching students from distant rural colleges how to work with the same tools that NASA uses to analyze images from the Hubble Space Telescope. In recognition of the breadth and impact of his contributions, the Python Software Foundation awarded him posthumously the first installment of its highest distinction, the PSF Distinguished Service Award.

John’s legacy will be far-reaching. His work in scientific computing happened in a context of turmoil in how science and education are conducted, financed and made available to the public. I am absolutely convinced that in a few decades, historians of science will describe the period we are in right now as one of deep and significant transformations to the very structure of science. And in that process, the rise of free openly available tools plays a central role. John was on the front lines of this effort for a decade, and with his accomplishments he shone brighter than most.

John’s life was cut far, far too short. We will mourn him for time to come, and we will never stop missing him. But he set the bar high, and the best way in which we can honor his incredible legacy is by living up to his standards: uncompromising integrity, never-ending intellectual curiosity, and most importantly, unbounded generosity towards all who crossed his path. I know I will never grow up to be John Hunter, but I know I must never stop trying.

Fernando Pérez

June 27th 2013, SciPy Conference, Austin, Tx.

by Fernando Perez (noreply@blogger.com) at June 13, 2016 10:13 AM

June 09, 2016

Pierre de Buyl

ActivePapers: hello, world

License: CC-BY

ActivePapers is a technology developed by Konrad Hinsen to store code, data and documentation with several benefits: storage in a single HDF5 file, internal provenance tracking (what code created what data/figure, with a Make-like conditional execution) and a containerized execution environment.

Implementations for the JVM and for Python are provided by the author. In this article, I go over the first steps of creating an ActivePaper. Being a regular user of Python, I cover only this language.

An overview of ActivePapers

First, a "statement of fact": An ActivePaper is a HDF5 file. That is, it is a binary, self-describing, structured and portable file whose content can be explored with generic tools provided by the HDF Group.

The ActivePapers project is developed by Konrad Hinsen as a vehicle for the publication of computational work. This description is a bit short and does not convey the depth that has gone into the design of ActivePapers, the ActivePapers paper will provide more information.

ActivePapers come, by design, with restrictions on the code that is executed. For instance, only Python code (in the Python implementation) can be used, with the scientific computing module NumPy. All data is accessed via the h5py module. The goals behind these design choices are related to security and to a good definition of the execution environment of the code.

Creating an ActivePaper

The tutorial on the ActivePapers website start by looking at an existing ActivePaper. I'll go the other way around, as I found it more intuitive. Interactions with an ActivePaper are channeled by the aptool program (see the installation notes).

Currently, ActivePapers lack a "hello, world" program, so here is mine. ActivePapers work best when you dedicate a directory to a single ActivePaper. You may enter the following in a terminal:

mkdir hello_world_ap                 # create a new directory
cd hello_world_ap                    # visit it
aptool -p hello_world.ap create      # This lines create a new file "hello_world.ap"
mkdir code                           # create the "code" directory where you can
                                     # write program that will be stored in the AP
echo "print 'hello, world'" > code/hello.py # create a program
aptool checkin -t calclet code/hello.py     # store the program in the AP

That's is, you have created an ActivePaper!

You can observe its content by issuing

aptool ls                            # inspect the AP

And execute it

aptool run hello                     # run the program in "code/hello.py"

This command looks into the ActivePapers file and not into the directories visible in the filesystem. The filesystem acts more like a staging area.

A basic computation in ActivePapers

The "hello, world" program above did not perform a computation of any kind. An introductory example for science is the computation of the number $\pi$ by the Monte Carlo method.

I will now create a new ActivePaper (AP) but comment on the specific ways to define parameters, store data and create plots. The dependency on the plotting library matplotlib has to be given when creating the ActivePaper:

mkdir pi_ap
cd pi_ap
aptool -p pi.ap create -d matplotlib

To generate a repeatable result, I store the seed for the random number generator

aptool set seed 1780812262
aptool set N 10000

The line above store a data element in the AP that is of type integer. The value of seed can be accessed in the Python code of the AP.

I will create several programs to mimic the workflow of more complex problems: one to generate the data, one to analyze the data and one for generating a figure.

The first program is generate_random_numbers.py

import numpy as np
from activepapers.contents import data

seed = data['seed'][()]
N = data['N'][()]   
np.random.seed(seed)
data['random_numbers'] = np.random.random(size=(N, 2))

Apart from importing the NumPy module, I have also imported the ActivePapers data

from activepapers.contents import data

data is a dict-like interface to the content of the ActivePaper and so only work in code that is checked in the ActivePaper and executed with aptool. data can be used to read values, such a the seed and number of samples, and to store data, such as the samples here.

The [()] returns the value of scalar datasets in HDF5. To have more information on this, see the dataset documentation of h5py.

The second program is compute_pi.py

import numpy as np
from activepapers.contents import data

xy = data['random_numbers'][...]
radius_square = np.sum(xy**2, axis=1)
N = len(radius_square)
data['estimator'] = np.cumsum(radius_square < 1) * 4 / np.linspace(1, N, N)

And the third is plot_pi.py

import numpy as np
import matplotlib
matplotlib.use('PDF')
import matplotlib.pyplot as plt
from activepapers.contents import data, open_documentation

estimator = data['estimator']
N = len(estimator)
plt.plot(estimator)
plt.xlabel('Number of samples')
plt.ylabel(r'Estimation of $\pi$')
plt.savefig(open_documentation('pi_figure.pdf', 'w'))

Notice:

  1. The setting of the PDF driver for matplotlib before importing matplotlib.pyplot.
  2. The use of open_documentation. This function provides file descriptors that can read and write binary blurbs.

Now, you can checkin and run the code

aptool checkin -t calclet code/*.py
aptool run generate_random_numbers
aptool run compute_pi
aptool run plot_pi

Concluding words

That's it, we have created an ActivePaper and ran code with it.

For fun: issue the command

aptool set seed 1780812263

(or any number of your choosing that is different from the previous one) and then

aptool update

ActivePapers handle dependencies! That's is, everything that depends on the seed will be updated. That include the random numbers, the estimator for pi and the figure. To see the update, check the creation times in the ActivePaper

aptool ls -l

It is good to know that ActivePapers have been used as companions to research articles! See Protein secondary-structure description with a coarse-grained model: code and datasets in ActivePapers format for instance.

You can have a look at the resulting files that I uploaded to Zenodo: doi:10.5281/zenodo.55268

References

ActivePapers paper K. Hinsen, ActivePapers: a platform for publishing and archiving computer-aided research, F1000Research (2015), 3 289.

ActivePapers website The website for ActivePapers

by Pierre de Buyl at June 09, 2016 12:00 PM

June 07, 2016

Matthieu Brucher

Announcement: ATKTransientShaper 1.0.0

I’m happy to announce the release of a mono transient shaper based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATK Transient ShaperATK Transient Shaper

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKTransientShaper .

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 07, 2016 07:32 AM

June 03, 2016

Continuum Analytics news

Anaconda Cloud Release v 2.18.0

Posted Friday, June 3, 2016

This is a quick note to let everyone know that we released a new version of Anaconda Cloud today - version 2.18.0 (and the underlying Anaconda Repository server software). It's a minor release, but has some useful new updates: 

  1. With the release of Pip 8.1.2, package downloads weren't working for some packages. This issue is now resolved. Additional details on this issue here.
  2. We've moved our docs from docs.anaconda.org to docs.continuum.io with a new IA and new look & feel. 
  3. The platform's API now has documentation - available here - more work to do to refine this feature, but the basics are present for an often-requested addition. 
  4. Of course, the laundry list of bug fixes... 

To read additional details, check out the Anaconda-Repository change-log.

If you run into issues, let us know. Here's the best starting point to help direct issues.

-Team Anaconda

by swebster at June 03, 2016 02:16 PM

June 02, 2016

Continuum Analytics news

NAG and Continuum Analytics Partner to Provide Readily Accessible Numerical Algorithms

Posted Thursday, June 2, 2016

Improved Accessibility for NAG’s Mathematical and Statistical Routines for Python Data Scientists

Numerical Algorithms Group (NAG) and Continuum have partnered together to provide conda packages for the NAG Library for Python (nag4py), the Python bindings for the NAG C Library. Users wishing to use the NAG Library with Anaconda can now install the bindings with a simple command (conda install -c nag nag4py) or the Anaconda Navigator GUI.

For those of us who use Anaconda, the leading Open Data Science platform, for package management and virtual environments, this enhancement provides immediate access to the 1,500+ numerical algorithms in the NAG Library. It also means that you can automatically download any future NAG Library updates as they are published on the NAG channel in Anaconda Cloud.

To illustrate how to use the NAG Library for Python, I have created an IPython Notebook1 that demonstrates the use of NAG’s implementation of the PELT algorithm to identify the changepoints of a stock whose price history has been stored in a MongoDB database. Using the example of Volkswagen (VOW), you can clearly see that a changepoint occurred when the news about the recent emissions scandal broke. This is an unsurprising result in this case, but in general, it will not always be as clear when and where a changepoint occurs.

So far, conda packages for the NAG Library for Python have been made available for 64-bit Linux, Mac and Windows platforms. On Linux and Mac, a conda package for the NAG C Library will automatically be installed alongside the Python bindings, so no further configuration is necessary. A Windows conda package for the NAG C Library is coming soon. Until then, a separate installation of the NAG C Library is required. In all cases, the Python bindings require NumPy, so that will also be installed by conda if necessary.

Use of the NAG C Library requires a valid licence key, which is available here: www.nag.com. The NAG Library is also available for a 30-day trial.

1The IPython notebook requires Mark 25, which is currently available on Windows and Linux. The Mac version will be released over the summer.

by pcudia at June 02, 2016 09:09 PM

Continuum Analytics Announces Inaugural AnacondaCON Conference

Posted Thursday, June 2, 2016

The brightest minds in Open Data Science will come together in Austin for two days of engaging sessions and panels from industry leaders and networking in February 2017

AUSTIN, Texas. – June 2, 2016 – Continuum Analytics, the creator and driving force behind Anaconda, the leading open source analytics platform powered by Python, today announced the inaugural Anaconda user conference, taking place from February 7-9, 2017 in Austin. AnacondaCON is a two-day event at the JW Marriot that brings together innovative enterprises that are on the journey to Open Data Science to capitalize on their growing treasure trove of data assets to create compelling business value for their enterprise.

From predictive analytics to deep learning, AnacondaCON will help attendees learn how to build data science applications to meet their needs. Attendees will be at varying stages from learning how to start their Open Data Science journey and accelerating it to sharing their experiences. The event will offer Open Data Science advocates an opportunity to engage in breakout sessions, hear from industry experts during keynote sessions, learn about case studies from subject matter experts and choose from specialized and focused sessions based on topic areas of interest.

“We connect regularly with Anaconda fans at many industry and community events worldwide. Now, we’re launching our first ever customer and user conference, AnacondaCON, for our growing and thriving enterprise community to have an informative gathering place to discover, share and engage with similar enterprises,” said Michele Chambers, VP of Products & CMO at Continuum Analytics. “The common thread that links these enterprises together is that they are all passionate about solving business and world problems and see Open Data Science as the answer. At AnacondaCON, they will connect and engage with the innovators and thought leaders behind the Open Data Science movement and learn more about industry trends, best practices and how to harness the power of Open Data Science to meet data-driven goals.”

Attend AnacondaCON
 

Registration will open soon, in the meantime visit: https://anacondacon17.io/ to receive regular updates about the conference.

Sponsorship Opportunities

There are select levels of sponsorship available, ranging from pre-set packages to a-la-carte options. To learn more about sponsorship, email us at sponsorship@continuum.io.

About Continuum Analytics

Continuum Analytics’ Anaconda is the leading open data science platform powered by Python. We put superpowers into the hands of people who are changing the world. Anaconda is trusted by leading businesses worldwide and across industries – financial services, government, health and life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda helps data science teams discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage open data science environments and harness the power of the latest open source analytic and technology innovations. Visit http://www.continuum.io.

by pcudia at June 02, 2016 02:28 PM

June 01, 2016

Thomas Wiecki

Bayesian Deep Learning

Neural Networks in PyMC3 estimated with Variational Inference

(c) 2016 by Thomas Wiecki

There are currently three big trends in machine learning: Probabilistic Programming, Deep Learning and "Big Data". Inside of PP, a lot of innovation is in making things scale using Variational Inference. In this blog post, I will show how to use Variational Inference in PyMC3 to fit a simple Bayesian Neural Network. I will also discuss how bridging Probabilistic Programming and Deep Learning can open up very interesting avenues to explore in future research.

Probabilistic Programming at scale

Probabilistic Programming allows very flexible creation of custom probabilistic models and is mainly concerned with insight and learning from your data. The approach is inherently Bayesian so we can specify priors to inform and constrain our models and get uncertainty estimation in form of a posterior distribution. Using MCMC sampling algorithms we can draw samples from this posterior to very flexibly estimate these models. PyMC3 and Stan are the current state-of-the-art tools to consruct and estimate these models. One major drawback of sampling, however, is that it's often very slow, especially for high-dimensional models. That's why more recently, variational inference algorithms have been developed that are almost as flexible as MCMC but much faster. Instead of drawing samples from the posterior, these algorithms instead fit a distribution (e.g. normal) to the posterior turning a sampling problem into and optimization problem. ADVI -- Automatic Differentation Variational Inference -- is implemented in PyMC3 and Stan, as well as a new package called Edward which is mainly concerned with Variational Inference.

Unfortunately, when it comes traditional ML problems like classification or (non-linear) regression, Probabilistic Programming often plays second fiddle (in terms of accuracy and scalability) to more algorithmic approaches like ensemble learning (e.g. random forests or gradient boosted regression trees).

Deep Learning

Now in its third renaissance, deep learning has been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games, and beating the world-champion Lee Sedol at Go. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders and in all sorts of other interesting ways (e.g. Recurrent Networks, or MDNs to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.

A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars:

  • Speed: facilitating the GPU allowed for much faster processing.
  • Software: frameworks like Theano and TensorFlow allow flexible creation of abstract models that can then be optimized and compiled to CPU or GPU.
  • Learning algorithms: training on sub-sets of the data -- stochastic gradient descent -- allows us to train these models on massive amounts of data. Techniques like drop-out avoid overfitting.
  • Architectural: A lot of innovation comes from changing the input layers, like for convolutional neural nets, or the output layers, like for MDNs.

Bridging Deep Learning and Probabilistic Programming

On one hand we Probabilistic Programming which allows us to build rather small and focused models in a very principled and well-understood way to gain insight into our data; on the other hand we have deep learning which uses many heuristics to train huge and highly complex models that are amazing at prediction. Recent innovations in variational inference allow probabilistic programming to scale model complexity as well as data size. We are thus at the cusp of being able to combine these two approaches to hopefully unlock new innovations in Machine Learning. For more motivation, see also Dustin Tran's recent blog post.

While this would allow Probabilistic Programming to be applied to a much wider set of interesting problems, I believe this bridging also holds great promise for innovations in Deep Learning. Some ideas are:

  • Uncertainty in predictions: As we will see below, the Bayesian Neural Network informs us about the uncertainty in its predictions. I think uncertainty is an underappreciated concept in Machine Learning as it's clearly important for real-world applications. But it could also be useful in training. For example, we could train the model specifically on samples it is most uncertain about.
  • Uncertainty in representations: We also get uncertainty estimates of our weights which could inform us about the stability of the learned representations of the network.
  • Regularization with priors: Weights are often L2-regularized to avoid overfitting, this very naturally becomes a Gaussian prior for the weight coefficients. We could, however, imagine all kinds of other priors, like spike-and-slab to enforce sparsity (this would be more like using the L1-norm).
  • Transfer learning with informed priors: If we wanted to train a network on a new object recognition data set, we could bootstrap the learning by placing informed priors centered around weights retrieved from other pre-trained networks, like GoogLeNet.
  • Hierarchical Neural Networks: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see my tutorial on Hierarchical Linear Regression in PyMC3). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy -- e.g. early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data.
  • Other hybrid architectures: We can more freely build all kinds of neural networks. For example, Bayesian non-parametrics could be used to flexibly adjust the size and shape of the hidden layers to optimally scale the network architecture to the problem at hand during training. Currently, this requires costly hyper-parameter optimization and a lot of tribal knowledge.

Bayesian Neural Networks in PyMC3

Generating data

First, lets generate some toy data -- a simple binary classification problem that's not linearly separable.

In [1]:
%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale
from sklearn.cross_validation import train_test_split
from sklearn.datasets import make_moons
In [2]:
X, Y = make_moons(noise=0.2, random_state=0, n_samples=1000)
X = scale(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.5)
In [3]:
fig, ax = plt.subplots()
ax.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0')
ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r', label='Class 1')
sns.despine(); ax.legend()
ax.set(xlabel='X', ylabel='Y', title='Toy binary classification data set');

Model specification

A neural network is quite simple. The basic unit is a perceptron which is nothing more than logistic regression. We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem.

In [17]:
# Trick: Turn inputs and outputs into shared variables. 
# It's still the same thing, but we can later change the values of the shared variable 
# (to switch in the test-data later) and pymc3 will just use the new data. 
# Kind-of like a pointer we can redirect.
# For more info, see: http://deeplearning.net/software/theano/library/compile/shared.html
ann_input = theano.shared(X_train)
ann_output = theano.shared(Y_train)

n_hidden = 5

# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden)
init_2 = np.random.randn(n_hidden, n_hidden)
init_out = np.random.randn(n_hidden)
    
with pm.Model() as neural_network:
    # Weights from input to hidden layer
    weights_in_1 = pm.Normal('w_in_1', 0, sd=1, 
                             shape=(X.shape[1], n_hidden), 
                             testval=init_1)
    
    # Weights from 1st to 2nd layer
    weights_1_2 = pm.Normal('w_1_2', 0, sd=1, 
                            shape=(n_hidden, n_hidden), 
                            testval=init_2)
    
    # Weights from hidden layer to output
    weights_2_out = pm.Normal('w_2_out', 0, sd=1, 
                              shape=(n_hidden,), 
                              testval=init_out)
    
    # Build neural-network using tanh activation function
    act_1 = T.tanh(T.dot(ann_input, 
                         weights_in_1))
    act_2 = T.tanh(T.dot(act_1, 
                         weights_1_2))
    act_out = T.nnet.sigmoid(T.dot(act_2, 
                                   weights_2_out))
    
    # Binary classification -> Bernoulli likelihood
    out = pm.Bernoulli('out', 
                       act_out,
                       observed=ann_output)

That's not so bad. The Normal priors help regularize the weights. Usually we would add a constant b to the inputs but I omitted it here to keep the code cleaner.

Variational Inference: Scaling model complexity

We could now just run a MCMC sampler like NUTS which works pretty well in this case but as I already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.

Instead, we will use the brand-new ADVI variational inference algorithm which was recently added to PyMC3. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior.

In [34]:
%%time

with neural_network:
    # Run ADVI which returns posterior means, standard deviations, and the evidence lower bound (ELBO)
    v_params = pm.variational.advi(n=50000)
Iteration 0 [0%]: ELBO = -368.86
Iteration 5000 [10%]: ELBO = -185.65
Iteration 10000 [20%]: ELBO = -197.23
Iteration 15000 [30%]: ELBO = -203.2
Iteration 20000 [40%]: ELBO = -192.46
Iteration 25000 [50%]: ELBO = -198.8
Iteration 30000 [60%]: ELBO = -183.39
Iteration 35000 [70%]: ELBO = -185.04
Iteration 40000 [80%]: ELBO = -187.56
Iteration 45000 [90%]: ELBO = -192.32
Finished [100%]: ELBO = -225.56
CPU times: user 36.3 s, sys: 60 ms, total: 36.4 s
Wall time: 37.2 s

< 40 seconds on my older laptop. That's pretty good considering that NUTS is having a really hard time. Further below we make this even faster. To make it really fly, we probably want to run the Neural Network on the GPU.

As samples are more convenient to work with, we can very quickly draw samples from the variational posterior using sample_vp() (this is just sampling from Normal distributions, so not at all the same like MCMC):

In [35]:
with neural_network:
    trace = pm.variational.sample_vp(v_params, draws=5000)

Plotting the objective function (ELBO) we can see that the optimization slowly improves the fit over time.

In [36]:
plt.plot(v_params.elbo_vals)
plt.ylabel('ELBO')
plt.xlabel('iteration')
Out[36]:
<matplotlib.text.Text at 0x7fa5dae039b0>

Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). We use sample_ppc() to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).

In [7]:
# Replace shared variables with testing set
ann_input.set_value(X_test)
ann_output.set_value(Y_test)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

# Use probability of > 0.5 to assume prediction of class 1
pred = ppc['out'].mean(axis=0) > 0.5
In [8]:
fig, ax = plt.subplots()
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
sns.despine()
ax.set(title='Predicted labels in testing set', xlabel='X', ylabel='Y');
In [9]:
print('Accuracy = {}%'.format((Y_test == pred).mean() * 100))
Accuracy = 94.19999999999999%

Hey, our neural network did all right!

Lets look at what the classifier has learned

For this, we evaluate the class probability predictions on a grid over the whole input space.

In [10]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
dummy_out = np.ones(grid.shape[1], dtype=np.int8)
In [11]:
ann_input.set_value(grid_2d)
ann_output.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

Probability surface

In [26]:
cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].mean(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Posterior predictive mean probability of class label = 0');

Uncertainty in predicted value

So far, everything I showed we could have done with a non-Bayesian Neural Network. The mean of the posterior predictive for each class-label should be identical to maximum likelihood predicted values. However, we can also look at the standard deviation of the posterior predictive to get a sense for the uncertainty in our predictions. Here is what that looks like:

In [27]:
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].std(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Uncertainty (posterior predictive standard deviation)');

We can see that very close to the decision boundary, our uncertainty as to which label to predict is highest. You can imagine that associating predictions with uncertainty is a critical property for many applications like health care. To further maximize accuracy, we might want to train the model primarily on samples from that high-uncertainty region.

Mini-batch ADVI: Scaling data size

So far, we have trained our model on all data at once. Obviously this won't scale to something like ImageNet. Moreover, training on mini-batches of data (stochastic gradient descent) avoids local minima and can lead to faster convergence.

Fortunately, ADVI can be run on mini-batches as well. It just requires some setting up:

In [43]:
# Set back to original data to retrain
ann_input.set_value(X_train)
ann_output.set_value(Y_train)

# Tensors and RV that will be using mini-batches
minibatch_tensors = [ann_input, ann_output]
minibatch_RVs = [out]

# Generator that returns mini-batches in each iteration
def create_minibatch(data):
    rng = np.random.RandomState(0)
    
    while True:
        # Return random data samples of set size 100 each iteration
        ixs = rng.randint(len(data), size=50)
        yield data[ixs]

minibatches = zip(
    create_minibatch(X_train), 
    create_minibatch(Y_train),
)

total_size = len(Y_train)

While the above might look a bit daunting, I really like the design. Especially the fact that you define a generator allows for great flexibility. In principle, we could just pool from a database there and not have to keep all the data in RAM.

Lets pass those to advi_minibatch():

In [48]:
%%time

with neural_network:
    # Run advi_minibatch
    v_params = pm.variational.advi_minibatch(
        n=50000, minibatch_tensors=minibatch_tensors, 
        minibatch_RVs=minibatch_RVs, minibatches=minibatches, 
        total_size=total_size, learning_rate=1e-2, epsilon=1.0
    )
Iteration 0 [0%]: ELBO = -311.63
Iteration 5000 [10%]: ELBO = -162.34
Iteration 10000 [20%]: ELBO = -70.49
Iteration 15000 [30%]: ELBO = -153.64
Iteration 20000 [40%]: ELBO = -164.07
Iteration 25000 [50%]: ELBO = -135.05
Iteration 30000 [60%]: ELBO = -240.99
Iteration 35000 [70%]: ELBO = -111.71
Iteration 40000 [80%]: ELBO = -87.55
Iteration 45000 [90%]: ELBO = -97.5
Finished [100%]: ELBO = -75.31
CPU times: user 17.4 s, sys: 56 ms, total: 17.5 s
Wall time: 17.5 s
In [49]:
with neural_network:    
    trace = pm.variational.sample_vp(v_params, draws=5000)
In [50]:
plt.plot(v_params.elbo_vals)
plt.ylabel('ELBO')
plt.xlabel('iteration')
sns.despine()

As you can see, mini-batch ADVI's running time is much lower. It also seems to converge faster.

For fun, we can also look at the trace. The point is that we also get uncertainty of our Neural Network weights.

In [51]:
pm.traceplot(trace);

Summary

Hopefully this blog post demonstrated a very powerful new inference algorithm available in PyMC3: ADVI. I also think bridging the gap between Probabilistic Programming and Deep Learning can open up many new avenues for innovation in this space, as discussed above. Specifically, a hierarchical neural network sounds pretty bad-ass. These are really exciting times.

Next steps

Theano, which is used by PyMC3 as its computational backend, was mainly developed for estimating neural networks and there are great libraries like Lasagne that build on top of Theano to make construction of the most common neural network architectures easy. Ideally, we wouldn't have to build the models by hand as I did above, but use the convenient syntax of Lasagne to construct the architecture, define our priors, and run ADVI.

While we haven't successfully run PyMC3 on the GPU yet, it should be fairly straight forward (this is what Theano does after all) and further reduce the running time significantly. If you know some Theano, this would be a great area for contributions!

You might also argue that the above network isn't really deep, but note that we could easily extend it to have more layers, including convolutional ones to train on more challenging data sets.

I also presented some of this work at PyData London, view the video below:

Finally, you can download this NB here. Leave a comment below, and follow me on twitter.

Acknowledgements

Taku Yoshioka did a lot of work on ADVI in PyMC3, including the mini-batch implementation as well as the sampling from the variational posterior. I'd also like to the thank the Stan guys (specifically Alp Kucukelbir and Daniel Lee) for deriving ADVI and teaching us about it. Thanks also to Chris Fonnesbeck, Andrew Campbell, Taku Yoshioka, and Peadar Coyle for useful comments on an earlier draft.

by Thomas Wiecki at June 01, 2016 02:00 PM

May 31, 2016

Continuum Analytics news

TECHNICAL COLLABORATION EXPANDING ANACONDA ECOSYSTEM

Posted Tuesday, May 31, 2016

Intel and Continuum Analytics Work Together to Extend the Power of Python-based Analytics Across the Enterprise

PYCON 2016, PORTLAND, Ore—May 31, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading open data science platform powered by Python, welcomes Intel into the Anaconda ecosystem. Intel has adopted the Anaconda packaging and distribution and is working with Continuum to provide interoperability.

By offering Anaconda as the foundational high-performance Python distribution, Intel is empowering enterprises to more quickly build open analytics applications that drive immediate business value. Organizations can now combine the power of the Intel® Math Kernel Library (MKL) and Anaconda’s Python-based data science to build the high performance analytic modeling and visualization applications required to compete in today’s data-driven economies.  

“We have been working closely with Continuum Analytics to bring the capabilities of Anaconda to the Intel Distribution for Python. We include conda, making it easier to install conda packages and create conda environments. You now have easy access to the large and growing set of packages available on Anaconda Cloud,” said Robert Cohn, Engineering Director for Intel’s Scripting and Analysis Tools in his recently posted blog.

“We are in the midst of a computing revolution where intelligent data-driven decisions will drive our every move––in business and at home. To unleash the floodgates to value, we need to make data science fast, accessible and open to everyone,” said Michele Chambers, VP of Products & CMO at Continuum Analytics. “Python is the defacto data science language that everyone from elementary to graduate school is using because it’s so easy to get started and powerful enough to drive highly complex analytics. Anaconda turbo boosts analytics without adding any complexity.”

Without optimization, high-level languages like Python lack the performance needed to analyze increasingly large data sets. The platform includes packages and technology that are accessible to beginner Python developers and powerful enough to tackle data science projects for Big Data. Anaconda offers support for advanced analytics, numerical computing, just-in-time compilation, profiling, parallelism, interactive visualization, collaboration and other analytic needs. Customers have experienced up to 100X performance increases with Anaconda.

Anaconda Cloud is a package management service that makes it easy to find, access, store and share public and private notebooks, environments, conda and PyPI packages. The Anaconda Cloud also keeps up with updates made to the packages and environments being used. Users are able to build packages using the Anaconda client command line interface (CLI), then manually or automatically upload the packages to Anaconda Cloud to quickly share with others or access from anywhere. The Intel channel on Anaconda Cloud is where users can go to get optimized packages that Intel is providing.

“Companies like Intel, Microsoft and Cloudera are making Open Data Science more accessible to enterprises. We are mutually committed to ensuring customers get access to open and transparent technology advances,” said Travis Oliphant, CEO and co-founder at Continuum Analytics. “Our technical collaborations with Intel and Open Data Science members are expanding and fueling the next generation of high performance computing for data science. Customers can now leverage their Intel-powered computing clusters––with or without Hadoop––along with a supercharged Python distribution to propel their organizations forward and capitalize on their ever growing data assets.”

Anaconda also powers Python for Microsoft’s Azure ML platform and Continuum recently partnered with Cloudera on a certified Cloudera parcel.

About Continuum Analytics

Continuum Analytics is the creator and driving force behind Anaconda, the leading, open data science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, data engineers and business analysts – to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

To learn more about Continuum Analytics, visit w​ww.continuum.io.​

 

by pcudia at May 31, 2016 01:06 PM

May 29, 2016

Matthieu Brucher

On modeling posts

I’m currently considering whether I should do more posts on preamps modeling or just keep implementing filters/plugins. Of course, it’s not one or the other, there are different options in this poll:

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

So the idea is to ask my readers what they actually want. I can explain how the new triodes filters are implemented, how they behave, but I can also add new filters in Audio Toolkit (based on different preamp and amp stages, dedicated to guitars, bass, other instruments), try to optimize them, and finally I can include them in new plugins that could be used by users. Or I can do something completely different.

So if you have any ideas, feel free to say so!

by Matt at May 29, 2016 10:11 AM

May 27, 2016

Continuum Analytics news

Taking the Wheel: How Open Source is Driving Data Science

Posted Friday, May 27, 2016

The world is a big, exciting place—and thanks to cutting-edge technology, we now have amazing ways to explore its many facets. Today, self-driving cars, bullet trains and even private rocket ships allow humans to travel anywhere faster, more safely and more efficiently than ever before. 

But technology's impact on our exploratory abilities isn't just limited to transportation: it's also revolutionizing how we navigate the Data Science landscape. More companies are moving toward Open Data Science and the open source technology that underlies it. As a result, we now have an amazing new fleet of vehicles for our data-related excursions. 

We're no longer constrained to the single railroad track or state highway of a proprietary analytics product. We can use hundreds of freely available open source libraries for any need: web scraping, ingesting and cleaning data, visualization, predictive analytics, report generation, online integration and more. With these tools, any corner of the Data Science map—astrophysics, financial services, public policy, you name it—can be reached nimbly and efficiently. 

But even in this climate of innovation, nobody can afford to completely abandon previous solutions and traditional approaches still remain viable. Fortunately, graceful interoperability is one of the hallmarks of Open Data Science. In appropriate scenarios, it accommodates the blending of legacy code or proprietary products with open source solutions. After all, sometimes taking the train is necessary and even preferable.

Regardless of which technology teams use, the open nature of Open Data Science allows