April 25, 2017

Matthieu Brucher

Announcement: ATKStereoUniversalDelay 1.0.0

I’m happy to announce the release of a stereo delay that allows ping-pong like effects based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATKStereoUniversalDelay

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKStereoUniversalDelay.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 25, 2017 07:40 AM

April 24, 2017

numfocus

Moore Foundation gives grant to support NumFOCUS Diversity & Inclusion in Scientific Computing initiatives

As part of our mission to support and promote better science through support of the open source scientific software community, NumFOCUS champions technical progress through diversity. NumFOCUS recognizes that the open source data science community is currently highly homogenous. We believe that diverse contributors and community members produce better science and better projects. NumFOCUS strives […]

by Gina Helfrich at April 24, 2017 07:23 PM

Anyone Can Do Astronomy with Python and Open Data

Ole Moeller-Nilsson, CTO at Pivigo, was kind enough to share his insights on how a beginner can easily get started exploring astronomy using Python. This blog post grew out of a presentation he gave at PyData London meetup on March 7th. Python is a great language for science, and specifically for astronomy. The various packages […]

by Gina Helfrich at April 24, 2017 03:00 PM

April 23, 2017

Titus Brown

A (revised and updated) shotgun metagenome workshop at UC Santa Cruz

We just finished teaching a second version of our two-day shotgun metagenome analysis workshop, this time at UC Santa Cruz (the first one was in October 2016, at Scripps Institute of Oceanography). Harriet Alexander led the workshop and Phillip Brooks and I co-taught; Luiz Irber, Shannon Joslin, and Taylor Reiter TAed. The workshop was hosted by Professor Marilou Sison-Mangus at the Earth and Marine Sciences Building.

(Note that Harriet will be running an expanded version of this workshop at our summer institute, July 17-21. Registration is still open!)

About 30-35 people came the first day, and about 30 were there on the second.

Some good - new lessons!

In addition to our old lessons on Illumina read QC, assembly with MEGAHIT, annotation with Prokka, and quantification with Salmon, we introduced two new lessons --

For all of this we used subset data from Hu et al. (the Banfield Lab), 2016, which is a great low-complexity metagenome.

More good - using XSEDE Jetstream instead of Amazon Web Services!

This was the first genomics workshop in many years where we didn't use Amazon Web Services - we used XSEDE Jetstream instead. See our login instructions here.

Why are we abandoning Amazon? Two reasons --

  • while we've been teaching it for almost 8 years now, the conversion rate seems to be very low: AFAICT our students aren't using it, because it costs money and their advisors don't want to pay for AWS when they can use institutional resources. (This is anecdotal.)
  • since sometime before October 2016, Amazon changed their registration system so that newly registered people cannot start up instances for a few hours after their first try. This is death on half-day and two-days workshops. (You can read a bit more about it here.) There seems to be nothing that AWS folk can do to help us so we are giving up.

I am happy to report that Jetstream went more smoothly than AWS in almost every way and seems to perfectly meet our needs for training! We may have more to say about it after our summer institute's use.

I also suspect that people will be more inclined to use Jetstream if they can get allocations on it for free; there was significant interest in this during the workshop.

Other good --

  • As always, the people that attended the workshop were fantastic, and dealt with our occasional hiccups pretty well!
  • We managed to pretty smoothly move between the command line and the Jupyter Notebook for two of the lessons, which was pretty cool.
  • We managed to implement a simple demo of a tetramer nucleotide frequency clustering system using sourmash and t-SNE - see the notebook on github (which should be run after the initial steps in the binning lesson).

There is no bad or ugly

Nothing went wrong! Which I guess is a 'good' all on its own!

There were a few minor issues with the Jetstream desktop, and some problems with starting up Jetstream instances every now and then, and the guest network at UCSC blocked port 8000 (which we used for Jupyter), but most of the time we could work around these issues.

Feedback from participants

The in-person feedback (which is admittedly always kinder than the anonymous feedback :) was excellent - students really liked the hands-on teaching style (Carpentry-style, but with copy/paste) and the slow teaching pace with lots of time for questions was well received.

Misc notes

As always, our materials are available under CC0 on github - the URL is https://github.com/ngs-docs/2017-ucsc-metagenomics.

--titus

by C. Titus Brown at April 23, 2017 10:00 PM

April 21, 2017

numfocus

NumFOCUS Welcomes SunPy, Our Newest Fiscally Sponsored Project

​NumFOCUS is pleased to announce the addition of SunPy to our fiscally sponsored projects. SunPy is a community-developed, free and open-source software library for solar physics based on Python. The aim of the SunPy project is to provide the software tools necessary so that anyone can analyze solar data. SunPy is written using the Python programming language […]

by Gina Helfrich at April 21, 2017 05:04 PM

April 20, 2017

Continuum Analytics news

Two Peas in a Pod: Anaconda + IBM Cognitive Systems

Thursday, April 20, 2017
Travis Oliphant
President, Chief Data Scientist & Co-Founder

There is no question that deep learning has come out to play across a wide range of sectors—finance, marketing, pharma, legal...the list goes on. What’s more, from now until 2022, the deep learning market is expected to grow more than 65 percent. Clearly, companies are increasingly looking deeply at this popular machine learning approach to help fulfill business needs. Deep learning makes it possible to process giant datasets with billions of elements and extract useful predictive models. Deep learning is transforming the businesses of leading consumer Web and mobile app companies and is also being adopted by more traditional business enterprises. 

That’s why this week we are pleased to announce the availability of Anaconda on IBM’s Cognitive Systems, the company’s high performance deep learning platform, highlighting the fact that Anaconda is regarded as an important capability for developers building cognitive solutions. The platform empowers these developers and data scientists to build and deploy deep learning applications that are ready to scale. Anaconda is also integrating with the IBM PowerAI software distribution that makes it simpler for companies to take advantage of Power performance and GPU optimization for data intensive cognitive workloads. 

At Anaconda, we’re helping leading businesses across the world, like IBM, solve the world’s most challenging problems—from improving medical treatments to discovering planets to predicting effects of public policy—by handing them tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence. This news reiterates the importance of Open Data Science in all factors of business. 

Want to learn more about this news? Read the press release here

by swebster at April 20, 2017 03:05 PM

NeuralEnsemble

PyNN 0.9.0 released

I'm happy to announce the release of PyNN 0.9.0!

This version of PyNN adopts the new, simplified Neo object model, first released as Neo 0.5.0, for the data structures returned by Population.get_data(). For more information on the new Neo API, see the Neo release notes

The main difference for a PyNN user is that the AnalogSignalArray class has been renamed to AnalogSignal, and similarly the Segment.analogsignalarrays attribute is now called Segment.analogsignals

What is PyNN?

PyNN (pronounced 'pine') is a simulator-independent language for building neuronal network models.

In other words, you can write the code for a model once, using the PyNN API and the Python programming language, and then run it without modification on any simulator that PyNN supports (currently NEURON, NEST and Brian as well as the SpiNNaker and BrainScaleS neuromorphic hardware systems).

Even if you don't wish to run simulations on multiple simulators, you may benefit from writing your simulation code using PyNN's powerful, high-level interface. In this case, you can use any neuron or synapse model supported by your simulator, and are not restricted to the standard models.

The code is released under the CeCILL licence (GPL-compatible).

by Andrew Davison (noreply@blogger.com) at April 20, 2017 11:11 AM

April 19, 2017

Matthew Rocklin

Asynchronous Optimization Algorithms with Dask

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

In a previous post we built convex optimization algorithms with Dask that ran efficiently on a distributed cluster and were important for a broad class of statistical and machine learning algorithms.

We now extend that work by looking at asynchronous algorithms. We show the following:

  1. APIs within Dask to build asynchronous computations generally, not just for machine learning and optimization
  2. Reasons why asynchronous algorithms are valuable in machine learning
  3. A concrete asynchronous algorithm (Async ADMM) and its performance on a toy dataset

This blogpost is co-authored by Chris White (Capital One) who knows optimization and Matthew Rocklin (Continuum Analytics) who knows distributed computing.

Reproducible notebook available here

Asynchronous vs Blocking Algorithms

When we say asynchronous we contrast it against synchronous or blocking.

In a blocking algorithm you send out a bunch of work and then wait for the result. Dask’s normal .compute() interface is blocking. Consider the following computation where we score a bunch of inputs in parallel and then find the best:

import dask

scores = [dask.delayed(score)(x) for x in L]  # many lazy calls to the score function
best = dask.delayed(max)(scores)
best = best.compute()  # Trigger all computation and wait until complete

This blocks. We can’t do anything while it runs. If we’re in a Jupyter notebook we’ll see a little asterisk telling us that we have to wait.

A Jupyter notebook cell blocking on a dask computation

In a non-blocking or asynchronous algorithm we send out work and track results as they come in. We are still able to run commands locally while our computations run in the background (or on other computers in the cluster). Dask has a variety of asynchronous APIs, but the simplest is probably the concurrent.futures API where we submit functions and then can wait and act on their return.

from dask.distributed import Client, as_completed
client = Client('scheduler-address:8786')

# Send out several computations
futures = [client.submit(score, x) for x in L]

# Find max as results arrive
best = 0
for future in as_completed(futures):
    score = future.result()
    if score > best:
        best = score

These two solutions are computationally equivalent. They do the same work and run in the same amount of time. The blocking dask.delayed solution is probably simpler to write down but the non-blocking futures + as_completed solution lets us be more flexible.

For example, if we get a score that is good enough then we might stop early. If we find that certain kinds of values are giving better scores than others then we might submit more computations around those values while cancelling others, changing our computation during execution.

This ability to monitor and adapt a computation during execution is one reason why people choose asynchronous algorithms. In the case of optimization algorithms we are doing a search process and frequently updating parameters. If we are able to update those parameters more frequently then we may be able to slightly improve every subsequently launched computation. Asynchronous algorithms enable increased flow of information around the cluster in comparison to more lock-step batch-iterative algorithms.

Asynchronous ADMM

In our last blogpost we showed a simplified implementation of Alternating Direction Method of Multipliers (ADMM) with dask.delayed. We saw that in a distributed context it performed well when compared to a more traditional distributed gradient descent. This algorithm works by solving a small optimization problem on every chunk of our data using our current parameter estimates, bringing these back to the local process, combining them, and then sending out new computation on updated parameters.

Now we alter this algorithm to update asynchronously, so that our parameters change continuously as partial results come in in real-time. Instead of sending out and waiting on batches of results, we now consume and emit a constant stream of tasks with slightly improved parameter estimates.

We show three algorithms in sequence:

  1. Synchronous: The original synchronous algorithm
  2. Asynchronous-single: updates parameters with every new result
  3. Asynchronous-batched: updates with all results that have come in since we last updated.

Setup

We create fake data

n, k, chunksize = 50000000, 100, 50000

beta = np.random.random(k) # random beta coefficients, no intercept
zero_idx = np.random.choice(len(beta), size=10)
beta[zero_idx] = 0 # set some parameters to 0
X = da.random.normal(0, 1, size=(n, k), chunks=(chunksize, k))
y = X.dot(beta) + da.random.normal(0, 2, size=n, chunks=(chunksize,)) # add noise

X, y = persist(X, y)  # trigger computation in the background

We define local functions for ADMM. These correspond to solving an l1-regularized Linear regression problem:

def local_f(beta, X, y, z, u, rho):
    return ((y - X.dot(beta)) **2).sum() + (rho / 2) * np.dot(beta - z + u,
                                                              beta - z + u)

def local_grad(beta, X, y, z, u, rho):
    return 2 * X.T.dot(X.dot(beta) - y) + rho * (beta - z + u)


def shrinkage(beta, t):
    return np.maximum(0, beta - t) - np.maximum(0, -beta - t)

local_update2 = partial(local_update, f=local_f, fprime=local_grad)

lamduh = 7.2 # regularization parameter

# algorithm parameters
rho = 1.2
abstol = 1e-4
reltol = 1e-2

z = np.zeros(p)  # the initial consensus estimate

# an array of the individual "dual variables" and parameter estimates,
# one for each chunk of data
u = np.array([np.zeros(p) for i in range(nchunks)])
betas = np.array([np.zeros(p) for i in range(nchunks)])

Finally because ADMM doesn’t want to work on distributed arrays, but instead on lists of remote numpy arrays (one numpy array per chunk of the dask.array) we convert each our Dask.arrays into a list of dask.delayed objects:

XD = X.to_delayed().flatten().tolist() # a list of numpy arrays, one for each chunk
yD = y.to_delayed().flatten().tolist()

Synchronous ADMM

In this algorithm we send out many tasks to run, collect their results, update parameters, and repeat. In this simple implementation we continue for a fixed amount of time but in practice we would want to check some convergence criterion.

start = time.time()

while time() - start < MAX_TIME:
    # process each chunk in parallel, using the black-box 'local_update' function
    betas = [delayed(local_update2)(xx, yy, bb, z, uu, rho)
             for xx, yy, bb, uu in zip(XD, yD, betas, u)]
    betas = np.array(da.compute(*betas))  # collect results back

    # Update Parameters
    ztilde = np.mean(betas + np.array(u), axis=0)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))
    u += betas - z  # update dual variables

    # track convergence metrics
    update_metrics()

Asynchronous ADMM

In the asynchronous version we send out only enough tasks to occupy all of our workers. We collect results one by one as they finish, update parameters, and then send out a new task.

# Submit enough tasks to occupy our current workers
starting_indices = np.random.choice(nchunks, size=ncores*2, replace=True)
futures = [client.submit(local_update, XD[i], yD[i], betas[i], z, u[i],
                           rho, f=local_f, fprime=local_grad)
           for i in starting_indices]
index = dict(zip(futures, starting_indices))

# An iterator that returns results as they come in
pool = as_completed(futures, with_results=True)

start = time.time()
count = 0

while time() - start < MAX_TIME:
    # Get next completed result
    future, local_beta = next(pool)
    i = index.pop(future)
    betas[i] = local_beta
    count += 1

    # Update parameters (this could be made more efficient)
    ztilde = np.mean(betas + np.array(u), axis=0)

    if count < nchunks:  # artificially inflate beta in the beginning
        ztilde *= nchunks / (count + 1)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))
    update_metrics()

    # Submit new task to the cluster
    i = random.randint(0, nchunks - 1)
    u[i] += betas[i] - z
    new_future = client.submit(local_update2, XD[i], yD[i], betas[i], z, u[i], rho)
    index[new_future] = i
    pool.add(new_future)

Batched Asynchronous ADMM

With enough distributed workers we find that our parameter-updating loop on the client can be the limiting factor. After profiling it seems that our client was bound not by updating parameters, but rather by computing the performance metrics that we are going to use for the convergence plots below (so not actually a limitation in practice). However we decided to leave this in because it is good practice for what is likely to occur in larger clusters, where the single machine that updates parameters is possibly overwhelmed by a high volume of updates from the workers. To resolve this, we build in batching.

Rather than update our parameters one by one, we update them with however many results have come in so far. This provides a natural defense against a slow client. This approach smoothly shifts our algorithm back over to the synchronous solution when the client becomes overwhelmed. (though again, at this scale we’re fine).

Conveniently, the as_completed iterator has a .batches() method that iterates over all of the results that have come in so far.

# ... same setup as before

pool = as_completed(new_betas, with_results=True)

batches = pool.batches()            # <<<--- this is new

while time() - start < MAX_TIME:

    # Get all tasks that have come in since we checked last time
    batch = next(batches)           # <<<--- this is new
    for future, result in batch:
        i = index.pop(future)
        betas[i] = result
        count += 1

    ztilde = np.mean(betas + np.array(u), axis=0)
    if count < nchunks:
        ztilde *= nchunks / (count + 1)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))
    update_metrics()

    # Submit as many new tasks as we collected
    for _ in batch:                 # <<<--- this is new
        i = random.randint(0, nchunks - 1)
        u[i] += betas[i] - z
        new_fut = client.submit(local_update2, XD[i], yD[i], betas[i], z, u[i], rho)
        index[new_fut] = i
        pool.add(new_fut)

Visual Comparison of Algorithms

To show the qualitative difference between the algorithms we include profile plots of each. Note the following:

  1. Synchronous has blocks of full CPU use followed by blocks of no use
  2. The Asynchrhonous methods are more smooth
  3. The Asynchronous single-update method has a lot of whitespace / time when CPUs are idling. This is artifiical and because our code that tracks convergence diagnostics for our plots below is wasteful and inside the client inner-loop
  4. We intentionally leave in this wasteful code so that we can reduce it by batching in the third plot, which is more saturated.

You can zoom in using the tools to the upper right of each plot. You can view the full profile in a full window by clicking on the “View full page” link.

Synchronous

View full page

Asynchronous single-update

View full page

Asynchronous batched-update

View full page

Plot Convergence Criteria

Primal residual for async-admm Primal residual for async-admm

Analysis

To get a better sense of what these plots convey, recall that optimization problems always come in pairs: the primal problem is typically the main problem of interest, and the dual problem is a closely related problem that provides information about the constraints in the primal problem. Perhaps the most famous example of duality is the Max-flow-min-cut Theorem from graph theory. In many cases, solving both of these problems simultaneously leads to gains in performance, which is what ADMM seeks to do.

In our case, the constraint in the primal problem is that all workers must agree on the optimum parameter estimate. Consequently, we can think of the dual variables (one for each chunk of data) as measuring the “cost” of agreement for their respective chunks. Intuitively, they will start out small and grow incrementally to find the right “cost” for each worker to have consensus. Eventually, they will level out at an optimum cost.

So:

  • the primal residual plot measures the amount of disagreement; “small” values imply agreement
  • the dual residual plot measures the total “cost” of agreement; this increases until the correct cost is found

The plots then tell us the following:

  • the cost of agreement is higher for asynchronous algorithms, which makes sense because each worker is always working with a slightly out-of-date global parameter estimate, making consensus harder
  • blocked ADMM doesn’t update at all until shortly after 5 seconds have passed, whereas async has already had time to converge. (In practice with real data, we would probably specify that all workers need to report in every K updates).
  • asynchronous algorithms take a little while for the information to properly diffuse, but once that happens they converge quickly.
  • both asynchronous and synchronous converge almost immediately; this is most likely due to a high degree of homogeneity in the data (which was generated to fit the model well). Our next experiment should involve real world data.

What we could have done better

Analysis wise we expect richer results by performing this same experiment on a real world data set that isn’t as homogeneous as the current toy dataset.

Performance wise we can get much better CPU saturation by doing two things:

  1. Not running our convergence diagnostics, or making them much faster
  2. Not running full np.mean computations over all of beta when we’ve only updated a few elements. Instead we should maintain a running aggregation of these results.

With these two changes (each of which are easy) we’re fairly confident that we can scale out to decently large clusters while still saturating hardware.

April 19, 2017 12:00 AM

April 18, 2017

numfocus

NumFOCUS is Now Hiring: Events Coordinator

NumFOCUS seeks applicants for the position of Events Coordinator. Join our small but mighty team and help support data science communities all over the world! This is a junior-level position with the opportunity to make a big impact. Why Work at NumFOCUS? We support the scientific software that makes reproducible science possible! Come work with […]

by Gina Helfrich at April 18, 2017 03:56 PM

Enthought

Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.

But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.

Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.

If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the Imputer class is used. Similarly, the decode_csv function in the TensorFlow library can be passed a record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.

The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.

Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:

  1. Identifying missing values in data
  2. Replacing missing values in data, and
  3. Removing missing values from data.

Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.


Identifying Missing Values – The Hard Way: Using Pandas

If you are interested in identifying missing values in a row/column of a DataFrame, you need to understand the isnull, any, all methods on a DataFrame.

Taking a detour, we have so far described missing values as being represented by NA or NaN. Instead, what if missing values in a column are values that aren’t of the same type as the rest of the cells in the column, say for example a string in a column containing integers? Doing so in Pandas is not trivial.

Identifying Missing Values – The Easy Way: Using the Data Import Tool

Highlighting Null Values using the Data Import Tool

Highlighting null values using the Data Import Tool

Instead of giving you the column names and index values of the cells containing missing values, the Data Import Tool shows them to you. Simply checking the `Highlight Missing Values` checkbox in the bottom-left corner of the Data Import Tool will paint the DataFrame to show you the cells that contain missing values. Further, the Data Import Tool understands that your data file might have errors, like having a string value in a column otherwise containing integers. The Data Import Tool highlights the cell and displays the underlying content too.

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values


Replacing Missing Values – The Hard Way: Using Pandas

While Pandas does a great job at handling column operations even if the columns contain NaN values, our data analysis workflow might need us to replace the missing values in our data.

After spending a little time browsing through the Pandas documentation, you will come across the `fillna` method on a DataFrame, which can be used to replace a missing values. The arguments you pass to the fillna method will determine what value the missing values in your DataFrame are replaced with and how the underlying column dtypes change after replacing the missing values.

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Replacing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool, you can replace missing values by right-clicking on the column containing missing values selecting the appropriate Fill Missing Values item. Opting to replace missing values in the column with a specific column will open an additional dialog, prompting you to enter the value.

Fill missing values

Replace missing values in your DataFrame using the Canopy Data Import Tool


Removing Missing Values – The Hard Way: Using Pandas

While removing columns or rows containing missing values might be a little extreme, it might be necessary. Pandas suggests that you use the dropna method on the DataFrame to drop columns or rows that contain missing values. The arguments you pass to the dropna method will determine what rows/columns are removed from the DataFrame.

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Removing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool on the other hand, you can remove rows/columns containing missing values by selecting the “Delete Empty Columns” or “Delete Empty Rows” item from the “Transform” menu. An additional dialog will pop up asking you how lenient you want to be in removing rows/columns containing missing values – if you choose ‘any’, the Data Import Tool will remove rows/columns that contain any missing values; if you choose ‘all’, the Data Import Tool will only remove those rows/columns which contain only missing values.

Delete Empty Rows & Columns

Delete empty cells in rows/columns using the Canopy Data Import Tool

Delete Empty Columns

Choose to delete columns containing any null value or columns full of null values using the Canopy Data Import Tool

Finally, we have data that contains no missing values. So far, we’ve used the DIT to easily discover the missing values in our dataset and to remove/replace the missing values. Finally, by clicking on ‘Use DataFrame’, you can import the dataset as a pandas DataFrame into the IPython workspace of the Canopy Editor. If you’re a data scientist, your data is now void of missing values and can be converted to arrays or variables and passed on to scikit-learn, TensorFlow or any other Machine Learning library of your choice.

Ready to try the Canopy Data Import Tool?

Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way


Additional resources:

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

The post Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way appeared first on Enthought Blog.

by Rahul Poruri at April 18, 2017 02:21 PM

Pierre de Buyl

A concise derivation of the Wiener-Khinchin theorem

Introduction

While teaching a class on statistical physics, I found myself unhappy with textbook derivations of the Wiener-Khinchin theorem. I worked my way to a very short derivation that is free of integral bounds manipulations and of holes, or so I believe.

In either the books by Balakrishnan (Elements of Nonequilibrium Statistical Mechanics, Ane Books, 2008), Risken (The Fokker-Planck Equation, 2nd edition, Springer-Verlag, 1989), Coffey-Kalmykov-Waldron (The Langevin Equation, 2nd edition, World Scientific, 2004) or MathWorld, I could not find a short derivation that would cleanly take into account the averaging procedure or that would not resort to splittings of the domain of integration. I present here one derivation and a numerical illustration with Python.

Definitions

  • We are intersted in the real-valued signal or process $\xi(t)$
  • $\tilde\xi(\omega) = \int_{-\infty}^\infty dt e^{-i\omega t} \xi(t)$ is the Fourier transform of $\xi(t)$.
  • Complex conjugates are denoted by a star $\tilde\xi^\ast(\omega)$
  • $\xi(t) = (2\pi)^{-1} \int_{-\infty}^\infty d\omega e^{i\omega t} \tilde\xi(\omega)$ defines the inverse Fourier transform.
  • The autocorrelation of $\xi(t)$ is $C(\tau) = \lim_{T\to\infty} T^{-1} \int_0^T dt \xi(t) \xi(t+\tau)$
  • The spectral power density is $S(\omega) = |\tilde\xi(\omega)|^2 / (2\pi)$

Derivation

We start by the definition of the autocorrelation

$$C(\tau) = \lim_{T\to\infty} T^{-1} \int_0^T dt \xi(t) \xi(t+\tau)$$

and replace $\xi(t)$ by the inverse transform of $\tilde\xi(\omega)$

$$C(\tau) = \lim_{T\to\infty} T^{-1} \int_0^T dt\int_{-\infty}^\infty \frac{d\omega}{2\pi} e^{i\omega t} \tilde\xi(\omega) \int_{-\infty}^\infty \frac{d\omega'}{2\pi} e^{i\omega' (t+\tau)} \tilde\xi(\omega')~,$$

change the order of the integrals $$C(\tau) = \lim_{T\to\infty} \int_{-\infty}^\infty \frac{d\omega}{2\pi} \int_{-\infty}^\infty \frac{d\omega'}{2\pi} T^{-1} \int_0^T dt e^{i\omega t} e^{i\omega' (t+\tau)} \tilde\xi(\omega) \tilde\xi(\omega')~,$$ $$C(\tau) = \lim_{T\to\infty} \int_{-\infty}^\infty \frac{d\omega}{2\pi} \int_{-\infty}^\infty \frac{d\omega'}{2\pi} T^{-1} \int_0^T dt e^{i(\omega+\omega') t} e^{i\omega' \tau} \tilde\xi(\omega) \tilde\xi(\omega')~.$$

The integral over $t$ will depend on the value of $\omega+\omega'$. Explicitly, the subcases are

  1. $\omega+\omega'\neq 0$ Here, writing $T= n 2 \pi / (\omega+\omega') + T'$, where $0\leq T'2 \pi / (\omega+\omega')$ we find that the integral is zero to order $\approx 1/T$.
  2. $\omega+\omega'=0$ Here the integral is equal to $T$

By only keeping the non-zero contribution, we can use $\omega+\omega'=0$ and obtain $$C(\tau) = \frac{1}{2\pi} \int_{-\infty}^\infty \frac{d\omega}{2\pi} e^{-i\omega \tau} \tilde\xi(\omega) \tilde\xi(-\omega)$$ and by using the fact that for a real-valued $\xi(t)$, $\tilde\xi(-\omega)=\tilde\xi^\ast(\omega)$ and changing the variable $\omega$ to $-\omega$ in the integral $$C(\tau) = \frac{1}{2\pi} \int_{-\infty}^\infty \frac{d\omega}{2\pi} e^{i\omega \tau} \tilde\xi(\omega) \tilde\xi^\ast(\omega) = \frac{1}{2\pi} \int_{-\infty}^\infty \frac{d\omega}{2\pi} e^{i\omega \tau} |\tilde\xi(\omega)|^2$$

We have thus obtained the Wiener-Khinchin theorem that states that the autocorrelation of a signal is the inverse Fourier transform of its spectral power density divided by $2\pi$.

Illustration

For illustration, I consider a periodic signal and a Ornstein–Uhlenbeck process. For the numerical evaluation of the autocorrelation using FFTs, see the dedicated blog post: http://pdebuyl.be/blog/2016/correlators.html

The "code cell" below loads the math, random, NumPy, matplotlib and SciPy libraries. There are small differences due to the fact that the signals are of finite length, though!

In [1]:
%matplotlib inline
import math
import random
import numpy as np
import matplotlib.pyplot as plt
import scipy.signal
plt.rcParams['font.size'] = 18
plt.rcParams['figure.subplot.wspace'] = 0.25

Periodic signal

Let us consider a sinusoid with pulse $\omega = 2.7$. The power spectrum will have peaks at $\pm\omega$ that would converge to Dirac deltas for an infinite time series.

The autocorrelation can be computed analytically and is a cosine. It is plotted for reference.

In [2]:
# Define the signal

N = 1024
omega = 2.7
dt = 2*np.pi/omega/128
time = np.arange(N)*dt
xi = np.sin(omega*time)

plt.plot(time, xi)
plt.xlabel(r'$t$')
plt.ylabel(r'$\xi(t)$');
In [3]:
# Analytical value of the autocorrelation

plt.plot(time, 0.5*np.cos(omega*time))

# Compute numerically the autocorrelation via a Fourier transform

fft_cor = scipy.signal.fftconvolve(xi, xi[::-1])[N-1:]
fft_cor /= (N - np.arange(N))
plt.plot(time, fft_cor, 'k-', lw=2)

# Compute the autocorrelation via the Wiener-Khinchin theorem
# The NumPy fft routines include the 2 pi factors

psd = np.fft.fft(xi)*np.conj(np.fft.fft(xi))/N
C = np.fft.ifft(psd).real

plt.plot(time, C)


plt.xlim(0, 5*2*np.pi/omega)

plt.xlabel(r'$\tau$')
plt.ylabel(r'$C(\tau)$');
In [4]:
# Plot the spectral power density

plt.plot(np.fft.fftfreq(N, dt), psd.real)
plt.axvline(-omega/(2*np.pi), ls='--', c='k')
plt.axvline(omega/(2*np.pi), ls='--', c='k')

plt.xlim(-2, 2)
plt.xlabel(r'$\nu$')
plt.ylabel(r'$|\tilde\xi(\nu)|^2$');

Ornstein–Uhlenbeck process

The Ornstein–Uhlenbeck (OU) process is defined by the following Langevin equation $$\dot v = -\gamma v + \eta ~,$$ where $\gamma$ is the friction and $\eta$ is the noise term, obeying $\langle \eta(t) \eta(t+\tau) \rangle = 2\gamma$.

The dynamics is solved with the first order Euler scheme $v_{i+1}=v_i - \gamma v_i \Delta t + \sqrt{2\gamma\Delta t} \chi$ where $\chi$ is a normally sampled number with zero mean and unit variance. $v(t=0)=0$ and 1024 loops of thermalization are performed.

For the OU process, the autocorrelation decays exponentially. The power spectrum is a Lorentzian $$S(\nu) = \frac{\gamma}{\gamma^2 + (2\pi \nu)^2}$$ and must be scaled by the sampling time of the signal.

In [5]:
# First order Euler integration of the Langevin equation

N = 8192
v = 0
dt = 0.03
T = N*dt
time = np.arange(N)*dt
gamma = 2.5
v_factor = math.sqrt(2*gamma*dt)
v_data = []
for t in range(1024):
    F = random.gauss(0,1)
    v = v - gamma*v*dt + v_factor*F
for t in range(N):
    F = random.gauss(0,1)
    v = v - gamma*v*dt + v_factor*F
    v_data.append(v)
v_data = np.array(v_data)
In [6]:
# Plot the time series for v(t)
plt.plot(time, v_data)
plt.xlabel(r'$t$')
plt.ylabel(r'$v(t)$')
plt.xlim(0, 30/gamma);
In [7]:
# Analytical value of the autocorrelation

plt.plot(time, np.exp(-gamma*time))

# Compute numerically the autocorrelation via a Fourier transform

fft_cor = scipy.signal.fftconvolve(v_data, v_data[::-1])[N-1:]
fft_cor /= (N - np.arange(N))
plt.plot(time, fft_cor, 'k-', lw=2)

# Compute the autocorrelation via the Wiener-Khinchin theorem

psd = np.fft.fft(v_data)*np.conj(np.fft.fft(v_data))/N
C = np.fft.ifft(psd).real

plt.plot(time, C)

plt.xlim(0, 30/gamma)
plt.ylim(-0.05, 1.1)

plt.xlabel(r'$\tau$')
plt.ylabel(r'$C(\tau)$');
In [8]:
# Plot the spectral power density

# Compute the FFT with proper units
t_v = np.fft.fft(v_data)*dt
psd = (t_v*t_v.conjugate()).real
psd[N//2] = np.nan # to avoid the crossing from -infinity to infinity

# The psd is defined per unit time, so 1/T normalizes the result
plt.plot(np.fft.fftfreq(N, dt), psd/T)
# Analytical value
freqs = np.linspace(-2, 2, 100)
plt.plot(freqs, 2*gamma/((2*np.pi*freqs)**2+gamma**2))

plt.xlim(-1, 1)
plt.xlabel(r'$\nu$')
plt.ylabel(r'$S_v(\nu)$');

Ending

In this post, I presented a short derivation of the Wiener-Kinchin theorem and its numerical application to a periodic signal and to a stochastic process using the scientific Python tools.

Comments are welcome!

by Pierre de Buyl at April 18, 2017 08:00 AM

Matthieu Brucher

Announcement: Audio TK 2.0.0

ATK is updated to 2.0.0 with a major refactoring to ensure signed/unsigned consistency, new Adaptive module and EQ design. Complex-valued filters are also now available to allow simultaneous dual channel processes and advanced filters like complex LMS filters.

Thanks to Travis and Appveyor, binaries for the releases are now updated on Github. On all platforms we compile static and shared libraries. On Linux, gcc 5, gcc 6, clang 3.8 and clang 3.9 are generated, on OS X, XCode 7 and XCode 8 are available as universal binaries, and on Windows, 32 bits, 64 bits, with dynamic or static (no shared libraries in this case) runtime, are also generated.

Download link: ATK 2.0.0

Changelog:
2.0.0
* Refactored fixed line delays (performance improvement)
* Allow new filters to have unconnected inputs (can only be changed inside a filter)
* Refactored the stereo universal delay line to allow more simultaneous channels (renamed to MultipleUniversalDelayLineFilter)
* ATK now allows complex-valued filters with filters to convert from real from/to complex
* Added a BlockLMSFilter with Python wrappers
* Added a LMSFilter with Python wrappers
* Added a RemezBasedCoefficients with Python wrappers to be used with FIRFilter to generate a FIR filter from a template
* Added a RLSFilter with Python wrappers
* Support for IPP as a FFT backend
* Refactored the API for global unsigned consistency

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 18, 2017 07:10 AM

April 17, 2017

Titus Brown

Workshops posted for DIBSI - July 10-15, July 17-21

As part of our Summer Institute in Data Intensive Biology, we will be running nine week-long computational workshops from July 10 to July 17 at the University of California, Davis.

Week 1: July 10-15

Week 2: July 17-21

All workshops will take place at UC Davis; please see the venue information for details.

Workshops may extend into the evening hours; please plan on devoting the entire time to the workshop. Workshops are $350/wk.

On-campus housing information is available for approximately $400/wk, which includes breakfast and dinner. Housing registration currently closes April 26th.

Registration links for each workshop are under the workshop description; housing is linked there as well, and must be booked separately. Attendees of both weeks of workshops may book housing for both weeks, and attendees of the two-week introductory bioinformatics workshop, ANGUS may book a full four weeks of housing.

For questions about registration, travel, invitation letters, or other general topics, please contact dibsi.training@gmail.com. For workshop specific questions, contact the instructors (e-mail links are under each workshop).

--titus

by C. Titus Brown at April 17, 2017 10:00 PM

April 13, 2017

Matthew Rocklin

Streaming Python Prototype

This work is supported by Continuum Analytics, and the Data Driven Discovery Initiative from the Moore Foundation.

This blogpost is about experimental software. The project may change or be abandoned without warning. You should not depend on anything within this blogpost.

This week I built a small streaming library for Python. This was originally an exercise to help me understand streaming systems like Storm, Flink, Spark-Streaming, and Beam, but the end result of this experiment is not entirely useless, so I thought I’d share it. This blogpost will talk about my experience building such a system and what I valued when using it. Hopefully it elevates interest in streaming systems among the Python community.

Background with Iterators

Python has sequences and iterators. We’re used to mapping, filtering and aggregating over lists and generators happily.

seq = [1, 2, 3, 4, 5]
seq = map(inc, L)
seq = filter(iseven, L)

>>> sum(seq) # 2 + 4 + 6
12

If these iterators are infinite, for example if they are coming from some infinite data feed like a hardware sensor or stock market signal then most of these pieces still work except for the final aggregation, which we replace with an accumulating aggregation.

def get_data():
    i = 0
    while True:
        i += 1
        yield i

seq = get_data()
seq = map(inc, seq)
seq = filter(iseven, seq)
seq = accumulate(lambda total, x: total + x, seq)

>>> next(seq)  # 2
2
>>> next(seq)  # 2 + 4
6
>>> next(seq)  # 2 + 4 + 6
12

This is usually a fine way to handle infinite data streams. However this approach becomes awkward if you don’t want to block on calling next(seq) and have your program hang until new data comes in. This approach also becomes awkward when you want to branch off your sequence to multiple outputs and consume from multiple inputs. Additionally there are operations like rate limiting, time windowing, etc. that occur frequently but are tricky to implement if you are not comfortable using threads and queues. These complications often push people to a computation model that goes by the name streaming.

To introduce streaming systems in this blogpost I’ll use my new tiny library, currently called streams (better name to come in the future). However if you decide to use streaming systems in your workplace then you should probably use some other more mature library instead. Common recommendations include the following:

  • ReactiveX (RxPy)
  • Flink
  • Storm (Streamparse)
  • Beam
  • Spark Streaming

Streams

We make a stream, which is an infinite sequence of data into which we can emit values and from which we can subscribe to make new streams.

from streams import Stream
source = Stream()

From here we replicate our example above. This follows the standard map/filter/reduce chaining API.

s = (source.map(inc)
           .filter(iseven)
           .accumulate(lambda total, x: total + x))

Note that we haven’t pushed any data into this stream yet, nor have we said what should happen when data leaves. So that we can look at results, lets make a list and push data into it when data leaves the stream.

results = []
s.sink(results.append)  # call the append method on every element leaving the stream

And now lets push some data in at the source and see it arrive at the sink:

>>> for x in [1, 2, 3, 4, 5]:
...     source.emit(x)

>>> results
[2, 6, 12]

We’ve accomplished the same result as our infinite iterator, except that rather than pulling data with next we push data through with source.emit. And we’ve done all of this at only a 10x slowdown over normal Python iteators :) (this library takes a few microseconds per element rather than CPython’s normal 100ns overhead).

This will get more interesting in the next few sections.

Branching

This approach becomes more interesting if we add multiple inputs and outputs.

source = Stream()
s = source.map(inc)
evens = s.filter(iseven)
evens.accumulate(add)

odds = s.filter(isodd)
odds.accumulate(sub)

Or we can combine streams together

second_source = Stream()
s = combine_latest(second_source, odds).map(sum)

So you may have multiple different input sources updating at different rates and you may have multiple outputs, perhaps some going to a diagnostics dashboard, others going to long-term storage, others going to a database, etc.. A streaming library makes it relatively easy to set up infrastructure and pipe everything to the right locations.

Time and Back Pressure

When dealing with systems that produce and consume data continuously you often want to control the flow so that the rates of production are not greater than the rates of consumption. For example if you can only write data to a database at 10MB/s or if you can only make 5000 web requests an hour then you want to make sure that the other parts of the pipeline don’t feed you too much data, too quickly, which would eventually lead to a buildup in one place.

To deal with this, as our operations push data forward they also accept Tornado Futures as a receipt.

Upstream: Hey Downstream! Here is some data for you
Downstream: Thanks Upstream!  Let me give you a Tornado future in return.
            Make sure you don't send me any more data until that future
            finishes.
Upstream: Got it, Thanks!  I will pass this to the person who gave me the
          data that I just gave to you.

Under normal operation you don’t need to think about Tornado futures at all (many Python users aren’t familiar with asynchronous programming) but it’s nice to know that the library will keep track of balancing out flow. The code below uses @gen.coroutine and yield common for Tornado coroutines. This is similar to the async/await syntax in Python 3. Again, you can safely ignore it if you’re not familiar with asynchronous programming.

@gen.coroutine
def write_to_database(data):
    with connect('my-database:1234/table') as db:
        yield db.write(data)

source = Stream()
(source.map(...)
       .accumulate(...)
       .sink(write_to_database))  # <- sink produces a Tornado future

for data in infinite_feed:
    yield source.emit(data)       # <- that future passes through everything
                                  #    and ends up here to be waited on

There are also a number of operations to help you buffer flow in the right spots, control rate limiting, etc..

source = Stream()
source.timed_window(interval=0.050)  # Capture all records of the last 50ms into batches
      .filter(len)                   # Remove empty batches
      .map(...)                      # Do work on each batch
      .buffer(10)                    # Allow ten batches to pile up here
      .sink(write_to_database)       # Potentially rate-limiting stage

I’ve written enough little utilities like timed_window and buffer to discover both that in a full system you would want more of these, and that they are easy to write. Here is the definition of timed_window

class timed_window(Stream):
    def __init__(self, interval, child, loop=None):
        self.interval = interval
        self.buffer = []
        self.last = gen.moment

        Stream.__init__(self, child, loop=loop)
        self.loop.add_callback(self.cb)

    def update(self, x, who=None):
        self.buffer.append(x)
        return self.last

    @gen.coroutine
    def cb(self):
        while True:
            L, self.buffer = self.buffer, []
            self.last = self.emit(L)
            yield self.last
            yield gen.sleep(self.interval)

If you are comfortable with Tornado coroutines or asyncio then my hope is that this should feel natural.

Recursion and Feedback

By connecting the sink of one stream to the emit function of another we can create feedback loops. Here is stream that produces the Fibonnacci sequence. To stop it from overwhelming our local process we added in a rate limiting step:

from streams import Stream
source = Stream()
s = source.sliding_window(2).map(sum)
L = s.sink_to_list()  # store result in a list

s.rate_limit(0.5).sink(source.emit)  # pipe output back to input

source.emit(0)  # seed with initial values
source.emit(1)
>>> L
[1, 2, 3, 5]

>>> L  # wait a couple seconds, then check again
[1, 2, 3, 5, 8, 13, 21, 34]

>>> L  # wait a couple seconds, then check again
[1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

Note: due to the time rate-limiting functionality this example relied on an event loop running somewhere in another thread. This is the case for example in a Jupyter notebook, or if you have a Dask Client running.

Things that this doesn’t do

If you are familiar with streaming systems then you may say the following:

Lets not get ahead of ourselves; there’s way more to a good streaming system than what is presented here. You need to handle parallelism, fault tolerance, out-of-order elements, event/processing times, etc..

… and you would be entirely correct. What is presented here is not in any way a competitor to existing systems like Flink for production-level data engineering problems. There is a lot of logic that hasn’t been built here (and its good to remember that this project was built at night over a week).

Although some of those things, and in particular the distributed computing bits, we may get for free.

Distributed computing

So, during the day I work on Dask, a Python library for parallel and distributed computing. The core task schedulers within Dask are more than capable of running these kinds of real-time computations. They handle far more complex real-time systems every day including few-millisecond latencies, node failures, asynchronous computation, etc.. People use these features today inside companies, but they tend to roll their own system rather than use a high-level API (indeed, they chose Dask because their system was complex enough or private enough that rolling their own was a necessity). Dask lacks any kind of high-level streaming API today.

Fortunately, the system we described above can be modified fairly easily to use a Dask Client to submit functions rather than run them locally.

from dask.distributed import Client
client = Client()       # start Dask in the background

source.to_dask()
      .scatter()        # send data to a cluster
      .map(...)         # this happens on the cluster
      .accumulate(...)  # this happens on the cluster
      .gather()         # gather results back to local machine
      .sink(...)        # This happens locally

Other things that this doesn’t do, but could with modest effort

There are a variety of ways that we could improve this with modest cost:

  1. Streams of sequences: We can be more efficient if we pass not individual elements through a Stream, but rather lists of elements. This will let us lose the microseconds of overhead that we have now per element and let us operate at pure Python (100ns) speeds.
  2. Streams of NumPy arrays / Pandas dataframes: Rather than pass individual records we might pass bits of Pandas dataframes through the stream. So for example rather than filtering elements we would filter out rows of the dataframe. Rather than compute at Python speeds we can compute at C speeds. We’ve built a lot of this logic before for dask.dataframe. Doing this again is straightforward but somewhat time consuming.
  3. Annotate elements: we want to pass through event time, processing time, and presumably other metadata
  4. Convenient Data IO utilities: We would need some convenient way to move data in and out of Kafka and other common continuous data streams.

None of these things are hard. Many of them are afternoon or weekend projects if anyone wants to pitch in.

Reasons I like this project

This was originally built strictly for educational purposes. I (and hopefully you) now know a bit more about streaming systems, so I’m calling it a success. It wasn’t designed to compete with existing streaming systems, but still there are some aspects of it that I like quite a bit and want to highlight.

  1. Lightweight setup: You can import it and go without setting up any infrastructure. It can run (in a limited way) on a Dask cluster or on an event loop, but it’s also fully operational in your local Python thread. There is no magic in the common case. Everything up until time-handling runs with tools that you learn in an introductory programming class.
  2. Small and maintainable: The codebase is currently a few hundred lines. It is also, I claim, easy for other people to understand. Here is the code for filter:

    class filter(Stream):
        def __init__(self, predicate, child):
            self.predicate = predicate
            Stream.__init__(self, child)
    
        def update(self, x, who=None):
            if self.predicate(x):
                return self.emit(x)
    
  3. Composable with Dask: Handling distributed computing is tricky to do well. Fortunately this project can offload much of that worry to Dask. The dividing line between the two systems is pretty clear and, I think, could lead to a decently powerful and maintainable system if we spend time here.
  4. Low performance overhead: Because this project is so simple it has overheads in the few-microseconds range when in a single process.
  5. Pythonic: All other streaming systems were originally designed for Java/Scala engineers. While they have APIs that are clearly well thought through they are sometimes not ideal for Python users or common Python applications.

Future Work

This project needs both users and developers.

I find it fun and satisfying to work on and so encourage others to play around. The codebase is short and, I think, easily digestible in an hour or two.

This project was built without a real use case (see the project’s examples directory for a basic Daskified web crawler). It could use patient users with real-world use cases to test-drive things and hopefully provide PRs adding necessary features.

I genuinely don’t know if this project is worth pursuing. This blogpost is a test to see if people have sufficient interest to use and contribute to such a library or if the best solution is to carry on with any of the fine solutions that already exist.

pip install git+https://github.com/mrocklin/streams

April 13, 2017 12:00 AM

April 11, 2017

Enthought

Webinar- Get More From Your Core: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core

1704-Virtual-Core-Webinar-900x300-Banner

When: Tues, April 25, 2017, 11-11:45 AM CT
Where: Live webcast (or register for a recording)
What: Presentation, demo, and Q&A with Brendon Hall, Enthought Geosciences Product Manager and Application Engineer

Who should attend:

  • Oil and gas industry professionals who are looking for ways to extract more value from expensive science wells
  • Those interested in learning how artificial intelligence and machine learning techniques can be applied to core analysis

Register  If you can’t attend, register and we’ll send you a recording after the session


Geoscientists and petroleum engineers rely on accurate core measurements to characterize reservoirs, develop drilling plans and de-risk play assessments. Whole-core CT scans are now routinely performed on extracted well cores, however the data produced from these scans are difficult to visualize and integrate with other measurements.

Virtual Core automates aspects of core description for geologists, drastically reducing the time and effort required for core description, and its unified visualization interface displays cleansed whole-core CT data alongside core photographs and well logs. It provides tools for geoscientists to analyze core data and extract features from sub-millimeter scale to the entire core.

In this webinar and demo, we’ll start by introducing the Clear Core processing pipeline, which automatically removes unwanted artifacts (such as tubing) from the CT image. We’ll then show how the machine learning capabilities in Virtual Core can be used to describe the core, extracting features such as bedding planes and dip angle. Finally, we’ll show how the data can be viewed and analyzed alongside other core data, such as photographs, wellbore images, well logs, plug measurements, and more.

What You’ll Learn:

  • How core CT data, photographs, well logs, borehole images, and more can be integrated into a digital core workshop
  • How digital core data can shorten core description timelines and deliver business results faster
  • How new features can be extracted from digital core data using artificial intelligence
  • Novel workflows that leverage these features, such as identifying parasequences and strategies for determining net pay

Register  If you can’t attend, register and we’ll send you a recording after the session

Presenter:

Brendon Hall, Geoscience Applications Engineer, Enthought Brendon Hall, Enthought
Geoscience Product Manager and Application Engineer

Additional Resources

Other Blogs and Articles on Virtual Core:

The post Webinar- Get More From Your Core: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core appeared first on Enthought Blog.

by Brendon Hall at April 11, 2017 01:00 PM

Matthieu Brucher

Audio Toolkit: Create a FIR Filter from a Template (EQ module)

Last week, I published a post on adaptive filtering. It was long overdue, but I actually had one other project on hold for even longer: allowing a user to specify a filter template and let Audio Toolkit figure out a FIR filter from this template.

Remez/Parks & McClellan algorithm

The most famous algorithm is the Remez/Parks & McClellan algorithm. In Matlab, it’s called remez, but Remez is actually a more generic algorithm than just FIR determination.

The algorithm starts by selecting a few random points on the templates where the user set non-zero weights. The zero weights are usually the transition zones, which means that the filter can roam free in these sections. Usually, you don’t want to have them too big, especially in bandpass filters. As the resulting filter has ripples, this means that you can select the weight of each bandwidth in the template. Where the ripples should be small, use a big weight, where they don’t matter, use a small one.

Then, the Remez algorithm is all about moving these points to the maximum of the difference between the template and the actual filter. At the end, the result is an optimal filter around the given template, for a given order.

The determination of the result rests often on the selection of the starting points. If all starting points are in only one bandwidth, then the determination of the filter is wrong. As such, Audio Toolkit selects points that are equidistant so that all bandwidths are covered. Of course, if one bandwidth is too small, then the determination will fail.

Demo

There are many good papers on the Remez algorithm for FIR determination so I won’t take the time to rehash something that lots of people did far better than I could. But I’ll try to explain how it goes on a simple example, with the Python script that was used as the reference test case for the development of the plugin.

Instead of using the equidistant start, I used the set of indices coming from the paper I used (and same for the template). As such, the indices are:

[51 101 341 361 531 671 701 851]

After the optimization, we get the following error function:
Remez Iteration 1

The maximum error is 0.0325 in that case. The algorithm then selects new iterations for the next iteration, at the minimum and maximum of the current error function:

[  0 151 296 409 512 595 744 877]

From these indices, we compute the optimal parameters again and then get a new error function (notice that the highlighted points correspond to the previous min/max)

Remez Iteration 2

The maximum error is now 0.162. And we start the selection process again:

[  0 163 320 409 512 579 718 874]

Once again, we get a new error function:

Remez Iteration 3

The max error is a little bit bigger and is now 0.169. We select new indices:

[  0 163 320 409 512 579 718 874]

The indices are identical, and at the next iteration, the search for the best stops.

The resulting filter has the following transfer function (the template is in red)

Estimated filter against template

Conclusion

There is finally a way of designing filters in Audio Toolkit that don’t require you to go to Matlab or Python. This can be quite efficient to design a linear phase filter on the fly in a plugin. There is probably more work to be done in terms of optimization, but the processing part itself is already fully optimized.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 11, 2017 07:51 AM

April 10, 2017

Continuum Analytics news

Open Data Science is a Team Sport

Monday, April 10, 2017
Stephen Kearns
Continuum Analytics

As every March Madness fan knows, athletic talent and coaching are key, but it’s how they come together as a unit that determines a team’s success. Known for its drama-ridden storylines and endless buzzer beaters, the NCAA’s college basketball championship tournaments (both mens and womens) showcase the power of teamwork and dedication. Basketball is a team sport, where the interrelationships of complementary player skills often dictate the game’s winner. Everyone must focus and work together for the common good of the team. These same principles hold true for data scientist teams.

Much like basketball, data science requires a team of players in different positions, including business analysts, data scientists, data engineers, DevOps engineers, and more. However, too many data scientists still function in silos, each working with his/her own tools to manage data sets. Working individually doesn’t work on the court and it won’t work in data science. Data scientists, and their data science equipment, must function together to work as a team. That’s where Open Data Science comes in.

With Open Data Science, team members have their positions, but are able to move around the court and play with flexibility, like a basketball team. Everyone can also score in basketball, and similarly with Open Data Science, all team members, from data engineers to domain experts, are encouraged to contribute wherever their skills intersect with with the goals of the project. In fact, according to our recent survey, company decision leaders and data scientists revealed that 69 percent of respondents associate “Open Data Science” with collaboration. No longer just a one-person job, data science is a team sport.

Open Data Science is an inclusive movement that not only encourages data scientists to function as a cohesive unit, but also embraces open source tools, so they can work together more easily in a connected ecosystem. Instead of pigeonholing data scientists into using a single language or set of tools, Open Data Science facilitates collaboration and enables data science teams to reap the benefits of all available technologies. Open Data Science brings innovation from every community together, making the latest information readily available to all. Collaboration helps enterprises harness their data faster and extract more value—so, don’t drop the ball with your organization’s data science strategy. Make it a true team effort.

Learn more about the Five Dysfunctions of a Data Science team in slides from my latest webinar below, or download the slides here.

 

The Five Dysfunctions of a Data Science Team from Continuum Analytics

by swebster at April 10, 2017 03:35 PM

April 07, 2017

Paul Ivanov

March 29th, 2017

What's missing -- feels like there's something missing --
The capacity is there -- the job's not stressful but
I somehow fail at the ignition stage - all this
fuel just sitting around -- un-utilized potential
How do I light that fire? Set it ablaze
in a daze caught up in the haze of comfort
I need to challenge myself, raising tides lift
all boats, but they also drown 
livestock
cows, horses, and goats, seeking refuge in hills
that once covered in grass now fill up like
lifeboats. 
Doctors in white coats 
say "Keep your spirits up" -- hope floats.

by Paul Ivanov at April 07, 2017 07:00 AM

April 04, 2017

Enthought

Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting

by: Tim Diller, Product Manager and Scientific Software Developer, Enthought

Last week I attended the AIChE (American Institute of Chemical Engineers) Spring Meeting in San Antonio, Texas. It was a great time of year to visit this cultural gem deep in the heart of Texas (and just down the road from our Austin offices), with plenty of good food, sights and sounds to take in on top of the conference and its sessions.

The AIChE Spring Meeting focuses on applications of chemical engineering in industry, and Enthought was invited to present a poster and deliver a “vendor perspective” talk on the Canopy Platform for Process Monitoring and Optimization as part of the “Big Data Analytics” track. This was my first time at AIChE, so some of the names were new, but in a lot of ways it felt very similar to many other engineering conferences I have participated in over the years (for instance, ASME (American Society of Mechanical Engineers), SAE (Society of Automotive Engineers), etc.).

This event underscored that regardless of industry, engineers are bringing the same kinds of practical ingenuity to bear on similar kinds of problems, and with the cost of data acquisition and storage plummeting in the last decade, many engineers are now sitting on more data than they know how to effectively handle.

What exactly is “big data”? Does it really matter for solving hard engineering problems?

One theme that came up time and again in the “Big Data Analytics” sessions Enthought participated in was what exactly “big data” is. In many circles, a good working definition of what makes data “big” is that it exceeds the size of the physical RAM on the machine doing the computation, so that something other than simply loading the data into memory has to be done to make meaningful computations, and thus a working definition of some tens of GB delimits “big” data from “small”.

For others, and many at the conference indeed, a more mundane definition of “big” means that the data set doesn’t fit within the row or column limits of a Microsoft Excel Worksheet.

But the question of whether your data is “big” is really a moot one as far as we at Enthought are concerned; really, being “big” just adds complexity to an already hard problem, and the kind of complexity is an implementation detail dependent on the details of the problem at hand.

And that relates to the central message of my talk, which was that an analytics platform (in this case I was talking about our Canopy Platform) should abstract away the tedious complexities, and help an expert get to the heart of the hard problem at hand.

At AIChE, the “hard problems” at hand seemed invariably to involve one or both of two things: (1) increasing safety/reliability, and (2) increasing plant output.

To solve these problems, two general kinds of activity were on display: different pattern recognition algorithms and tools, and modeling, typically through some kind of regression-based approach. Both of these things are straightforward in the Canopy Platform.

The Canopy Platform is a collection of related technologies that work together in an integrated way to support the scientist/analyst/engineer.

What is the Canopy Platform?

If you’re using Python for science or engineering, you have probably used or heard of Canopy, Enthought’s Python-based data analytics application offering an integrated code editor and interactive command prompt, package manager, documentation browser, debugger, variable browser, data import tool, and lots of hidden features like support for many kinds of proxy systems that work behind the scenes to make a seamless work environment in enterprise settings.

However, this is just one part of the Canopy Platform. Over the years, Enthought has been building other components and related technologies that work together in an integrated way to support the engineer/analyst/scientist solving hard problems.

At the center of the this is the Enthought Python Distribution, with runtime interpreters for Python 2.7 and 3.x and over 450 pre-built Python packages for scientific computing, including tools for machine learning and the kind of regression modeling that was shown in some of the other presentations in the Big Data sessions. Other components of the Canopy Platform include interface modules for Excel (PyXLL) and for National Instruments’ LabView software (Python Integration Toolkit for LabVIEW), among others.

A key component of our Canopy Platform is our Deployment Server, which simplifies the tricky tasks of deploying proprietary applications and packages or creating customized, reproducible Python environments inside an organization, especially behind a firewall or an air-gapped network.

Finally, (and this is what we were really showing off at the AIChE Big Data Analytics session) there are the Data Catalog and the Cloud Compute layers within the Canopy Platform.

The Data Catalog provides an indexed interface to potentially heterogeneous data sources, making them available for search and query based on various kinds of metadata.

The Data Catalog provides an indexed interface to potentially heterogeneous data sources. These can range from a simple network directory with a collection of HDF5 files to a server hosting files with the Byzantine complexity of the IRIG 106 Ch. 10 Digital Recorder Standard used by US military test flight ranges. The nice thing about the Data Catalog is that it lets you query and select data based on computed metadata, for example “factory A, on Tuesdays when Ethylene output was below 10kg/hr”, or in a test flight data example “test flights involving a T-38 that exceeded 10,000 ft but stayed subsonic.”

With the Cloud Compute layer, an expert user can write code and test it locally on some subset of data from the Data Catalog. Then, when it is working to satisfaction, he or she can publish the code as a computational kernel to run on some other, larger subset of the data in the Data Catalog, using remote compute resources, which might be an HPC cluster or an Apache Spark server. That kernel is then available to other users in the organization, who do not have to understand the algorithm to run it on other data queries.

In the demo below, I showed hooking up the Data Catalog to some historical factory data stored on a remote machine.

Data Catalog View The Data Catalog allows selection of subsets of the data set for inspection and ad hoc analysis. Here, three channels are compared using a time window set on the time series data shown on the top plot.

Then using a locally tested and developed compute kernel, I did a principal component analysis on the frequencies of the channel data for a subset of the data in the Data Catalog. Then I published the kernel and ran it on the entire data set using the remote compute resource.

After the compute kernel has been published and run on the entire data set, then the result explorer tool enables further interactions.

Ultimately, the Canopy Platform is for building and distributing applications that solve hard problems.  Some of the products we have built on the platform are available today (for instance, Canopy Geoscience and Virtual Core), others are in prototype stage or have been developed for other companies with proprietary components and are not publicly available.

It was exciting to participate in the Big Data Analytics track this year, to see what others are doing in this area, and to be a part of many interesting and fruitful discussions. Thanks to Ivan Castillo and Chris Reed at Dow for arranging our participation.

The post Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting appeared first on Enthought Blog.

by Tim Diller at April 04, 2017 08:13 PM

Matthieu Brucher

Announcement: ATKChorus 1.1.0 and ATKUniversalVariableDelay 1.1.0

I’m happy to announce the update of the chorus and the universal variable delay based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

This release fixes the noises that can arise in some configuration.

ATKChorus
ATKUniversalVariableDelay

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKChorus.
Direct link for ATKUniversalVariableDelay .

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 04, 2017 07:17 AM

April 03, 2017

numfocus

IBM Brings Jupyter and Spark to the Mainframe

NumFOCUS Platinum Sponsor IBM has been doing wonderful work to support one of our fiscally sponsored projects, Project Jupyter. Brian Granger over at the Jupyter Blog has the details… “For the past few years, Project Jupyter has been collaborating with IBM on a number of initiatives. Much of this work has happened in the Jupyter Incubation Program, […]

by NumFOCUS Staff at April 03, 2017 12:00 AM

March 30, 2017

Continuum Analytics news

Anaconda Leader to Speak at TDWI Accelerate Boston

Thursday, March 30, 2017

Chief Data Scientist and Co-Founder Travis Oliphant to Discuss the Power of the Python Ecosystem and Open Data Science

BOSTON, Mass.—March 30, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Travis Oliphant, chief data scientist and co-founder, will be speaking at TDWI Accelerate Boston on April 4 at 1:30pm EST. As one of the leading conferences on Big Data and data science, Accelerate brings together the brightest and best data minds in the industry to discuss the future of data science and analytics.

In his session, titled “How to Leverage Python, the Fastest-Growing Open Source Tool for Data Scientists,” Oliphant will highlight the power of the Python ecosystem and Open Data Science. With thirteen million downloads and counting, Anaconda remains the most popular Python distribution. Oliphant will specifically address the burgeoning ecosystem forming around the Python and Anaconda combination, including new offerings such as Anaconda Cloud and conda-forge.

WHO: Travis Oliphant, chief data scientist and co-founder, Anaconda Powered By Continuum Analytics
WHAT: How to Leverage Python, the Fastest-Growing Open Source Tool for Data Scientists
WHEN: April 4, 1:30-2:10 p.m. EST
WHERE: The Boston Marriott, Copley Place, 110 Huntington Ave, Boston, MA 02116
REGISTER: HERE

###

About Anaconda Powered by Continuum Analytics

Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io.

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by swebster at March 30, 2017 01:06 PM

March 28, 2017

Mark Fenner

DC SVD II: From Values to Vectors

In our last installment, we discussed solutions to the secular equation. These solutions are the eigenvalues (and/or) singular values of matrices with a particular form. Since this post is otherwise light on technical content, I’ll dive into those matrix forms now. Setting up the Secular Equation to Solve the Eigen and Singular Problems In dividing-and-conquering […]

by Mark Fenner at March 28, 2017 02:11 PM

Matthieu Brucher

Audio Toolkit: Recursive Least Square Filter (Adaptive module)

I’ve started working on adaptive filtering a long time ago, but could never figure out why my simple implementation of the RLS algorithm failed. Well, there was a typo in the reference book!

Now that this is fixed, let’s see what this guy does.

Algorithm

The RLS algorithm learns an input signal based on its past and predicts new values from it. As such, it can be used to learn periodic signals, but also noise. The basis is to predict a new value based on the past, compare it to the actual value and update the set of coefficients. The update itself is based on a memory time constraint, and the higher the value, the slower the update.

Once the filter has learned enough, the learning stage can be shut off, and the filter can be used to select frequencies.

Results

Let’s start with a simple sinusoidal signal, and see if an order 10 can be used to learn it:

Sinusoidal signal learnt with RLS

As it can be seen, at the beginning, the filter is learning, as it doesn’t match the input. After a short time, it does match (zooming on the signal shows that there is a latency and also the amplitude do not exactly match).

Let’s see how it does for more complex signals. Let’s add two additional slightly out of tunes sinusoids:

Three out-of-tune sinusoids learnt with RLS

Once again, after a short time, the learning phase is stable, and we can switch it off and the signal is estimated properly.

Let’s try now something a little bit more complex, and try to denoise an input signal.
Filtered noise

The original noise in blue is estimated in green, and the remainder noise is in red. Obviously, we don’t do a great job here, but let’s see what is actually attenuated:
Filtered noise in the spectral domain

So the middle of the bandwidth is better attenuated that the sides, which is expected in a way.

Now, what does that do to a signal we try to denoise?
Denoised signal

Obviously, the signal is denoised, but also increased! And the same happens in the spectral domain.
Denoised signal in the spectral domain

When looking at the estimated function, the picture is a little bit clearer:
Estimated spectral transfer function

Our noise is actually between 0.6 and 1.2 rad/s (from sampling frequency/10 to sampling frequency/5), and the RLS filter underestimates these a little bit but doesn’t cut the high frequencies, which can lead to ringing…

Also the cost of learning the noise is quite costly:
Learning cost

Learning was only activated during half the total processing time…

Conclusion

RLS filters are interesting to follow a signal. Obviously this filter is just the start of this new module, and I hope I’ll have real denoising filters at some point.

This filter will be available in ATK 2.0.0 and is already in the develop branch with the Python example scripts.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at March 28, 2017 07:34 AM

Matthew Rocklin

Dask and Pandas and XGBoost

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

This post talks about distributing Pandas Dataframes with Dask and then handing them over to distributed XGBoost for training.

More generally it discusses the value of launching multiple distributed systems in the same shared-memory processes and smoothly handing data back and forth between them.

Introduction

XGBoost is a well-loved library for a popular class of machine learning algorithms, gradient boosted trees. It is used widely in business and is one of the most popular solutions in Kaggle competitions. For larger datasets or faster training, XGBoost also comes with its own distributed computing system that lets it scale to multiple machines on a cluster. Fantastic. Distributed gradient boosted trees are in high demand.

However before we can use distributed XGBoost we need to do three things:

  1. Prepare and clean our possibly large data, probably with a lot of Pandas wrangling
  2. Set up XGBoost master and workers
  3. Hand data our cleaned data from a bunch of distributed Pandas dataframes to XGBoost workers across our cluster

This ends up being surprisingly easy. This blogpost gives a quick example using Dask.dataframe to do distributed Pandas data wrangling, then using a new dask-xgboost package to setup an XGBoost cluster inside the Dask cluster and perform the handoff.

After this example we’ll talk about the general design and what this means for other distributed systems.

Example

We have a ten-node cluster with eight cores each (m4.2xlarges on EC2)

import dask
from dask.distributed import Client, progress

>>> client = Client('172.31.33.0:8786')
>>> client.restart()
<Client: scheduler='tcp://172.31.33.0:8786' processes=10 cores=80>

We load the Airlines dataset using dask.dataframe (just a bunch of Pandas dataframes spread across a cluster) and do a bit of preprocessing:

import dask.dataframe as dd

# Subset of the columns to use
cols = ['Year', 'Month', 'DayOfWeek', 'Distance',
        'DepDelay', 'CRSDepTime', 'UniqueCarrier', 'Origin', 'Dest']

# Create the dataframe
df = dd.read_csv('s3://dask-data/airline-data/20*.csv', usecols=cols,
                  storage_options={'anon': True})

df = df.sample(frac=0.2) # XGBoost requires a bit of RAM, we need a larger cluster

is_delayed = (df.DepDelay.fillna(16) > 15)  # column of labels
del df['DepDelay']  # Remove delay information from training dataframe

df['CRSDepTime'] = df['CRSDepTime'].clip(upper=2399)

df, is_delayed = dask.persist(df, is_delayed)  # start work in the background

This loaded a few hundred pandas dataframes from CSV data on S3. We then had to downsample because how we are going to use XGBoost in the future seems to require a lot of RAM. I am not an XGBoost expert. Please forgive my ignorance here. At the end we have two dataframes:

  • df: Data from which we will learn if flights are delayed
  • is_delayed: Whether or not those flights were delayed.

Data scientists familiar with Pandas will probably be familiar with the code above. Dask.dataframe is very similar to Pandas, but operates on a cluster.

>>> df.head()
Year Month DayOfWeek CRSDepTime UniqueCarrier Origin Dest Distance
182193 2000 1 2 800 WN LAX OAK 337
83424 2000 1 6 1650 DL SJC SLC 585
346781 2000 1 5 1140 AA ORD LAX 1745
375935 2000 1 2 1940 DL PHL ATL 665
309373 2000 1 4 1028 CO MCI IAH 643
>>> is_delayed.head()
182193    False
83424     False
346781    False
375935    False
309373    False
Name: DepDelay, dtype: bool

Categorize and One Hot Encode

XGBoost doesn’t want to work with text data like destination=”LAX”. Instead we create new indicator columns for each of the known airports and carriers. This expands our data into many boolean columns. Fortunately Dask.dataframe has convenience functions for all of this baked in (thank you Pandas!)

>>> df2 = dd.get_dummies(df.categorize()).persist()

This expands our data out considerably, but makes it easier to train on.

>>> len(df2.columns)
685

Split and Train

Great, now we’re ready to split our distributed dataframes

data_train, data_test = df2.random_split([0.9, 0.1],
                                         random_state=1234)
labels_train, labels_test = is_delayed.random_split([0.9, 0.1],
                                                    random_state=1234)

Start up a distributed XGBoost instance, and train on this data

%%time
import dask_xgboost as dxgb

params = {'objective': 'binary:logistic', 'nround': 1000,
          'max_depth': 16, 'eta': 0.01, 'subsample': 0.5,
          'min_child_weight': 1, 'tree_method': 'hist',
          'grow_policy': 'lossguide'}

bst = dxgb.train(client, params, data_train, labels_train)

CPU times: user 355 ms, sys: 29.7 ms, total: 385 ms
Wall time: 54.5 s

Great, so we were able to train an XGBoost model on this data in about a minute using our ten machines. What we get back is just a plain XGBoost Booster object.

>>> bst
<xgboost.core.Booster at 0x7fa1c18c4c18>

We could use this on normal Pandas data locally

import xgboost as xgb
pandas_df = data_test.head()
dtest = xgb.DMatrix(pandas_df)

>>> bst.predict(dtest)
array([ 0.464578  ,  0.46631625,  0.47434333,  0.47245741,  0.46194169], dtype=float32)

Of we can use dask-xgboost again to train on our distributed holdout data, getting back another Dask series.

>>> predictions = dxgb.predict(client, bst, data_test).persist()
>>> predictions
Dask Series Structure:
npartitions=93
None    float32
None        ...
         ...
None        ...
None        ...
Name: predictions, dtype: float32
Dask Name: _predict_part, 93 tasks

Evaluate

We can bring these predictions to the local process and use normal Scikit-learn operations to evaluate the results.

>>> from sklearn.metrics import roc_auc_score, roc_curve
>>> print(roc_auc_score(labels_test.compute(),
...                     predictions.compute()))
0.654800768411
fpr, tpr, _ = roc_curve(labels_test.compute(), predictions.compute())
# Taken from
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
plt.figure(figsize=(8, 8))
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

We might want to play with our parameters above or try different data to improve our solution. The point here isn’t that we predicted airline delays well, it was that if you are a data scientist who knows Pandas and XGBoost, everything we did above seemed pretty familiar. There wasn’t a whole lot of new material in the example above. We’re using the same tools as before, just at a larger scale.

Analysis

OK, now that we’ve demonstrated that this works lets talk a bit about what just happened and what that means generally for cooperation between distributed services.

What dask-xgboost does

The dask-xgboost project is pretty small and pretty simple (200 TLOC). Given a Dask cluster of one central scheduler and several distributed workers it starts up an XGBoost scheduler in the same process running the Dask scheduler and starts up an XGBoost worker within each of the Dask workers. They share the same physical processes and memory spaces. Dask was built to support this kind of situation, so this is relatively easy.

Then we ask the Dask.dataframe to fully materialize in RAM and we ask where all of the constituent Pandas dataframes live. We tell each Dask worker to give all of the Pandas dataframes that it has to its local XGBoost worker and then just let XGBoost do its thing. Dask doesn’t power XGBoost, it’s just sets it up, gives it data, and lets it do it’s work in the background.

People often ask what machine learning capabilities Dask provides, how they compare with other distributed machine learning libraries like H2O or Spark’s MLLib. For gradient boosted trees the 200-line dask-xgboost package is the answer. Dask has no need to make such an algorithm because XGBoost already exists, works well and provides Dask users with a fully featured and efficient solution.

Because both Dask and XGBoost can live in the same Python process they can share bytes between each other without cost, can monitor each other, etc.. These two distributed systems co-exist together in multiple processes in the same way that NumPy and Pandas operate together within a single process. Sharing distributed processes with multiple systems can be really beneficial if you want to use multiple specialized services easily and avoid large monolithic frameworks.

Connecting to Other distributed systems

A while ago I wrote a similar blogpost about hosting TensorFlow from Dask in exactly the same way that we’ve done here. It was similarly easy to setup TensorFlow alongside Dask, feed it data, and let TensorFlow do its thing.

Generally speaking this “serve other libraries” approach is how Dask operates when possible. We’re only able to cover the breadth of functionality that we do today because we lean heavily on the existing open source ecosystem. Dask.arrays use Numpy arrays, Dask.dataframes use Pandas, and now the answer to gradient boosted trees with Dask is just to make it really really easy to use distributed XGBoost. Ta da! We get a fully featured solution that is maintained by other devoted developers, and the entire connection process was done over a weekend (see dmlc/xgboost #2032 for details).

Since this has come out we’ve had requests to support other distributed systems like Elemental and to do general hand-offs to MPI computations. If we’re able to start both systems with the same set of processes then all of this is pretty doable. Many of the challenges of inter-system collaboration go away when you can hand numpy arrays between the workers of one system to the workers of the other system within the same processes.

Acknowledgements

Thanks to Tianqi Chen and Olivier Grisel for their help when building and testing dask-xgboost. Thanks to Will Warner for his help in editing this post.

March 28, 2017 12:00 AM

March 23, 2017

Continuum Analytics news

The Conda Configuration Engine for Power Users

Tuesday, April 4, 2017
Kale Franz
Continuum Analytics

Released last fall, conda 4.2 brought with it configuration superpowers. The capabilities are extensive, and they're designed with conda power users, devops engineers, and sysadmins in mind.

Configuration information comes from four basic sources:

  1. hard-coded defaults,
  2. configuration files,
  3. environment variables, and
  4. command-line arguments.

Each time a conda process initializes, an operating context is built that in a cascading fashion merges configuration sources. Command-line arguments hold the highest precedence, and hard-coded defaults the lowest.

The configuration file search path has been dramatically expanded. In order from lowest to highest priority, and directly from the conda code,

SEARCH_PATH = ( 
    '/etc/conda/.condarc', 
    '/etc/conda/condarc', 
    '/etc/conda/condarc.d/', 
    '/var/lib/conda/.condarc', 
    '/var/lib/conda/condarc', 
    '/var/lib/conda/condarc.d/', 
    '$CONDA_ROOT/.condarc', 
    '$CONDA_ROOT/condarc', 
    '$CONDA_ROOT/condarc.d/', 
    '~/.conda/.condarc', 
    '~/.conda/condarc', 
    '~/.conda/condarc.d/', 
    '~/.condarc', 
    '$CONDA_PREFIX/.condarc', 
    '$CONDA_PREFIX/condarc', 
    '$CONDA_PREFIX/condarc.d/', 
    '$CONDARC', 
)

where environment variables and user home directory are expanded on first use. $CONDA_ROOT is automatically set to the root environment prefix (and shouldn't be set by users), and $CONDA_PREFIX is automatically set for activated conda environments. Thus, conda environments can have their own individualized and customized configurations. For the ".d" directories in the search path, conda will read in sorted order any (and only) files ending with .yml or .yaml extensions. The $CONDARC environment variable can be any path to a file having a .yml or .yaml extension, or containing "condarc" in the file name; it can also be a directory.

Environment variables hold second-highest precedence, and all configuration parameters are able to be specified as environment variables. To convert from the condarc file-based configuration parameter name to the environment variable parameter name, make the name all uppercase and prepend CONDA_. For example, conda's always_yes configuration parameter can be specified using a CONDA_ALWAYS_YES environment variable.

Configuration parameters in some cases have aliases. For example, setting always_yes: true or yes: true in a configuration file is equivalent to the command-line flag --yes. They're all also equivalent to both CONDA_ALWAYS_YES=true and CONDA_YES=true environment variables. A validation error is thrown if multiple parameters aliased to each other are specified within a single configuration source.

There are three basic configuration parameter types: primitive, map, and sequence. Each follow a slightly different set of merge rules.

The primitive configuration parameter is the easiest to merge. Within the linearized chain of information sources, the last source that sets the parameter wins. There is one caveat: if the parameter is trailed by a #!final flag, the merge cascade stops for that parameter. (Indeed, the markup concept is borrowed from the !important rule in CSS.) While still giving end-users extreme flexibility in most cases, we also give sysadmins the ability to lock down as much configuration as needed by making files read-only.

Map configuration parameters have elements that are key-value pairs. Merges are at the per-key level. Given two files with the contents

# file: /etc/conda/condarc.d/proxies.yml
proxy_servers: https: 
  http://prod-proxy

# file: ~/.conda/condarc.d/proxies.yml
proxy_servers: 
  http: http://dev-proxy:1080 
  https: http://dev-proxy:1081

the merged proxy_servers configuration will be

proxy_servers: 
  http: http://dev-proxy:1080 
  https: http://dev-proxy:1081

However, by modifying the contents of the first file to be

# file: /etc/conda/condarc.d/proxies.yml
proxy_servers:
  https: http://prod-proxy #!final

the merged settings will be

proxy_servers: 
  http: http://dev-proxy:1080 
  https: http://prod-proxy

Note the use of the !final flag acts at the per-key level. A !final flag can also be set for the parameter as a whole. With the first file again changed to

# file: /etc/conda/condarc.d/proxies.yml
proxy_servers: #!final 
  https: http://prod-proxy

the merged settings will be

proxy_servers: 
  https: http://prod-proxy

with no http key defined.

The sequence parameter merges are the most involved. Consider contents of the three files

# file: /etc/conda/condarc
channels: 
  - one 
  - two

# file: ~/.condarc
channels: 
  - three 
  - four

# file: $CONDA_PREFIX/.condarc
channels: 
  - five 
  - six

the final merged configuration will be

channels: 
  - five 
  - six 
  - three 
  - four 
  - one 
  - two

Sequence order within each individual configuration source is preserved, while still respecting sources' overall precedence. Just like map parameters, a !final flag can be used for a sequence parameter as a whole. However, the !final flag does not apply to individual elements of sequence parameters, and instead !top and !bottom flags are available. Modifying the sequence example to the following

# file: /etc/conda/condarc
channels: 
  - one #!top 
  - two

# file: ~/.condarc
channels: #!final 
  - three 
  - four #!bottom

# file: $CONDA_PREFIX/.condarc
channels: 
  - five 
  - six

will yield a final merged configuration

channels: 
  - one 
  - three 
  - two 
  - four

Managing all of these new sources of configuration could become difficult without some new tools. The most basic is conda config --validate, which simply exits 0 if conda's configured state passes all validation tests. The command conda config --describe (recently added in 4.3.16) gives a detailed description of available configuration parameters.

We've also added the commands conda config --show-sources and conda config --show. The first displays all of the configuration information conda recognizes--in its non-merged form broken out per source. The second gives the final, merged values for all configuration parameters.

Conda's configuration engine gives power users tools for ultimate control. If you've read to this point, that's probably you. And as a conda power user, please consider participating in the conda canary program. Be on the cutting edge, and also help influence new conda features and behaviors before they're solidified in general availability releases.

by swebster at March 23, 2017 02:13 PM

Matthew Rocklin

Dask Release 0.14.1

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.1. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on February 27th.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed

or you can pip install from PyPI

pip install dask[complete] --upgrade

Arrays

Recent work in distributed computing and machine learning have motivated new performance-oriented and usability changes to how we handle arrays.

Automatic chunking and operation on NumPy arrays

Many interactions between Dask arrays and NumPy arrays work smoothly. NumPy arrays are made lazy and are appropriately chunked to match the operation and the Dask array.

>>> x = np.ones(10)                 # a numpy array
>>> y = da.arange(10, chunks=(5,))  # a dask array
>>> z = x + y                       # combined become a dask.array
>>> z
dask.array<add, shape=(10,), dtype=float64, chunksize=(5,)>

>>> z.compute()
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

Reshape

Reshaping distributed arrays is simple in simple cases, and can be quite complex in complex cases. Reshape now supports a much more broad set of shape transformations where any dimension is collapsed or merged to other dimensions.

>>> x = da.ones((2, 3, 4, 5, 6), chunks=(2, 2, 2, 2, 2))
>>> x.reshape((6, 2, 2, 30, 1))
dask.array<reshape, shape=(6, 2, 2, 30, 1), dtype=float64, chunksize=(3, 1, 2, 6, 1)>

This operation ends up being quite useful in a number of distributed array cases.

Optimize Slicing to Minimize Communication

Dask.array slicing optimizations are now careful to produce graphs that avoid situations that could cause excess inter-worker communication. The details of how they do this is a bit out of scope for a short blogpost, but the history here is interesting.

Historically dask.arrays were used almost exclusively by researchers with large on-disk arrays stored as HDF5 or NetCDF files. These users primarily used the single machine multi-threaded scheduler. We heavily tailored Dask array optimizations to this situation and made that community pretty happy. Now as some of that community switches to cluster computing on larger datasets the optimization goals shift a bit. We have tons of distributed disk bandwidth but really want to avoid communicating large results between workers. Supporting both use cases is possible and I think that we’ve achieved that in this release so far, but it’s starting to require increasing levels of care.

Micro-optimizations

With distributed computing also comes larger graphs and a growing importance of graph-creation overhead. This has been optimized somewhat in this release. We expect this to be a focus going forward.

DataFrames

Set_index

Set_index is smarter in two ways:

  1. If you set_index on a column that happens to be sorted then we’ll identify that and avoid a costly shuffle. This was always possible with the sorted= keyword but users rarely used this feature. Now this is automatic.
  2. Similarly when setting the index we can look at the size of the data and determine if there are too many or too few partitions and rechunk the data while shuffling. This can significantly improve performance if there are too many partitions (a common case).

Shuffle performance

We’ve micro-optimized some parts of dataframe shuffles. Big thanks to the Pandas developers for the help here. This accelerates set_index, joins, groupby-applies, and so on.

Fastparquet

The fastparquet library has seen a lot of use lately and has undergone a number of community bugfixes.

Importantly, Fastparquet now supports Python 2.

We strongly recommend Parquet as the standard data storage format for Dask dataframes (and Pandas DataFrames).

dask/fastparquet #87

Distributed Scheduler

Replay remote exceptions

Debugging is hard in part because exceptions happen on remote machines where normal debugging tools like pdb can’t reach. Previously we were able to bring back the traceback and exception, but you couldn’t dive into the stack trace to investigate what went wrong:

def div(x, y):
    return x / y

>>> future = client.submit(div, 1, 0)
>>> future
<Future: status: error, key: div-4a34907f5384bcf9161498a635311aeb>

>>> future.result()  # getting result re-raises exception locally
<ipython-input-3-398a43a7781e> in div()
      1 def div(x, y):
----> 2     return x / y

ZeroDivisionError: division by zero

Now Dask can bring a failing task and all necessary data back to the local machine and rerun it so that users can leverage the normal Python debugging toolchain.

>>> client.recreate_error_locally(future)
<ipython-input-3-398a43a7781e> in div(x, y)
      1 def div(x, y):
----> 2     return x / y
ZeroDivisionError: division by zero

Now if you’re in IPython or a Jupyter notebook you can use the %debug magic to jump into the stacktrace, investigate local variables, and so on.

In [8]: %debug
> <ipython-input-3-398a43a7781e>(2)div()
      1 def div(x, y):
----> 2     return x / y

ipdb> pp x
1
ipdb> pp y
0

dask/distributed #894

Async/await syntax

Dask.distributed uses Tornado for network communication and Tornado coroutines for concurrency. Normal users rarely interact with Tornado coroutines; they aren’t familiar to most people so we opted instead to copy the concurrent.futures API. However some complex situations are much easier to solve if you know a little bit of async programming.

Fortunately, the Python ecosystem seems to be embracing this change towards native async code with the async/await syntax in Python 3. In an effort to motivate people to learn async programming and to gently nudge them towards Python 3 Dask.distributed we now support async/await in a few cases.

You can wait on a dask Future

async def f():
    future = client.submit(func, *args, **kwargs)
    result = await future

You can put the as_completed iterator into an async for loop

async for future in as_completed(futures):
    result = await future
    ... do stuff with result ...

And, because Tornado supports the await protocols you can also use the existing shadow concurrency API (everything prepended with an underscore) with await. (This was doable before.)

results = client.gather(futures)         # synchronous
...
results = await client._gather(futures)  # asynchronous

If you’re in Python 2 you can always do this with normal yield and the tornado.gen.coroutine decorator.

dask/distributed #952

Inproc transport

In the last release we enabled Dask to communicate over more things than just TCP. In practice this doesn’t come up (TCP is pretty useful). However in this release we now support single-machine “clusters” where the clients, scheduler, and workers are all in the same process and transfer data cost-free over in-memory queues.

This allows the in-memory user community to use some of the more advanced features (asynchronous computation, spill-to-disk support, web-diagnostics) that are only available in the distributed scheduler.

This is on by default if you create a cluster with LocalCluster without using Nanny processes.

>>> from dask.distributed import LocalCluster, Client

>>> cluster = LocalCluster(nanny=False)

>>> client = Client(cluster)

>>> client
<Client: scheduler='inproc://192.168.1.115/8437/1' processes=1 cores=4>

>>> from threading import Lock         # Not serializable
>>> lock = Lock()                      # Won't survive going over a socket
>>> [future] = client.scatter([lock])  # Yet we can send to a worker
>>> future.result()                    # ... and back
<unlocked _thread.lock object at 0x7fb7f12d08a0>

dask/distributed #919

Connection pooling for inter-worker communications

Workers now maintain a pool of sustained connections between each other. This pool is of a fixed size and removes connections with a least-recently-used policy. It avoids re-connection delays when transferring data between workers. In practice this shaves off a millisecond or two from every communication.

This is actually a revival of an old feature that we had turned off last year when it became clear that the performance here wasn’t a problem.

Along with other enhancements, this takes our round-trip latency down to 11ms on my laptop.

In [10]: %%time
    ...: for i in range(1000):
    ...:     future = client.submit(inc, i)
    ...:     result = future.result()
    ...:
CPU times: user 4.96 s, sys: 348 ms, total: 5.31 s
Wall time: 11.1 s

There may be room for improvement here though. For comparison here is the same test with the concurent.futures.ProcessPoolExecutor.

In [14]: e = ProcessPoolExecutor(8)

In [15]: %%time
    ...: for i in range(1000):
    ...:     future = e.submit(inc, i)
    ...:     result = future.result()
    ...:
CPU times: user 320 ms, sys: 56 ms, total: 376 ms
Wall time: 442 ms

Also, just to be clear, this measures total roundtrip latency, not overhead. Dask’s distributed scheduler overhead remains in the low hundreds of microseconds.

dask/distributed #935

There has been activity around Dask and machine learning:

  • dask-learn is undergoing some performance enhancements. It turns out that when you offer distributed grid search people quickly want to scale up their computations to hundreds of thousands of trials.
  • dask-glm now has a few decent algorithms for convex optimization. The authors of this wrote a blogpost very recently if you’re interested: Developing Convex Optimization Algorithms in Dask
  • dask-xgboost lets you hand off distributed data in Dask dataframes or arrays and hand it directly to a distributed XGBoost system (that Dask will nicely set up and tear down for you). This was a nice example of easy hand-off between two distributed services running in the same processes.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.0 release on February 27th

  • Antoine Pitrou
  • Brian Martin
  • Elliott Sales de Andrade
  • Erik Welch
  • Francisco de la Peña
  • jakirkham
  • Jim Crist
  • Jitesh Kumar Jha
  • Julien Lhermitte
  • Martin Durant
  • Matthew Rocklin
  • Markus Gonser
  • Talmaj

The following people contributed to the dask/distributed repository since the 1.16.0 release on February 27th

  • Antoine Pitrou
  • Ben Schreck
  • Elliott Sales de Andrade
  • Martin Durant
  • Matthew Rocklin
  • Phil Elson

March 23, 2017 12:00 AM

numfocus

PyData Atlanta Meetup Celebrates 1 Year and over 1,000 members

PyData Atlanta holds a meetup at MailChimp, where Jim Crozier spoke about analyzing NFL data with PySpark. Atlanta tells a new story about data by Rob Clewley In late 2015, the three of us (Tony Fast, Neel Shivdasani, and myself) had been regularly  nerding out about data over beers and becoming fast friends. We were […]

by NumFOCUS Staff at March 23, 2017 12:00 AM

March 22, 2017

Matthew Rocklin

Developing Convex Optimization Algorithms in Dask

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

We build distributed optimization algorithms with Dask.  We show both simple examples and also benchmarks from a nascent dask-glm library for generalized linear models.  We also talk about the experience of learning Dask to do this kind of work.

This blogpost is co-authored by Chris White (Capital One) who knows optimization and Matthew Rocklin (Continuum Analytics) who knows distributed computing.

Introduction

Many machine learning and statistics models (such as logistic regression) depend on convex optimization algorithms like Newton’s method, stochastic gradient descent, and others.  These optimization algorithms are both pragmatic (they’re used in many applications) and mathematically interesting.  As a result these algorithms have been the subject of study by researchers and graduate students around the world for years both in academia and in industry.

Things got interesting about five or ten years ago when datasets grew beyond the size of working memory and “Big Data” became a buzzword.  Parallel and distributed solutions for these algorithms have become the norm, and a researcher’s skillset now has to extend beyond linear algebra and optimization theory to include parallel algorithms and possibly even network programming, especially if you want to explore and create more interesting algorithms.

However, relatively few people understand both mathematical optimization theory and the details of distributed systems. Typically algorithmic researchers depend on the APIs of distributed computing libraries like Spark or Flink to implement their algorithms. In this blogpost we explore the extent to which Dask can be helpful in these applications. We approach this from two perspectives:

  1. Algorithmic researcher (Chris): someone who knows optimization and iterative algorithms like Conjugate Gradient, Dual Ascent, or GMRES but isn’t so hot on distributed computing topics like sockets, MPI, load balancing, and so on
  2. Distributed systems developer (Matt): someone who knows how to move bytes around and keep machines busy but doesn’t know the right way to do a line search or handle a poorly conditioned matrix

Prototyping Algorithms in Dask

Given knowledge of algorithms and of NumPy array computing it is easy to write parallel algorithms with Dask. For a range of complicated algorithmic structures we have two straightforward choices:

  1. Use parallel multi-dimensional arrays to construct algorithms from common operations like matrix multiplication, SVD, and so on. This mirrors mathematical algorithms well but lacks some flexibility.
  2. Create algorithms by hand that track operations on individual chunks of in-memory data and dependencies between them. This is very flexible but requires a bit more care.

Coding up either of these options from scratch can be a daunting task, but with Dask it can be as simple as writing NumPy code.

Let’s build up an example of fitting a large linear regression model using both built-in array parallelism and fancier, more customized parallelization features that Dask offers. The dask.array module helps us to easily parallelize standard NumPy functionality using the same syntax – we’ll start there.

Data Creation

Dask has many ways to create dask arrays; to get us started quickly prototyping let’s create some random data in a way that should look familiar to NumPy users.

import dask
import dask.array as da
import numpy as np

from dask.distributed import Client

client = Client()

## create inputs with a bunch of independent normals
beta = np.random.random(100)  # random beta coefficients, no intercept
X = da.random.normal(0, 1, size=(1000000, 100), chunks=(100000, 100))
y = X.dot(beta) + da.random.normal(0, 1, size=1000000, chunks=(100000,))

## make sure all chunks are ~equally sized
X, y = dask.persist(X, y)
client.rebalance([X, y])

Observe that X is a dask array stored in 10 chunks, each of size (100000, 100). Also note that X.dot(beta) runs smoothly for both numpy and dask arrays, so we can write code that basically works in either world.

Caveat: If X is a numpy array and beta is a dask array, X.dot(beta) will output an in-memory numpy array. This is usually not desirable as you want to carefully choose when to load something into memory. One fix is to use multipledispatch to handle odd edge cases; for a starting example, check out the dot code here.

Dask also has convenient visualization features built in that we will leverage; below we visualize our data in its 10 independent chunks:

Create data for dask-glm computations

Array Programming

If you can write iterative array-based algorithms in NumPy, then you can write iterative parallel algorithms in Dask

As we’ve already seen, Dask inherits much of the NumPy API that we are familiar with, so we can write simple NumPy-style iterative optimization algorithms that will leverage the parallelism dask.array has built-in already. For example, if we want to naively fit a linear regression model on the data above, we are trying to solve the following convex optimization problem:

Recall that in non-degenerate situations this problem has a closed-form solution that is given by:

We can compute $\beta^*$ using the above formula with Dask:

## naive solution
beta_star = da.linalg.solve(X.T.dot(X), X.T.dot(y))

>>> abs(beta_star.compute() - beta).max()
0.0024817567237768179

Sometimes a direct solve is too costly, and we want to solve the above problem using only simple matrix-vector multiplications. To this end, let’s take this one step further and actually implement a gradient descent algorithm which exploits parallel matrix operations. Recall that gradient descent iteratively refines an initial estimate of beta via the update:

where $\alpha$ can be chosen based on a number of different “step-size” rules; for the purposes of exposition, we will stick with a constant step-size:

## quick step-size calculation to guarantee convergence
_, s, _ = np.linalg.svd(2 * X.T.dot(X))
step_size = 1 / s - 1e-8

## define some parameters
max_steps = 100
tol = 1e-8
beta_hat = np.zeros(100) # initial guess

for k in range(max_steps):
    Xbeta = X.dot(beta_hat)
    func = ((y - Xbeta)**2).sum()
    gradient = 2 * X.T.dot(Xbeta - y)

    ## Update
    obeta = beta_hat
    beta_hat = beta_hat - step_size * gradient
    new_func = ((y - X.dot(beta_hat))**2).sum()
    beta_hat, func, new_func = dask.compute(beta_hat, func, new_func)  # <--- Dask code

    ## Check for convergence
    change = np.absolute(beta_hat - obeta).max()

    if change < tol:
        break

>>> abs(beta_hat - beta).max()
0.0024817567259038942

It’s worth noting that almost all of this code is exactly the same as the equivalent NumPy code. Because Dask.array and NumPy share the same API it’s pretty easy for people who are already comfortable with NumPy to get started with distributed algorithms right away. The only thing we had to change was how we produce our original data (da.random.normal instead of np.random.normal) and the call to dask.compute at the end of the update state. The dask.compute call tells Dask to go ahead and actually evaluate everything we’ve told it to do so far (Dask is lazy by default). Otherwise, all of the mathematical operations, matrix multiplies, slicing, and so on are exactly the same as with Numpy, except that Dask.array builds up a chunk-wise parallel computation for us and Dask.distributed can execute that computation in parallel.

To better appreciate all the scheduling that is happening in one update step of the above algorithm, here is a visualization of the computation necessary to compute beta_hat and the new function value new_func:

Gradient descent step Dask graph

Each rectangle is an in-memory chunk of our distributed array and every circle is a numpy function call on those in-memory chunks. The Dask scheduler determines where and when to run all of these computations on our cluster of machines (or just on the cores of our laptop).

Array Programming + dask.delayed

Now that we’ve seen how to use the built-in parallel algorithms offered by Dask.array, let’s go one step further and talk about writing more customized parallel algorithms. Many distributed “consensus” based algorithms in machine learning are based on the idea that each chunk of data can be processed independently in parallel, and send their guess for the optimal parameter value to some master node. The master then computes a consensus estimate for the optimal parameters and reports that back to all of the workers. Each worker then processes their chunk of data given this new information, and the process continues until convergence.

From a parallel computing perspective this is a pretty simple map-reduce procedure. Any distributed computing framework should be able to handle this easily. We’ll use this as a very simple example for how to use Dask’s more customizable parallel options.

One such algorithm is the Alternating Direction Method of Multipliers, or ADMM for short. For the sake of this post, we will consider the work done by each worker to be a black box.

We will also be considering a regularized version of the problem above, namely:

At the end of the day, all we will do is:

  • create NumPy functions which define how each chunk updates its parameter estimates
  • wrap those functions in dask.delayed
  • call dask.compute and process the individual estimates, again using NumPy

First we need to define some local functions that the chunks will use to update their individual parameter estimates, and import the black box local_update step from dask_glm; also, we will need the so-called shrinkage operator (which is the proximal operator for the $l1$-norm in our problem):

from dask_glm.algorithms import local_update

def local_f(beta, X, y, z, u, rho):
    return ((y - X.dot(beta)) **2).sum() + (rho / 2) * np.dot(beta - z + u,
                                                                  beta - z + u)

def local_grad(beta, X, y, z, u, rho):
    return 2 * X.T.dot(X.dot(beta) - y) + rho * (beta - z + u)


def shrinkage(beta, t):
    return np.maximum(0, beta - t) - np.maximum(0, -beta - t)

## set some algorithm parameters
max_steps = 10
lamduh = 7.2
rho = 1.0

(n, p) = X.shape
nchunks = X.npartitions

XD = X.to_delayed().flatten().tolist()  # A list of pointers to remote numpy arrays
yD = y.to_delayed().flatten().tolist()  # ... one for each chunk

# the initial consensus estimate
z = np.zeros(p)

# an array of the individual "dual variables" and parameter estimates,
# one for each chunk of data
u = np.array([np.zeros(p) for i in range(nchunks)])
betas = np.array([np.zeros(p) for i in range(nchunks)])

for k in range(max_steps):

    # process each chunk in parallel, using the black-box 'local_update' magic
    new_betas = [dask.delayed(local_update)(xx, yy, bb, z, uu, rho,
                                            f=local_f,
                                            fprime=local_grad)
                 for xx, yy, bb, uu in zip(XD, yD, betas, u)]
    new_betas = np.array(dask.compute(*new_betas))

    # everything else is NumPy code occurring at "master"
    beta_hat = 0.9 * new_betas + 0.1 * z

    # create consensus estimate
    zold = z.copy()
    ztilde = np.mean(beta_hat + np.array(u), axis=0)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))

    # update dual variables
    u += beta_hat - z

>>> # Number of coefficients zeroed out due to L1 regularization
>>> print((z == 0).sum())
12

There is of course a little bit more work occurring in the above algorithm, but it should be clear that the distributed operations are not one of the difficult pieces. Using dask.delayed we were able to express a simple map-reduce algorithm like ADMM with similarly simple Python for loops and delayed function calls. Dask.delayed is keeping track of all of the function calls we wanted to make and what other function calls they depend on. For example all of the local_update calls can happen independent of each other, but the consensus computation blocks on all of them.

We hope that both parallel algorithms shown above (gradient descent, ADMM) were straightforward to someone reading with an optimization background. These implementations run well on a laptop, a single multi-core workstation, or a thousand-node cluster if necessary. We’ve been building somewhat more sophisticated implementations of these algorithms (and others) in dask-glm. They are more sophisticated from an optimization perspective (stopping criteria, step size, asynchronicity, and so on) but remain as simple from a distributed computing perspective.

Experiment

We compare dask-glm implementations against Scikit-learn on a laptop, and then show them running on a cluster.

Reproducible notebook is available here

We’re building more sophisticated versions of the algorithms above in dask-glm.  This project has convex optimization algorithms for gradient descent, proximal gradient descent, Newton’s method, and ADMM.  These implementations extend the implementations above by also thinking about stopping criteria, step sizes, and other niceties that we avoided above for simplicity.

In this section we show off these algorithms by performing a simple numerical experiment that compares the numerical performance of proximal gradient descent and ADMM alongside Scikit-Learn’s LogisticRegression and SGD implementations on a single machine (a personal laptop) and then follows up by scaling the dask-glm options to a moderate cluster.  

Disclaimer: These experiments are crude. We’re using artificial data, we’re not tuning parameters or even finding parameters at which these algorithms are producing results of the same accuracy. The goal of this section is just to give a general feeling of how things compare.

We create data

## size of problem (no. observations)
N = 8e6
chunks = 1e6
seed = 20009
beta = (np.random.random(15) - 0.5) * 3

X = da.random.random((N,len(beta)), chunks=chunks)
y = make_y(X, beta=np.array(beta), chunks=chunks)

X, y = dask.persist(X, y)
client.rebalance([X, y])

And run each of our algorithms as follows:

# Dask-GLM Proximal Gradient
result = proximal_grad(X, y, lamduh=alpha)

# Dask-GLM ADMM
X2 = X.rechunk((1e5, None)).persist()  # ADMM prefers smaller chunks
y2 = y.rechunk(1e5).persist()
result = admm(X2, y2, lamduh=alpha)

# Scikit-Learn LogisticRegression
nX, ny = dask.compute(X, y)  # sklearn wants numpy arrays
result = LogisticRegression(penalty='l1', C=1).fit(nX, ny).coef_

# Scikit-Learn Stochastic Gradient Descent
result = SGDClassifier(loss='log',
                       penalty='l1',
                       l1_ratio=1,
                       n_iter=10,
                       fit_intercept=False).fit(nX, ny).coef_

We then compare with the $L_{\infty}$ norm (largest different value).

abs(result - beta).max()

Times and $L_\infty$ distance from the true “generative beta” for these parameters are shown in the table below:

Algorithm Error Duration (s)
Proximal Gradient 0.0227 128
ADMM 0.0125 34.7
LogisticRegression 0.0132 79
SGDClassifier 0.0456 29.4

Again, please don’t take these numbers too seriously: these algorithms all solve regularized problems, so we don’t expect the results to necessarily be close to the underlying generative beta (even asymptotically). The numbers above are meant to demonstrate that they all return results which were roughly the same distance from the beta above. Also, Dask-glm is using a full four-core laptop while SKLearn is restricted to use a single core.

In the sections below we include profile plots for proximal gradient and ADMM. These show the operations that each of eight threads was doing over time. You can mouse-over rectangles/tasks and zoom in using the zoom tools in the upper right. You can see the difference in complexity of the algorithms. ADMM is much simpler from Dask’s perspective but also saturates hardware better for this chunksize.

Profile Plot for Proximal Gradient Descent

Profile Plot for ADMM

The general takeaway here is that dask-glm performs comparably to Scikit-Learn on a single machine. If your problem fits in memory on a single machine you should continue to use Scikit-Learn and Statsmodels. The real benefit to the dask-glm algorithms is that they scale and can run efficiently on data that is larger-than-memory by operating from disk on a single computer or on a cluster of computers working together.

Cluster Computing

As a demonstration, we run a larger version of the data above on a cluster of eight m4.2xlarges on EC2 (8 cores and 30GB of RAM each.)

We create a larger dataset with 800,000,000 rows and 15 columns across eight processes.

N = 8e8
chunks = 1e7
seed = 20009
beta = (np.random.random(15) - 0.5) * 3

X = da.random.random((N,len(beta)), chunks=chunks)
y = make_y(X, beta=np.array(beta), chunks=chunks)

X, y = dask.persist(X, y)

We then run the same proximal_grad and admm operations from before:

# Dask-GLM Proximal Gradient
result = proximal_grad(X, y, lamduh=alpha)

# Dask-GLM ADMM
X2 = X.rechunk((1e6, None)).persist()  # ADMM prefers smaller chunks
y2 = y.rechunk(1e6).persist()
result = admm(X2, y2, lamduh=alpha)

Proximal grad completes in around seventeen minutes while ADMM completes in around four minutes. Profiles for the two computations are included below:

Profile Plot for Proximal Gradient Descent

We include only the first few iterations here. Otherwise this plot is several megabytes.

Link to fullscreen plot

Profile Plot for ADMM

Link to fullscreen plot

These both obtained similar $L_{\infty}$ errors to what we observed before.

Algorithm Error Duration (s)
Proximal Gradient 0.0306 1020
ADMM 0.00159 270

Although this time we had to be careful about a couple of things:

  1. We explicitly deleted the old data after rechunking (ADMM prefers different chunksizes than proximal_gradient) because our full dataset, 100GB, is close enough to our total distributed RAM (240GB) that it’s a good idea to avoid keeping replias around needlessly. Things would have run fine, but spilling excess data to disk would have negatively affected performance.
  2. We set the OMP_NUM_THREADS=1 environment variable to avoid over-subscribing our CPUs. Surprisingly not doing so led both to worse performance and to non-deterministic results. An issue that we’re still tracking down.

Analysis

The algorithms in Dask-GLM are new and need development, but are in a usable state by people comfortable operating at this technical level. Additionally, we would like to attract other mathematical and algorithmic developers to this work. We’ve found that Dask provides a nice balance between being flexible enough to support interesting algorithms, while being managed enough to be usable by researchers without a strong background in distributed systems. In this section we’re going to discuss the things that we learned from both Chris’ (mathematical algorithms) and Matt’s (distributed systems) perspective and then talk about possible future work. We encourage people to pay attention to future work; we’re open to collaboration and think that this is a good opportunity for new researchers to meaningfully engage.

Chris’s perspective

  1. Creating distributed algorithms with Dask was surprisingly easy; there is still a small learning curve around when to call things like persist, compute, rebalance, and so on, but that can’t be avoided. Using Dask for algorithm development has been a great learning environment for understanding the unique challenges associated with distributed algorithms (including communication costs, among others).
  2. Getting the particulars of algorithms correct is non-trivial; there is still work to be done in better understanding the tolerance settings vs. accuracy tradeoffs that are occurring in many of these algorithms, as well as fine-tuning the convergence criteria for increased precision.
  3. On the software development side, reliably testing optimization algorithms is hard. Finding provably correct optimality conditions that should be satisfied which are also numerically stable has been a challenge for me.
  4. Working on algorithms in isolation is not nearly as fun as collaborating on them; please join the conversation and contribute!
  5. Most importantly from my perspective, I’ve found there is a surprisingly large amount of misunderstanding in “the community” surrounding what optimization algorithms do in the world of predictive modeling, what problems they each individually solve, and whether or not they are interchangeable for a given problem. For example, Newton’s method can’t be used to optimize an l1-regularized problem, and the coefficient estimates from an l1-regularized problem are fundamentally (and numerically) different from those of an l2-regularized problem (and from those of an unregularized problem). My own personal goal is that the API for dask-glm exposes these subtle distinctions more transparently and leads to more thoughtful modeling decisions “in the wild”.

Matt’s perspective

This work triggered a number of concrete changes within the Dask library:

  1. We can convert Dask.dataframes to Dask.arrays. This is particularly important because people want to do pre-processing with dataframes but then switch to efficient multi-dimensional arrays for algorithms.
  2. We had to unify the single-machine scheduler and distributed scheduler APIs a bit, notably adding a persist function to the single machine scheduler. This was particularly important because Chris generally prototyped on his laptop but we wanted to write code that was effective on clusters.
  3. Scheduler overhead can be a problem for the iterative dask-array algorithms (gradient descent, proximal gradient descent, BFGS). This is particularly a problem because NumPy is very fast. Often our tasks take only a few milliseconds, which makes Dask’s overhead of 200us per task become very relevant (this is why you see whitespace in the profile plots above). We’ve started resolving this problem in a few ways like more aggressive task fusion and lower overheads generally, but this will be a medium-term challenge. In practice for dask-glm we’ve started handling this just by choosing chunksizes well. I suspect that for the dask-glm in particular we’ll just develop auto-chunksize heuristics that will mostly solve this problem. However we expect this problem to recur in other work with scientists on HPC systems who have similar situations.
  4. A couple of things can be tricky for algorithmic users:
    1. Placing the calls to asynchronously start computation (persist, compute). In practice Chris did a good job here and then I came through and tweaked things afterwards. The web diagnostics ended up being crucial to identify issues.
    2. Avoiding accidentally calling NumPy functions on dask.arrays and vice versa. We’ve improved this on the dask.array side, and they now operate intelligently when given numpy arrays. Changing this on the NumPy side is harder until NumPy protocols change (which is planned).

Future work

There are a number of things we would like to do, both in terms of measurement and for the dask-glm project itself. We welcome people to voice their opinions (and join development) on the following issues:

  1. Asynchronous Algorithms
  2. User APIs
  3. Extend GLM families
  4. Write more extensive rigorous algorithm testing - for satisfying provable optimality criteria, and for robustness to various input data
  5. Begin work on smart initialization routines

What is your perspective here, gentle reader? Both Matt and Chris can use help on this project. We hope that some of the issues above provide seeds for community engagement. We welcome other questions, comments, and contributions either as github issues or comments below.

Acknowledgements

Thanks also go to Hussain Sultan (Capital One) and Tom Augspurger for collaboration on Dask-GLM and to Will Warner (Continuum) for reviewing and editing this post.

March 22, 2017 12:00 AM

numfocus

nteract: Building on top of Jupyter (from a rich REPL toolkit to interactive notebooks)

This post originally appeared on the nteract blog.   nteract builds upon the very successful foundations of Jupyter. I think of Jupyter as a brilliantly rich REPL toolkit. A typical REPL (Read-Eval-Print-Loop) is an interpreter that takes input from the user and prints results (on stdout and stderr). ​ Here’s the standard Python interpreter; a REPL many […]

by NumFOCUS Staff at March 22, 2017 12:00 AM

March 21, 2017

Matthieu Brucher

Announcement: Audio TK 1.5.0

ATK is updated to 1.5.0 with new features oriented around preamplifiers and optimizations. It is also now compiled on Appveyor: https://ci.appveyor.com/project/mbrucher/audiotk.

Thanks to Travis and Appveyor, binaries for the releases are now updated on Github. On all platforms we compile static and shared libraries. On Linux, gcc 5, gcc 6, clang 3.8 and clang 3.9 are generated, on OS X, XCode 7 and XCode 8 are available as universal binaries, and on Windows, 32 bits, 64 bits, with dynamic or static (no shared libraries in this case) runtime, are also generated.

Download link: ATK 1.5.0

Changelog:
1.5.0
* Adding a follower class solid state preamplifier with Python wrappers
* Adding a Dempwolf model for tube filters with Python wrappers
* Adding a Munro-Piazza model for tube filters with Python wrappers
* Optimized distortion and preamplifier filters by using fmath exp calls

1.4.1
* Vectorized x4 the IIR part of the IIR filter
* Vectorized delay filters
* Fixed bug in gain filters

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at March 21, 2017 08:24 AM

March 20, 2017

Continuum Analytics news

​Announcing Anaconda Project: Data Science Project Encapsulation and Deployment, the Easy Way!

Monday, March 20, 2017
Christine Doig
Christine Doig
Sr. Data Scientist, Product Manager

Kristopher Overholt
Kristopher Overholt
Product Manager

One year ago, we presented Anaconda and Docker: Better Together for Reproducible Data Science. In that blog post, we described our vision and a foundational approach to portable and reproducible data science using Anaconda and Docker.

This approach embraced the philosophy of Open Data Science in which data scientists can connect the powerful data science experience of Anaconda with the tools that they know and love, which today includes Jupyter notebooks, machine learning frameworks, data analysis libraries, big data computations and connectivity, visualization toolkits, high-performance numerical libraries and more.

We also discussed how data scientists could use Anaconda to develop data science analyses on their local machine, then use Docker to deploy those same data science analyses into production. This was the state of data science encapsulation and deployment that we presented last year:

In this blog post, we’ll be diving deeper into how we’ve created a standard data science project encapsulation approach that helps data scientists deploy secure, scalable and reproducible projects across an entire team with Anaconda.

This blog post also provides more details about how we’re using Anaconda and Docker for encapsulation and containerization of data science projects to power the data science deployment functionality in the next generation of Anaconda Enterprise, which augments our truly end-to-end data science platform.

Supercharge Your Data Science with More Than Just Dockerfiles!

The reality is, as much as Docker is loved and used by the DevOps community, it is not the preferred tool or entrypoint for data scientists looking to deploy their applications. Using Docker alone as a data science encapsulation strategy still requires coordination with their IT and DevOps teams to write their Dockerfiles, install the required system libraries in their containers, and orchestrate and deploy their Docker containers into production.

Having data scientists worry about infrastructure details and DevOps tooling takes away time from their most valuable skills: finding insights in data, modeling and running experiments, and delivering consumable data-driven applications to their team and end-users.

Data scientists enjoy using the packages they know and love with Anaconda along with conda environments, and wish it was as easy to deploy data science projects as it is to get Anaconda running in their laptop.

By working directly with our amazing customers and users and listening to the needs of their data science teams over the last five years, we have clearly identified how Anaconda and Docker can be used together for data science project encapsulation and as a more useful abstraction layer for data scientists: Anaconda Projects.

The Next Generation of Portable and Reproducible Data Science with Anaconda

As part of the next generation of data science encapsulation, reproducibility and deployment, we are happy to announce the release of Anaconda Project with the latest release of Anaconda! Download the latest version of Anaconda 4.3.1 to get started with Anaconda Project today.

Or, if you already have Anaconda, you can install Anaconda Project using the following command:

conda install anaconda-project

Anaconda Project makes it easy to encapsulate data science projects and makes them fully portable and deployment-ready. It automates the configuration and setup of data science projects, such as installing the necessary packages and dependencies, downloading data sets and required files, setting environment variables for credentials or runtime configuration, and running commands.

Anaconda Project is an open source tool created by Continuum Analytics that delivers light-weight, efficient encapsulation and portability of data science projects. Learn more by checking out the Anaconda Project documentation.

Anaconda Project makes it easy to reproduce your data science analyses, share data science projects with others, run projects across different platforms, or deploy data science applications with a single-click in Anaconda Enterprise.

Whether you’re running a project locally or deploying a project with Anaconda Enterprise, you are using the same project encapsulation standard: an Anaconda Project. We’re bringing you the next generation of true Open Data Science deployment in 2017 with Anaconda:

New Release of Anaconda Navigator with Support for Anaconda Projects

As part of this release of Anaconda Project, we’ve integrated easy data science project creation and encapsulation to the familiar Anaconda Navigator experience, which is a graphical interface for your Anaconda environments and data science tools. You can easily create, edit, and upload Anaconda Projects to Anaconda Cloud through a graphical interface:

Download the latest version of Anaconda 4.3.1 to get started with Anaconda Navigator and Anaconda Project today.

Or, if you already have Anaconda, you can install the latest version of Anaconda Navigator using the following command:

conda install anaconda-navigator

When you’re using Anaconda Project with Navigator, you can create a new project and specify its dependencies, or you can import an existing conda environment file (environment.yaml) or pip requirements file (requirements.txt).

Anaconda Project examples:

  • Image classifier web application using Tensorflow and Flask
  • Live Python and R notebooks that retrieve the latest stock market data
  • Interactive Bokeh and Shiny applications for data clustering, cross filtering, and data exploration
  • Interactive visualizations of data sets with Bokeh, including streaming data
  • Machine learning models with REST APIs

To get started even quicker with portable data science projects, refer to the example Anaconda Projects on Anaconda Cloud.

Deploying Secure and Scalable Data Science Projects with Anaconda Enterprise

The new data science deployment and collaboration functionality in Anaconda Enterprise leverages Anaconda Project plus industry-standard containerization with Docker and enterprise-ready container orchestration technology with Kubernetes.

This productionization and deployment strategy makes it easy to create and deploy data science projects with a single-click for projects that use Python 2, Python 3, R, (including their dependencies in C++, Fortran, Java, etc.) or anything else you can build with the 730+ packages in Anaconda.

From Data Science Development to Deployment with Anaconda Projects and Anaconda Enterprise

All of this is possible without having to edit Dockerfiles directly, install system packages in your Docker containers, or manually deploy Docker containers into production. Anaconda Enterprise handles all of that for you, so you can get back to doing data science analysis.

The result is that any project that a data scientist can create on their machine with Anaconda can be deployed to an Anaconda Enterprise cluster in a secure, scalable, and highly-available manner with just a single click, including live notebooks, interactive applications, machine learning models with REST APIs, or any other projects that leverage the 730+ packages in Anaconda.

Anaconda is such a foundational and ubiquitous data science platform that other lightweight data science workspaces and workbenches are using Anaconda as a necessary core component for their portable and reproducible data science. Anaconda is the leading Open Data Science platform powered by Python and empowers data scientists with a truly integrated experience and support for end-to-end workflows. Why would you want your data science team using Anaconda in production with anything other than Anaconda Enterprise?

Anaconda Enterprise is a true end-to-end data science platform that integrates with all of the most popular tools and platforms and provides your data science team with an on-premises package repository, secure enterprise notebook collaboration, data science and analytics on Hadoop/Spark, and secure and scalable data science deployment.

Anaconda Enterprise also includes support for all of the 730+ Open Data Science packages in Anaconda. Finally, Anaconda Scale is the only recommended and certified method for deploying Anaconda to a Hadoop cluster for PySpark or SparkR jobs.

Getting Started with Anaconda Enterprise and Anaconda Projects

Anaconda Enterprise uses Anaconda Project and Docker as its standard project encapsulation and deployment format to enable simple one-click deployments of secure and scalable data science applications for your entire data science team.

Are you interested in using Anaconda Enterprise in your organization to deploy data science projects, including live notebooks, machine learning models, dashboards, and interactive applications?

Access to the next generation of Anaconda Enterprise v5, which features one-click secure and scalable data science deployments, is now available as a technical preview as part of the Anaconda Enterprise Innovator Program.

Join the Anaconda Enterprise v5 Innovator Program today to discover the powerful data science deployment capabilities for yourself. Anaconda Enterprise handles your secure and scalable data science project encapsulation and deployment requirements so that your data science team can focus on data exploration and analysis workflows and spend less time worrying about infrastructure and DevOps tooling.

by swebster at March 20, 2017 05:30 PM

March 16, 2017

Titus Brown

Registration reminder for our two-week summer workshop on high-throughput sequencing data analysis!

Our two-week summer workshop (announcement, direct link) is shaping up quite well, but the application deadline is today! So if you're interested, you should apply sometime before the end of the day. (We'll leave applications open as long as it's March 17th somewhere in the world.)

Some updates and expansions on the original announcement --

  • we'll be training attendees in high-performance computing, in the service of doing bioinformatics analyses. To that end, we've received a large grant from NSF XSEDE, and we'll be using JetStream for our analyses.
  • we have limited financial support that will be awarded after acceptances are issued in a week.

Here's the original announcement below:

ANGUS: Analyzing High Throughput Sequencing Data

June 26-July 8, 2017

University of California, Davis

  • Zero-entry - no experience required or expected!
  • Hands-on training in using the UNIX command line to analyze your sequencing data.
  • Friendly, helpful instructors and TAs!
  • Summer sequencing camp - meet and talk science with great people!
  • Now in its eighth year!

The workshop fee will be $500 for the two weeks, and on-campus room and board is available for $500/week. Applications will close March 17th. International and industry applicants are more than welcome!

Please see http://ivory.idyll.org/dibsi/ANGUS.html for more information, and contact dibsi.training@gmail.com if you have questions or suggestions.


--titus

by C. Titus Brown at March 16, 2017 11:00 PM

numfocus

Facebook Makes Sophisticated Forecasting Techniques Available to Non-Experts Thanks to Stan, a NumFOCUS Sponsored Project

Facebook Prophet Stan Facebook Forecasting Tool Prophet is built on Stan Facebook recently announced that they have made their forecasting tool, Prophet, open source. This is great news for data scientists and business analysts alike—forecasting is an important but tricky process that is critical to many, both for-profit and non-profit organizations. The Prophet forecasting tool is able […]

by NumFOCUS Staff at March 16, 2017 12:00 AM

March 14, 2017

Thomas Wiecki

Random-Walk Bayesian Deep Networks: Dealing with Non-Stationary Data

Download the NB: https://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/random_walk_deep_net.ipynb

(c) 2017 by Thomas Wiecki -- Quantopian Inc.

Most problems solved by Deep Learning are stationary. A cat is always a cat. The rules of Go have remained stable for 2,500 years, and will likely stay that way. However, what if the world around you is changing? This is common, for example when applying Machine Learning in Quantitative Finance. Markets are constantly evolving so features that are predictive in some time-period might not lose their edge while other patterns emerge. Usually, quants would just retrain their classifiers every once in a while. This approach of just re-estimating the same model on more recent data is very common. I find that to be a pretty unsatisfying way of modeling, as there are certain shortfalls:

  • The estimation window should be long so as to incorporate as much training data as possible.
  • The estimation window should be short so as to incorporate only the most recent data, as old data might be obsolete.
  • When you have no estimate of how fast the world around you is changing, there is no principled way of setting the window length to balance these two objectives.

Certainly there is something to be learned even from past data, we just need to instill our models with a sense of time and recency.

Enter random-walk processes. Ever since I learned about them in the stochastic volatility model they have become one of my favorite modeling tricks. Basically, it allows you to turn every static model into a time-sensitive one.

You can read more about the details of a random-walk priors here, but the central idea is that, in any time-series model, rather than assuming a parameter to be constant over time, we allow it to change gradually, following a random walk. For example, take a logistic regression:

$$ Y_i = f(\beta X_i) $$

Where $f$ is the logistic function and $\beta$ is our learnable parameter. If we assume that our data is not iid and that $\beta$ is changing over time. We thus need a different $\beta$ for every $i$:

$$ Y_i = f(\beta_i X_i) $$

Of course, this will just overfit, so we need to constrain our $\beta_i$ somehow. We will assume that while $\beta_i$ is changing over time, it will do so rather gradually by placing a random-walk prior on it:

$$ \beta_t \sim \mathcal{N}(\beta_{t-1}, s^2) $$

So $\beta_t$ is allowed to only deviate a little bit (determined by the step-width $s$) form its previous value $\beta_{t-1}$. $s$ can be thought of as a stability parameter -- how fast is the world around you changing.

Let's first generate some toy data and then implement this model in PyMC3. We will then use this same trick in a Neural Network with hidden layers.

If you would like a more complete introduction to Bayesian Deep Learning, see my recent ODSC London talk. This blog post takes things one step further so definitely read further below.

In [1]:
%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale


import warnings
from scipy import VisibleDeprecationWarning
warnings.filterwarnings("ignore", category=VisibleDeprecationWarning) 

sns.set_context('notebook')

Generating data

First, lets generate some toy data -- a simple binary classification problem that's linearly separable. To introduce the non-stationarity, we will rotate this data along the center across time. Safely skip over the next few code cells.

In [2]:
X, Y = sklearn.datasets.make_blobs(n_samples=1000, centers=2, random_state=1)
X = scale(X)
colors = Y.astype(str)
colors[Y == 0] = 'r'
colors[Y == 1] = 'b'

interval = 20
subsample = X.shape[0] // interval
chunk = np.arange(0, X.shape[0]+1, subsample)
degs = np.linspace(0, 360, len(chunk))

sep_lines = []

for ii, (i, j, deg) in enumerate(list(zip(np.roll(chunk, 1), chunk, degs))[1:]):
    theta = np.radians(deg)
    c, s = np.cos(theta), np.sin(theta)
    R = np.matrix([[c, -s], [s, c]])

    X[i:j, :] = X[i:j, :].dot(R)
In [4]:
import base64
from tempfile import NamedTemporaryFile

VIDEO_TAG = """<video controls>
 <source src="data:video/x-m4v;base64,{0}" type="video/mp4">
 Your browser does not support the video tag.
</video>"""


def anim_to_html(anim):
    if not hasattr(anim, '_encoded_video'):
        anim.save("test.mp4", fps=20, extra_args=['-vcodec', 'libx264'])

        video = open("test.mp4","rb").read()

    anim._encoded_video = base64.b64encode(video).decode('utf-8')
    return VIDEO_TAG.format(anim._encoded_video)

from IPython.display import HTML

def display_animation(anim):
    plt.close(anim._fig)
    return HTML(anim_to_html(anim))
from matplotlib import animation

# First set up the figure, the axis, and the plot element we want to animate
fig, ax = plt.subplots()
ims = [] #l, = plt.plot([], [], 'r-')
for i in np.arange(0, len(X), 10):
    ims.append([(ax.scatter(X[:i, 0], X[:i, 1], color=colors[:i]))])

ax.set(xlabel='X1', ylabel='X2')
# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.ArtistAnimation(fig, ims,
                                 interval=500, 
                                 blit=True);

display_animation(anim)
Out[4]:
Your browser does not support the video tag.

The last frame of the video, where all data is plotted is what a classifier would see that has no sense of time. Thus, the problem we set up is impossible to solve when ignoring time, but trivial once you do.

How would we classically solve this? You could just train a different classifier on each subset. But as I wrote above, you need to get the frequency right and you use less data overall.

Random-Walk Logistic Regression in PyMC3

In [5]:
from pymc3 import HalfNormal, GaussianRandomWalk, Bernoulli
from pymc3.math import sigmoid
import theano.tensor as tt


X_shared = theano.shared(X)
Y_shared = theano.shared(Y)

n_dim = X.shape[1] # 2

with pm.Model() as random_walk_perceptron:
    step_size = pm.HalfNormal('step_size', sd=np.ones(n_dim), 
                              shape=n_dim)
    
    # This is the central trick, PyMC3 already comes with this distribution
    w = pm.GaussianRandomWalk('w', sd=step_size, 
                              shape=(interval, 2))
    
    weights = tt.repeat(w, X_shared.shape[0] // interval, axis=0)
    
    class_prob = sigmoid(tt.batched_dot(X_shared, weights))
    
    # Binary classification -> Bernoulli likelihood
    pm.Bernoulli('out', class_prob, observed=Y_shared)

OK, if you understand the stochastic volatility model, the first two lines should look fairly familiar. We are creating 2 random-walk processes. As allowing the weights to change on every new data point is overkill, we subsample. The repeat turns the vector [t, t+1, t+2] into [t, t, t, t+1, t+1, ...] so that it matches the number of data points.

Next, we would usually just apply a single dot-product but here we have many weights we're applying to the input data, so we need to call dot in a loop. That is what tt.batched_dot does. In the end, we just get probabilities (predicitions) for our Bernoulli likelihood.

On to the inference. In PyMC3 we recently improved NUTS in many different places. One of those is automatic initialization. If you just call pm.sample(n_iter), we will first run ADVI to estimate the diagional mass matrix and find a starting point. This usually makes NUTS run quite robustly.

In [6]:
with random_walk_perceptron:
    trace_perceptron = pm.sample(2000)
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -90.867: 100%|██████████| 200000/200000 [01:13<00:00, 2739.70it/s]
Finished [100%]: Average ELBO = -90.869
100%|██████████| 2000/2000 [00:39<00:00, 50.58it/s]

Let's look at the learned weights over time:

In [7]:
plt.plot(trace_perceptron['w'][:, :, 0].T, alpha=.05, color='r');
plt.plot(trace_perceptron['w'][:, :, 1].T, alpha=.05, color='b');
plt.xlabel('time'); plt.ylabel('weights'); plt.title('Optimal weights change over time'); sns.despine();

As you can see, the weights are slowly changing over time. What does the learned hyperplane look like? In the plot below, the points are still the training data but the background color codes the class probability learned by the model.

In [8]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
grid_2d = np.tile(grid_2d, (interval, 1))
dummy_out = np.ones(grid_2d.shape[0], dtype=np.int8)

X_shared.set_value(grid_2d)
Y_shared.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace_perceptron, model=random_walk_perceptron, samples=500)

def create_surface(X, Y, grid, ppc, fig=None, ax=None):
    artists = []
    cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
    contour = ax.contourf(*grid, ppc, cmap=cmap)
    artists.extend(contour.collections)
    artists.append(ax.scatter(X[Y==0, 0], X[Y==0, 1], color='b'))
    artists.append(ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r'))
    _ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X1', ylabel='X2');
    return artists

fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
    a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].mean(axis=0).reshape(100, 100), fig=fig, ax=ax)
    axs.append(a)
    
anim2 = animation.ArtistAnimation(fig, axs,
                                 interval=1000);
display_animation(anim2)
100%|██████████| 500/500 [00:23<00:00, 24.47it/s]
Out[8]:
Your browser does not support the video tag.

Nice, we can see that the random-walk logistic regression adapts its weights to perfectly separate the two point clouds.

Random-Walk Neural Network

In the previous example, we had a very simple linearly classifiable problem. Can we extend this same idea to non-linear problems and build a Bayesian Neural Network with weights adapting over time?

If you haven't, I recommend you read my original post on Bayesian Deep Learning where I more thoroughly explain how a Neural Network can be implemented and fit in PyMC3.

Lets generate some toy data that is not linearly separable and again rotate it around its center.

In [9]:
from sklearn.datasets import make_moons
X, Y = make_moons(noise=0.2, random_state=0, n_samples=5000)
X = scale(X)

colors = Y.astype(str)
colors[Y == 0] = 'r'
colors[Y == 1] = 'b'

interval = 20
subsample = X.shape[0] // interval
chunk = np.arange(0, X.shape[0]+1, subsample)
degs = np.linspace(0, 360, len(chunk))

sep_lines = []

for ii, (i, j, deg) in enumerate(list(zip(np.roll(chunk, 1), chunk, degs))[1:]):
    theta = np.radians(deg)
    c, s = np.cos(theta), np.sin(theta)
    R = np.matrix([[c, -s], [s, c]])

    X[i:j, :] = X[i:j, :].dot(R)
In [28]:
fig, ax = plt.subplots()
ims = []
for i in np.arange(0, len(X), 10):
    ims.append((ax.scatter(X[:i, 0], X[:i, 1], color=colors[:i]),))

ax.set(xlabel='X1', ylabel='X2')
anim = animation.ArtistAnimation(fig, ims,
                                 interval=500, 
                                 blit=True);

display_animation(anim)
Out[28]:
Your browser does not support the video tag.

Looks a bit like Ying and Yang, who knew we'd be creating art in the process.

On to the model. Rather than have all the weights in the network follow random-walks, we will just have the first hidden layer change its weights. The idea is that the higher layers learn stable higher-order representations while the first layer is transforming the raw data so that it appears stationary to the higher layers. We can of course also place random-walk priors on all weights, or only on those of higher layers, whatever assumptions you want to build into the model.

In [11]:
np.random.seed(123)

ann_input = theano.shared(X)
ann_output = theano.shared(Y)

n_hidden = [2, 5]

# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden[0]).astype(theano.config.floatX)
init_2 = np.random.randn(n_hidden[0], n_hidden[1]).astype(theano.config.floatX)
init_out = np.random.randn(n_hidden[1]).astype(theano.config.floatX)
    
with pm.Model() as neural_network:
    # Weights from input to hidden layer
    step_size = pm.HalfNormal('step_size', sd=np.ones(n_hidden[0]), 
                              shape=n_hidden[0])
    
    weights_in_1 = pm.GaussianRandomWalk('w1', sd=step_size, 
                                         shape=(interval, X.shape[1], n_hidden[0]),
                                         testval=np.tile(init_1, (interval, 1, 1))
                                        )
    
    weights_in_1_rep = tt.repeat(weights_in_1, 
                                 ann_input.shape[0] // interval, axis=0)
    
    weights_1_2 = pm.Normal('w2', mu=0, sd=1., 
                            shape=(1, n_hidden[0], n_hidden[1]),
                            testval=init_2)
    
    weights_1_2_rep = tt.repeat(weights_1_2, 
                                ann_input.shape[0], axis=0)
    
    weights_2_out = pm.Normal('w3', mu=0, sd=1.,
                              shape=(1, n_hidden[1]),
                              testval=init_out)
    
    weights_2_out_rep = tt.repeat(weights_2_out, 
                                  ann_input.shape[0], axis=0)
      

    # Build neural-network using tanh activation function
    act_1 = tt.tanh(tt.batched_dot(ann_input, 
                         weights_in_1_rep))
    act_2 = tt.tanh(tt.batched_dot(act_1, 
                         weights_1_2_rep))
    act_out = tt.nnet.sigmoid(tt.batched_dot(act_2, 
                                             weights_2_out_rep))
        
    # Binary classification -> Bernoulli likelihood
    out = pm.Bernoulli('out', 
                       act_out,
                       observed=ann_output)

Hopefully that's not too incomprehensible. It is basically applying the principles from the random-walk logistic regression but adding another hidden layer.

I also want to take the opportunity to look at what the Bayesian approach to Deep Learning offers. Usually, we fit these models using point-estimates like the MLE or the MAP. Let's see how well that works on a structually more complex model like this one:

In [12]:
import scipy.optimize
with neural_network:
    map_est = pm.find_MAP(fmin=scipy.optimize.fmin_l_bfgs_b)
In [13]:
plt.plot(map_est['w1'].reshape(20, 4));

Some of the weights are changing, maybe it worked? How well does it fit the training data:

In [14]:
ppc = pm.sample_ppc([map_est], model=neural_network, samples=1)
print('Accuracy on train data = {:.2f}%'.format((ppc['out'] == Y).mean() * 100))
100%|██████████| 1/1 [00:00<00:00,  6.32it/s]
Accuracy on train data = 76.64%

Now on to estimating the full posterior, as a proper Bayesian would:

In [15]:
with neural_network:
    trace = pm.sample(1000, tune=200)
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -538.86: 100%|██████████| 200000/200000 [13:06<00:00, 254.43it/s]
Finished [100%]: Average ELBO = -538.69
100%|██████████| 1000/1000 [1:22:05<00:00,  4.97s/it]
In [16]:
plt.plot(trace['w1'][200:, :, 0, 0].T, alpha=.05, color='r');
plt.plot(trace['w1'][200:, :, 0, 1].T, alpha=.05, color='b');
plt.plot(trace['w1'][200:, :, 1, 0].T, alpha=.05, color='g');
plt.plot(trace['w1'][200:, :, 1, 1].T, alpha=.05, color='c');

plt.xlabel('time'); plt.ylabel('weights'); plt.title('Optimal weights change over time'); sns.despine();

That already looks quite different. What about the accuracy:

In [17]:
ppc = pm.sample_ppc(trace, model=neural_network, samples=100)
print('Accuracy on train data = {:.2f}%'.format(((ppc['out'].mean(axis=0) > .5) == Y).mean() * 100))
100%|██████████| 100/100 [00:00<00:00, 112.04it/s]
Accuracy on train data = 96.72%

I think this is worth highlighting. The point-estimate did not do well at all, but by estimating the whole posterior we were able to model the data much more accurately. I'm not quite sure why that is the case. It's possible that we either did not find the true MAP because the optimizer can't deal with the correlations in the posterior as well as NUTS can, or the MAP is just not a good point. See my other blog post on hierarchical models as for why the MAP is a terrible choice for some models.

On to the fireworks. What does this actually look like:

In [18]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
grid_2d = np.tile(grid_2d, (interval, 1))
dummy_out = np.ones(grid_2d.shape[0], dtype=np.int8)

ann_input.set_value(grid_2d)
ann_output.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
    a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].mean(axis=0).reshape(100, 100), fig=fig, ax=ax)
    axs.append(a)
    
anim2 = animation.ArtistAnimation(fig, axs,
                                  interval=1000);
display_animation(anim2)
100%|██████████| 500/500 [00:58<00:00,  7.82it/s]
Out[18]:
Your browser does not support the video tag.

Holy shit! I can't believe that actually worked. Just for fun, let's also make use of the fact that we have the full posterior and plot our uncertainty of our prediction (the background now encodes posterior standard-deviation where red means high uncertainty).

In [19]:
fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
    a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].std(axis=0).reshape(100, 100), 
                       fig=fig, ax=ax)
    axs.append(a)

anim2 = animation.ArtistAnimation(fig, axs,
                                  interval=1000);
display_animation(anim2)
Out[19]:
Your browser does not support the video tag.

Conclusions

In this blog post I explored the possibility of extending Neural Networks in new ways (to my knowledge), enabled by expressing them in a Probabilistic Programming framework. Using a classic point-estimate did not provide a good fit for the data, only full posterior inference using MCMC allowed us to fit this model adequately. What is quite nice, is that we did not have to do anything special for the inference in PyMC3, just calling pymc3.sample() gave stable results on this complex model.

Initially I built the model allowing all parameters to change, but realizing that we can selectively choose which layers to change felt like a profound insight. If you expect the raw data to change, but the higher-level representations to remain stable, as was the case here, we allow the bottom hidden layers to change. If we instead imagine e.g. handwriting recognition, where your handwriting might change over time, we would expect lower level features (lines, curves) to remain stable but allow changes in how we combine them. Finally, if the world remains stable but the labels change, we would place a random-walk process on the output layer. Of course, if you don't know, you can have every layer change its weights over time and give each one a separate step-size parameter which would allow the model to figure out which layers change (high step-size), and which remain stable (low step-size).

In terms of quantatitative finance, this type of model allows us to train on much larger data sets ranging back a long time. A lot of that data is still useful to build up stable hidden representations, even if for predicition you still want your model to predict using its most up-to-date state of the world. No need to define a window-length or discard valuable training data.

In [24]:
%load_ext watermark
%watermark -v -m -p numpy,scipy,sklearn,theano,pymc3,matplotlib
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.6.0
IPython 5.1.0

numpy 1.11.3
scipy 0.18.1
sklearn 0.18.1
theano 0.9.0beta1.dev-9f1aaacb6e884ebcff9e249f19848db8aa6cb1b2
pymc3 3.0
matplotlib 2.0.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)
system     : Darwin
release    : 16.4.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit

by Thomas Wiecki at March 14, 2017 02:00 PM

Matthieu Brucher

AudioToolkit: creating a simple plugin with WDL-OL

Audio Toolkit was started several years ago now, there are more than a dozen plugins based on the platform, applications using it, but I never wrote a tutorial explaining how to use it. Users had to find out for themselves. This changes today.

Building Audio Toolkit

Let’s start with building Audio ToolKit. It uses CMake to ease the pain of supporting several platforms, although you can build it yourself if you generate config.h.

You will require Boost, Eigen and FFTW if you want to test the library and ensure that everything is all right.

Windows

Windows may be the most complicated platform. This stems from the fact that the runtime is different for each version of the Microsoft compiler (except after 2015), and usually that’s not the one you have with your DAW (and thus probably not the one you have with your users’ DAW).

SO the first question is which kind of build you need. For a plugin, I think it is clearly a static runtime that you require, for an app, I would suggest the dynamic runtime. For this, in the CMake GUI, set MSVC_RUNTIME to Static or Dynamic. Enable the same output, static for a plugin and shared libraries for an application.

Note that tests require the shared libraries.

macSierra/OS X

On OS X, just create the default Xcode project, you may want to also generate ATK with CMAKE_OSX_ARCHITECTURES to i386 to get a 32bits version, or x86_64 for a universal binary (I’ll use i386 in this tutorial).

The same rules for static/shared apply here.

Linux

For Linux, I don’t have a plugin support in WDL-OL, but suffice to say that it is the ideas in the next section that are actually relevant.

Building a plugin with WDL-OL

I’ll use the same simple code to generate a simple plugin that does more or less nothing except copy data from the input to the output inside a plugin.

Common code

Start by using the duplicate.py script to create your own plugin. Use a “1-1” PLUG_CHANNEL_IO value to create a mono plugin (this is in resource.h). More advanced configurations can be seen on the ATK plugin repository.

Now, we need an input and an output filter for our pipeline. Let’s add them to our plugin class:

#include <ATK/Core/InPointerFilter.h>
#include <ATK/Core/OutPointerFilter.h>

and new members:

  ATK::InPointerFilter<double> inFilter;
  ATK::OutPointerFilter<double> outFilter;

Now, in the initialization list, add the following:

inFilter(nullptr, 1, 0, false), outFilter(nullptr, 1, 0, false)
  outFilter.set_input_port(0, &inFilter, 0);
  Reset();

This is required to setup the pipeline and initialize the internal variables.
In Reset() put the following:

  int sampling_rate = GetSampleRate();
 
  if(sampling_rate != outFilter.get_output_sampling_rate())
  {
    inFilter.set_input_sampling_rate(sampling_rate);
    inFilter.set_output_sampling_rate(sampling_rate);
    outFilter.set_input_sampling_rate(sampling_rate);
    outFilter.set_output_sampling_rate(sampling_rate);
  }

This ensures that all the sampling rates are consistent. If this is not required for a copy pipeline, for EQs, modeling filters, this is mandatory. Also ATK requires the pipeline to be consistent, so you can’t connect filters that don’t have matching input/output sampling rates. Some of them can change rates, like oversampling and undersampling ones, but they are the exception, not the rule.

And now, the only thing that remains is to actually trigger the pipeline:

  inFilter.set_pointer(inputs[0], nFrames);
  outFilter.set_pointer(outputs[0], nFrames);
  outFilter.process(nFrames);

Now, the WDL-OL projects must be adapted.

Windows

In both cases, it is quite straightforward: set include paths and libraries for the link stage.

For Windows, you need to have a matching ATK build for Debug/Release. In the project properties, add ATK include folder in Project->Properties->C++->Preprocessor->AdditionalIncludeDirectories.

Then add in the link page the ATK libraries you require (Project->Properties->Link->AdditionalDependencies) for each configuration.

macSierra/OS X

On OS X, it is easier to add the include/library folders, by adding them to ADDITIONAL_INCLUDES and ADDITIONAL_LIBRARY_PATHS.

The second step is to add the libraries to the project by adding them to the Link Binary With Libraries list for each target you want to build.

Conclusion

That’s it!

In the end, I hope that I showed that it is easy to build something with Audio ToolKit.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at March 14, 2017 08:49 AM

Enthought

Webinar: Using Python and LabVIEW Together to Rapidly Solve Engineering Problems

1703-Python-LabVIEW-Webinar-Banner-900x300

When: On-Demand (Live webcast took place March 28, 2017)
What: Presentation, demo, and Q&A with Collin Draughon, Software Product Manager, National Instruments, and Andrew Collette, Scientific Software Developer, Enthought

View Now  If you missed the live session, fill out the form to view the recording!


Engineers and scientists all over the world are using Python and LabVIEW to solve hard problems in manufacturing and test automation, by taking advantage of the vast ecosystem of Python software.  But going from an engineer’s proof-of-concept to a stable, production-ready version of Python, smoothly integrated with LabVIEW, has long been elusive.

In this on-demand webinar and demo, we take a LabVIEW data acquisition app and extend it with Python’s machine learning capabilities, to automatically detect and classify equipment vibration.  Using a modern Python platform and the Python Integration Toolkit for LabVIEW, we show how easy and fast it is to install heavy-hitting Python analysis libraries, take advantage of them from live LabVIEW code, and finally deploy the entire solution, Python included, using LabVIEW Application Builder.


Python_LabVIEW_VI_Diagram

In this webinar, you’ll see how easy it is to solve an engineering problem by using LabVIEW and Python together.

What You’ll Learn:

  • How Python’s machine learning libraries can simplify a hard engineering problem
  • How to extend an existing LabVIEW VI using Python analysis libraries
  • How to quickly bundle Python and LabVIEW code into an installable app

Who Should Watch:

  • Engineers and managers interested in extending LabVIEW with Python’s ecosystem
  • People who need to easily share and deploy software within their organization
  • Current LabVIEW users who are curious what Python brings to the table
  • Current Python users in organizations where LabVIEW is used

How LabVIEW users can benefit from Python:

  • High-level, general purpose programming language ideally suited to the needs of engineers, scientists, and analysts
  • Huge, international user base representing industries such as aerospace, automotive, manufacturing, military and defense, research and development, biotechnology, geoscience, electronics, and many more
  • Tens of thousands of available packages, ranging from advanced 3D visualization frameworks to nonlinear equation solvers
  • Simple, beginner-friendly syntax and fast learning curve

View Now  If you missed the live webcast, fill out the form to view the recording

Presenters:

Collin Draughon, National Instruments, Software Product Manager Collin Draughon, National Instruments
Software Product Manager
Andrew Collette, Enthought, Scientific Software Developer Andrew Collette, Enthought
Scientific Software Developer
Python Integration Toolkit for LabVIEW core developer

FAQs and Additional Resources

Python Integration Toolkit for LabVIEW

Quickly and efficiently access scientific and engineering tools for signal processing, machine learning, image and array processing, web and cloud connectivity, and much more. With only minimal coding on the Python side, this extraordinarily simple interface provides access to all of Python’s capabilities.

  • What is the Python Integration Toolkit for LabVIEW?

The Python Integration Toolkit for LabVIEW provides a seamless bridge between Python and LabVIEW. With fast two-way communication between environments, your LabVIEW project can benefit from thousands of mature, well-tested software packages in the Python ecosystem.

Run Python and LabVIEW side by side, and exchange data live. Call Python functions directly from LabVIEW, and pass arrays and other numerical data natively. Automatic type conversion virtually eliminates the “boilerplate” code usually needed to communicate with non-LabVIEW components.

Develop and test your code quickly with Enthought Canopy, a complete integrated development environment and supported Python distribution included with the Toolkit.

  • What is LabVIEW?

LabVIEW is a software platform made by National Instruments, used widely in industries such as semiconductors, telecommunications, aerospace, manufacturing, electronics, and automotive for test and measurement applications. In August 2016, Enthought released the Python Integration Toolkit for LabVIEW, which is a “bridge” between the LabVIEW and Python environments.

  • Who is Enthought?

Enthought is a global leader in software, training, and consulting solutions using the Python programming language.

The post Webinar: Using Python and LabVIEW Together to Rapidly Solve Engineering Problems appeared first on Enthought Blog.

by admin at March 14, 2017 05:00 AM

numfocus

Technical preview: Native GPU programming with CUDAnative.jl (Julia)

This post originally appeared on Julialang.org blog. 14 Mar 2017  |  Tim Besard After 2 years of slow but steady development, we would like to announce the first preview release of native GPU programming capabilities for Julia. You can now write your CUDA kernels in Julia, albeit with some restrictions, making it possible to use […]

by NumFOCUS Staff at March 14, 2017 12:00 AM

Some fun with π in Julia

This post originally appeared on the Julialang.org blog. Some fun with π in Julia 14 Mar 2017  |  Simon Byrne, Luis Benet and David Sanders This post is available as a Jupyter notebook here π in Julia (Simon Byrne) Like most technical languages, Julia provides a variable constant for π. However Julia’s handling is a […]

by NumFOCUS Staff at March 14, 2017 12:00 AM

March 13, 2017

Continuum Analytics news

Pi Day 2017: Why Celebrating Science & Mathematics is More Critical Than Ever

Tuesday, March 14, 2017
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

While Pi Day is typically about cheap pizza and other retail stunts, this year the day is being used by the tech community to influence industry leaders to stand up against new policies that affect the future. In an era of "alternative facts" and "fake news," it's more important than ever that data-driven projects are a priority for businesses and government bodies. Without it, the tech community risks losing decades of data-backed progress across the board.

The proof is undeniable - collecting, organizing and learning from data generated in today’s world improves problem solving for everyone (and I mean everyone). More and more people each day are pushing for an increasingly data-driven society, and at Continuum Analytics, we believe that data and Open Data Science empower people with the tools to solve the world’s greatest challenges—boosting tech diversity, treating rare diseases, eradicating human trafficking, predicting the effects of public policy and even improving city infrastructure.

This year, we’ve seen the technology industry stand together on issues it couldn’t have anticipated. We’ve heard tech leaders share their opinions on changing policies. So, on this very Pi Day, let’s celebrate those who are driven by science, mathematics and data, to make the world a better place. Oh, and let’s eat some pie.

Happy Pi Day, all!

by swebster at March 13, 2017 06:03 PM

March 09, 2017

Titus Brown

A draft bit of text on open science communities

This is early draft text that Anita and I put together from a bunch of brainstorming done at the Imagining Tomorrow's University workshop. Comments welcome!

Communities are the fabric of open research, and serve as the basis for development and sharing of best practices, building effective open source tools, and engaging with researchers newly interested in practicing open research. Effective communities often emerge from bottom up interactions, and can serve as a support network for individual open researchers. A few points:

  • These communities can consist of virtual clusters of likeminded individuals; they can include scholars, librarians, developers and tech staff or open research advocates at all levels of experience and with different backgrounds; the communities themselves can be short-lived and focused on a specific issue, tool, or approach, or they can have more long-term goals and aspirations.
  • A key defining feature of these groups is that the principles of open science permeate their practice, meaning they are inherently inclusive, and aim to open up the process of scholarly exploration to the widest possible audience.
  • We recommend that all stakeholders take steps to create an ecosystem that encourages these communities to develop. This means supporting common standards, funding "connective tissue" between different efforts, and sharing practices, tools, and people between communities

After collecting a series of narratives on effective and intentional approaches to creating, growing, and nurturing such communities, we recommended the following actions for different stakeholders to support the formation of adaptive and organic, bottom-up, distributed and open research communities:

Institutions:

  • Provide physical space and/or admin support for community interactions.
  • Recognize the need for explicit training in principles and practice of open research.
  • Explore what "design by a community" looks like in areas where it’s not traditional, e.g. (mechanical) engineering, to change views of what constitutes excellence in a discipline.
  • Reward incremental steps: provide incentives for aspects of open science (e.g. only share code, not data, or vv) then make it really easy to continue down a "sharing trajectory".

Funders:

  • Recognize how "disciplinary shackles" can hinder adoption of Open Science practices (e.g. development of common software/workflows and other community resources may not be respected as part of disciplinary work).
  • Award interdisciplinary and team efforts next to or instead of individual competition. Inclusivity is a defining feature of Open Science, as well as extensibility, reproducibility - goal is not solely to further individual rewards but to facilitate involvement of others: not lock-in economics, explore other reward methodologies.
  • Reward incremental steps: provide incentives for aspects of open science (e.g. only share code, not data, or vv) then make it really easy to continue down a "sharing trajectory"

Platforms and publishers:

  • Integrate training materials into platforms.
  • Support development of platform specialists inside institutions.
  • Start "pop-up" open science communities around e.g. datatype manipulation.
  • Build support for openness into tools.
  • Create communities around specific tools and practices; build norms and codes of conduct into these platforms endemically
  • Lower barriers of entry to sharing practices, tools can support the "automatic" creation of communities (cf social media tools, platforms help define communities (e.g. "My Facebook friends", "My Jupyter Friends")

Community organizers:

  • Build openness into governance
  • Recognize the value of simple narratives for roping people into community participation.
  • Funding culture changers: hiring people who are tasked with changing e.g. data dissemination processes/practices
  • Praise incremental steps towards openness by community members.
  • Establish a code of conduct and community interaction expectations.

by Anita De Waard and C. Titus Brown at March 09, 2017 11:00 PM

Matthew Rocklin

Biased Benchmarks

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

Performing benchmarks to compare software is surprisingly difficult to do fairly, even assuming the best intentions of the author. Technical developers can fall victim to a few natural human failings:

  1. We judge other projects by our own objectives rather than the objectives under which that project was developed
  2. We fail to use other projects with the same expertise that we have for our own
  3. We naturally gravitate towards cases at which our project excels
  4. We improve our software during the benchmarking process
  5. We don’t release negative results

We discuss each of these failings in the context of current benchmarks I’m working on comparing Dask and Spark Dataframes.

Introduction

Last week I started comparing performance between Dask Dataframes (a project that I maintain) and Spark Dataframes (the current standard). My initial results showed that Dask.dataframes were overall much faster, somewhere like 5x.

These results were wrong. They weren’t wrong in a factual sense, and the experiments that I ran were clear and reproducible, but there was so much bias in how I selected, set up, and ran those experiments that the end result was misleading. After checking results myself and then having other experts come in and check my results I now see much more sensible numbers. At the moment both projects are within a factor of two most of the time, with some interesting exceptions either way.

This blogpost outlines the ways in which library authors can fool themselves when performing benchmarks, using my recent experience as an anecdote. I hope that this encourages authors to check themselves, and encourages readers to be more critical of numbers that they see in the future.

This problem exists as well in academic research. For a pop-science rendition I recommend “The Experiment Experiment” on the Planet Money Podcast.

Skewed Objectives

Feature X is so important. I wonder how the competition fares?

Every project is developed with different applications in mind and so has different strengths and weaknesses. If we approach a benchmark caring only about our original objectives and dismissing the objectives of the other projects then we’ll very likely trick ourselves.

For example consider reading CSV files. Dask’s CSV reader is based off of Pandas’ CSV reader, which was the target of great effort and love; this is because CSV was so important to the finance community where Pandas grew up. Spark’s CSV solution is less awesome, but that’s less about the quality of Spark and more a statement about how Spark users tend not to use CSV. When they use text-based formats they’re much more likely to use line-delimited JSON, which is typical in Spark’s common use cases (web diagnostics, click logs, and so on). Pandas/Dask came from the scientific and finance worlds where CSV is king while Spark came from the web world where JSON reigns.

Conversely, Dask.dataframe hasn’t bothered to hook up the pandas.read_json function to Dask.dataframe yet. Surprisingly it rarely comes up. Both projects can correctly say that the other project’s solution to what they consider the standard text-based file format is less-than-awesome. Comparing performance here either way will likely lead to misguided conclusions.

So when benchmarking data ingestion maybe we look around a bit, see that both claim to support Parquet well, and use that as the basis for comparison.

Skewed Experience

Whoa, this other project has a lot of configuration parameters! Let’s just use the defaults.

Software is often easy to set up, but often requires experience to set up optimally. Authors are naturally more adept at setting up their own software than the software of their competition.

My original (and flawed) solution to this was to “just use the defaults” on both projects. Given my inability to tune Spark (there are several dozen parameters) I decided to also not tune Dask and run under default settings. I figured that this would be a good benchmark not only of the software, but also on choices for sane defaults, which is a good design principle in itself.

This failed spectacularly because I was making unconscious decisions like the size of machines that I was using for the experiment, CPU/memory ratios, and so on. It turns out that Spark’s defaults are optimized for very small machines (or more likely, small YARN containers) and use only 1GB of memory per executor by default while Dask is typically run on larger boxes or has the full use of a single machine in a single shared-memory process. My standard cluster configurations were biased towards Dask before I even considered running a benchmark.

Similarly the APIs of software projects are complex and for any given problem there is often both a fast way and a general-but-slow way. Authors naturally choose the fast way on their own system but inadvertently choose the general way that comes up first when reading documentation for the other project. It often takes months of hands-on experience to understand a project well enough to definitely say that you’re not doing things in a dumb way.

In both cases I think the only solution is to collaborate with someone that primarily uses the other system.

Preference towards strengths

Oh hey, we’re doing really well here. This is great! Let’s dive into this a bit more.

It feels great to see your project doing well. This emotional pleasure response is powerful. It’s only natural that we pursue that feeling more, exploring different aspects of it. This can skew our writing as well. We’ll find that we’ve decided to devote 80% of the text to what originally seemed like a small set of features, but which now seems like the main point.

It’s important that we define a set of things we’re going to study ahead of time and then stick to those things. When we run into cases where our project fails we should take that as an opportunity to raise an issue for future (though not current) development.

Tuning during experimentation

Oh, I know why this is slow. One sec, let me change something in the code.

I’m doing this right now. Dask dataframe shuffles are generally slower than Spark dataframe shuffles. On numeric data this used to be around a 2x difference, now it’s more like a 1.2x difference (at least on my current problem and machine). Overall this is great, seeing that another project was beating Dask motivated me to dive in (see dask/distributed #932) and this will result in a better experience for users in the future. As a developer this is also how I operate. I define a benchmark, profile my code, identify bottlenecks, and optimize. Business as usual.

However as an author of a comparative benchmark this is also somewhat dishonest; I’m not giving the Spark developers the same opportunity to find and fix similar performance issues in their software before I publish my results. I’m also giving a biased picture to my readers. I’ve made all of the pieces that I’m going to show off fast while neglecting the others. Picking benchmarks, optimizing the project to make them fast, and then publishing those results gives the incorrect impression that the entire project has been optimized to that level.

Omission

So, this didn’t go as planned. Let’s wait a few months until the next release.

There is no motivation to publish negative results. Unless of course you’ve just written a blogpost announcing that you plan to release benchmarks in the near future. Then you’re really forced to release numbers, even if they’re mixed.

That’s ok. Mixed numbers can be informative. They build trust and community. And we all talk about open source community driven software, so these should be welcome.

Straight up bias

Look, we’re not in grad-school any more. We’ve got to convince companies to actually use this stuff.

Everything we’ve discussed so far assumes best intentions, and that the author is acting in good faith, but falling victim to basic human failings.

However many developers today (including myself) are paid and work for for-profit companies that need to make money. To an increasing extent making this money depends on community mindshare, which means publishing benchmarks that sway users to our software. Authors have bosses that they’re trying to impress or the content and tone of an article may be influenced by people within the company other than the stated author.

I’ve been pretty lucky working with Continuum Analytics (my employer) in that they’ve been pretty hands-off with technical writing. For other employers that may be reading, we’ve actually had an easier time getting business because of the honest tone in these blogposts in some cases. Potential clients generally have the sense that we’re trustworthy.

Technical honesty goes a surprisingly long way towards implying technical proficiency.

March 09, 2017 12:00 AM

March 07, 2017

Continuum Analytics news

Self-Service Open Data Science: Custom Anaconda Management Packs for Hortonworks HDP and Apache Ambari

Monday, March 6, 2017
Kristopher Overholt
Kristopher Overholt
Product Manager

Daniel Rodriguez
Continuum Analytics

As part of our partnership with Hortonworks, we’re excited to announce a new self-service feature of the Anaconda platform that can be used to generate custom Anaconda management packs for the Hortonworks Data Platform (HDP) and Apache Ambari. This functionality is now available in the Anaconda platform as part of the Anaconda Scale and Anaconda Repository platform components.

The ability to generate custom Anaconda management packs makes it easy for system administrators to provide data scientists and analysts with the data science libraries from Anaconda that they already know and love. The custom management packs allow Anaconda to integrate with a Hortonworks HDP cluster along with Hadoop, Spark, Jupyter Notebooks, and Apache Zeppelin.

Data scientists working with big data workloads want to use different versions of Anaconda, Python, R, and custom conda packages on their Hortonworks HDP clusters. Using custom management packs to manage and distribute multiple Anaconda installations across a Hortonworks HDP cluster is convenient because they work natively with Hortonworks HDP 2.3, 2.4, and 2.5+ and Ambari 2.2 and 2.4+ without the need to install additional software or services on the HDP cluster nodes.

Deploying multiple custom versions of Anaconda on a Hortonworks HDP cluster with Hadoop and Spark has never been easier! In this blog post, we’ll take a closer look at how we can create and install a custom Anaconda management pack using Anaconda Repository and Ambari then configure and run PySpark jobs in notebooks, including Jupyter and Zeppelin.

Generating Custom Anaconda Management Packs for Hortonworks HDP

For this example, we’ve installed Anaconda Repository (which is part of the Anaconda Enterprise subscription) and created an on-premises mirror of more than 730 conda packages that are available in the Anaconda distribution and repository. We’ve also installed Hortonworks HDP 2.5.3 along with Ambari 2.4.2, Spark 1.6.2, Zeppelin 0.6.0, and Jupyter 4.3.1 on a cluster.

In Anaconda Repository, we can see feature for Installers, which can be used to generate custom Anaconda management packs for Hortonworks HDP.

 

 

The Installers page describes how we can create custom Anaconda management packs for Hortonworks HDP that are served directly by Anaconda Repository from a URL.

 

 

After selecting the Create New Installer button, we can then specify the packages that we want to include in our custom Anaconda management pack, which we’ll name anaconda_hdp.

Then, we specify the latest version of Anaconda (4.3.0) and Python 2.7. We’ve added the anaconda package to include all of the conda packages that are included by default in the Anaconda installer. Specifying the anaconda package is optional, but it’s a great way to kickstart your custom Anaconda management pack with more than 200 of the most popular Open Data Science packages, including NumPy, Pandas, SciPy, matplotlib, scikit-learn and more.

 

 

In addition to the packages available in Anaconda, additional Python and R conda packages can be included in the custom management pack, including libraries for natural language processing, visualization, data I/O and other data analytics libraries such as azure, bcolz, boto3, datashader, distributed, gensim, hdfs3, holoviews, impyla, seaborn, spacy, tensorflow or xarray.

We could have also included conda packages from other channels in our on-premises installation of Anaconda Repository, including community-built packages from conda-forge or other custom-built conda packages from different users within our organization.

When you’re ready to generate the custom Anaconda management pack, press the Create Management Pack button.

After creating the custom Anaconda management pack, we’ll see a list of files that were generated, including the management pack file that can be used to install Anaconda with Hortonworks HDP and Ambari.

 

 

You can install the custom management pack directly from the HDP node running the Ambari server using a URL provided by Anaconda Repository. Alternatively, the anaconda_hdp-mpack-1.0.0.tar.gz file can be manually downloaded and transferred to the Hortonworks HDP cluster for installation.

Now we’re ready to install the newly created custom Anaconda management pack using Ambari.

Installing Custom Anaconda Management Packs Using Ambari

Now that we’ve generated a custom Anaconda management pack, we can install it on our Hortonworks HDP cluster and make it available to all of the HDP cluster users for PySpark and SparkR jobs.

The management pack can be installed into Ambari by using the following command on the machine running the Ambari server.

# ambari-server install-mpack

--mpack=http://54.211.228.253:8080/anaconda/installers/anaconda/download/1.0.0/anaconda-mpack-1.0.0.tar.gz

Using python  /usr/bin/python

Installing management pack

Ambari Server 'install-mpack' completed successfully.

After installing a management pack, the Ambari server must be restarted:

# ambari-server restart

After the Ambari server restarts, navigate to the Ambari Cluster Dashboard UI in a browser:

 

 

Scroll down to the bottom of the list of services on the left sidebar, then click on the Actions > Add Services button:

 

 

This will open the Add Service Wizard:

 

 

In the Add Service Wizard, you can scroll down in the list of services until you see the name of the custom Anaconda management pack that you installed. Select the custom Anaconda management pack and click the Next button:

 

 

On the Assign Slaves and Clients screen, select the Client checkbox for each HDP node that you want to install the custom Anaconda management pack onto, then click the Next button:

 

 

On the Review screen, review the proposed configuration changes, then click the Deploy button:

 

 

Over the next few minutes, the custom Anaconda management pack will be distributed and installed across the HDP cluster:

 

 

And you’re done! The custom Anaconda management pack has installed Anaconda in /opt/continuum/anaconda on each HDP node that you selected, and Anaconda is active and ready to be used by Spark or other distributed frameworks across your Hortonworks HDP cluster.

Refer to the Ambari documentation for more information about using Ambari server with management packs, and refer to the HDP documentation for more information about using and administering your Hortonworks HDP cluster with Ambari.

Using the Custom Anaconda Management Pack with spark-submit

Now that we’ve generated and installed the custom Anaconda management pack, we can use libraries from Anaconda with Spark, PySpark, SparkR or other distributed frameworks.

You can use the spark-submit command along with the PYSPARK_PYTHON environment variable to run Spark jobs that use libraries from Anaconda across the HDP cluster, for example:

$ PYSPARK_PYTHON=/opt/continuum/anaconda/bin/python spark-submit pyspark_script.py

 

Using the Custom Anaconda Management Pack with Jupyter

To work with Spark jobs interactively on the Hortonworks HDP cluster, you can use Jupyter Notebooks via Anaconda Enterprise Notebooks, which is a multi-user notebook server with collaborative features for your data science team and integration with enterprise authentication. Refer to our previous blog post on Using Anaconda with PySpark for Distributed Language Processing on a Hadoop Cluster for more information about configuring Jupyter with PySpark.

 

 

Using the Custom Anaconda Management Pack with Zeppelin

You can also use Anaconda with Zeppelin on your HDP cluster. In HDP 2.5 and Zeppelin 0.6, you’ll need to configure Zeppelin to point to the custom version of Anaconda installed on the HDP cluster by navigating to Zeppelin Notebook > Configs > Advanced zeppelin-env in the Ambari Cluster Dashboard UI in your browser:

 

 

Scroll down to the zeppelin_env_content property, uncomment, and set the following line to match the location of the Anaconda on your HDP cluster nodes:

export PYSPARK_PYTHON="/opt/continuum/anaconda/bin/python"

 

 

Then restart the Zeppelin service when prompted.

You should also configure the zeppelin.pyspark.python property in the Zeppelin PySpark interpreter to point to Anaconda (/opt/continuum/anaconda/bin/python):

 

 

Then restart the Zeppelin interpreter when prompted. Note that the PySpark interpreter configuration process will be improved and centralized in Zeppelin in a future version.

Once you’ve configured Zeppelin to point to the location of Anaconda on your HDP cluster, data scientists can run interactive Zeppelin notebooks with Anaconda and use all of the data science libraries they know and love in Anaconda with their PySpark and SparkR jobs:

 

 

Get Started with Custom Anaconda Management Packs for Hortonworks in Your Enterprise

If you’re interested in generating custom Anaconda management packs for Hortonworks HDP and Ambari to empower your data science team, we can help! Get in touch with us by using our contact us page for more information about this functionality and our enterprise Anaconda platform subscriptions.

If you’d like to test-drive the enterprise features of Anaconda on a bare-metal, on-premises or cloud-based cluster, please contact us at sales@continuum.io.

by ryanwh at March 07, 2017 08:12 PM

Matthieu Brucher

Announcement: Audio Unit updates

I’m happy to announce the updates of all OS X plugins based on the Audio Toolkit. They are available on OS X (min. 10.11) in AU, VST2 and VST3 formats.

This update is due to different reports on Logic Pro where these plugins were failing. This should now be fixed.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at March 07, 2017 08:17 AM

March 06, 2017

Fernando Perez

"Literate computing" and computational reproducibility: IPython in the age of data-driven journalism

As "software eats the world" and we become awash in the flood of quantitative information denoted by the "Big Data" buzzword, it's clear that informed debate in society will increasingly depend on our ability to communicate information that is based on data. And for this communication to be a truly effective dialog, it is necessary that the arguments made based on data can be deconstructed, analyzed, rebutted or expanded by others. Since these arguments in practice often rely critically on the execution of code (whether an Excel spreadsheet or a proper program), it means that we really need tools to effectively communicate narratives that combine code, data and the interpretation of the results.

I will point out here two recent examples, taken from events in the news this week, where IPython has helped this kind of discussion, in the hopes that it can motivate a more informed style of debate where all the moving parts of a quantitative argument are available to all participants.

Insight, not numbers: from literate programming to literate computing

The computing community has for decades known about the "literate programming" paradigm introduced by Don Knuth in the 70's and fully formalized in his famous 1992 book. Briefly, Knuth's approach proposes writing computer programs in a format that mixes the code and a textual narrative together, and from this format generating separate files that will contain either an actual code that can be compiled/executed by the computer, or a narrative document that explains the program and is meant for human consumption. The idea is that by allowing the authors to maintain a close connection between code and narrative, a number of benefits will ensue (clearer code, less programming errors, more meaningful descriptions than mere comments embedded in the code, etc).

I don't take any issue with this approach per se, but I don't personally use it because it's not very well suited to the kinds of workflows that I need in practice. These require the frequent execution of small fragments of code, in an iterative cycle where code is run to obtain partial results that inform the next bit of code to be written. Such is the nature of interactive exploratory computing, which is the bread and butter of many practicing scientists. This is the kind of workflow that led me to creating IPython over a decade ago, and it continues to inform basically every decision we make in the project today.

As Hamming famously said in 1962, "The purpose of computing is insight, not numbers.". IPython tries to help precisely in this kind of usage pattern of the computer, in contexts where there is no clear notion in advance of what needs to be done, so the user is the one driving the computation. However, IPython also tries to provide a way to capture this process, and this is where we join back with the discussion above: while LP focuses on providing a narrative description of the structure of an algorithm, our working paradigm is one where the act of computing occupies the center stage.

From this perspective, we therefore refer to the worfklow exposed by these kinds of computational notebooks (not just IPython, but also Sage, Mathematica and others), as "literate computing": it is the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components. For the goals of communicating results in scientific computing and data analysis, I think this model is a better fit than the literate programming one, which is rather aimed at developing software in tight concert with its design and explanatory documentation. I should note that we have some ideas on how to make IPython stronger as a tool for "traditional" literate programming, but it's a bit early for us to focus on that, as we first want to solidify the computational workflows possible with IPython.

As I mentioned in a previous blog post about the history of the IPython notebook, the idea of a computational notebook is not new nor ours. Several IPython developers used extensively other similar systems from a long time and we took lots of inspiration from them. What we have tried to do, however, is to take a fresh look at these ideas, so that we can build a computational notebook that provides the best possible experience for computational work today. That means taking the existence of the Internet as a given in terms of using web technologies, an architecture based on well-specified protocols and reusable low-level formats (JSON), a language-agnostic view of the problem and a concern about the entire cycle of computing from the beginning. We want to build a tool that is just as good for individual experimentation as it is for collaboration, communication, publication and education.

Government debt, economic growth and a buggy Excel spreadsheet: the code behind the politics of fiscal austerity

In the last few years, extraordinarily contentious debates have raged in the circles of political power and fiscal decision making around the world, regarding the relation between government debt and economic growth. One of the center pieces of this debate was a paper form Harvard economists C. Reinhart and K. Rogoff, later turned into a best-selling book, that argued that beyond 90% debt ratios, economic growth would plummet precipitously.

This argument was used (amongst others) by politicians to justify some of the extreme austerity policies that have been foisted upon many countries in the last few years. On April 15, a team of researchers from U. Massachusetts published a re-analysis of the original data where they showed how Rienhart and Rogoff had made both fairly obvious coding errors in their orignal Excel spreadsheets as well as some statistically questionable manipulations of the data. Herndon, Ash and Pollin (the U. Mass authors) published all their scripts in R so that others could inspect their calculations.

Two posts from the Economist and the Roosevelt Institute nicely summarize the story with a more informed policy and economics discussion than I can make. James Kwak has a series of posts that dive into technical detail and question the horrible choice of using Excel, a tool that should for all intents and purposes be banned from serious research as it entangles code and data in ways that more or less guarantee serious errors in anything but trivial scenarios. Victoria Stodden just wrote an excellent new post with specific guidance on practices for better reproducibility; here I want to take a narrow view of these same questions focusing strictly on the tools.

As reported in Mike Konczal's piece at the Roosevelt Institute, Herndon et al. had to reach out to Reinhart and Rogoff for the original code, which hadn't been made available before (apparently causing much frustration in economics circles). It's absolutely unacceptable that major policy decisions that impact millions worldwide had until now hinged effectively on the unverified word of two scientists: no matter how competent or honorable they may be, we know everybody makes mistakes, and in this case there were both egregious errors and debatable assumptions. As Konczal says, "all I can hope is that future historians note that one of the core empirical points providing the intellectual foundation for the global move to austerity in the early 2010s was based on someone accidentally not updating a row formula in Excel." To that I would add the obvious: this should never have happened in the first place, as we should have been able to inspect that code and data from the start.

Now, moving over to IPython, something interesting happened: when I saw the report about the Herndon et al. paper and realized they had published their R scripts for all to see, I posted this request on Twitter:

It seemed to me that the obvious thing to do would be to create a document that explained together the analysis and a bit of narrative using IPython, hopefully more easily used as a starting point for further discussion. What I didn't really expect is that it would take less than three hours for Vincent Arel-Bundock, a PhD Student in Political Science at U. Michigan, to come through with a solution:

I suggested that he turn this example into a proper repository on github with the code and data, which he quickly did:

So now we have a full IPython notebook, kept in a proper github repository. This repository can enable an informed debate about the statistical methodologies used for the analysis, and now anyone who simply installs the SciPy stack can not only run the code as-is, but explore new directions and contribute to the debate in a properly informed way.

On to the heavens: the New York Times' infographic on NASA's Kepler mission

As I was discussing the above with Vincent on Twitter, I came across this post by Jonathan Corum, an information designer who works as NY Times science graphics editor:

The post links to a gorgeous, animated infographic that summarizes the results that NASA's Kepler spacecraft has obtained so far, and which accompanies a full article at the NYT on Kepler's most recent results: a pair of planets that seem to have just the right features to possibly support life, a quick 1200 light-years hop from us.

Jonathan indicated that he converted his notebook to a Python script later on for version control and automation, though I explained to him that he could have continued using the notebook, since the --script flag would give him a .py file if needed, and it's also possible to execute a notebook just like a script, with a bit of additional support code:

In this case Jonathan's code isn't publicly available, but I am still very happy to see this kind of usage: it's a step in the right direction already and as more of this analysis is done with open-source tools, we move further towards the possibility of an informed discussion around data-driven journalism.

I also hope he'll release perhaps some of the code later on, so that others can build upon it for similar analyses. I'm sure lots of people would be interested and it wouldn't detract in any way from the interest in his own work which is strongly tied to the rest of the NYT editorial resources and strengths.

Looking ahead from IPython's perspective

Our job with IPython is to think deeply about questions regarding the intersection of computing, data and science, but it's clear to me at this point that we can contribute in contexts beyond pure scientific research. I hope we'll be able to provide folks who have a direct intersection with the public, such as journalists, with tools that help a more informed and productive debate.

Coincidentally, UC Berkeley will be hosting on May 4 a symposium on data and journalism, and in recent days I've had very productive interactions with folks in this space on campus. Cathryn Carson currently directs the newly formed D-Lab, whose focus is precisely the use of quantitative and datamethods in the social sciences, and her team has recently been teaching workshops on using Python and R for social scientists. And just last week I lectured in Raymond Yee's course (from the School of Information) where they are using the notebook extensively, following Wes McKinney's excellent Python for Data Analysis as the class textbook. Given all this, I'm fairly optimistic about the future of a productive dialog and collaborations on campus, given that we have a lot of the IPython team working full-time here.

Note: as usual, this post is available as an IPython notebook in my blog repo.

by Fernando Perez (noreply@blogger.com) at March 06, 2017 09:19 PM

The IPython notebook: a historical retrospective

On December 21 2011, we released IPython 0.12 after an intense 4 1/2 months of development.  Along with a number of new features and bug fixes, the main highlight of this release is our new browser-based interactive notebook: an environment that retains all the features of the familiar console-based IPython but provides a cell-based execution workflow and can contain not only code but any element a modern browser can display.  This means you can create interactive computational documents that contain explanatory text (including LaTeX equations rendered in-browser via MathJax), results of computations, figures, video and more.  These documents are stored in a version-control-friendly JSON format that is easy to export as a pure Python script, reStructuredText, LaTeX or HTML.

For the IPython project this was a major milestone, as we had wanted for years to have such a system, and it has generated a fair amount of interest online. In particular, on our mailing list a user asked us about the relationship between this effort and the well-known and highly capable Sage Notebook.  In responding to the question, I ended up writing up a fairly detailed retrospective of our path to get to the IPython notebook, and it seemed like a good idea to put this up as a blog post to encourage discussion beyond the space of a mailing list, so here it goes (the original email that formed the base of this post, in case anyone is curious about the context).

The question that was originally posed by Oleg Mikulchenklo was: What is the relation and comparison between the IPython notebook and the Sage notebook? Can someone provide motivation and roadmap for the IPython notebook as an alternative to the Sage notebook?  I'll try to answer that now...

Early efforts: 2001-2005

Let me provide some perspective on this, since it's a valid question that is probably in the minds of others as well.  This is a long post, but I'm trying to do justice to over 10 years of development, multiple interactions between the two projects and the contributions of many people.  I apologize in advance to anyone I've forgotten, and please do correct me in the comments, as I want to have a full record that's reasonably trustworthy.

Let's go back to the beginning: when I started IPython in late 2001, I was a graduate student in physics at CU Boulder, and had used extensively first Maple, then Mathematica, both of which have notebook environments.  I also used Pascal (earlier) then C/C++, but those two (plus IDL for numerics) were the interactive environments that I knew well, and my experience with them shaped my views on what a good system for everyday scientific computing should look like.  In particular, I was a heavy user of the Mathematica notebooks and liked them a lot.

I started using Python in 2001 and liked the language, but its interactive prompt felt like a crippled toy compared to the systems mentioned above or to a Unix shell.  When I found out about sys.displayhook, I realized that by putting in a callable object, I would be able to hold state and capture previous results for reuse.  I then wrote a python startup file to provide these features and some other niceties such as loading Numeric and Gnuplot, giving me a 'mini-mathematica' in Python (femto- might be a better description, in fairness).  Thus was my 'ipython-0.0.1' born, a mere 259 lines to be loaded as $PYTYHONSTARTUP.

I also read an article that mentioned two good interactive systems for Python, LazyPython and IPP, not surprisingly also created by scientists.  I say this because the natural flow of scientific computing pretty much mandates a solid interactive environment, so while other Python users and developers may like having occasional access to interactive facilities, scientists more or less demand them.  I contacted their authors,  Nathan Gray and Janko Hauser, seeking to join forces to create IPython;  they were both very gracious and let me use their code, but didn't have the time to participate in the effort.  As any self-respecting graduate student with a dissertation deadline looming would do, I threw myself full-time into building the first 'real' IPython by merging my code with both of theirs (eventually I did graduate, by the way).

The point of this little trip down memory lane is to show how from the very beginning, Mathematica and its notebooks (and the Maple worksheets before) were in my mind as the ideal environment for daily scientific work. In 2005 we had two Google SoC students and we took a stab at building, using Wx, a notebook system.  Robert Kern then put some more work into the problem, but unfortunately that prototype never really became fully usable.

Sage bursts into the scene

In early 2006, William Stein organized the first Sage Days at UCSD and invited me; William and I had been in touch since 2005 as he was using IPython for the Sage terminal interface.  I  suggested Robert Kern come as well, and he demoed the notebook prototype he had at that point. It was very clear that the system wasn't production ready, and William was already starting to think about a notebook-like system for Sage as well. Eventually he started working on a browser-based system, and by Sage Days 2 in October 2006, as shown by the coding sprint topics, the Sage notebook was already usable.

For Sage, going at it separately was completely reasonable and justified: we were moving slowly and by that point we weren't even convinced the Wx approach would go anywhere. William is a force of nature and was trying to get Sage to be very usable very fast, so building something integrated for his needs was certainly the right choice.

We continued slowly working on IPython, and actually had another attempt at a notebook-type system in 2006-2007. By that point Brian Granger and Min Ragan-Kelley had come on board and we had built the Twisted-based parallel tools. Using this, Min got a notebook prototype working using an SQL/SQLAlchemy backend.  We had the opportunity to work on many of these ideas during a workshop on Interactive Parallel Computation that William and I co-organized (along with others).  Like Sage, this prototype used a browser for the client but it tried to retain the 'IPython experience', something the Sage notebook didn't provide.

Keeping the IPython experience in the notebook

This is a key difference of our approach and the Sage notebook, so it' worth clarifying what I mean, the key point being the execution model and its relation to the filesystem.  The Sage notebook took the route of using the filesystem for notebook operations, so you can't meaningfully use 'ls' in it or move around the filesystem yourself with 'cd', because Sage will always execute your code in hidden directories with each cell actually being a separate subdirectory.  This is a perfectly valid approach and has a number of very good consequences for the Sage notebook, but it is also very different from the IPython model where we always keep the user very close to the filesystem and OS.  For us, it's really important that you can access local scripts, use %run, see arbitrary files conveniently, etc., as these are routine needs in data analysis and numerical simulation.

Furthermore, we wanted a notebook that would provide the entire IPython experience, meaning that magics, aliases, syntax extensions and all other special IPython features worked the same in the notebook and terminal.  The Sage notebook reimplemented some of these things in its own way: they reused the % syntax but it has a different meaning, they took some of the IPython introspection code and built their own x?/?? object introspection system, etc. In some cases it's almost like IPython but in others the behavior is fairly different; this is fine for Sage but doesn't work for us.

So we continued with our own efforts, even though by then the Sage notebook was fairly mature.  For a number of reasons (I honestly don't recall all the details), Min's browser-based notebook prototype also never reached production quality.

Breaking through our bottleneck and ZeroMQ

Eventually, in the summer of 2009 we were able to fund Brian to work full-time on IPython, thanks to Matthew Brett and Jarrod Millman, with resources from the NiPy project.  Brian could then dig into the heart of the beast, and attack the fundamental problem that made IPython development so slow and hard: the fact that the main codebase was an outgrowth of that original merge from 2001 of my hack, IPP and LazyPython, by now having become an incomprehensible and terribly interconnected mess with barely any test suite.  Brian was able to devote a summer full-time to dismantling these pieces and reassembling them so that they would continue to work as before (with only minimal regressions), but now in a vastly more approachable and cleanly modularized codebase.

This is where early 2010 found us, and then zerendipity struck: while on a month-long teaching trip to Colombia I read an article about ZeroMQ and talked to Brian about it, as it seemed to provide the right abstractions for us with a simpler model than Twisted.  Brian then blew me away, coming back in two days with a new set of clean Cython-based bindings: we now had pyzmq! It became clear that we had the right tools to build a two-process implementation of IPython that could give us the 'real IPython' but communicating with a different frontend, and this is precisely what we wanted for cleaner parallel computing, multiprocess clients and a notebook.

When I returned from Colombia I had a free weekend and drove down from Berkeley to San Luis Obispo.  Upon arriving at Brian's place I didn't even have zeromq installed nor had I read any docs about it.  I installed it, and Brian simply told me what to type in IPython to import the library and open a socket, while he had another one open on his laptop.  We then started exchanging messages from our IPython sessions.  The fact that we could be up and running this fast was a good sign that the library was exactly what we wanted.  We coded frantically in parallel: one of us wrote the kernel and the other the client, and we'd debug one of them while leaving the other running in the meantime.  It was the perfect blend of pair programming and simultaneous development, and in just two days we had a prototype of a python shell over zmq working, proving that we could indeed build everything we needed.  Incidentally, that code may still be useful to someone wanting to understand our basic ideas or how to build an interactive client over ZeroMQ, so I've posted it for reference as a standalone github repository.

Shortly thereafter, we had discussions with Eric Jones and Travis Oliphant at Enthought, who offered to support Brian and I to work in collaboration with Evan Patterson, and build a Qt console for IPython using this new design. Our little weekend prototype had been just a proof of concept, but their support allowed us to spend the time necessary to apply the same ideas to the real IPython. Brian and I would build a zeromq kernel with all the IPython functionality, while Evan built a Qt console that would drive it using our communications protocol.  This worked extremely well, and by late 2010 we had a more or less complete Qt console working:



Over the summer of 2010, Omar Zapata and Gerardo Gutierrez worked as part of the Google Summer of Code project and started building both terminal- and Qt-based clients for IPython on top of ZeroMQ.  Their task was made much harder because we hadn't yet refactored all of IPython to use zmq, but the work they did provided critical understanding of the problem at this point, and eventually by 0.12 much of it has been finally merged.

The value and correctness of this architecture became clear when Brian, Min and I met with the Enthought folks and Shahrokh Mortazavi and Dino Viehland from Microsoft.  After a single session explaining to Dino and Shahrokh our design and pointing them to our github repository, they were able to build support for IPython into the new Python Tools for Visual Studio, without ever asking us a single question:


In October 2010 James Gao (a Berkeley neuroscience graduate student) wrote up a quick prototype of a web notebook, demonstrating again that this design really worked well and could be easily used by a completely different client:


And finally, in the summer of 2011 Brian took James' prototype and built up a fully working system, this time using websockets, the Tornado web server, JQuery for Javascript, CodeMirror for code editing, and MathJax for LaTeX rendering.  Ironically, we had looked at Tornado in early 2010 along with ZeroMQ as a candidate for our communications, but dismissed it as it wasn't really the tool for that job; it now turned out to be the perfect fit for an asynchronous http server with Websockets support.

We merged Brian's work in late August while working on IRC from a boarding room at the San Francisco airport, just in time for me to present it at the EuroSciPy 2011 conference.  We  then polished it over the next few months to finally release it as part of IPython 0.12:



Other differences with the Sage notebook

We deliberately wrote the IPython notebook to be a lightweight, single-user program that feels like any other local application.  The Sage notebook draws many parallels with the google docs model, by default requiring a login and showing all of your notebooks together, kept in a location separate from the rest of your files.  In contrast, we want the notebook to just start like any other program and for the ipynb files to be part of your normal workflow, ready to be version-controlled just like any other, stored in your normal folders and easy to manage on their own. Update: as noted by Jason Grout, the Sage notebook was designed from the start to scale to big centralized multi-user servers (sagenb.org, with about 76,000 accounts, is a good example).  The notebook that runs in the local user's computer is the same as the one in these large public servers.

There are other deliberate differences of interface and workflow:

  • We keep our In/Out prompts explicit because we have an entire system of caching variables that uses those numbers, and because those numbers give the user a visual clue of the execution order of cells, which may differ from the document's order.
  • We deliberately chose a structured JSON format for our documents. It's clear enough for human reading while allowing easy and powerful machine manipulation without having to write our own parsing.  So writing utilities like a reStructuredText or LaTeX converter is very easy, as we recently showed.
  • Our move to zmq allowed us (thanks to Thomas Kluyver's tireless work) to ship the notebook working both on Python2 and Python3 out of the box.  The current version of the  Sage notebook only works on Python2, in part due to its use of Twisted.  Update: William pointed out to me that the upcoming 5.0 version of the notebook will have a vastly reduced dependency on Twisted, so this will soon be less of an issue for Sage.
  • Because our notebook works in the normal filesystem, and lets you create .py files right next to the .ipynb just by passing --script at startup, you can reuse your notebooks like normal scripts, import one notebook from another or a normal python script, etc.  I'm not sure how to import a Sage notebook from a normal python file, or if it's even possible.
  • We have a long list of plans for the document format: multi-sheet capabilities, LaTeX-style preamble, per-cell metadata, structural cells to allow outline-level navigation and manipulation such as in LyX, improved literate programming and validation/reproducibility support, ... For that, we need to control the document format ourselves so we can evolve it according to our needs and ideas.

As you see, there are indeed a number of key differences between our notebook and the sage one, but there are very good technical reasons for this.  The notebook integrates with our architecture and leverages it; you can for example use the interactive debugger via a console or qtconsole against a notebook kernel, something not possible with the sage notebook.

In addition, Sage is GPL licensed while IPython is BSD licensed.  This means we can not directly reuse their code, though when we have asked them to relicense specific pieces of code to us, they have always agreed to do so. But large-scale reuse of Sage code in IPython is not really viable.

The value of being the slowest in the race

As this long story shows, it has taken us a very long time to get here. But what we have now makes a lot of sense for us, even considering the existence of the Sage notebook and how good it is for many use cases. Our notebook is just one particular aspect of a large and rich architecture built around the concept of a Python interpreter abstracted over a JSON-based, explicitly defined communications protocol.  Even considering purely http clients, the notebook is still just one of many possible: you can easily build an interface that only evaluates a single cell with a tiny bit of javascript like the Sage single cell server, for example.

Furthermore, since Min also reimplemented our parallel machinery completely with pyzmq, now we have one truly common codebase for all of IPython. We still need to finish up a bit of integration between the interactive kernels and the parallel ones, but we plan to finish that soon.

In many ways, our slow pace of development paid off:
  • We had multiple false starts that helped us much to better understand the hard parts of the problem and where the dead ends would lie.
  • We were still thinking about this all the time: even when we couldn't spare the time to actively work on it, we had no end of discussions on these things over the years (esp. Brian, Min and I, but also with others at meetings and conferences).
  • The Sage notebook was a great trailblazer showing both what could be done, and also how there were certain decisions that we wanted to make differently.
  • The technology of some critical third-party tools caught up in an amazing way: ZeroMQ, Tornado, WebSockets, MathJax, and the fast and capable Javascript engines in modern browsers along with good JS libraries. Without these tools we couldn't possibly have implemented what we have now.
As much as we would have loved to have a solid notebook years ago in IPython, I'm actually happy at how things turned out.  We have now a very nice mix of our own implementation for the things that are really within our scope, and leveraging third party tools for critical parts that we wouldn't want to implement ourselves.

What next?

We have a lot of ideas for the notebook, as we want it to be the best possible environment for modern computational work (scientific work is our focus, but not its only use), including research, education and publication, with consistent support for clean and reproducible practices throughout.  We are fairly confident that the core design and architecture are extremely solid, and we already have a long list of ideas and improvements we want to make.  We are limited only by manpower and time, so please join us on github and pitch in!

Since this post was motivated by questions about Sage, I'd like to emphasize that we have had multiple, productive collaborations with William and other Sage developers in the past, and I expect that to continue to be the case.  On certain points that collaboration has already led to convergence; e.g. the new Sage single cell server uses the IPython messaging protocol, after we worked closely with Jason Grout during Sage Days 29 in March 2011 thanks to William's invitation.  Furthermore, William's invitations to several Sage Days events, as well as the workshops we have organized together over the years, offered multiple opportunities for collaboration and discussion that proved critical on the way to today's results.

In the future we may find other areas where we can reuse tools or approaches common to Sage and IPython.  It is clear to us that the Sage notebook is a fantastic system, it just wasn't the right fit for IPython. I hope this very long post illustrates why, as well as providing some insights into our vision for scientific computing.

Last, but not least

From this post it should be obvious that what today's IPython is the result of the work of many talented people over the years, and I would like to thank all the developers and users who contribute to the project.  But it's especially important to recognize the stunning quality and quantity of work that Brian Granger and Min Ragan-Kelley have done for this to be possible.  Brian and I did our PhDs together at CU and we have been close friends since then. Min was an undergraduate student of Brian's while he was a professor at U. Santa Clara and the first IPython parallel implementation using Twisted was his senior thesis project; he is now a PhD student at Berkeley (where I work) so we continue to be able to easily collaborate.  Building a project like IPython with partners of such talent, dedication, tenacity and generous spirit is a wonderful experience. Thanks, guys!

Please notify me in the comments of any inaccuracies in the above, especially if I failed to credit someone.

by Fernando Perez (noreply@blogger.com) at March 06, 2017 09:19 PM

Python goes to Reno: SIAM CSE 2011

In what's becoming a bit of a tradition, Simula's Hans-Petter Langtangen, U. Washington's Randy LeVeque and I co-organized yet another minisymposium on Python for Scientific computing at a SIAM conference.

At the Computational Science and Engineering 2011 meeting, held in Reno February 28-March 4, we had 2 sessions with 4 talks each (part I and II).  I have put together a page with all the slides I got from the various speakers, that also includes slides from python-related talks in other minisymposia.  I have also posted some pictures from our sessions and from the minisymposium on reproducible research that my friend and colleague Jarrod Millman organized during the same conference.

We had great attendance, with a standing-room-only crowd for the first session, something rather unusual during the parallel sessions of a SIAM conference.  But more importantly, this year there were three other sessions entirely devoted to Python in scientific computing at the conference, organized completely independently from ours.  One focused on PDEs and the other on optimization.  Furthermore, there were scattered talks at several other sessions where Python was explicitly discussed in the title or abstract.  For all of these, I have collected the slides I was able to get; if you have slides for one such talk I failed to include, please contact me and I'll be happy to post them there. 

Unfortunately for our audience, we had last-minute logistical complications that prevented Robert Bradshaw and John Hunter from attending, so I had to deliver the Cython and matplotlib talks (in addition to my IPython one).  Having a speaker give three back-to-back talks isn't ideal, but both of them kindly prepared all the materials and "delivered" them to me over skype the day before, so hopefully the audience got a reasonable simile of their original intent. It's a shame, since I know first-hand how good both of them are as speakers, but canceling talks on these two key tools would really have been a disservice to everyone; my thanks go to the SIAM organizers who were flexible enough to allow for this to happen.  Given how packed the room was, I'm sure we made the right choice.

It's now abundantly clear from this level of interest that Python is being very successful in solving real problems in scientific computing.  We've come a long way from the days when some of us (I have painful memories of this) had to justify to our colleagues/advisors why we wanted to 'play' with this newfangled 'toy' instead of just getting on with our job using the existing tools (in my case it was IDL, a hodgepodge of homegrown shell/awk/sed/perl scripting, custom C and some Gnuplot thrown in the mix for good measure).  Things are by no means perfect, and there's plenty of problems to solve, but we have a great foundation, a number of good quality tools that continue to improve as well as our most important asset: a rapidly growing community that is solving new problems, creating new libraries and coming up with innovative approaches to computational and mathematical questions, often facilitated by Python's tremendous flexibility. It's been a fun ride so far, but I suspect the next decade is going to be even more interesting.  If you missed this, try to make it to SciPy 2011 or EuroSciPy 2011!

Link summary


by Fernando Perez (noreply@blogger.com) at March 06, 2017 09:19 PM

Blogging with the IPython notebook

Update (May 2014): Please note that these instructions are outdated. while it is still possible (and in fact easier) to blog with the Notebook, the exact process has changed now that IPython has an official conversion framework. However, Blogger isn't the ideal platform for that (though it can be made to work). If you are interested in using the Notebook as a tool for technical blogging, I recommend looking at Jake van der Plas' Pelican support or Damián Avila's support in Nikola.

Update: made full github repo for blog-as-notebooks, and updated instructions on how to more easily configure everything and use the newest nbconvert for a more streamlined workflow.

Since the notebook was introduced with IPython 0.12, it has proved to be very popular, and we are seeing great adoption of the tool and the underlying file format in research and education. One persistent question we've had since the beginning (even prior to its official release) was whether it would be possible to easily write blog posts using the notebook. The combination of easy editing in markdown with the notebook's ability to contain code, figures and results, makes it an ideal platform for quick authoring of technical documents, so being able to post to a blog is a natural request.

Today, in answering a query about this from a colleague, I decided to try again the status of our conversion pipeline, and I'm happy to report that with a bit of elbow-grease, at least on Blogger things work pretty well!

This post was entirely written as a notebook, and in fact I have now created a github repo, which means that you can see it directly rendered in IPyhton's nbviewer app.

The purpose of this post is to quickly provide a set of instructions on how I got it to work, and to test things out. Please note: this requires code that isn't quite ready for prime-time and is still under heavy development, so expect some assembly.

Converting your notebook to html with nbconvert

The first thing you will need is our nbconvert tool that converts notebooks across formats. The README file in the repo contains the requirements for nbconvert (basically python-markdown, pandoc, docutils from SVN and pygments).

Once you have nbconvert installed, you can convert your notebook to Blogger-friendly html with:

nbconvert -f blogger-html your_notebook.ipynb

This will leave two files in your computer, one named your_notebook.html and one named your_noteboook_header.html; it might also create a directory called your_notebook_files if needed for ancillary files. The first file will contain the body of your post and can be pasted wholesale into the Blogger editing area. The second file contains the CSS and Javascript material needed for the notebook to display correctly, you should only need to use this once to configure your blogger setup (see below):

# Only one notebook so far
(master)longs[blog]> ls
120907-Blogging with the IPython Notebook.ipynb  fig/  old/

# Now run the conversion:
(master)longs[blog]> nbconvert.py -f blogger-html 120907-Blogging\ with\ the\ IPython\ Notebook.ipynb

# This creates the header and html body files
(master)longs[blog]> ls
120907-Blogging with the IPython Notebook_header.html  fig/
120907-Blogging with the IPython Notebook.html         old/
120907-Blogging with the IPython Notebook.ipynb

Configuring your Blogger blog to accept notebooks

The notebook uses a lot of custom CSS for formatting input and output, as well as Javascript from MathJax to display mathematical notation. You will need all this CSS and the Javascript calls in your blog's configuration for your notebook-based posts to display correctly:

  1. Once authenticated, go to your blog's overview page by clicking on its title.
  2. Click on templates (left column) and customize using the Advanced options.
  3. Scroll down the middle column until you see an "Add CSS" option.
  4. Copy entire the contents of the _header file into the CSS box.

That's it, and you shouldn't need to do anything else as long as the CSS we use in the notebooks doesn't drastically change. This customization of your blog needs to be done only once.

While you are at it, I recommend you change the width of your blog so that cells have enough space for clean display; in experimenting I found out that the default template was too narrow to properly display code cells, producing a lot of text wrapping that impaired readability. I ended up using a layout with a single column for all blog contents, putting the blog archive at the bottom. Otherwise, if I kept the right sidebar, code cells got too squished in the post area.

I also had problems using some of the fancier templates available from 'Dynamic Views', in that I could never get inline math to render. But sticking to those from the Simple or 'Picture Window' categories worked fine and they still allow for a lot of customization.

Note: if you change blog templates, Blogger does destroy your custom CSS, so you may need to repeat the above steps in that case.

Adding the actual posts

Now, whenever you want to write a new post as a notebook, simply convert the .ipynb file to blogger-html and copy its entire contents to the clipboard. Then go to the 'raw html' view of the post, remove anything Blogger may have put there by default, and paste. You should also click on the 'options' tab (right hand side) and select both Show HTML literally and Use <br> tag, else your paragraph breaks will look all wrong.

That's it!

What can you put in?

I will now add a few bits of code, plots, math, etc, to show which kinds of content can be put in and work out of the box. These are mostly bits copied from our example notebooks so the actual content doesn't matter, I'm just illustrating the kind of content that works.

In [1]:
# Let's initialize pylab so we can plot later
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.

With pylab loaded, the usual matplotlib operations work

In [2]:
x = linspace(0, 2*pi)
plot(x, sin(x), label=r'$\sin(x)$')
plot(x, cos(x), 'ro', label=r'$\cos(x)$')
title(r'Two familiar functions')
legend()
Out [2]:
<matplotlib.legend.Legend at 0x3128610>

The notebook, thanks to MathJax, has great LaTeX support, so that you can type inline math $(1,\gamma,\ldots, \infty)$ as well as displayed equations:

$$ e^{i \pi}+1=0 $$

but by loading the sympy extension, it's easy showcase math output from Python computations, where we don't type the math expressions in text, and instead the results of code execution are displayed in mathematical format:

In [3]:
%load_ext sympyprinting
import sympy as sym
from sympy import *
x, y, z = sym.symbols("x y z")

From simple algebraic expressions

In [4]:
Rational(3,2)*pi + exp(I*x) / (x**2 + y)
Out [4]:
$$\frac{3}{2} \pi + \frac{e^{\mathbf{\imath} x}}{x^{2} + y}$$
In [5]:
eq = ((x+y)**2 * (x+1))
eq
Out [5]:
$$\left(x + 1\right) \left(x + y\right)^{2}$$
In [6]:
expand(eq)
Out [6]:
$$x^{3} + 2 x^{2} y + x^{2} + x y^{2} + 2 x y + y^{2}$$

To calculus

In [7]:
diff(cos(x**2)**2 / (1+x), x)
Out [7]:
$$- 4 \frac{x \operatorname{sin}\left(x^{2}\right) \operatorname{cos}\left(x^{2}\right)}{x + 1} - \frac{\operatorname{cos}^{2}\left(x^{2}\right)}{\left(x + 1\right)^{2}}$$

For more examples of how to use sympy in the notebook, you can see our example sympy notebook or go to the sympy website for much more documentation.

You can easily include formatted text and code with markdown

You can italicize, boldface

  • build
  • lists

and embed code meant for illustration instead of execution in Python:

def f(x):
    """a docstring"""
    return x**2

or other languages:

if (i=0; i<n; i++) {
  printf("hello %d\n", i);
  x += 4;
}

And since the notebook can store displayed images in the file itself, you can show images which will be embedded in your post:

In [8]:
from IPython.display import Image
Image(filename='fig/img_4926.jpg')
Out [8]: