## March 28, 2017

### Mark Fenner

#### DC SVD II: From Values to Vectors

In our last installment, we discussed solutions to the secular equation. These solutions are the eigenvalues (and/or) singular values of matrices with a particular form. Since this post is otherwise light on technical content, I’ll dive into those matrix forms now. Setting up the Secular Equation to Solve the Eigen and Singular Problems In dividing-and-conquering […]

### Matthieu Brucher

#### Audio Toolkit: Recursive Least Square Filter (Adaptive module)

I’ve started working on adaptive filtering a long time ago, but could never figure out why my simple implementation of the RLS algorithm failed. Well, there was a typo in the reference book!

Now that this is fixed, let’s see what this guy does.

#### Algorithm

The RLS algorithm learns an input signal based on its past and predicts new values from it. As such, it can be used to learn periodic signals, but also noise. The basis is to predict a new value based on the past, compare it to the actual value and update the set of coefficients. The update itself is based on a memory time constraint, and the higher the value, the slower the update.

Once the filter has learned enough, the learning stage can be shut off, and the filter can be used to select frequencies.

#### Results

Let’s start with a simple sinusoidal signal, and see if an order 10 can be used to learn it:

Sinusoidal signal learnt with RLS

As it can be seen, at the beginning, the filter is learning, as it doesn’t match the input. After a short time, it does match (zooming on the signal shows that there is a latency and also the amplitude do not exactly match).

Let’s see how it does for more complex signals. Let’s add two additional slightly out of tunes sinusoids:

Three out-of-tune sinusoids learnt with RLS

Once again, after a short time, the learning phase is stable, and we can switch it off and the signal is estimated properly.

Let’s try now something a little bit more complex, and try to denoise an input signal.
Filtered noise

The original noise in blue is estimated in green, and the remainder noise is in red. Obviously, we don’t do a great job here, but let’s see what is actually attenuated:
Filtered noise in the spectral domain

So the middle of the bandwidth is better attenuated that the sides, which is expected in a way.

Now, what does that do to a signal we try to denoise?
Denoised signal

Obviously, the signal is denoised, but also increased! And the same happens in the spectral domain.
Denoised signal in the spectral domain

When looking at the estimated function, the picture is a little bit clearer:
Estimated spectral transfer function

Our noise is actually between 0.6 and 1.2 rad/s (from sampling frequency/10 to sampling frequency/5), and the RLS filter underestimates these a little bit but doesn’t cut the high frequencies, which can lead to ringing…

Also the cost of learning the noise is quite costly:
Learning cost

Learning was only activated during half the total processing time…

#### Conclusion

RLS filters are interesting to follow a signal. Obviously this filter is just the start of this new module, and I hope I’ll have real denoising filters at some point.

This filter will be available in ATK 2.0.0 and is already in the develop branch with the Python example scripts.

### Matthew Rocklin

#### Dask and Pandas and XGBoost

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

## Summary

This post talks about distributing Pandas Dataframes with Dask and then handing them over to distributed XGBoost for training.

More generally it discusses the value of launching multiple distributed systems in the same shared-memory processes and smoothly handing data back and forth between them.

## Introduction

XGBoost is a well-loved library for a popular class of machine learning algorithms, gradient boosted trees. It is used widely in business and is one of the most popular solutions in Kaggle competitions. For larger datasets or faster training, XGBoost also comes with its own distributed computing system that lets it scale to multiple machines on a cluster. Fantastic. Distributed gradient boosted trees are in high demand.

However before we can use distributed XGBoost we need to do three things:

1. Prepare and clean our possibly large data, probably with a lot of Pandas wrangling
2. Set up XGBoost master and workers
3. Hand data our cleaned data from a bunch of distributed Pandas dataframes to XGBoost workers across our cluster

This ends up being surprisingly easy. This blogpost gives a quick example using Dask.dataframe to do distributed Pandas data wrangling, then using a new dask-xgboost package to setup an XGBoost cluster inside the Dask cluster and perform the handoff.

After this example we’ll talk about the general design and what this means for other distributed systems.

## Example

We have a ten-node cluster with eight cores each (m4.2xlarges on EC2)

import dask
from dask.distributed import Client, progress

>>> client = Client('172.31.33.0:8786')
>>> client.restart()
<Client: scheduler='tcp://172.31.33.0:8786' processes=10 cores=80>


We load the Airlines dataset using dask.dataframe (just a bunch of Pandas dataframes spread across a cluster) and do a bit of preprocessing:

import dask.dataframe as dd

# Subset of the columns to use
cols = ['Year', 'Month', 'DayOfWeek', 'Distance',
'DepDelay', 'CRSDepTime', 'UniqueCarrier', 'Origin', 'Dest']

# Create the dataframe
df = dd.read_csv('s3://dask-data/airline-data/20*.csv', usecols=cols,
storage_options={'anon': True})

df = df.sample(frac=0.2) # XGBoost requires a bit of RAM, we need a larger cluster

is_delayed = (df.DepDelay.fillna(16) > 15)  # column of labels
del df['DepDelay']  # Remove delay information from training dataframe

df['CRSDepTime'] = df['CRSDepTime'].clip(upper=2399)

df, is_delayed = dask.persist(df, is_delayed)  # start work in the background


This loaded a few hundred pandas dataframes from CSV data on S3. We then had to downsample because how we are going to use XGBoost in the future seems to require a lot of RAM. I am not an XGBoost expert. Please forgive my ignorance here. At the end we have two dataframes:

• df: Data from which we will learn if flights are delayed
• is_delayed: Whether or not those flights were delayed.

Data scientists familiar with Pandas will probably be familiar with the code above. Dask.dataframe is very similar to Pandas, but operates on a cluster.

>>> df.head()

Year Month DayOfWeek CRSDepTime UniqueCarrier Origin Dest Distance
182193 2000 1 2 800 WN LAX OAK 337
83424 2000 1 6 1650 DL SJC SLC 585
346781 2000 1 5 1140 AA ORD LAX 1745
375935 2000 1 2 1940 DL PHL ATL 665
309373 2000 1 4 1028 CO MCI IAH 643
>>> is_delayed.head()
182193    False
83424     False
346781    False
375935    False
309373    False
Name: DepDelay, dtype: bool


### Categorize and One Hot Encode

XGBoost doesn’t want to work with text data like destination=”LAX”. Instead we create new indicator columns for each of the known airports and carriers. This expands our data into many boolean columns. Fortunately Dask.dataframe has convenience functions for all of this baked in (thank you Pandas!)

>>> df2 = dd.get_dummies(df.categorize()).persist()


This expands our data out considerably, but makes it easier to train on.

>>> len(df2.columns)
685


### Split and Train

Great, now we’re ready to split our distributed dataframes

data_train, data_test = df2.random_split([0.9, 0.1],
random_state=1234)
labels_train, labels_test = is_delayed.random_split([0.9, 0.1],
random_state=1234)


Start up a distributed XGBoost instance, and train on this data

%%time
import dask_xgboost as dxgb

params = {'objective': 'binary:logistic', 'nround': 1000,
'max_depth': 16, 'eta': 0.01, 'subsample': 0.5,
'min_child_weight': 1, 'tree_method': 'hist',
'grow_policy': 'lossguide'}

bst = dxgb.train(client, params, data_train, labels_train)

CPU times: user 355 ms, sys: 29.7 ms, total: 385 ms
Wall time: 54.5 s


Great, so we were able to train an XGBoost model on this data in about a minute using our ten machines. What we get back is just a plain XGBoost Booster object.

>>> bst
<xgboost.core.Booster at 0x7fa1c18c4c18>


We could use this on normal Pandas data locally

import xgboost as xgb
pandas_df = data_test.head()
dtest = xgb.DMatrix(pandas_df)

>>> bst.predict(dtest)
array([ 0.464578  ,  0.46631625,  0.47434333,  0.47245741,  0.46194169], dtype=float32)


Of we can use dask-xgboost again to train on our distributed holdout data, getting back another Dask series.

>>> predictions = dxgb.predict(client, bst, data_test).persist()
>>> predictions
Dask Series Structure:
npartitions=93
None    float32
None        ...
...
None        ...
None        ...
Name: predictions, dtype: float32
Dask Name: _predict_part, 93 tasks


### Evaluate

We can bring these predictions to the local process and use normal Scikit-learn operations to evaluate the results.

>>> from sklearn.metrics import roc_auc_score, roc_curve
>>> print(roc_auc_score(labels_test.compute(),
...                     predictions.compute()))
0.654800768411

fpr, tpr, _ = roc_curve(labels_test.compute(), predictions.compute())
# Taken from
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
plt.figure(figsize=(8, 8))
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()


We might want to play with our parameters above or try different data to improve our solution. The point here isn’t that we predicted airline delays well, it was that if you are a data scientist who knows Pandas and XGBoost, everything we did above seemed pretty familiar. There wasn’t a whole lot of new material in the example above. We’re using the same tools as before, just at a larger scale.

## Analysis

OK, now that we’ve demonstrated that this works lets talk a bit about what just happened and what that means generally for cooperation between distributed services.

### What dask-xgboost does

The dask-xgboost project is pretty small and pretty simple (200 TLOC). Given a Dask cluster of one central scheduler and several distributed workers it starts up an XGBoost scheduler in the same process running the Dask scheduler and starts up an XGBoost worker within each of the Dask workers. They share the same physical processes and memory spaces. Dask was built to support this kind of situation, so this is relatively easy.

Then we ask the Dask.dataframe to fully materialize in RAM and we ask where all of the constituent Pandas dataframes live. We tell each Dask worker to give all of the Pandas dataframes that it has to its local XGBoost worker and then just let XGBoost do its thing. Dask doesn’t power XGBoost, it’s just sets it up, gives it data, and lets it do it’s work in the background.

People often ask what machine learning capabilities Dask provides, how they compare with other distributed machine learning libraries like H2O or Spark’s MLLib. For gradient boosted trees the 200-line dask-xgboost package is the answer. Dask has no need to make such an algorithm because XGBoost already exists, works well and provides Dask users with a fully featured and efficient solution.

Because both Dask and XGBoost can live in the same Python process they can share bytes between each other without cost, can monitor each other, etc.. These two distributed systems co-exist together in multiple processes in the same way that NumPy and Pandas operate together within a single process. Sharing distributed processes with multiple systems can be really beneficial if you want to use multiple specialized services easily and avoid large monolithic frameworks.

### Connecting to Other distributed systems

A while ago I wrote a similar blogpost about hosting TensorFlow from Dask in exactly the same way that we’ve done here. It was similarly easy to setup TensorFlow alongside Dask, feed it data, and let TensorFlow do its thing.

Generally speaking this “serve other libraries” approach is how Dask operates when possible. We’re only able to cover the breadth of functionality that we do today because we lean heavily on the existing open source ecosystem. Dask.arrays use Numpy arrays, Dask.dataframes use Pandas, and now the answer to gradient boosted trees with Dask is just to make it really really easy to use distributed XGBoost. Ta da! We get a fully featured solution that is maintained by other devoted developers, and the entire connection process was done over a weekend (see dmlc/xgboost #2032 for details).

Since this has come out we’ve had requests to support other distributed systems like Elemental and to do general hand-offs to MPI computations. If we’re able to start both systems with the same set of processes then all of this is pretty doable. Many of the challenges of inter-system collaboration go away when you can hand numpy arrays between the workers of one system to the workers of the other system within the same processes.

## Acknowledgements

Thanks to Tianqi Chen and Olivier Grisel for their help when building and testing dask-xgboost. Thanks to Will Warner for his help in editing this post.

## March 23, 2017

### Matthew Rocklin

#### Dask Release 0.14.1

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.1. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on February 27th.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed


or you can pip install from PyPI

pip install dask[complete] --upgrade


## Arrays

Recent work in distributed computing and machine learning have motivated new performance-oriented and usability changes to how we handle arrays.

### Automatic chunking and operation on NumPy arrays

Many interactions between Dask arrays and NumPy arrays work smoothly. NumPy arrays are made lazy and are appropriately chunked to match the operation and the Dask array.

>>> x = np.ones(10)                 # a numpy array
>>> y = da.arange(10, chunks=(5,))  # a dask array
>>> z = x + y                       # combined become a dask.array
>>> z
dask.array<add, shape=(10,), dtype=float64, chunksize=(5,)>

>>> z.compute()
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])


### Reshape

Reshaping distributed arrays is simple in simple cases, and can be quite complex in complex cases. Reshape now supports a much more broad set of shape transformations where any dimension is collapsed or merged to other dimensions.

>>> x = da.ones((2, 3, 4, 5, 6), chunks=(2, 2, 2, 2, 2))
>>> x.reshape((6, 2, 2, 30, 1))
dask.array<reshape, shape=(6, 2, 2, 30, 1), dtype=float64, chunksize=(3, 1, 2, 6, 1)>


This operation ends up being quite useful in a number of distributed array cases.

### Optimize Slicing to Minimize Communication

Dask.array slicing optimizations are now careful to produce graphs that avoid situations that could cause excess inter-worker communication. The details of how they do this is a bit out of scope for a short blogpost, but the history here is interesting.

Historically dask.arrays were used almost exclusively by researchers with large on-disk arrays stored as HDF5 or NetCDF files. These users primarily used the single machine multi-threaded scheduler. We heavily tailored Dask array optimizations to this situation and made that community pretty happy. Now as some of that community switches to cluster computing on larger datasets the optimization goals shift a bit. We have tons of distributed disk bandwidth but really want to avoid communicating large results between workers. Supporting both use cases is possible and I think that we’ve achieved that in this release so far, but it’s starting to require increasing levels of care.

### Micro-optimizations

With distributed computing also comes larger graphs and a growing importance of graph-creation overhead. This has been optimized somewhat in this release. We expect this to be a focus going forward.

## DataFrames

### Set_index

Set_index is smarter in two ways:

1. If you set_index on a column that happens to be sorted then we’ll identify that and avoid a costly shuffle. This was always possible with the sorted= keyword but users rarely used this feature. Now this is automatic.
2. Similarly when setting the index we can look at the size of the data and determine if there are too many or too few partitions and rechunk the data while shuffling. This can significantly improve performance if there are too many partitions (a common case).

### Shuffle performance

We’ve micro-optimized some parts of dataframe shuffles. Big thanks to the Pandas developers for the help here. This accelerates set_index, joins, groupby-applies, and so on.

### Fastparquet

The fastparquet library has seen a lot of use lately and has undergone a number of community bugfixes.

Importantly, Fastparquet now supports Python 2.

We strongly recommend Parquet as the standard data storage format for Dask dataframes (and Pandas DataFrames).

dask/fastparquet #87

## Distributed Scheduler

### Replay remote exceptions

Debugging is hard in part because exceptions happen on remote machines where normal debugging tools like pdb can’t reach. Previously we were able to bring back the traceback and exception, but you couldn’t dive into the stack trace to investigate what went wrong:

def div(x, y):
return x / y

>>> future = client.submit(div, 1, 0)
>>> future
<Future: status: error, key: div-4a34907f5384bcf9161498a635311aeb>

>>> future.result()  # getting result re-raises exception locally
<ipython-input-3-398a43a7781e> in div()
1 def div(x, y):
----> 2     return x / y

ZeroDivisionError: division by zero


Now Dask can bring a failing task and all necessary data back to the local machine and rerun it so that users can leverage the normal Python debugging toolchain.

>>> client.recreate_error_locally(future)
<ipython-input-3-398a43a7781e> in div(x, y)
1 def div(x, y):
----> 2     return x / y
ZeroDivisionError: division by zero


Now if you’re in IPython or a Jupyter notebook you can use the %debug magic to jump into the stacktrace, investigate local variables, and so on.

In [8]: %debug
> <ipython-input-3-398a43a7781e>(2)div()
1 def div(x, y):
----> 2     return x / y

ipdb> pp x
1
ipdb> pp y
0


dask/distributed #894

### Async/await syntax

Dask.distributed uses Tornado for network communication and Tornado coroutines for concurrency. Normal users rarely interact with Tornado coroutines; they aren’t familiar to most people so we opted instead to copy the concurrent.futures API. However some complex situations are much easier to solve if you know a little bit of async programming.

Fortunately, the Python ecosystem seems to be embracing this change towards native async code with the async/await syntax in Python 3. In an effort to motivate people to learn async programming and to gently nudge them towards Python 3 Dask.distributed we now support async/await in a few cases.

You can wait on a dask Future

async def f():
future = client.submit(func, *args, **kwargs)
result = await future


You can put the as_completed iterator into an async for loop

async for future in as_completed(futures):
result = await future
... do stuff with result ...


And, because Tornado supports the await protocols you can also use the existing shadow concurrency API (everything prepended with an underscore) with await. (This was doable before.)

results = client.gather(futures)         # synchronous
...
results = await client._gather(futures)  # asynchronous


If you’re in Python 2 you can always do this with normal yield and the tornado.gen.coroutine decorator.

dask/distributed #952

### Inproc transport

In the last release we enabled Dask to communicate over more things than just TCP. In practice this doesn’t come up (TCP is pretty useful). However in this release we now support single-machine “clusters” where the clients, scheduler, and workers are all in the same process and transfer data cost-free over in-memory queues.

This allows the in-memory user community to use some of the more advanced features (asynchronous computation, spill-to-disk support, web-diagnostics) that are only available in the distributed scheduler.

This is on by default if you create a cluster with LocalCluster without using Nanny processes.

>>> from dask.distributed import LocalCluster, Client

>>> cluster = LocalCluster(nanny=False)

>>> client = Client(cluster)

>>> client
<Client: scheduler='inproc://192.168.1.115/8437/1' processes=1 cores=4>

>>> from threading import Lock         # Not serializable
>>> lock = Lock()                      # Won't survive going over a socket
>>> [future] = client.scatter([lock])  # Yet we can send to a worker
>>> future.result()                    # ... and back
<unlocked _thread.lock object at 0x7fb7f12d08a0>


dask/distributed #919

### Connection pooling for inter-worker communications

Workers now maintain a pool of sustained connections between each other. This pool is of a fixed size and removes connections with a least-recently-used policy. It avoids re-connection delays when transferring data between workers. In practice this shaves off a millisecond or two from every communication.

This is actually a revival of an old feature that we had turned off last year when it became clear that the performance here wasn’t a problem.

Along with other enhancements, this takes our round-trip latency down to 11ms on my laptop.

In [10]: %%time
...: for i in range(1000):
...:     future = client.submit(inc, i)
...:     result = future.result()
...:
CPU times: user 4.96 s, sys: 348 ms, total: 5.31 s
Wall time: 11.1 s


There may be room for improvement here though. For comparison here is the same test with the concurent.futures.ProcessPoolExecutor.

In [14]: e = ProcessPoolExecutor(8)

In [15]: %%time
...: for i in range(1000):
...:     future = e.submit(inc, i)
...:     result = future.result()
...:
CPU times: user 320 ms, sys: 56 ms, total: 376 ms
Wall time: 442 ms


Also, just to be clear, this measures total roundtrip latency, not overhead. Dask’s distributed scheduler overhead remains in the low hundreds of microseconds.

dask/distributed #935

There has been activity around Dask and machine learning:

• dask-learn is undergoing some performance enhancements. It turns out that when you offer distributed grid search people quickly want to scale up their computations to hundreds of thousands of trials.
• dask-glm now has a few decent algorithms for convex optimization. The authors of this wrote a blogpost very recently if you’re interested: Developing Convex Optimization Algorithms in Dask
• dask-xgboost lets you hand off distributed data in Dask dataframes or arrays and hand it directly to a distributed XGBoost system (that Dask will nicely set up and tear down for you). This was a nice example of easy hand-off between two distributed services running in the same processes.

## Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.0 release on February 27th

• Antoine Pitrou
• Brian Martin
• Elliott Sales de Andrade
• Erik Welch
• Francisco de la Peña
• jakirkham
• Jim Crist
• Jitesh Kumar Jha
• Julien Lhermitte
• Martin Durant
• Matthew Rocklin
• Markus Gonser
• Talmaj

The following people contributed to the dask/distributed repository since the 1.16.0 release on February 27th

• Antoine Pitrou
• Ben Schreck
• Elliott Sales de Andrade
• Martin Durant
• Matthew Rocklin
• Phil Elson

## March 22, 2017

### Matthew Rocklin

#### Developing Convex Optimization Algorithms in Dask

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

## Summary

We build distributed optimization algorithms with Dask.  We show both simple examples and also benchmarks from a nascent dask-glm library for generalized linear models.  We also talk about the experience of learning Dask to do this kind of work.

This blogpost is co-authored by Chris White (Capital One) who knows optimization and Matthew Rocklin (Continuum Analytics) who knows distributed computing.

## Introduction

Many machine learning and statistics models (such as logistic regression) depend on convex optimization algorithms like Newton’s method, stochastic gradient descent, and others.  These optimization algorithms are both pragmatic (they’re used in many applications) and mathematically interesting.  As a result these algorithms have been the subject of study by researchers and graduate students around the world for years both in academia and in industry.

Things got interesting about five or ten years ago when datasets grew beyond the size of working memory and “Big Data” became a buzzword.  Parallel and distributed solutions for these algorithms have become the norm, and a researcher’s skillset now has to extend beyond linear algebra and optimization theory to include parallel algorithms and possibly even network programming, especially if you want to explore and create more interesting algorithms.

However, relatively few people understand both mathematical optimization theory and the details of distributed systems. Typically algorithmic researchers depend on the APIs of distributed computing libraries like Spark or Flink to implement their algorithms. In this blogpost we explore the extent to which Dask can be helpful in these applications. We approach this from two perspectives:

1. Algorithmic researcher (Chris): someone who knows optimization and iterative algorithms like Conjugate Gradient, Dual Ascent, or GMRES but isn’t so hot on distributed computing topics like sockets, MPI, load balancing, and so on
2. Distributed systems developer (Matt): someone who knows how to move bytes around and keep machines busy but doesn’t know the right way to do a line search or handle a poorly conditioned matrix

## Prototyping Algorithms in Dask

Given knowledge of algorithms and of NumPy array computing it is easy to write parallel algorithms with Dask. For a range of complicated algorithmic structures we have two straightforward choices:

1. Use parallel multi-dimensional arrays to construct algorithms from common operations like matrix multiplication, SVD, and so on. This mirrors mathematical algorithms well but lacks some flexibility.
2. Create algorithms by hand that track operations on individual chunks of in-memory data and dependencies between them. This is very flexible but requires a bit more care.

Coding up either of these options from scratch can be a daunting task, but with Dask it can be as simple as writing NumPy code.

Let’s build up an example of fitting a large linear regression model using both built-in array parallelism and fancier, more customized parallelization features that Dask offers. The dask.array module helps us to easily parallelize standard NumPy functionality using the same syntax – we’ll start there.

### Data Creation

Dask has many ways to create dask arrays; to get us started quickly prototyping let’s create some random data in a way that should look familiar to NumPy users.

import dask
import dask.array as da
import numpy as np

from dask.distributed import Client

client = Client()

## create inputs with a bunch of independent normals
beta = np.random.random(100)  # random beta coefficients, no intercept
X = da.random.normal(0, 1, size=(1000000, 100), chunks=(100000, 100))
y = X.dot(beta) + da.random.normal(0, 1, size=1000000, chunks=(100000,))

## make sure all chunks are ~equally sized
X, y = dask.persist(X, y)
client.rebalance([X, y])


Observe that X is a dask array stored in 10 chunks, each of size (100000, 100). Also note that X.dot(beta) runs smoothly for both numpy and dask arrays, so we can write code that basically works in either world.

Caveat: If X is a numpy array and beta is a dask array, X.dot(beta) will output an in-memory numpy array. This is usually not desirable as you want to carefully choose when to load something into memory. One fix is to use multipledispatch to handle odd edge cases; for a starting example, check out the dot code here.

Dask also has convenient visualization features built in that we will leverage; below we visualize our data in its 10 independent chunks:

### Array Programming

If you can write iterative array-based algorithms in NumPy, then you can write iterative parallel algorithms in Dask

As we’ve already seen, Dask inherits much of the NumPy API that we are familiar with, so we can write simple NumPy-style iterative optimization algorithms that will leverage the parallelism dask.array has built-in already. For example, if we want to naively fit a linear regression model on the data above, we are trying to solve the following convex optimization problem:

Recall that in non-degenerate situations this problem has a closed-form solution that is given by:

We can compute $\beta^*$ using the above formula with Dask:

## naive solution
beta_star = da.linalg.solve(X.T.dot(X), X.T.dot(y))

>>> abs(beta_star.compute() - beta).max()
0.0024817567237768179


Sometimes a direct solve is too costly, and we want to solve the above problem using only simple matrix-vector multiplications. To this end, let’s take this one step further and actually implement a gradient descent algorithm which exploits parallel matrix operations. Recall that gradient descent iteratively refines an initial estimate of beta via the update:

where $\alpha$ can be chosen based on a number of different “step-size” rules; for the purposes of exposition, we will stick with a constant step-size:

## quick step-size calculation to guarantee convergence
_, s, _ = np.linalg.svd(2 * X.T.dot(X))
step_size = 1 / s - 1e-8

## define some parameters
max_steps = 100
tol = 1e-8
beta_hat = np.zeros(100) # initial guess

for k in range(max_steps):
Xbeta = X.dot(beta_hat)
func = ((y - Xbeta)**2).sum()
gradient = 2 * X.T.dot(Xbeta - y)

## Update
obeta = beta_hat
beta_hat = beta_hat - step_size * gradient
new_func = ((y - X.dot(beta_hat))**2).sum()
beta_hat, func, new_func = dask.compute(beta_hat, func, new_func)  # <--- Dask code

## Check for convergence
change = np.absolute(beta_hat - obeta).max()

if change < tol:
break

>>> abs(beta_hat - beta).max()
0.0024817567259038942


It’s worth noting that almost all of this code is exactly the same as the equivalent NumPy code. Because Dask.array and NumPy share the same API it’s pretty easy for people who are already comfortable with NumPy to get started with distributed algorithms right away. The only thing we had to change was how we produce our original data (da.random.normal instead of np.random.normal) and the call to dask.compute at the end of the update state. The dask.compute call tells Dask to go ahead and actually evaluate everything we’ve told it to do so far (Dask is lazy by default). Otherwise, all of the mathematical operations, matrix multiplies, slicing, and so on are exactly the same as with Numpy, except that Dask.array builds up a chunk-wise parallel computation for us and Dask.distributed can execute that computation in parallel.

To better appreciate all the scheduling that is happening in one update step of the above algorithm, here is a visualization of the computation necessary to compute beta_hat and the new function value new_func:

Each rectangle is an in-memory chunk of our distributed array and every circle is a numpy function call on those in-memory chunks. The Dask scheduler determines where and when to run all of these computations on our cluster of machines (or just on the cores of our laptop).

#### Array Programming + dask.delayed

Now that we’ve seen how to use the built-in parallel algorithms offered by Dask.array, let’s go one step further and talk about writing more customized parallel algorithms. Many distributed “consensus” based algorithms in machine learning are based on the idea that each chunk of data can be processed independently in parallel, and send their guess for the optimal parameter value to some master node. The master then computes a consensus estimate for the optimal parameters and reports that back to all of the workers. Each worker then processes their chunk of data given this new information, and the process continues until convergence.

From a parallel computing perspective this is a pretty simple map-reduce procedure. Any distributed computing framework should be able to handle this easily. We’ll use this as a very simple example for how to use Dask’s more customizable parallel options.

One such algorithm is the Alternating Direction Method of Multipliers, or ADMM for short. For the sake of this post, we will consider the work done by each worker to be a black box.

We will also be considering a regularized version of the problem above, namely:

At the end of the day, all we will do is:

• create NumPy functions which define how each chunk updates its parameter estimates
• wrap those functions in dask.delayed
• call dask.compute and process the individual estimates, again using NumPy

First we need to define some local functions that the chunks will use to update their individual parameter estimates, and import the black box local_update step from dask_glm; also, we will need the so-called shrinkage operator (which is the proximal operator for the $l1$-norm in our problem):

from dask_glm.algorithms import local_update

def local_f(beta, X, y, z, u, rho):
return ((y - X.dot(beta)) **2).sum() + (rho / 2) * np.dot(beta - z + u,
beta - z + u)

def local_grad(beta, X, y, z, u, rho):
return 2 * X.T.dot(X.dot(beta) - y) + rho * (beta - z + u)

def shrinkage(beta, t):
return np.maximum(0, beta - t) - np.maximum(0, -beta - t)

## set some algorithm parameters
max_steps = 10
lamduh = 7.2
rho = 1.0

(n, p) = X.shape
nchunks = X.npartitions

XD = X.to_delayed().flatten().tolist()  # A list of pointers to remote numpy arrays
yD = y.to_delayed().flatten().tolist()  # ... one for each chunk

# the initial consensus estimate
z = np.zeros(p)

# an array of the individual "dual variables" and parameter estimates,
# one for each chunk of data
u = np.array([np.zeros(p) for i in range(nchunks)])
betas = np.array([np.zeros(p) for i in range(nchunks)])

for k in range(max_steps):

# process each chunk in parallel, using the black-box 'local_update' magic
new_betas = [dask.delayed(local_update)(xx, yy, bb, z, uu, rho,
f=local_f,
fprime=local_grad)
for xx, yy, bb, uu in zip(XD, yD, betas, u)]
new_betas = np.array(dask.compute(*new_betas))

# everything else is NumPy code occurring at "master"
beta_hat = 0.9 * new_betas + 0.1 * z

# create consensus estimate
zold = z.copy()
ztilde = np.mean(beta_hat + np.array(u), axis=0)
z = shrinkage(ztilde, lamduh / (rho * nchunks))

# update dual variables
u += beta_hat - z

>>> # Number of coefficients zeroed out due to L1 regularization
>>> print((z == 0).sum())
12


There is of course a little bit more work occurring in the above algorithm, but it should be clear that the distributed operations are not one of the difficult pieces. Using dask.delayed we were able to express a simple map-reduce algorithm like ADMM with similarly simple Python for loops and delayed function calls. Dask.delayed is keeping track of all of the function calls we wanted to make and what other function calls they depend on. For example all of the local_update calls can happen independent of each other, but the consensus computation blocks on all of them.

We hope that both parallel algorithms shown above (gradient descent, ADMM) were straightforward to someone reading with an optimization background. These implementations run well on a laptop, a single multi-core workstation, or a thousand-node cluster if necessary. We’ve been building somewhat more sophisticated implementations of these algorithms (and others) in dask-glm. They are more sophisticated from an optimization perspective (stopping criteria, step size, asynchronicity, and so on) but remain as simple from a distributed computing perspective.

## Experiment

We compare dask-glm implementations against Scikit-learn on a laptop, and then show them running on a cluster.

Reproducible notebook is available here

We’re building more sophisticated versions of the algorithms above in dask-glm.  This project has convex optimization algorithms for gradient descent, proximal gradient descent, Newton’s method, and ADMM.  These implementations extend the implementations above by also thinking about stopping criteria, step sizes, and other niceties that we avoided above for simplicity.

In this section we show off these algorithms by performing a simple numerical experiment that compares the numerical performance of proximal gradient descent and ADMM alongside Scikit-Learn’s LogisticRegression and SGD implementations on a single machine (a personal laptop) and then follows up by scaling the dask-glm options to a moderate cluster.

Disclaimer: These experiments are crude. We’re using artificial data, we’re not tuning parameters or even finding parameters at which these algorithms are producing results of the same accuracy. The goal of this section is just to give a general feeling of how things compare.

We create data

## size of problem (no. observations)
N = 8e6
chunks = 1e6
seed = 20009
beta = (np.random.random(15) - 0.5) * 3

X = da.random.random((N,len(beta)), chunks=chunks)
y = make_y(X, beta=np.array(beta), chunks=chunks)

X, y = dask.persist(X, y)
client.rebalance([X, y])


And run each of our algorithms as follows:

# Dask-GLM Proximal Gradient
result = proximal_grad(X, y, lamduh=alpha)

# Dask-GLM ADMM
X2 = X.rechunk((1e5, None)).persist()  # ADMM prefers smaller chunks
y2 = y.rechunk(1e5).persist()
result = admm(X2, y2, lamduh=alpha)

# Scikit-Learn LogisticRegression
nX, ny = dask.compute(X, y)  # sklearn wants numpy arrays
result = LogisticRegression(penalty='l1', C=1).fit(nX, ny).coef_

# Scikit-Learn Stochastic Gradient Descent
result = SGDClassifier(loss='log',
penalty='l1',
l1_ratio=1,
n_iter=10,
fit_intercept=False).fit(nX, ny).coef_


We then compare with the $L_{\infty}$ norm (largest different value).

abs(result - beta).max()


Times and $L_\infty$ distance from the true “generative beta” for these parameters are shown in the table below:

Algorithm Error Duration (s)
Proximal Gradient 0.0227 128
ADMM 0.0125 34.7
LogisticRegression 0.0132 79
SGDClassifier 0.0456 29.4

Again, please don’t take these numbers too seriously: these algorithms all solve regularized problems, so we don’t expect the results to necessarily be close to the underlying generative beta (even asymptotically). The numbers above are meant to demonstrate that they all return results which were roughly the same distance from the beta above. Also, Dask-glm is using a full four-core laptop while SKLearn is restricted to use a single core.

In the sections below we include profile plots for proximal gradient and ADMM. These show the operations that each of eight threads was doing over time. You can mouse-over rectangles/tasks and zoom in using the zoom tools in the upper right. You can see the difference in complexity of the algorithms. ADMM is much simpler from Dask’s perspective but also saturates hardware better for this chunksize.

#### Profile Plot for ADMM

The general takeaway here is that dask-glm performs comparably to Scikit-Learn on a single machine. If your problem fits in memory on a single machine you should continue to use Scikit-Learn and Statsmodels. The real benefit to the dask-glm algorithms is that they scale and can run efficiently on data that is larger-than-memory by operating from disk on a single computer or on a cluster of computers working together.

### Cluster Computing

As a demonstration, we run a larger version of the data above on a cluster of eight m4.2xlarges on EC2 (8 cores and 30GB of RAM each.)

We create a larger dataset with 800,000,000 rows and 15 columns across eight processes.

N = 8e8
chunks = 1e7
seed = 20009
beta = (np.random.random(15) - 0.5) * 3

X = da.random.random((N,len(beta)), chunks=chunks)
y = make_y(X, beta=np.array(beta), chunks=chunks)

X, y = dask.persist(X, y)


We then run the same proximal_grad and admm operations from before:

# Dask-GLM Proximal Gradient
result = proximal_grad(X, y, lamduh=alpha)

# Dask-GLM ADMM
X2 = X.rechunk((1e6, None)).persist()  # ADMM prefers smaller chunks
y2 = y.rechunk(1e6).persist()
result = admm(X2, y2, lamduh=alpha)


Proximal grad completes in around seventeen minutes while ADMM completes in around four minutes. Profiles for the two computations are included below:

#### Profile Plot for Proximal Gradient Descent

We include only the first few iterations here. Otherwise this plot is several megabytes.

Link to fullscreen plot

#### Profile Plot for ADMM

Link to fullscreen plot

These both obtained similar $L_{\infty}$ errors to what we observed before.

Algorithm Error Duration (s)
Proximal Gradient 0.0306 1020
ADMM 0.00159 270

Although this time we had to be careful about a couple of things:

1. We explicitly deleted the old data after rechunking (ADMM prefers different chunksizes than proximal_gradient) because our full dataset, 100GB, is close enough to our total distributed RAM (240GB) that it’s a good idea to avoid keeping replias around needlessly. Things would have run fine, but spilling excess data to disk would have negatively affected performance.
2. We set the OMP_NUM_THREADS=1 environment variable to avoid over-subscribing our CPUs. Surprisingly not doing so led both to worse performance and to non-deterministic results. An issue that we’re still tracking down.

### Analysis

The algorithms in Dask-GLM are new and need development, but are in a usable state by people comfortable operating at this technical level. Additionally, we would like to attract other mathematical and algorithmic developers to this work. We’ve found that Dask provides a nice balance between being flexible enough to support interesting algorithms, while being managed enough to be usable by researchers without a strong background in distributed systems. In this section we’re going to discuss the things that we learned from both Chris’ (mathematical algorithms) and Matt’s (distributed systems) perspective and then talk about possible future work. We encourage people to pay attention to future work; we’re open to collaboration and think that this is a good opportunity for new researchers to meaningfully engage.

#### Chris’s perspective

1. Creating distributed algorithms with Dask was surprisingly easy; there is still a small learning curve around when to call things like persist, compute, rebalance, and so on, but that can’t be avoided. Using Dask for algorithm development has been a great learning environment for understanding the unique challenges associated with distributed algorithms (including communication costs, among others).
2. Getting the particulars of algorithms correct is non-trivial; there is still work to be done in better understanding the tolerance settings vs. accuracy tradeoffs that are occurring in many of these algorithms, as well as fine-tuning the convergence criteria for increased precision.
3. On the software development side, reliably testing optimization algorithms is hard. Finding provably correct optimality conditions that should be satisfied which are also numerically stable has been a challenge for me.
4. Working on algorithms in isolation is not nearly as fun as collaborating on them; please join the conversation and contribute!
5. Most importantly from my perspective, I’ve found there is a surprisingly large amount of misunderstanding in “the community” surrounding what optimization algorithms do in the world of predictive modeling, what problems they each individually solve, and whether or not they are interchangeable for a given problem. For example, Newton’s method can’t be used to optimize an l1-regularized problem, and the coefficient estimates from an l1-regularized problem are fundamentally (and numerically) different from those of an l2-regularized problem (and from those of an unregularized problem). My own personal goal is that the API for dask-glm exposes these subtle distinctions more transparently and leads to more thoughtful modeling decisions “in the wild”.

#### Matt’s perspective

This work triggered a number of concrete changes within the Dask library:

1. We can convert Dask.dataframes to Dask.arrays. This is particularly important because people want to do pre-processing with dataframes but then switch to efficient multi-dimensional arrays for algorithms.
2. We had to unify the single-machine scheduler and distributed scheduler APIs a bit, notably adding a persist function to the single machine scheduler. This was particularly important because Chris generally prototyped on his laptop but we wanted to write code that was effective on clusters.
3. Scheduler overhead can be a problem for the iterative dask-array algorithms (gradient descent, proximal gradient descent, BFGS). This is particularly a problem because NumPy is very fast. Often our tasks take only a few milliseconds, which makes Dask’s overhead of 200us per task become very relevant (this is why you see whitespace in the profile plots above). We’ve started resolving this problem in a few ways like more aggressive task fusion and lower overheads generally, but this will be a medium-term challenge. In practice for dask-glm we’ve started handling this just by choosing chunksizes well. I suspect that for the dask-glm in particular we’ll just develop auto-chunksize heuristics that will mostly solve this problem. However we expect this problem to recur in other work with scientists on HPC systems who have similar situations.
4. A couple of things can be tricky for algorithmic users:
1. Placing the calls to asynchronously start computation (persist, compute). In practice Chris did a good job here and then I came through and tweaked things afterwards. The web diagnostics ended up being crucial to identify issues.
2. Avoiding accidentally calling NumPy functions on dask.arrays and vice versa. We’ve improved this on the dask.array side, and they now operate intelligently when given numpy arrays. Changing this on the NumPy side is harder until NumPy protocols change (which is planned).

#### Future work

There are a number of things we would like to do, both in terms of measurement and for the dask-glm project itself. We welcome people to voice their opinions (and join development) on the following issues:

1. Asynchronous Algorithms
2. User APIs
3. Extend GLM families
4. Write more extensive rigorous algorithm testing - for satisfying provable optimality criteria, and for robustness to various input data
5. Begin work on smart initialization routines

What is your perspective here, gentle reader? Both Matt and Chris can use help on this project. We hope that some of the issues above provide seeds for community engagement. We welcome other questions, comments, and contributions either as github issues or comments below.

## Acknowledgements

Thanks also go to Hussain Sultan (Capital One) and Tom Augspurger for collaboration on Dask-GLM and to Will Warner (Continuum) for reviewing and editing this post.

## March 21, 2017

### Matthieu Brucher

#### Announcement: Audio TK 1.5.0

ATK is updated to 1.5.0 with new features oriented around preamplifiers and optimizations. It is also now compiled on Appveyor: https://ci.appveyor.com/project/mbrucher/audiotk.

Thanks to Travis and Appveyor, binaries for the releases are now updated on Github. On all platforms we compile static and shared libraries. On Linux, gcc 5, gcc 6, clang 3.8 and clang 3.9 are generated, on OS X, XCode 7 and XCode 8 are available as universal binaries, and on Windows, 32 bits, 64 bits, with dynamic or static (no shared libraries in this case) runtime, are also generated.

Download link: ATK 1.5.0

Changelog:
1.5.0
* Adding a follower class solid state preamplifier with Python wrappers
* Adding a Dempwolf model for tube filters with Python wrappers
* Adding a Munro-Piazza model for tube filters with Python wrappers
* Optimized distortion and preamplifier filters by using fmath exp calls

1.4.1
* Vectorized x4 the IIR part of the IIR filter
* Vectorized delay filters
* Fixed bug in gain filters

## March 20, 2017

### Continuum Analytics news

#### ​Announcing Anaconda Project: Data Science Project Encapsulation and Deployment, the Easy Way!

Monday, March 20, 2017
Christine Doig
Continuum Analytics

Kristopher Overholt
Continuum Analytics

One year ago, we presented Anaconda and Docker: Better Together for Reproducible Data Science. In that blog post, we described our vision and a foundational approach to portable and reproducible data science using Anaconda and Docker.

This approach embraced the philosophy of Open Data Science in which data scientists can connect the powerful data science experience of Anaconda with the tools that they know and love, which today includes Jupyter notebooks, machine learning frameworks, data analysis libraries, big data computations and connectivity, visualization toolkits, high-performance numerical libraries and more.

We also discussed how data scientists could use Anaconda to develop data science analyses on their local machine, then use Docker to deploy those same data science analyses into production. This was the state of data science encapsulation and deployment that we presented last year:

## project-1.png

In this blog post, we’ll be diving deeper into how we’ve created a standard data science project encapsulation approach that helps data scientists deploy secure, scalable and reproducible projects across an entire team with Anaconda.

This blog post also provides more details about how we’re using Anaconda and Docker for encapsulation and containerization of data science projects to power the data science deployment functionality in the next generation of Anaconda Enterprise, which augments our truly end-to-end data science platform.

### Supercharge Your Data Science with More Than Just Dockerfiles!

The reality is, as much as Docker is loved and used by the DevOps community, it is not the preferred tool or entrypoint for data scientists looking to deploy their applications. Using Docker alone as a data science encapsulation strategy still requires coordination with their IT and DevOps teams to write their Dockerfiles, install the required system libraries in their containers, and orchestrate and deploy their Docker containers into production.

Having data scientists worry about infrastructure details and DevOps tooling takes away time from their most valuable skills: finding insights in data, modeling and running experiments, and delivering consumable data-driven applications to their team and end-users.

Data scientists enjoy using the packages they know and love with Anaconda along with conda environments, and wish it was as easy to deploy data science projects as it is to get Anaconda running in their laptop.

By working directly with our amazing customers and users and listening to the needs of their data science teams over the last five years, we have clearly identified how Anaconda and Docker can be used together for data science project encapsulation and as a more useful abstraction layer for data scientists: Anaconda Projects.

### The Next Generation of Portable and Reproducible Data Science with Anaconda

As part of the next generation of data science encapsulation, reproducibility and deployment, we are happy to announce the release of Anaconda Project with the latest release of Anaconda! Download the latest version of Anaconda 4.3.1 to get started with Anaconda Project today.

Or, if you already have Anaconda, you can install Anaconda Project using the following command:

conda install anaconda-project

Anaconda Project makes it easy to encapsulate data science projects and makes them fully portable and deployment-ready. It automates the configuration and setup of data science projects, such as installing the necessary packages and dependencies, downloading data sets and required files, setting environment variables for credentials or runtime configuration, and running commands.

Anaconda Project is an open source tool created by Continuum Analytics that delivers light-weight, efficient encapsulation and portability of data science projects. Learn more by checking out the Anaconda Project documentation.

Anaconda Project makes it easy to reproduce your data science analyses, share data science projects with others, run projects across different platforms, or deploy data science applications with a single-click in Anaconda Enterprise.

Whether you’re running a project locally or deploying a project with Anaconda Enterprise, you are using the same project encapsulation standard: an Anaconda Project. We’re bringing you the next generation of true Open Data Science deployment in 2017 with Anaconda:

## project-2.png

#### New Release of Anaconda Navigator with Support for Anaconda Projects

As part of this release of Anaconda Project, we’ve integrated easy data science project creation and encapsulation to the familiar Anaconda Navigator experience, which is a graphical interface for your Anaconda environments and data science tools. You can easily create, edit, and upload Anaconda Projects to Anaconda Cloud through a graphical interface:

## anaconda-project-a (1).gif

Download the latest version of Anaconda 4.3.1 to get started with Anaconda Navigator and Anaconda Project today.

Or, if you already have Anaconda, you can install the latest version of Anaconda Navigator using the following command:

conda install anaconda-navigator

When you’re using Anaconda Project with Navigator, you can create a new project and specify its dependencies, or you can import an existing conda environment file (environment.yaml) or pip requirements file (requirements.txt).

#### Anaconda Project examples:

• Image classifier web application using Tensorflow and Flask
• Live Python and R notebooks that retrieve the latest stock market data
• Interactive Bokeh and Shiny applications for data clustering, cross filtering, and data exploration
• Interactive visualizations of data sets with Bokeh, including streaming data
• Machine learning models with REST APIs

To get started even quicker with portable data science projects, refer to the example Anaconda Projects on Anaconda Cloud.

### Deploying Secure and Scalable Data Science Projects with Anaconda Enterprise

The new data science deployment and collaboration functionality in Anaconda Enterprise leverages Anaconda Project plus industry-standard containerization with Docker and enterprise-ready container orchestration technology with Kubernetes.

This productionization and deployment strategy makes it easy to create and deploy data science projects with a single-click for projects that use Python 2, Python 3, R, (including their dependencies in C++, Fortran, Java, etc.) or anything else you can build with the 730+ packages in Anaconda.

## project-3.png

### From Data Science Development to Deployment with Anaconda Projects and Anaconda Enterprise

All of this is possible without having to edit Dockerfiles directly, install system packages in your Docker containers, or manually deploy Docker containers into production. Anaconda Enterprise handles all of that for you, so you can get back to doing data science analysis.

The result is that any project that a data scientist can create on their machine with Anaconda can be deployed to an Anaconda Enterprise cluster in a secure, scalable, and highly-available manner with just a single click, including live notebooks, interactive applications, machine learning models with REST APIs, or any other projects that leverage the 730+ packages in Anaconda.

## anaconda-project-b (1).gif

Anaconda is such a foundational and ubiquitous data science platform that other lightweight data science workspaces and workbenches are using Anaconda as a necessary core component for their portable and reproducible data science. Anaconda is the leading Open Data Science platform powered by Python and empowers data scientists with a truly integrated experience and support for end-to-end workflows. Why would you want your data science team using Anaconda in production with anything other than Anaconda Enterprise?

Anaconda Enterprise is a true end-to-end data science platform that integrates with all of the most popular tools and platforms and provides your data science team with an on-premises package repository, secure enterprise notebook collaboration, data science and analytics on Hadoop/Spark, and secure and scalable data science deployment.

Anaconda Enterprise also includes support for all of the 730+ Open Data Science packages in Anaconda. Finally, Anaconda Scale is the only recommended and certified method for deploying Anaconda to a Hadoop cluster for PySpark or SparkR jobs.

### Getting Started with Anaconda Enterprise and Anaconda Projects

Anaconda Enterprise uses Anaconda Project and Docker as its standard project encapsulation and deployment format to enable simple one-click deployments of secure and scalable data science applications for your entire data science team.

Are you interested in using Anaconda Enterprise in your organization to deploy data science projects, including live notebooks, machine learning models, dashboards, and interactive applications?

Access to the next generation of Anaconda Enterprise v5, which features one-click secure and scalable data science deployments, is now available as a technical preview as part of the Anaconda Enterprise Innovator Program.

Join the Anaconda Enterprise v5 Innovator Program today to discover the powerful data science deployment capabilities for yourself. Anaconda Enterprise handles your secure and scalable data science project encapsulation and deployment requirements so that your data science team can focus on data exploration and analysis workflows and spend less time worrying about infrastructure and DevOps tooling.

## March 16, 2017

### Titus Brown

#### Registration reminder for our two-week summer workshop on high-throughput sequencing data analysis!

Our two-week summer workshop (announcement, direct link) is shaping up quite well, but the application deadline is today! So if you're interested, you should apply sometime before the end of the day. (We'll leave applications open as long as it's March 17th somewhere in the world.)

Some updates and expansions on the original announcement --

• we'll be training attendees in high-performance computing, in the service of doing bioinformatics analyses. To that end, we've received a large grant from NSF XSEDE, and we'll be using JetStream for our analyses.
• we have limited financial support that will be awarded after acceptances are issued in a week.

Here's the original announcement below:

## ANGUS: Analyzing High Throughput Sequencing Data

June 26-July 8, 2017

University of California, Davis

• Zero-entry - no experience required or expected!
• Hands-on training in using the UNIX command line to analyze your sequencing data.
• Friendly, helpful instructors and TAs!
• Summer sequencing camp - meet and talk science with great people!
• Now in its eighth year!

The workshop fee will be $500 for the two weeks, and on-campus room and board is available for$500/week. Applications will close March 17th. International and industry applicants are more than welcome!

Please see http://ivory.idyll.org/dibsi/ANGUS.html for more information, and contact dibsi.training@gmail.com if you have questions or suggestions.

--titus

## March 14, 2017

### Thomas Wiecki

#### Random-Walk Bayesian Deep Networks: Dealing with Non-Stationary Data

(c) 2017 by Thomas Wiecki -- Quantopian Inc.

Most problems solved by Deep Learning are stationary. A cat is always a cat. The rules of Go have remained stable for 2,500 years, and will likely stay that way. However, what if the world around you is changing? This is common, for example when applying Machine Learning in Quantitative Finance. Markets are constantly evolving so features that are predictive in some time-period might not lose their edge while other patterns emerge. Usually, quants would just retrain their classifiers every once in a while. This approach of just re-estimating the same model on more recent data is very common. I find that to be a pretty unsatisfying way of modeling, as there are certain shortfalls:

• The estimation window should be long so as to incorporate as much training data as possible.
• The estimation window should be short so as to incorporate only the most recent data, as old data might be obsolete.
• When you have no estimate of how fast the world around you is changing, there is no principled way of setting the window length to balance these two objectives.

Certainly there is something to be learned even from past data, we just need to instill our models with a sense of time and recency.

Enter random-walk processes. Ever since I learned about them in the stochastic volatility model they have become one of my favorite modeling tricks. Basically, it allows you to turn every static model into a time-sensitive one.

You can read more about the details of a random-walk priors here, but the central idea is that, in any time-series model, rather than assuming a parameter to be constant over time, we allow it to change gradually, following a random walk. For example, take a logistic regression:

$$Y_i = f(\beta X_i)$$

Where $f$ is the logistic function and $\beta$ is our learnable parameter. If we assume that our data is not iid and that $\beta$ is changing over time. We thus need a different $\beta$ for every $i$:

$$Y_i = f(\beta_i X_i)$$

Of course, this will just overfit, so we need to constrain our $\beta_i$ somehow. We will assume that while $\beta_i$ is changing over time, it will do so rather gradually by placing a random-walk prior on it:

$$\beta_t \sim \mathcal{N}(\beta_{t-1}, s^2)$$

So $\beta_t$ is allowed to only deviate a little bit (determined by the step-width $s$) form its previous value $\beta_{t-1}$. $s$ can be thought of as a stability parameter -- how fast is the world around you changing.

Let's first generate some toy data and then implement this model in PyMC3. We will then use this same trick in a Neural Network with hidden layers.

If you would like a more complete introduction to Bayesian Deep Learning, see my recent ODSC London talk. This blog post takes things one step further so definitely read further below.

In [1]:
%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale

import warnings
from scipy import VisibleDeprecationWarning
warnings.filterwarnings("ignore", category=VisibleDeprecationWarning)

sns.set_context('notebook')


### Generating data

First, lets generate some toy data -- a simple binary classification problem that's linearly separable. To introduce the non-stationarity, we will rotate this data along the center across time. Safely skip over the next few code cells.

In [2]:
X, Y = sklearn.datasets.make_blobs(n_samples=1000, centers=2, random_state=1)
X = scale(X)
colors = Y.astype(str)
colors[Y == 0] = 'r'
colors[Y == 1] = 'b'

interval = 20
subsample = X.shape[0] // interval
chunk = np.arange(0, X.shape[0]+1, subsample)
degs = np.linspace(0, 360, len(chunk))

sep_lines = []

for ii, (i, j, deg) in enumerate(list(zip(np.roll(chunk, 1), chunk, degs))[1:]):
theta = np.radians(deg)
c, s = np.cos(theta), np.sin(theta)
R = np.matrix([[c, -s], [s, c]])

X[i:j, :] = X[i:j, :].dot(R)

In [4]:
import base64
from tempfile import NamedTemporaryFile

VIDEO_TAG = """<video controls>
<source src="data:video/x-m4v;base64,{0}" type="video/mp4">
Your browser does not support the video tag.
</video>"""

def anim_to_html(anim):
if not hasattr(anim, '_encoded_video'):
anim.save("test.mp4", fps=20, extra_args=['-vcodec', 'libx264'])

video = open("test.mp4","rb").read()

anim._encoded_video = base64.b64encode(video).decode('utf-8')
return VIDEO_TAG.format(anim._encoded_video)

from IPython.display import HTML

def display_animation(anim):
plt.close(anim._fig)
return HTML(anim_to_html(anim))
from matplotlib import animation

# First set up the figure, the axis, and the plot element we want to animate
fig, ax = plt.subplots()
ims = [] #l, = plt.plot([], [], 'r-')
for i in np.arange(0, len(X), 10):
ims.append([(ax.scatter(X[:i, 0], X[:i, 1], color=colors[:i]))])

ax.set(xlabel='X1', ylabel='X2')
# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.ArtistAnimation(fig, ims,
interval=500,
blit=True);

display_animation(anim)

Out[4]:
Your browser does not support the video tag.

The last frame of the video, where all data is plotted is what a classifier would see that has no sense of time. Thus, the problem we set up is impossible to solve when ignoring time, but trivial once you do.

How would we classically solve this? You could just train a different classifier on each subset. But as I wrote above, you need to get the frequency right and you use less data overall.

## Random-Walk Logistic Regression in PyMC3¶

In [5]:
from pymc3 import HalfNormal, GaussianRandomWalk, Bernoulli
from pymc3.math import sigmoid
import theano.tensor as tt

X_shared = theano.shared(X)
Y_shared = theano.shared(Y)

n_dim = X.shape[1] # 2

with pm.Model() as random_walk_perceptron:
step_size = pm.HalfNormal('step_size', sd=np.ones(n_dim),
shape=n_dim)

# This is the central trick, PyMC3 already comes with this distribution
w = pm.GaussianRandomWalk('w', sd=step_size,
shape=(interval, 2))

weights = tt.repeat(w, X_shared.shape[0] // interval, axis=0)

class_prob = sigmoid(tt.batched_dot(X_shared, weights))

# Binary classification -> Bernoulli likelihood
pm.Bernoulli('out', class_prob, observed=Y_shared)


OK, if you understand the stochastic volatility model, the first two lines should look fairly familiar. We are creating 2 random-walk processes. As allowing the weights to change on every new data point is overkill, we subsample. The repeat turns the vector [t, t+1, t+2] into [t, t, t, t+1, t+1, ...] so that it matches the number of data points.

Next, we would usually just apply a single dot-product but here we have many weights we're applying to the input data, so we need to call dot in a loop. That is what tt.batched_dot does. In the end, we just get probabilities (predicitions) for our Bernoulli likelihood.

On to the inference. In PyMC3 we recently improved NUTS in many different places. One of those is automatic initialization. If you just call pm.sample(n_iter), we will first run ADVI to estimate the diagional mass matrix and find a starting point. This usually makes NUTS run quite robustly.

In [6]:
with random_walk_perceptron:
trace_perceptron = pm.sample(2000)

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -90.867: 100%|██████████| 200000/200000 [01:13<00:00, 2739.70it/s]
Finished [100%]: Average ELBO = -90.869
100%|██████████| 2000/2000 [00:39<00:00, 50.58it/s]


Let's look at the learned weights over time:

In [7]:
plt.plot(trace_perceptron['w'][:, :, 0].T, alpha=.05, color='r');
plt.plot(trace_perceptron['w'][:, :, 1].T, alpha=.05, color='b');
plt.xlabel('time'); plt.ylabel('weights'); plt.title('Optimal weights change over time'); sns.despine();


As you can see, the weights are slowly changing over time. What does the learned hyperplane look like? In the plot below, the points are still the training data but the background color codes the class probability learned by the model.

In [8]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
grid_2d = np.tile(grid_2d, (interval, 1))
dummy_out = np.ones(grid_2d.shape[0], dtype=np.int8)

X_shared.set_value(grid_2d)
Y_shared.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace_perceptron, model=random_walk_perceptron, samples=500)

def create_surface(X, Y, grid, ppc, fig=None, ax=None):
artists = []
cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
contour = ax.contourf(*grid, ppc, cmap=cmap)
artists.extend(contour.collections)
artists.append(ax.scatter(X[Y==0, 0], X[Y==0, 1], color='b'))
artists.append(ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r'))
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X1', ylabel='X2');
return artists

fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].mean(axis=0).reshape(100, 100), fig=fig, ax=ax)
axs.append(a)

anim2 = animation.ArtistAnimation(fig, axs,
interval=1000);
display_animation(anim2)

100%|██████████| 500/500 [00:23<00:00, 24.47it/s]

Out[8]:
Your browser does not support the video tag.

Nice, we can see that the random-walk logistic regression adapts its weights to perfectly separate the two point clouds.

## Random-Walk Neural Network¶

In the previous example, we had a very simple linearly classifiable problem. Can we extend this same idea to non-linear problems and build a Bayesian Neural Network with weights adapting over time?

If you haven't, I recommend you read my original post on Bayesian Deep Learning where I more thoroughly explain how a Neural Network can be implemented and fit in PyMC3.

Lets generate some toy data that is not linearly separable and again rotate it around its center.

In [9]:
from sklearn.datasets import make_moons
X, Y = make_moons(noise=0.2, random_state=0, n_samples=5000)
X = scale(X)

colors = Y.astype(str)
colors[Y == 0] = 'r'
colors[Y == 1] = 'b'

interval = 20
subsample = X.shape[0] // interval
chunk = np.arange(0, X.shape[0]+1, subsample)
degs = np.linspace(0, 360, len(chunk))

sep_lines = []

for ii, (i, j, deg) in enumerate(list(zip(np.roll(chunk, 1), chunk, degs))[1:]):
theta = np.radians(deg)
c, s = np.cos(theta), np.sin(theta)
R = np.matrix([[c, -s], [s, c]])

X[i:j, :] = X[i:j, :].dot(R)

In [28]:
fig, ax = plt.subplots()
ims = []
for i in np.arange(0, len(X), 10):
ims.append((ax.scatter(X[:i, 0], X[:i, 1], color=colors[:i]),))

ax.set(xlabel='X1', ylabel='X2')
anim = animation.ArtistAnimation(fig, ims,
interval=500,
blit=True);

display_animation(anim)

Out[28]:
Your browser does not support the video tag.

Looks a bit like Ying and Yang, who knew we'd be creating art in the process.

On to the model. Rather than have all the weights in the network follow random-walks, we will just have the first hidden layer change its weights. The idea is that the higher layers learn stable higher-order representations while the first layer is transforming the raw data so that it appears stationary to the higher layers. We can of course also place random-walk priors on all weights, or only on those of higher layers, whatever assumptions you want to build into the model.

In [11]:
np.random.seed(123)

ann_input = theano.shared(X)
ann_output = theano.shared(Y)

n_hidden = [2, 5]

# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden[0]).astype(theano.config.floatX)
init_2 = np.random.randn(n_hidden[0], n_hidden[1]).astype(theano.config.floatX)
init_out = np.random.randn(n_hidden[1]).astype(theano.config.floatX)

with pm.Model() as neural_network:
# Weights from input to hidden layer
step_size = pm.HalfNormal('step_size', sd=np.ones(n_hidden[0]),
shape=n_hidden[0])

weights_in_1 = pm.GaussianRandomWalk('w1', sd=step_size,
shape=(interval, X.shape[1], n_hidden[0]),
testval=np.tile(init_1, (interval, 1, 1))
)

weights_in_1_rep = tt.repeat(weights_in_1,
ann_input.shape[0] // interval, axis=0)

weights_1_2 = pm.Normal('w2', mu=0, sd=1.,
shape=(1, n_hidden[0], n_hidden[1]),
testval=init_2)

weights_1_2_rep = tt.repeat(weights_1_2,
ann_input.shape[0], axis=0)

weights_2_out = pm.Normal('w3', mu=0, sd=1.,
shape=(1, n_hidden[1]),
testval=init_out)

weights_2_out_rep = tt.repeat(weights_2_out,
ann_input.shape[0], axis=0)

# Build neural-network using tanh activation function
act_1 = tt.tanh(tt.batched_dot(ann_input,
weights_in_1_rep))
act_2 = tt.tanh(tt.batched_dot(act_1,
weights_1_2_rep))
act_out = tt.nnet.sigmoid(tt.batched_dot(act_2,
weights_2_out_rep))

# Binary classification -> Bernoulli likelihood
out = pm.Bernoulli('out',
act_out,
observed=ann_output)


Hopefully that's not too incomprehensible. It is basically applying the principles from the random-walk logistic regression but adding another hidden layer.

I also want to take the opportunity to look at what the Bayesian approach to Deep Learning offers. Usually, we fit these models using point-estimates like the MLE or the MAP. Let's see how well that works on a structually more complex model like this one:

In [12]:
import scipy.optimize
with neural_network:
map_est = pm.find_MAP(fmin=scipy.optimize.fmin_l_bfgs_b)

In [13]:
plt.plot(map_est['w1'].reshape(20, 4));


Some of the weights are changing, maybe it worked? How well does it fit the training data:

In [14]:
ppc = pm.sample_ppc([map_est], model=neural_network, samples=1)
print('Accuracy on train data = {:.2f}%'.format((ppc['out'] == Y).mean() * 100))

100%|██████████| 1/1 [00:00<00:00,  6.32it/s]
Accuracy on train data = 76.64%


Now on to estimating the full posterior, as a proper Bayesian would:

In [15]:
with neural_network:
trace = pm.sample(1000, tune=200)

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -538.86: 100%|██████████| 200000/200000 [13:06<00:00, 254.43it/s]
Finished [100%]: Average ELBO = -538.69
100%|██████████| 1000/1000 [1:22:05<00:00,  4.97s/it]

In [16]:
plt.plot(trace['w1'][200:, :, 0, 0].T, alpha=.05, color='r');
plt.plot(trace['w1'][200:, :, 0, 1].T, alpha=.05, color='b');
plt.plot(trace['w1'][200:, :, 1, 0].T, alpha=.05, color='g');
plt.plot(trace['w1'][200:, :, 1, 1].T, alpha=.05, color='c');

plt.xlabel('time'); plt.ylabel('weights'); plt.title('Optimal weights change over time'); sns.despine();


That already looks quite different. What about the accuracy:

In [17]:
ppc = pm.sample_ppc(trace, model=neural_network, samples=100)
print('Accuracy on train data = {:.2f}%'.format(((ppc['out'].mean(axis=0) > .5) == Y).mean() * 100))

100%|██████████| 100/100 [00:00<00:00, 112.04it/s]
Accuracy on train data = 96.72%


I think this is worth highlighting. The point-estimate did not do well at all, but by estimating the whole posterior we were able to model the data much more accurately. I'm not quite sure why that is the case. It's possible that we either did not find the true MAP because the optimizer can't deal with the correlations in the posterior as well as NUTS can, or the MAP is just not a good point. See my other blog post on hierarchical models as for why the MAP is a terrible choice for some models.

On to the fireworks. What does this actually look like:

In [18]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
grid_2d = np.tile(grid_2d, (interval, 1))
dummy_out = np.ones(grid_2d.shape[0], dtype=np.int8)

ann_input.set_value(grid_2d)
ann_output.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].mean(axis=0).reshape(100, 100), fig=fig, ax=ax)
axs.append(a)

anim2 = animation.ArtistAnimation(fig, axs,
interval=1000);
display_animation(anim2)

100%|██████████| 500/500 [00:58<00:00,  7.82it/s]

Out[18]:
Your browser does not support the video tag.

Holy shit! I can't believe that actually worked. Just for fun, let's also make use of the fact that we have the full posterior and plot our uncertainty of our prediction (the background now encodes posterior standard-deviation where red means high uncertainty).

In [19]:
fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].std(axis=0).reshape(100, 100),
fig=fig, ax=ax)
axs.append(a)

anim2 = animation.ArtistAnimation(fig, axs,
interval=1000);
display_animation(anim2)

Out[19]:
Your browser does not support the video tag.

## Conclusions¶

In this blog post I explored the possibility of extending Neural Networks in new ways (to my knowledge), enabled by expressing them in a Probabilistic Programming framework. Using a classic point-estimate did not provide a good fit for the data, only full posterior inference using MCMC allowed us to fit this model adequately. What is quite nice, is that we did not have to do anything special for the inference in PyMC3, just calling pymc3.sample() gave stable results on this complex model.

Initially I built the model allowing all parameters to change, but realizing that we can selectively choose which layers to change felt like a profound insight. If you expect the raw data to change, but the higher-level representations to remain stable, as was the case here, we allow the bottom hidden layers to change. If we instead imagine e.g. handwriting recognition, where your handwriting might change over time, we would expect lower level features (lines, curves) to remain stable but allow changes in how we combine them. Finally, if the world remains stable but the labels change, we would place a random-walk process on the output layer. Of course, if you don't know, you can have every layer change its weights over time and give each one a separate step-size parameter which would allow the model to figure out which layers change (high step-size), and which remain stable (low step-size).

In terms of quantatitative finance, this type of model allows us to train on much larger data sets ranging back a long time. A lot of that data is still useful to build up stable hidden representations, even if for predicition you still want your model to predict using its most up-to-date state of the world. No need to define a window-length or discard valuable training data.

In [24]:
%load_ext watermark
%watermark -v -m -p numpy,scipy,sklearn,theano,pymc3,matplotlib

The watermark extension is already loaded. To reload it, use:
%reload_ext watermark
CPython 3.6.0
IPython 5.1.0

numpy 1.11.3
scipy 0.18.1
sklearn 0.18.1
theano 0.9.0beta1.dev-9f1aaacb6e884ebcff9e249f19848db8aa6cb1b2
pymc3 3.0
matplotlib 2.0.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)
system     : Darwin
release    : 16.4.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit


### Matthieu Brucher

#### AudioToolkit: creating a simple plugin with WDL-OL

Audio Toolkit was started several years ago now, there are more than a dozen plugins based on the platform, applications using it, but I never wrote a tutorial explaining how to use it. Users had to find out for themselves. This changes today.

# Building Audio Toolkit

Let’s start with building Audio ToolKit. It uses CMake to ease the pain of supporting several platforms, although you can build it yourself if you generate config.h.

You will require Boost, Eigen and FFTW if you want to test the library and ensure that everything is all right.

## Windows

Windows may be the most complicated platform. This stems from the fact that the runtime is different for each version of the Microsoft compiler (except after 2015), and usually that’s not the one you have with your DAW (and thus probably not the one you have with your users’ DAW).

SO the first question is which kind of build you need. For a plugin, I think it is clearly a static runtime that you require, for an app, I would suggest the dynamic runtime. For this, in the CMake GUI, set MSVC_RUNTIME to Static or Dynamic. Enable the same output, static for a plugin and shared libraries for an application.

Note that tests require the shared libraries.

## macSierra/OS X

On OS X, just create the default Xcode project, you may want to also generate ATK with CMAKE_OSX_ARCHITECTURES to i386 to get a 32bits version, or x86_64 for a universal binary (I’ll use i386 in this tutorial).

The same rules for static/shared apply here.

## Linux

For Linux, I don’t have a plugin support in WDL-OL, but suffice to say that it is the ideas in the next section that are actually relevant.

# Building a plugin with WDL-OL

I’ll use the same simple code to generate a simple plugin that does more or less nothing except copy data from the input to the output inside a plugin.

## Common code

Start by using the duplicate.py script to create your own plugin. Use a “1-1” PLUG_CHANNEL_IO value to create a mono plugin (this is in resource.h). More advanced configurations can be seen on the ATK plugin repository.

Now, we need an input and an output filter for our pipeline. Let’s add them to our plugin class:

#include <ATK/Core/InPointerFilter.h>
#include <ATK/Core/OutPointerFilter.h>

and new members:

  ATK::InPointerFilter<double> inFilter;
ATK::OutPointerFilter<double> outFilter;

Now, in the initialization list, add the following:

inFilter(nullptr, 1, 0, false), outFilter(nullptr, 1, 0, false)
  outFilter.set_input_port(0, &inFilter, 0);
Reset();

This is required to setup the pipeline and initialize the internal variables.
In Reset() put the following:

  int sampling_rate = GetSampleRate();

if(sampling_rate != outFilter.get_output_sampling_rate())
{
inFilter.set_input_sampling_rate(sampling_rate);
inFilter.set_output_sampling_rate(sampling_rate);
outFilter.set_input_sampling_rate(sampling_rate);
outFilter.set_output_sampling_rate(sampling_rate);
}

This ensures that all the sampling rates are consistent. If this is not required for a copy pipeline, for EQs, modeling filters, this is mandatory. Also ATK requires the pipeline to be consistent, so you can’t connect filters that don’t have matching input/output sampling rates. Some of them can change rates, like oversampling and undersampling ones, but they are the exception, not the rule.

And now, the only thing that remains is to actually trigger the pipeline:

  inFilter.set_pointer(inputs[0], nFrames);
outFilter.set_pointer(outputs[0], nFrames);
outFilter.process(nFrames);

Now, the WDL-OL projects must be adapted.

## Windows

In both cases, it is quite straightforward: set include paths and libraries for the link stage.

For Windows, you need to have a matching ATK build for Debug/Release. In the project properties, add ATK include folder in Project->Properties->C++->Preprocessor->AdditionalIncludeDirectories.

Then add in the link page the ATK libraries you require (Project->Properties->Link->AdditionalDependencies) for each configuration.

## macSierra/OS X

On OS X, it is easier to add the include/library folders, by adding them to ADDITIONAL_INCLUDES and ADDITIONAL_LIBRARY_PATHS.

The second step is to add the libraries to the project by adding them to the Link Binary With Libraries list for each target you want to build.

# Conclusion

That’s it!

In the end, I hope that I showed that it is easy to build something with Audio ToolKit.

### Enthought

#### Webinar: Using Python and LabVIEW Together to Rapidly Solve Engineering Problems

When: On-Demand (Live webcast took place March 28, 2017)
What: Presentation, demo, and Q&A with Collin Draughon, Software Product Manager, National Instruments, and Andrew Collette, Scientific Software Developer, Enthought

View Now  If you missed the live session, fill out the form and we’ll send you a recording!

Engineers and scientists all over the world are using Python and LabVIEW to solve hard problems in manufacturing and test automation, by taking advantage of the vast ecosystem of Python software.  But going from an engineer’s proof-of-concept to a stable, production-ready version of Python, smoothly integrated with LabVIEW, has long been elusive.

Join us for a live webinar and demo, as we take a LabVIEW data acquisition app and extend it with Python’s machine learning capabilities, to automatically detect and classify equipment vibration.  Using a modern Python platform and the Python Integration Toolkit for LabVIEW, we’ll show how easy and fast it is to install heavy-hitting Python analysis libraries, take advantage of them from live LabVIEW code, and finally deploy the entire solution, Python included, using LabVIEW Application Builder.

In this webinar, you’ll see how easy it is to solve an engineering problem by using LabVIEW and Python together.

## What You’ll Learn:

• How Python’s machine learning libraries can simplify a hard engineering problem
• How to extend an existing LabVIEW VI using Python analysis libraries
• How to quickly bundle Python and LabVIEW code into an installable app

## Who Should Attend:

• Engineers and managers interested in extending LabVIEW with Python’s ecosystem
• People who need to easily share and deploy software within their organization
• Current LabVIEW users who are curious what Python brings to the table
• Current Python users in organizations where LabVIEW is used

### How LabVIEW users can benefit from Python:

• High-level, general purpose programming language ideally suited to the needs of engineers, scientists, and analysts
• Huge, international user base representing industries such as aerospace, automotive, manufacturing, military and defense, research and development, biotechnology, geoscience, electronics, and many more
• Tens of thousands of available packages, ranging from advanced 3D visualization frameworks to nonlinear equation solvers
• Simple, beginner-friendly syntax and fast learning curve

View Now  If you missed the live webcast, register and we’ll send you a recording

Presenters:

 Collin Draughon, National Instruments Software Product Manager Andrew Collette, Enthought Scientific Software Developer Python Integration Toolkit for LabVIEW core developer

## FAQs and Additional Resources

Quickly and efficiently access scientific and engineering tools for signal processing, machine learning, image and array processing, web and cloud connectivity, and much more. With only minimal coding on the Python side, this extraordinarily simple interface provides access to all of Python’s capabilities.

• What is the Python Integration Toolkit for LabVIEW?

The Python Integration Toolkit for LabVIEW provides a seamless bridge between Python and LabVIEW. With fast two-way communication between environments, your LabVIEW project can benefit from thousands of mature, well-tested software packages in the Python ecosystem.

Run Python and LabVIEW side by side, and exchange data live. Call Python functions directly from LabVIEW, and pass arrays and other numerical data natively. Automatic type conversion virtually eliminates the “boilerplate” code usually needed to communicate with non-LabVIEW components.

Develop and test your code quickly with Enthought Canopy, a complete integrated development environment and supported Python distribution included with the Toolkit.

• What is LabVIEW?

LabVIEW is a software platform made by National Instruments, used widely in industries such as semiconductors, telecommunications, aerospace, manufacturing, electronics, and automotive for test and measurement applications. In August 2016, Enthought released the Python Integration Toolkit for LabVIEW, which is a “bridge” between the LabVIEW and Python environments.

• Who is Enthought?

Enthought is a global leader in software, training, and consulting solutions using the Python programming language.

The post Webinar: Using Python and LabVIEW Together to Rapidly Solve Engineering Problems appeared first on Enthought Blog.

## March 13, 2017

### Continuum Analytics news

#### Pi Day 2017: Why Celebrating Science & Mathematics is More Critical Than Ever

Tuesday, March 14, 2017
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

While Pi Day is typically about cheap pizza and other retail stunts, this year the day is being used by the tech community to influence industry leaders to stand up against new policies that affect the future. In an era of "alternative facts" and "fake news," it's more important than ever that data-driven projects are a priority for businesses and government bodies. Without it, the tech community risks losing decades of data-backed progress across the board.

The proof is undeniable - collecting, organizing and learning from data generated in today’s world improves problem solving for everyone (and I mean everyone). More and more people each day are pushing for an increasingly data-driven society, and at Continuum Analytics, we believe that data and Open Data Science empower people with the tools to solve the world’s greatest challenges—boosting tech diversity, treating rare diseases, eradicating human trafficking, predicting the effects of public policy and even improving city infrastructure.

This year, we’ve seen the technology industry stand together on issues it couldn’t have anticipated. We’ve heard tech leaders share their opinions on changing policies. So, on this very Pi Day, let’s celebrate those who are driven by science, mathematics and data, to make the world a better place. Oh, and let’s eat some pie.

Happy Pi Day, all!

## March 09, 2017

### Titus Brown

#### A draft bit of text on open science communities

This is early draft text that Anita and I put together from a bunch of brainstorming done at the Imagining Tomorrow's University workshop. Comments welcome!

Communities are the fabric of open research, and serve as the basis for development and sharing of best practices, building effective open source tools, and engaging with researchers newly interested in practicing open research. Effective communities often emerge from bottom up interactions, and can serve as a support network for individual open researchers. A few points:

• These communities can consist of virtual clusters of likeminded individuals; they can include scholars, librarians, developers and tech staff or open research advocates at all levels of experience and with different backgrounds; the communities themselves can be short-lived and focused on a specific issue, tool, or approach, or they can have more long-term goals and aspirations.
• A key defining feature of these groups is that the principles of open science permeate their practice, meaning they are inherently inclusive, and aim to open up the process of scholarly exploration to the widest possible audience.
• We recommend that all stakeholders take steps to create an ecosystem that encourages these communities to develop. This means supporting common standards, funding "connective tissue" between different efforts, and sharing practices, tools, and people between communities

After collecting a series of narratives on effective and intentional approaches to creating, growing, and nurturing such communities, we recommended the following actions for different stakeholders to support the formation of adaptive and organic, bottom-up, distributed and open research communities:

Institutions:

• Provide physical space and/or admin support for community interactions.
• Recognize the need for explicit training in principles and practice of open research.
• Explore what "design by a community" looks like in areas where it’s not traditional, e.g. (mechanical) engineering, to change views of what constitutes excellence in a discipline.
• Reward incremental steps: provide incentives for aspects of open science (e.g. only share code, not data, or vv) then make it really easy to continue down a "sharing trajectory".

Funders:

• Recognize how "disciplinary shackles" can hinder adoption of Open Science practices (e.g. development of common software/workflows and other community resources may not be respected as part of disciplinary work).
• Award interdisciplinary and team efforts next to or instead of individual competition. Inclusivity is a defining feature of Open Science, as well as extensibility, reproducibility - goal is not solely to further individual rewards but to facilitate involvement of others: not lock-in economics, explore other reward methodologies.
• Reward incremental steps: provide incentives for aspects of open science (e.g. only share code, not data, or vv) then make it really easy to continue down a "sharing trajectory"

Platforms and publishers:

• Integrate training materials into platforms.
• Support development of platform specialists inside institutions.
• Start "pop-up" open science communities around e.g. datatype manipulation.
• Build support for openness into tools.
• Create communities around specific tools and practices; build norms and codes of conduct into these platforms endemically
• Lower barriers of entry to sharing practices, tools can support the "automatic" creation of communities (cf social media tools, platforms help define communities (e.g. "My Facebook friends", "My Jupyter Friends")

Community organizers:

• Build openness into governance
• Recognize the value of simple narratives for roping people into community participation.
• Funding culture changers: hiring people who are tasked with changing e.g. data dissemination processes/practices
• Praise incremental steps towards openness by community members.
• Establish a code of conduct and community interaction expectations.

### Matthew Rocklin

#### Biased Benchmarks

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation.

## Summary

Performing benchmarks to compare software is surprisingly difficult to do fairly, even assuming the best intentions of the author. Technical developers can fall victim to a few natural human failings:

1. We judge other projects by our own objectives rather than the objectives under which that project was developed
2. We fail to use other projects with the same expertise that we have for our own
3. We naturally gravitate towards cases at which our project excels
4. We improve our software during the benchmarking process
5. We don’t release negative results

We discuss each of these failings in the context of current benchmarks I’m working on comparing Dask and Spark Dataframes.

## Introduction

Last week I started comparing performance between Dask Dataframes (a project that I maintain) and Spark Dataframes (the current standard). My initial results showed that Dask.dataframes were overall much faster, somewhere like 5x.

These results were wrong. They weren’t wrong in a factual sense, and the experiments that I ran were clear and reproducible, but there was so much bias in how I selected, set up, and ran those experiments that the end result was misleading. After checking results myself and then having other experts come in and check my results I now see much more sensible numbers. At the moment both projects are within a factor of two most of the time, with some interesting exceptions either way.

This blogpost outlines the ways in which library authors can fool themselves when performing benchmarks, using my recent experience as an anecdote. I hope that this encourages authors to check themselves, and encourages readers to be more critical of numbers that they see in the future.

This problem exists as well in academic research. For a pop-science rendition I recommend “The Experiment Experiment” on the Planet Money Podcast.

## Skewed Objectives

Feature X is so important. I wonder how the competition fares?

Every project is developed with different applications in mind and so has different strengths and weaknesses. If we approach a benchmark caring only about our original objectives and dismissing the objectives of the other projects then we’ll very likely trick ourselves.

For example consider reading CSV files. Dask’s CSV reader is based off of Pandas’ CSV reader, which was the target of great effort and love; this is because CSV was so important to the finance community where Pandas grew up. Spark’s CSV solution is less awesome, but that’s less about the quality of Spark and more a statement about how Spark users tend not to use CSV. When they use text-based formats they’re much more likely to use line-delimited JSON, which is typical in Spark’s common use cases (web diagnostics, click logs, and so on). Pandas/Dask came from the scientific and finance worlds where CSV is king while Spark came from the web world where JSON reigns.

Conversely, Dask.dataframe hasn’t bothered to hook up the pandas.read_json function to Dask.dataframe yet. Surprisingly it rarely comes up. Both projects can correctly say that the other project’s solution to what they consider the standard text-based file format is less-than-awesome. Comparing performance here either way will likely lead to misguided conclusions.

So when benchmarking data ingestion maybe we look around a bit, see that both claim to support Parquet well, and use that as the basis for comparison.

## Skewed Experience

Whoa, this other project has a lot of configuration parameters! Let’s just use the defaults.

Software is often easy to set up, but often requires experience to set up optimally. Authors are naturally more adept at setting up their own software than the software of their competition.

My original (and flawed) solution to this was to “just use the defaults” on both projects. Given my inability to tune Spark (there are several dozen parameters) I decided to also not tune Dask and run under default settings. I figured that this would be a good benchmark not only of the software, but also on choices for sane defaults, which is a good design principle in itself.

This failed spectacularly because I was making unconscious decisions like the size of machines that I was using for the experiment, CPU/memory ratios, and so on. It turns out that Spark’s defaults are optimized for very small machines (or more likely, small YARN containers) and use only 1GB of memory per executor by default while Dask is typically run on larger boxes or has the full use of a single machine in a single shared-memory process. My standard cluster configurations were biased towards Dask before I even considered running a benchmark.

Similarly the APIs of software projects are complex and for any given problem there is often both a fast way and a general-but-slow way. Authors naturally choose the fast way on their own system but inadvertently choose the general way that comes up first when reading documentation for the other project. It often takes months of hands-on experience to understand a project well enough to definitely say that you’re not doing things in a dumb way.

In both cases I think the only solution is to collaborate with someone that primarily uses the other system.

## Preference towards strengths

Oh hey, we’re doing really well here. This is great! Let’s dive into this a bit more.

It feels great to see your project doing well. This emotional pleasure response is powerful. It’s only natural that we pursue that feeling more, exploring different aspects of it. This can skew our writing as well. We’ll find that we’ve decided to devote 80% of the text to what originally seemed like a small set of features, but which now seems like the main point.

It’s important that we define a set of things we’re going to study ahead of time and then stick to those things. When we run into cases where our project fails we should take that as an opportunity to raise an issue for future (though not current) development.

## Tuning during experimentation

Oh, I know why this is slow. One sec, let me change something in the code.

I’m doing this right now. Dask dataframe shuffles are generally slower than Spark dataframe shuffles. On numeric data this used to be around a 2x difference, now it’s more like a 1.2x difference (at least on my current problem and machine). Overall this is great, seeing that another project was beating Dask motivated me to dive in (see dask/distributed #932) and this will result in a better experience for users in the future. As a developer this is also how I operate. I define a benchmark, profile my code, identify bottlenecks, and optimize. Business as usual.

However as an author of a comparative benchmark this is also somewhat dishonest; I’m not giving the Spark developers the same opportunity to find and fix similar performance issues in their software before I publish my results. I’m also giving a biased picture to my readers. I’ve made all of the pieces that I’m going to show off fast while neglecting the others. Picking benchmarks, optimizing the project to make them fast, and then publishing those results gives the incorrect impression that the entire project has been optimized to that level.

## Omission

So, this didn’t go as planned. Let’s wait a few months until the next release.

There is no motivation to publish negative results. Unless of course you’ve just written a blogpost announcing that you plan to release benchmarks in the near future. Then you’re really forced to release numbers, even if they’re mixed.

That’s ok. Mixed numbers can be informative. They build trust and community. And we all talk about open source community driven software, so these should be welcome.

## Straight up bias

Look, we’re not in grad-school any more. We’ve got to convince companies to actually use this stuff.

Everything we’ve discussed so far assumes best intentions, and that the author is acting in good faith, but falling victim to basic human failings.

However many developers today (including myself) are paid and work for for-profit companies that need to make money. To an increasing extent making this money depends on community mindshare, which means publishing benchmarks that sway users to our software. Authors have bosses that they’re trying to impress or the content and tone of an article may be influenced by people within the company other than the stated author.

I’ve been pretty lucky working with Continuum Analytics (my employer) in that they’ve been pretty hands-off with technical writing. For other employers that may be reading, we’ve actually had an easier time getting business because of the honest tone in these blogposts in some cases. Potential clients generally have the sense that we’re trustworthy.

Technical honesty goes a surprisingly long way towards implying technical proficiency.

## March 07, 2017

### Continuum Analytics news

#### Self-Service Open Data Science: Custom Anaconda Management Packs for Hortonworks HDP and Apache Ambari

Monday, March 6, 2017
Kristopher Overholt
Continuum Analytics

Daniel Rodriguez
Continuum Analytics

As part of our partnership with Hortonworks, we’re excited to announce a new self-service feature of the Anaconda platform that can be used to generate custom Anaconda management packs for the Hortonworks Data Platform (HDP) and Apache Ambari. This functionality is now available in the Anaconda platform as part of the Anaconda Scale and Anaconda Repository platform components.

The ability to generate custom Anaconda management packs makes it easy for system administrators to provide data scientists and analysts with the data science libraries from Anaconda that they already know and love. The custom management packs allow Anaconda to integrate with a Hortonworks HDP cluster along with Hadoop, Spark, Jupyter Notebooks, and Apache Zeppelin.

## anaconda-mpack-a.png

Data scientists working with big data workloads want to use different versions of Anaconda, Python, R, and custom conda packages on their Hortonworks HDP clusters. Using custom management packs to manage and distribute multiple Anaconda installations across a Hortonworks HDP cluster is convenient because they work natively with Hortonworks HDP 2.3, 2.4, and 2.5+ and Ambari 2.2 and 2.4+ without the need to install additional software or services on the HDP cluster nodes.

Deploying multiple custom versions of Anaconda on a Hortonworks HDP cluster with Hadoop and Spark has never been easier! In this blog post, we’ll take a closer look at how we can create and install a custom Anaconda management pack using Anaconda Repository and Ambari then configure and run PySpark jobs in notebooks, including Jupyter and Zeppelin.

### Generating Custom Anaconda Management Packs for Hortonworks HDP

For this example, we’ve installed Anaconda Repository (which is part of the Anaconda Enterprise subscription) and created an on-premises mirror of more than 730 conda packages that are available in the Anaconda distribution and repository. We’ve also installed Hortonworks HDP 2.5.3 along with Ambari 2.4.2, Spark 1.6.2, Zeppelin 0.6.0, and Jupyter 4.3.1 on a cluster.

In Anaconda Repository, we can see feature for Installers, which can be used to generate custom Anaconda management packs for Hortonworks HDP.

## anaconda-mpack-b.png

The Installers page describes how we can create custom Anaconda management packs for Hortonworks HDP that are served directly by Anaconda Repository from a URL.

## anaconda-mpack-c.png

After selecting the Create New Installer button, we can then specify the packages that we want to include in our custom Anaconda management pack, which we’ll name anaconda_hdp.

Then, we specify the latest version of Anaconda (4.3.0) and Python 2.7. We’ve added the anaconda package to include all of the conda packages that are included by default in the Anaconda installer. Specifying the anaconda package is optional, but it’s a great way to kickstart your custom Anaconda management pack with more than 200 of the most popular Open Data Science packages, including NumPy, Pandas, SciPy, matplotlib, scikit-learn and more.

## anaconda-mpack-d.png

In addition to the packages available in Anaconda, additional Python and R conda packages can be included in the custom management pack, including libraries for natural language processing, visualization, data I/O and other data analytics libraries such as azure, bcolz, boto3, datashader, distributed, gensim, hdfs3, holoviews, impyla, seaborn, spacy, tensorflow or xarray.

We could have also included conda packages from other channels in our on-premises installation of Anaconda Repository, including community-built packages from conda-forge or other custom-built conda packages from different users within our organization.

When you’re ready to generate the custom Anaconda management pack, press the Create Management Pack button.

After creating the custom Anaconda management pack, we’ll see a list of files that were generated, including the management pack file that can be used to install Anaconda with Hortonworks HDP and Ambari.

## anaconda-mpack-e.png

You can install the custom management pack directly from the HDP node running the Ambari server using a URL provided by Anaconda Repository. Alternatively, the anaconda_hdp-mpack-1.0.0.tar.gz file can be manually downloaded and transferred to the Hortonworks HDP cluster for installation.

Now we’re ready to install the newly created custom Anaconda management pack using Ambari.

### Installing Custom Anaconda Management Packs Using Ambari

Now that we’ve generated a custom Anaconda management pack, we can install it on our Hortonworks HDP cluster and make it available to all of the HDP cluster users for PySpark and SparkR jobs.

The management pack can be installed into Ambari by using the following command on the machine running the Ambari server.

# ambari-server install-mpack

--mpack=http://54.211.228.253:8080/anaconda/installers/anaconda/download/1.0.0/anaconda-mpack-1.0.0.tar.gz

Using python  /usr/bin/python

Installing management pack

Ambari Server 'install-mpack' completed successfully.

After installing a management pack, the Ambari server must be restarted:

# ambari-server restart

After the Ambari server restarts, navigate to the Ambari Cluster Dashboard UI in a browser:

## anaconda-mpack-f.png

Scroll down to the bottom of the list of services on the left sidebar, then click on the Actions > Add Services button:

## anaconda-mpack-g.png

This will open the Add Service Wizard:

## anaconda-mpack-h.png

In the Add Service Wizard, you can scroll down in the list of services until you see the name of the custom Anaconda management pack that you installed. Select the custom Anaconda management pack and click the Next button:

## anaconda-mpack-i.png

On the Assign Slaves and Clients screen, select the Client checkbox for each HDP node that you want to install the custom Anaconda management pack onto, then click the Next button:

## anaconda-mpack-j.png

On the Review screen, review the proposed configuration changes, then click the Deploy button:

## anaconda-mpack-k.png

Over the next few minutes, the custom Anaconda management pack will be distributed and installed across the HDP cluster:

## anaconda-mpack-l.png

And you’re done! The custom Anaconda management pack has installed Anaconda in /opt/continuum/anaconda on each HDP node that you selected, and Anaconda is active and ready to be used by Spark or other distributed frameworks across your Hortonworks HDP cluster.

Refer to the Ambari documentation for more information about using Ambari server with management packs, and refer to the HDP documentation for more information about using and administering your Hortonworks HDP cluster with Ambari.

### Using the Custom Anaconda Management Pack with spark-submit

Now that we’ve generated and installed the custom Anaconda management pack, we can use libraries from Anaconda with Spark, PySpark, SparkR or other distributed frameworks.

You can use the spark-submit command along with the PYSPARK_PYTHON environment variable to run Spark jobs that use libraries from Anaconda across the HDP cluster, for example:

$PYSPARK_PYTHON=/opt/continuum/anaconda/bin/python spark-submit pyspark_script.py ## anaconda-mpack-m.png ### Using the Custom Anaconda Management Pack with Jupyter To work with Spark jobs interactively on the Hortonworks HDP cluster, you can use Jupyter Notebooks via Anaconda Enterprise Notebooks, which is a multi-user notebook server with collaborative features for your data science team and integration with enterprise authentication. Refer to our previous blog post on Using Anaconda with PySpark for Distributed Language Processing on a Hadoop Cluster for more information about configuring Jupyter with PySpark. ## anaconda-mpack-n.png ### Using the Custom Anaconda Management Pack with Zeppelin You can also use Anaconda with Zeppelin on your HDP cluster. In HDP 2.5 and Zeppelin 0.6, you’ll need to configure Zeppelin to point to the custom version of Anaconda installed on the HDP cluster by navigating to Zeppelin Notebook > Configs > Advanced zeppelin-env in the Ambari Cluster Dashboard UI in your browser: ## anaconda-mpack-o_0.png Scroll down to the zeppelin_env_content property, uncomment, and set the following line to match the location of the Anaconda on your HDP cluster nodes: export PYSPARK_PYTHON="/opt/continuum/anaconda/bin/python" ## anaconda-mpack-p.png Then restart the Zeppelin service when prompted. You should also configure the zeppelin.pyspark.python property in the Zeppelin PySpark interpreter to point to Anaconda (/opt/continuum/anaconda/bin/python): ## anaconda-mpack-q.png Then restart the Zeppelin interpreter when prompted. Note that the PySpark interpreter configuration process will be improved and centralized in Zeppelin in a future version. Once you’ve configured Zeppelin to point to the location of Anaconda on your HDP cluster, data scientists can run interactive Zeppelin notebooks with Anaconda and use all of the data science libraries they know and love in Anaconda with their PySpark and SparkR jobs: ## anaconda-mpack-r.png ### Get Started with Custom Anaconda Management Packs for Hortonworks in Your Enterprise If you’re interested in generating custom Anaconda management packs for Hortonworks HDP and Ambari to empower your data science team, we can help! Get in touch with us by using our contact us page for more information about this functionality and our enterprise Anaconda platform subscriptions. If you’d like to test-drive the enterprise features of Anaconda on a bare-metal, on-premises or cloud-based cluster, please contact us at sales@continuum.io. ### Matthieu Brucher #### Announcement: Audio Unit updates I’m happy to announce the updates of all OS X plugins based on the Audio Toolkit. They are available on OS X (min. 10.11) in AU, VST2 and VST3 formats. This update is due to different reports on Logic Pro where these plugins were failing. This should now be fixed. The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code. ## March 06, 2017 ### Fernando Perez #### "Literate computing" and computational reproducibility: IPython in the age of data-driven journalism As "software eats the world" and we become awash in the flood of quantitative information denoted by the "Big Data" buzzword, it's clear that informed debate in society will increasingly depend on our ability to communicate information that is based on data. And for this communication to be a truly effective dialog, it is necessary that the arguments made based on data can be deconstructed, analyzed, rebutted or expanded by others. Since these arguments in practice often rely critically on the execution of code (whether an Excel spreadsheet or a proper program), it means that we really need tools to effectively communicate narratives that combine code, data and the interpretation of the results. I will point out here two recent examples, taken from events in the news this week, where IPython has helped this kind of discussion, in the hopes that it can motivate a more informed style of debate where all the moving parts of a quantitative argument are available to all participants. ## Insight, not numbers: from literate programming to literate computing The computing community has for decades known about the "literate programming" paradigm introduced by Don Knuth in the 70's and fully formalized in his famous 1992 book. Briefly, Knuth's approach proposes writing computer programs in a format that mixes the code and a textual narrative together, and from this format generating separate files that will contain either an actual code that can be compiled/executed by the computer, or a narrative document that explains the program and is meant for human consumption. The idea is that by allowing the authors to maintain a close connection between code and narrative, a number of benefits will ensue (clearer code, less programming errors, more meaningful descriptions than mere comments embedded in the code, etc). I don't take any issue with this approach per se, but I don't personally use it because it's not very well suited to the kinds of workflows that I need in practice. These require the frequent execution of small fragments of code, in an iterative cycle where code is run to obtain partial results that inform the next bit of code to be written. Such is the nature of interactive exploratory computing, which is the bread and butter of many practicing scientists. This is the kind of workflow that led me to creating IPython over a decade ago, and it continues to inform basically every decision we make in the project today. As Hamming famously said in 1962, "The purpose of computing is insight, not numbers.". IPython tries to help precisely in this kind of usage pattern of the computer, in contexts where there is no clear notion in advance of what needs to be done, so the user is the one driving the computation. However, IPython also tries to provide a way to capture this process, and this is where we join back with the discussion above: while LP focuses on providing a narrative description of the structure of an algorithm, our working paradigm is one where the act of computing occupies the center stage. From this perspective, we therefore refer to the worfklow exposed by these kinds of computational notebooks (not just IPython, but also Sage, Mathematica and others), as "literate computing": it is the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components. For the goals of communicating results in scientific computing and data analysis, I think this model is a better fit than the literate programming one, which is rather aimed at developing software in tight concert with its design and explanatory documentation. I should note that we have some ideas on how to make IPython stronger as a tool for "traditional" literate programming, but it's a bit early for us to focus on that, as we first want to solidify the computational workflows possible with IPython. As I mentioned in a previous blog post about the history of the IPython notebook, the idea of a computational notebook is not new nor ours. Several IPython developers used extensively other similar systems from a long time and we took lots of inspiration from them. What we have tried to do, however, is to take a fresh look at these ideas, so that we can build a computational notebook that provides the best possible experience for computational work today. That means taking the existence of the Internet as a given in terms of using web technologies, an architecture based on well-specified protocols and reusable low-level formats (JSON), a language-agnostic view of the problem and a concern about the entire cycle of computing from the beginning. We want to build a tool that is just as good for individual experimentation as it is for collaboration, communication, publication and education. ## Government debt, economic growth and a buggy Excel spreadsheet: the code behind the politics of fiscal austerity In the last few years, extraordinarily contentious debates have raged in the circles of political power and fiscal decision making around the world, regarding the relation between government debt and economic growth. One of the center pieces of this debate was a paper form Harvard economists C. Reinhart and K. Rogoff, later turned into a best-selling book, that argued that beyond 90% debt ratios, economic growth would plummet precipitously. This argument was used (amongst others) by politicians to justify some of the extreme austerity policies that have been foisted upon many countries in the last few years. On April 15, a team of researchers from U. Massachusetts published a re-analysis of the original data where they showed how Rienhart and Rogoff had made both fairly obvious coding errors in their orignal Excel spreadsheets as well as some statistically questionable manipulations of the data. Herndon, Ash and Pollin (the U. Mass authors) published all their scripts in R so that others could inspect their calculations. Two posts from the Economist and the Roosevelt Institute nicely summarize the story with a more informed policy and economics discussion than I can make. James Kwak has a series of posts that dive into technical detail and question the horrible choice of using Excel, a tool that should for all intents and purposes be banned from serious research as it entangles code and data in ways that more or less guarantee serious errors in anything but trivial scenarios. Victoria Stodden just wrote an excellent new post with specific guidance on practices for better reproducibility; here I want to take a narrow view of these same questions focusing strictly on the tools. As reported in Mike Konczal's piece at the Roosevelt Institute, Herndon et al. had to reach out to Reinhart and Rogoff for the original code, which hadn't been made available before (apparently causing much frustration in economics circles). It's absolutely unacceptable that major policy decisions that impact millions worldwide had until now hinged effectively on the unverified word of two scientists: no matter how competent or honorable they may be, we know everybody makes mistakes, and in this case there were both egregious errors and debatable assumptions. As Konczal says, "all I can hope is that future historians note that one of the core empirical points providing the intellectual foundation for the global move to austerity in the early 2010s was based on someone accidentally not updating a row formula in Excel." To that I would add the obvious: this should never have happened in the first place, as we should have been able to inspect that code and data from the start. Now, moving over to IPython, something interesting happened: when I saw the report about the Herndon et al. paper and realized they had published their R scripts for all to see, I posted this request on Twitter: It seemed to me that the obvious thing to do would be to create a document that explained together the analysis and a bit of narrative using IPython, hopefully more easily used as a starting point for further discussion. What I didn't really expect is that it would take less than three hours for Vincent Arel-Bundock, a PhD Student in Political Science at U. Michigan, to come through with a solution: I suggested that he turn this example into a proper repository on github with the code and data, which he quickly did: So now we have a full IPython notebook, kept in a proper github repository. This repository can enable an informed debate about the statistical methodologies used for the analysis, and now anyone who simply installs the SciPy stack can not only run the code as-is, but explore new directions and contribute to the debate in a properly informed way. ## On to the heavens: the New York Times' infographic on NASA's Kepler mission As I was discussing the above with Vincent on Twitter, I came across this post by Jonathan Corum, an information designer who works as NY Times science graphics editor: The post links to a gorgeous, animated infographic that summarizes the results that NASA's Kepler spacecraft has obtained so far, and which accompanies a full article at the NYT on Kepler's most recent results: a pair of planets that seem to have just the right features to possibly support life, a quick 1200 light-years hop from us. Jonathan indicated that he converted his notebook to a Python script later on for version control and automation, though I explained to him that he could have continued using the notebook, since the --script flag would give him a .py file if needed, and it's also possible to execute a notebook just like a script, with a bit of additional support code: In this case Jonathan's code isn't publicly available, but I am still very happy to see this kind of usage: it's a step in the right direction already and as more of this analysis is done with open-source tools, we move further towards the possibility of an informed discussion around data-driven journalism. I also hope he'll release perhaps some of the code later on, so that others can build upon it for similar analyses. I'm sure lots of people would be interested and it wouldn't detract in any way from the interest in his own work which is strongly tied to the rest of the NYT editorial resources and strengths. ## Looking ahead from IPython's perspective Our job with IPython is to think deeply about questions regarding the intersection of computing, data and science, but it's clear to me at this point that we can contribute in contexts beyond pure scientific research. I hope we'll be able to provide folks who have a direct intersection with the public, such as journalists, with tools that help a more informed and productive debate. Coincidentally, UC Berkeley will be hosting on May 4 a symposium on data and journalism, and in recent days I've had very productive interactions with folks in this space on campus. Cathryn Carson currently directs the newly formed D-Lab, whose focus is precisely the use of quantitative and datamethods in the social sciences, and her team has recently been teaching workshops on using Python and R for social scientists. And just last week I lectured in Raymond Yee's course (from the School of Information) where they are using the notebook extensively, following Wes McKinney's excellent Python for Data Analysis as the class textbook. Given all this, I'm fairly optimistic about the future of a productive dialog and collaborations on campus, given that we have a lot of the IPython team working full-time here. Note: as usual, this post is available as an IPython notebook in my blog repo. #### The IPython notebook: a historical retrospective On December 21 2011, we released IPython 0.12 after an intense 4 1/2 months of development. Along with a number of new features and bug fixes, the main highlight of this release is our new browser-based interactive notebook: an environment that retains all the features of the familiar console-based IPython but provides a cell-based execution workflow and can contain not only code but any element a modern browser can display. This means you can create interactive computational documents that contain explanatory text (including LaTeX equations rendered in-browser via MathJax), results of computations, figures, video and more. These documents are stored in a version-control-friendly JSON format that is easy to export as a pure Python script, reStructuredText, LaTeX or HTML. For the IPython project this was a major milestone, as we had wanted for years to have such a system, and it has generated a fair amount of interest online. In particular, on our mailing list a user asked us about the relationship between this effort and the well-known and highly capable Sage Notebook. In responding to the question, I ended up writing up a fairly detailed retrospective of our path to get to the IPython notebook, and it seemed like a good idea to put this up as a blog post to encourage discussion beyond the space of a mailing list, so here it goes (the original email that formed the base of this post, in case anyone is curious about the context). The question that was originally posed by Oleg Mikulchenklo was: What is the relation and comparison between the IPython notebook and the Sage notebook? Can someone provide motivation and roadmap for the IPython notebook as an alternative to the Sage notebook? I'll try to answer that now... Early efforts: 2001-2005 Let me provide some perspective on this, since it's a valid question that is probably in the minds of others as well. This is a long post, but I'm trying to do justice to over 10 years of development, multiple interactions between the two projects and the contributions of many people. I apologize in advance to anyone I've forgotten, and please do correct me in the comments, as I want to have a full record that's reasonably trustworthy. Let's go back to the beginning: when I started IPython in late 2001, I was a graduate student in physics at CU Boulder, and had used extensively first Maple, then Mathematica, both of which have notebook environments. I also used Pascal (earlier) then C/C++, but those two (plus IDL for numerics) were the interactive environments that I knew well, and my experience with them shaped my views on what a good system for everyday scientific computing should look like. In particular, I was a heavy user of the Mathematica notebooks and liked them a lot. I started using Python in 2001 and liked the language, but its interactive prompt felt like a crippled toy compared to the systems mentioned above or to a Unix shell. When I found out about sys.displayhook, I realized that by putting in a callable object, I would be able to hold state and capture previous results for reuse. I then wrote a python startup file to provide these features and some other niceties such as loading Numeric and Gnuplot, giving me a 'mini-mathematica' in Python (femto- might be a better description, in fairness). Thus was my 'ipython-0.0.1' born, a mere 259 lines to be loaded as$PYTYHONSTARTUP.

I also read an article that mentioned two good interactive systems for Python, LazyPython and IPP, not surprisingly also created by scientists.  I say this because the natural flow of scientific computing pretty much mandates a solid interactive environment, so while other Python users and developers may like having occasional access to interactive facilities, scientists more or less demand them.  I contacted their authors,  Nathan Gray and Janko Hauser, seeking to join forces to create IPython;  they were both very gracious and let me use their code, but didn't have the time to participate in the effort.  As any self-respecting graduate student with a dissertation deadline looming would do, I threw myself full-time into building the first 'real' IPython by merging my code with both of theirs (eventually I did graduate, by the way).

The point of this little trip down memory lane is to show how from the very beginning, Mathematica and its notebooks (and the Maple worksheets before) were in my mind as the ideal environment for daily scientific work. In 2005 we had two Google SoC students and we took a stab at building, using Wx, a notebook system.  Robert Kern then put some more work into the problem, but unfortunately that prototype never really became fully usable.

Sage bursts into the scene

In early 2006, William Stein organized the first Sage Days at UCSD and invited me; William and I had been in touch since 2005 as he was using IPython for the Sage terminal interface.  I  suggested Robert Kern come as well, and he demoed the notebook prototype he had at that point. It was very clear that the system wasn't production ready, and William was already starting to think about a notebook-like system for Sage as well. Eventually he started working on a browser-based system, and by Sage Days 2 in October 2006, as shown by the coding sprint topics, the Sage notebook was already usable.

For Sage, going at it separately was completely reasonable and justified: we were moving slowly and by that point we weren't even convinced the Wx approach would go anywhere. William is a force of nature and was trying to get Sage to be very usable very fast, so building something integrated for his needs was certainly the right choice.

We continued slowly working on IPython, and actually had another attempt at a notebook-type system in 2006-2007. By that point Brian Granger and Min Ragan-Kelley had come on board and we had built the Twisted-based parallel tools. Using this, Min got a notebook prototype working using an SQL/SQLAlchemy backend.  We had the opportunity to work on many of these ideas during a workshop on Interactive Parallel Computation that William and I co-organized (along with others).  Like Sage, this prototype used a browser for the client but it tried to retain the 'IPython experience', something the Sage notebook didn't provide.

Keeping the IPython experience in the notebook

This is a key difference of our approach and the Sage notebook, so it' worth clarifying what I mean, the key point being the execution model and its relation to the filesystem.  The Sage notebook took the route of using the filesystem for notebook operations, so you can't meaningfully use 'ls' in it or move around the filesystem yourself with 'cd', because Sage will always execute your code in hidden directories with each cell actually being a separate subdirectory.  This is a perfectly valid approach and has a number of very good consequences for the Sage notebook, but it is also very different from the IPython model where we always keep the user very close to the filesystem and OS.  For us, it's really important that you can access local scripts, use %run, see arbitrary files conveniently, etc., as these are routine needs in data analysis and numerical simulation.

Furthermore, we wanted a notebook that would provide the entire IPython experience, meaning that magics, aliases, syntax extensions and all other special IPython features worked the same in the notebook and terminal.  The Sage notebook reimplemented some of these things in its own way: they reused the % syntax but it has a different meaning, they took some of the IPython introspection code and built their own x?/?? object introspection system, etc. In some cases it's almost like IPython but in others the behavior is fairly different; this is fine for Sage but doesn't work for us.

So we continued with our own efforts, even though by then the Sage notebook was fairly mature.  For a number of reasons (I honestly don't recall all the details), Min's browser-based notebook prototype also never reached production quality.

Breaking through our bottleneck and ZeroMQ

Eventually, in the summer of 2009 we were able to fund Brian to work full-time on IPython, thanks to Matthew Brett and Jarrod Millman, with resources from the NiPy project.  Brian could then dig into the heart of the beast, and attack the fundamental problem that made IPython development so slow and hard: the fact that the main codebase was an outgrowth of that original merge from 2001 of my hack, IPP and LazyPython, by now having become an incomprehensible and terribly interconnected mess with barely any test suite.  Brian was able to devote a summer full-time to dismantling these pieces and reassembling them so that they would continue to work as before (with only minimal regressions), but now in a vastly more approachable and cleanly modularized codebase.

This is where early 2010 found us, and then zerendipity struck: while on a month-long teaching trip to Colombia I read an article about ZeroMQ and talked to Brian about it, as it seemed to provide the right abstractions for us with a simpler model than Twisted.  Brian then blew me away, coming back in two days with a new set of clean Cython-based bindings: we now had pyzmq! It became clear that we had the right tools to build a two-process implementation of IPython that could give us the 'real IPython' but communicating with a different frontend, and this is precisely what we wanted for cleaner parallel computing, multiprocess clients and a notebook.

When I returned from Colombia I had a free weekend and drove down from Berkeley to San Luis Obispo.  Upon arriving at Brian's place I didn't even have zeromq installed nor had I read any docs about it.  I installed it, and Brian simply told me what to type in IPython to import the library and open a socket, while he had another one open on his laptop.  We then started exchanging messages from our IPython sessions.  The fact that we could be up and running this fast was a good sign that the library was exactly what we wanted.  We coded frantically in parallel: one of us wrote the kernel and the other the client, and we'd debug one of them while leaving the other running in the meantime.  It was the perfect blend of pair programming and simultaneous development, and in just two days we had a prototype of a python shell over zmq working, proving that we could indeed build everything we needed.  Incidentally, that code may still be useful to someone wanting to understand our basic ideas or how to build an interactive client over ZeroMQ, so I've posted it for reference as a standalone github repository.

Shortly thereafter, we had discussions with Eric Jones and Travis Oliphant at Enthought, who offered to support Brian and I to work in collaboration with Evan Patterson, and build a Qt console for IPython using this new design. Our little weekend prototype had been just a proof of concept, but their support allowed us to spend the time necessary to apply the same ideas to the real IPython. Brian and I would build a zeromq kernel with all the IPython functionality, while Evan built a Qt console that would drive it using our communications protocol.  This worked extremely well, and by late 2010 we had a more or less complete Qt console working:

Over the summer of 2010, Omar Zapata and Gerardo Gutierrez worked as part of the Google Summer of Code project and started building both terminal- and Qt-based clients for IPython on top of ZeroMQ.  Their task was made much harder because we hadn't yet refactored all of IPython to use zmq, but the work they did provided critical understanding of the problem at this point, and eventually by 0.12 much of it has been finally merged.

The value and correctness of this architecture became clear when Brian, Min and I met with the Enthought folks and Shahrokh Mortazavi and Dino Viehland from Microsoft.  After a single session explaining to Dino and Shahrokh our design and pointing them to our github repository, they were able to build support for IPython into the new Python Tools for Visual Studio, without ever asking us a single question:

In October 2010 James Gao (a Berkeley neuroscience graduate student) wrote up a quick prototype of a web notebook, demonstrating again that this design really worked well and could be easily used by a completely different client:

And finally, in the summer of 2011 Brian took James' prototype and built up a fully working system, this time using websockets, the Tornado web server, JQuery for Javascript, CodeMirror for code editing, and MathJax for LaTeX rendering.  Ironically, we had looked at Tornado in early 2010 along with ZeroMQ as a candidate for our communications, but dismissed it as it wasn't really the tool for that job; it now turned out to be the perfect fit for an asynchronous http server with Websockets support.

We merged Brian's work in late August while working on IRC from a boarding room at the San Francisco airport, just in time for me to present it at the EuroSciPy 2011 conference.  We  then polished it over the next few months to finally release it as part of IPython 0.12:

Other differences with the Sage notebook

We deliberately wrote the IPython notebook to be a lightweight, single-user program that feels like any other local application.  The Sage notebook draws many parallels with the google docs model, by default requiring a login and showing all of your notebooks together, kept in a location separate from the rest of your files.  In contrast, we want the notebook to just start like any other program and for the ipynb files to be part of your normal workflow, ready to be version-controlled just like any other, stored in your normal folders and easy to manage on their own. Update: as noted by Jason Grout, the Sage notebook was designed from the start to scale to big centralized multi-user servers (sagenb.org, with about 76,000 accounts, is a good example).  The notebook that runs in the local user's computer is the same as the one in these large public servers.

There are other deliberate differences of interface and workflow:

• We keep our In/Out prompts explicit because we have an entire system of caching variables that uses those numbers, and because those numbers give the user a visual clue of the execution order of cells, which may differ from the document's order.
• We deliberately chose a structured JSON format for our documents. It's clear enough for human reading while allowing easy and powerful machine manipulation without having to write our own parsing.  So writing utilities like a reStructuredText or LaTeX converter is very easy, as we recently showed.
• Our move to zmq allowed us (thanks to Thomas Kluyver's tireless work) to ship the notebook working both on Python2 and Python3 out of the box.  The current version of the  Sage notebook only works on Python2, in part due to its use of Twisted.  Update: William pointed out to me that the upcoming 5.0 version of the notebook will have a vastly reduced dependency on Twisted, so this will soon be less of an issue for Sage.
• Because our notebook works in the normal filesystem, and lets you create .py files right next to the .ipynb just by passing --script at startup, you can reuse your notebooks like normal scripts, import one notebook from another or a normal python script, etc.  I'm not sure how to import a Sage notebook from a normal python file, or if it's even possible.
• We have a long list of plans for the document format: multi-sheet capabilities, LaTeX-style preamble, per-cell metadata, structural cells to allow outline-level navigation and manipulation such as in LyX, improved literate programming and validation/reproducibility support, ... For that, we need to control the document format ourselves so we can evolve it according to our needs and ideas.

As you see, there are indeed a number of key differences between our notebook and the sage one, but there are very good technical reasons for this.  The notebook integrates with our architecture and leverages it; you can for example use the interactive debugger via a console or qtconsole against a notebook kernel, something not possible with the sage notebook.

In addition, Sage is GPL licensed while IPython is BSD licensed.  This means we can not directly reuse their code, though when we have asked them to relicense specific pieces of code to us, they have always agreed to do so. But large-scale reuse of Sage code in IPython is not really viable.

The value of being the slowest in the race

As this long story shows, it has taken us a very long time to get here. But what we have now makes a lot of sense for us, even considering the existence of the Sage notebook and how good it is for many use cases. Our notebook is just one particular aspect of a large and rich architecture built around the concept of a Python interpreter abstracted over a JSON-based, explicitly defined communications protocol.  Even considering purely http clients, the notebook is still just one of many possible: you can easily build an interface that only evaluates a single cell with a tiny bit of javascript like the Sage single cell server, for example.

Furthermore, since Min also reimplemented our parallel machinery completely with pyzmq, now we have one truly common codebase for all of IPython. We still need to finish up a bit of integration between the interactive kernels and the parallel ones, but we plan to finish that soon.

In many ways, our slow pace of development paid off:
• We had multiple false starts that helped us much to better understand the hard parts of the problem and where the dead ends would lie.
• We were still thinking about this all the time: even when we couldn't spare the time to actively work on it, we had no end of discussions on these things over the years (esp. Brian, Min and I, but also with others at meetings and conferences).
• The Sage notebook was a great trailblazer showing both what could be done, and also how there were certain decisions that we wanted to make differently.
• The technology of some critical third-party tools caught up in an amazing way: ZeroMQ, Tornado, WebSockets, MathJax, and the fast and capable Javascript engines in modern browsers along with good JS libraries. Without these tools we couldn't possibly have implemented what we have now.
As much as we would have loved to have a solid notebook years ago in IPython, I'm actually happy at how things turned out.  We have now a very nice mix of our own implementation for the things that are really within our scope, and leveraging third party tools for critical parts that we wouldn't want to implement ourselves.

What next?

We have a lot of ideas for the notebook, as we want it to be the best possible environment for modern computational work (scientific work is our focus, but not its only use), including research, education and publication, with consistent support for clean and reproducible practices throughout.  We are fairly confident that the core design and architecture are extremely solid, and we already have a long list of ideas and improvements we want to make.  We are limited only by manpower and time, so please join us on github and pitch in!

Since this post was motivated by questions about Sage, I'd like to emphasize that we have had multiple, productive collaborations with William and other Sage developers in the past, and I expect that to continue to be the case.  On certain points that collaboration has already led to convergence; e.g. the new Sage single cell server uses the IPython messaging protocol, after we worked closely with Jason Grout during Sage Days 29 in March 2011 thanks to William's invitation.  Furthermore, William's invitations to several Sage Days events, as well as the workshops we have organized together over the years, offered multiple opportunities for collaboration and discussion that proved critical on the way to today's results.

In the future we may find other areas where we can reuse tools or approaches common to Sage and IPython.  It is clear to us that the Sage notebook is a fantastic system, it just wasn't the right fit for IPython. I hope this very long post illustrates why, as well as providing some insights into our vision for scientific computing.

Last, but not least

From this post it should be obvious that what today's IPython is the result of the work of many talented people over the years, and I would like to thank all the developers and users who contribute to the project.  But it's especially important to recognize the stunning quality and quantity of work that Brian Granger and Min Ragan-Kelley have done for this to be possible.  Brian and I did our PhDs together at CU and we have been close friends since then. Min was an undergraduate student of Brian's while he was a professor at U. Santa Clara and the first IPython parallel implementation using Twisted was his senior thesis project; he is now a PhD student at Berkeley (where I work) so we continue to be able to easily collaborate.  Building a project like IPython with partners of such talent, dedication, tenacity and generous spirit is a wonderful experience. Thanks, guys!

Please notify me in the comments of any inaccuracies in the above, especially if I failed to credit someone.

#### Python goes to Reno: SIAM CSE 2011

In what's becoming a bit of a tradition, Simula's Hans-Petter Langtangen, U. Washington's Randy LeVeque and I co-organized yet another minisymposium on Python for Scientific computing at a SIAM conference.

At the Computational Science and Engineering 2011 meeting, held in Reno February 28-March 4, we had 2 sessions with 4 talks each (part I and II).  I have put together a page with all the slides I got from the various speakers, that also includes slides from python-related talks in other minisymposia.  I have also posted some pictures from our sessions and from the minisymposium on reproducible research that my friend and colleague Jarrod Millman organized during the same conference.

We had great attendance, with a standing-room-only crowd for the first session, something rather unusual during the parallel sessions of a SIAM conference.  But more importantly, this year there were three other sessions entirely devoted to Python in scientific computing at the conference, organized completely independently from ours.  One focused on PDEs and the other on optimization.  Furthermore, there were scattered talks at several other sessions where Python was explicitly discussed in the title or abstract.  For all of these, I have collected the slides I was able to get; if you have slides for one such talk I failed to include, please contact me and I'll be happy to post them there.

Unfortunately for our audience, we had last-minute logistical complications that prevented Robert Bradshaw and John Hunter from attending, so I had to deliver the Cython and matplotlib talks (in addition to my IPython one).  Having a speaker give three back-to-back talks isn't ideal, but both of them kindly prepared all the materials and "delivered" them to me over skype the day before, so hopefully the audience got a reasonable simile of their original intent. It's a shame, since I know first-hand how good both of them are as speakers, but canceling talks on these two key tools would really have been a disservice to everyone; my thanks go to the SIAM organizers who were flexible enough to allow for this to happen.  Given how packed the room was, I'm sure we made the right choice.

It's now abundantly clear from this level of interest that Python is being very successful in solving real problems in scientific computing.  We've come a long way from the days when some of us (I have painful memories of this) had to justify to our colleagues/advisors why we wanted to 'play' with this newfangled 'toy' instead of just getting on with our job using the existing tools (in my case it was IDL, a hodgepodge of homegrown shell/awk/sed/perl scripting, custom C and some Gnuplot thrown in the mix for good measure).  Things are by no means perfect, and there's plenty of problems to solve, but we have a great foundation, a number of good quality tools that continue to improve as well as our most important asset: a rapidly growing community that is solving new problems, creating new libraries and coming up with innovative approaches to computational and mathematical questions, often facilitated by Python's tremendous flexibility. It's been a fun ride so far, but I suspect the next decade is going to be even more interesting.  If you missed this, try to make it to SciPy 2011 or EuroSciPy 2011!

Link summary

#### Blogging with the IPython notebook

Update (May 2014): Please note that these instructions are outdated. while it is still possible (and in fact easier) to blog with the Notebook, the exact process has changed now that IPython has an official conversion framework. However, Blogger isn't the ideal platform for that (though it can be made to work). If you are interested in using the Notebook as a tool for technical blogging, I recommend looking at Jake van der Plas' Pelican support or Damián Avila's support in Nikola.

Update: made full github repo for blog-as-notebooks, and updated instructions on how to more easily configure everything and use the newest nbconvert for a more streamlined workflow.

Since the notebook was introduced with IPython 0.12, it has proved to be very popular, and we are seeing great adoption of the tool and the underlying file format in research and education. One persistent question we've had since the beginning (even prior to its official release) was whether it would be possible to easily write blog posts using the notebook. The combination of easy editing in markdown with the notebook's ability to contain code, figures and results, makes it an ideal platform for quick authoring of technical documents, so being able to post to a blog is a natural request.

Today, in answering a query about this from a colleague, I decided to try again the status of our conversion pipeline, and I'm happy to report that with a bit of elbow-grease, at least on Blogger things work pretty well!

This post was entirely written as a notebook, and in fact I have now created a github repo, which means that you can see it directly rendered in IPyhton's nbviewer app.

The purpose of this post is to quickly provide a set of instructions on how I got it to work, and to test things out. Please note: this requires code that isn't quite ready for prime-time and is still under heavy development, so expect some assembly.

## Converting your notebook to html with nbconvert

The first thing you will need is our nbconvert tool that converts notebooks across formats. The README file in the repo contains the requirements for nbconvert (basically python-markdown, pandoc, docutils from SVN and pygments).

Once you have nbconvert installed, you can convert your notebook to Blogger-friendly html with:

nbconvert -f blogger-html your_notebook.ipynb


This will leave two files in your computer, one named your_notebook.html and one named your_noteboook_header.html; it might also create a directory called your_notebook_files if needed for ancillary files. The first file will contain the body of your post and can be pasted wholesale into the Blogger editing area. The second file contains the CSS and Javascript material needed for the notebook to display correctly, you should only need to use this once to configure your blogger setup (see below):

# Only one notebook so far
(master)longs[blog]> ls
120907-Blogging with the IPython Notebook.ipynb  fig/  old/

# Now run the conversion:
(master)longs[blog]> nbconvert.py -f blogger-html 120907-Blogging\ with\ the\ IPython\ Notebook.ipynb

# This creates the header and html body files
(master)longs[blog]> ls
120907-Blogging with the IPython Notebook_header.html  fig/
120907-Blogging with the IPython Notebook.html         old/
120907-Blogging with the IPython Notebook.ipynb


## Configuring your Blogger blog to accept notebooks

The notebook uses a lot of custom CSS for formatting input and output, as well as Javascript from MathJax to display mathematical notation. You will need all this CSS and the Javascript calls in your blog's configuration for your notebook-based posts to display correctly:

1. Once authenticated, go to your blog's overview page by clicking on its title.
2. Click on templates (left column) and customize using the Advanced options.
3. Scroll down the middle column until you see an "Add CSS" option.
4. Copy entire the contents of the _header file into the CSS box.

That's it, and you shouldn't need to do anything else as long as the CSS we use in the notebooks doesn't drastically change. This customization of your blog needs to be done only once.

While you are at it, I recommend you change the width of your blog so that cells have enough space for clean display; in experimenting I found out that the default template was too narrow to properly display code cells, producing a lot of text wrapping that impaired readability. I ended up using a layout with a single column for all blog contents, putting the blog archive at the bottom. Otherwise, if I kept the right sidebar, code cells got too squished in the post area.

I also had problems using some of the fancier templates available from 'Dynamic Views', in that I could never get inline math to render. But sticking to those from the Simple or 'Picture Window' categories worked fine and they still allow for a lot of customization.

Note: if you change blog templates, Blogger does destroy your custom CSS, so you may need to repeat the above steps in that case.

## Adding the actual posts

Now, whenever you want to write a new post as a notebook, simply convert the .ipynb file to blogger-html and copy its entire contents to the clipboard. Then go to the 'raw html' view of the post, remove anything Blogger may have put there by default, and paste. You should also click on the 'options' tab (right hand side) and select both Show HTML literally and Use <br> tag, else your paragraph breaks will look all wrong.

That's it!

## What can you put in?

I will now add a few bits of code, plots, math, etc, to show which kinds of content can be put in and work out of the box. These are mostly bits copied from our example notebooks so the actual content doesn't matter, I'm just illustrating the kind of content that works.

In [1]:
# Let's initialize pylab so we can plot later
%pylab inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.


With pylab loaded, the usual matplotlib operations work

In [2]:
x = linspace(0, 2*pi)
plot(x, sin(x), label=r'$\sin(x)$')
plot(x, cos(x), 'ro', label=r'$\cos(x)$')
title(r'Two familiar functions')
legend()

Out [2]:
<matplotlib.legend.Legend at 0x3128610>

The notebook, thanks to MathJax, has great LaTeX support, so that you can type inline math $(1,\gamma,\ldots, \infty)$ as well as displayed equations:

$$e^{i \pi}+1=0$$

but by loading the sympy extension, it's easy showcase math output from Python computations, where we don't type the math expressions in text, and instead the results of code execution are displayed in mathematical format:

In [3]:
%load_ext sympyprinting
import sympy as sym
from sympy import *
x, y, z = sym.symbols("x y z")


From simple algebraic expressions

In [4]:
Rational(3,2)*pi + exp(I*x) / (x**2 + y)

Out [4]:
$$\frac{3}{2} \pi + \frac{e^{\mathbf{\imath} x}}{x^{2} + y}$$
In [5]:
eq = ((x+y)**2 * (x+1))
eq

Out [5]:
$$\left(x + 1\right) \left(x + y\right)^{2}$$
In [6]:
expand(eq)

Out [6]:
$$x^{3} + 2 x^{2} y + x^{2} + x y^{2} + 2 x y + y^{2}$$

To calculus

In [7]:
diff(cos(x**2)**2 / (1+x), x)

Out [7]:
$$- 4 \frac{x \operatorname{sin}\left(x^{2}\right) \operatorname{cos}\left(x^{2}\right)}{x + 1} - \frac{\operatorname{cos}^{2}\left(x^{2}\right)}{\left(x + 1\right)^{2}}$$

For more examples of how to use sympy in the notebook, you can see our example sympy notebook or go to the sympy website for much more documentation.

## You can easily include formatted text and code with markdown

You can italicize, boldface

• build
• lists

and embed code meant for illustration instead of execution in Python:

def f(x):
"""a docstring"""
return x**2


or other languages:

if (i=0; i<n; i++) {
printf("hello %d\n", i);
x += 4;
}


And since the notebook can store displayed images in the file itself, you can show images which will be embedded in your post:

In [8]:
from IPython.display import Image
Image(filename='fig/img_4926.jpg')

Out [8]:

You can embed YouTube videos using the IPython object, this is my recent talk at SciPy'12 about IPython:

In [9]:
from IPython.display import YouTubeVideo
YouTubeVideo('iwVvqwLDsJo')

Out [9]:

## Including code examples from other languages

Using our various script cell magics, it's easy to include code in a variety of other languages

In [10]:
%%ruby
puts "Hello from Ruby #{RUBY_VERSION}"

Hello from Ruby 1.8.7

In [11]:
%%bash
echo "hello from $BASH"  hello from /bin/bash  And tools like the Octave and R magics let you interface with entire computational systems directly from the notebook; this is the Octave magic for which our example notebook contains more details: In [12]: %load_ext octavemagic  In [13]: %%octave -s 500,500 # butterworth filter, order 2, cutoff pi/2 radians b = [0.292893218813452 0.585786437626905 0.292893218813452]; a = [1 0 0.171572875253810]; freqz(b, a, 32);  The rmagic extension does a similar job, letting you call R directly from the notebook, passing variables back and forth between Python and R. In [14]: %load_ext rmagic  Start by creating some data in Python In [15]: X = np.array([0,1,2,3,4]) Y = np.array([3,5,4,6,7])  Which can then be manipulated in R, with results available back in Python (in XYcoef): In [16]: %%R -i X,Y -o XYcoef XYlm = lm(Y~X) XYcoef = coef(XYlm) print(summary(XYlm)) par(mfrow=c(2,2)) plot(XYlm)  Call: lm(formula = Y ~ X) Residuals: 1 2 3 4 5 -0.2 0.9 -1.0 0.1 0.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.2000 0.6164 5.191 0.0139 * X 0.9000 0.2517 3.576 0.0374 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7958 on 3 degrees of freedom Multiple R-squared: 0.81, Adjusted R-squared: 0.7467 F-statistic: 12.79 on 1 and 3 DF, p-value: 0.03739  In [17]: XYcoef  Out [17]: [ 3.2 0.9] And finally, in the same spirit, the cython magic extension lets you call Cython code directly from the notebook: In [18]: %load_ext cythonmagic  In [19]: %%cython -lm from libc.math cimport sin print 'sin(1)=', sin(1)  sin(1)= 0.841470984808  ## Keep in mind, this is still experimental code! Hopefully this post shows that the system is already useful to communicate technical content in blog form with a minimal amount of effort. But please note that we're still in heavy development of many of these features, so things are susceptible to changing in the near future. By all means join the IPython dev mailing list if you'd like to participate and help us make IPython a better tool! #### An ambitious experiment in Data Science takes off: a biased, Open Source view from Berkeley Today, during a White House OSTP event combining government, academia and industry, the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation announced a$37.8M funding commitment to build new data science environments. This caps a year's worth of hard work for us at Berkeley, and even more for the Moore and Sloan teams, led by Vicki Chandler, Chris Mentzel and Josh Greenberg: they ran a very thorough selection process to choose three universities to participate in this effort. The Berkeley team was led by Saul Perlmutter, and we are now thrilled to join forces with teams at the University of Washington and NYU, respectively led by Ed Lazowska and Yann LeCun. We have worked very hard on this in private, so it's great to finally be able to publicly discuss what this ambitious effort is all about.

Most of the UC Berkeley BIDS team, from left to right: Josh Bloom, Cathryn Carson, Jas Sekhon, Saul Perlmutter, Erik Mitchell, Kimmen Sjölander, Jim Sethian, Mike Franklin, Fernando Perez. Not present: Henry Brady, David Culler, Philip Stark and Ion Stoica (photo credit: Kaja Sehrt, VCRO).

As Joshua Greenberg from the Sloan Foundation says, "What this partnership is trying to do is change the culture of universities to create a data science culture." For us at Berkeley, the whole story has two interlocking efforts:

1. The Moore and Sloan foundations are supporting a cross-institution initiative, where we will tackle the challenges that the rise of data-intensive science is posing.

2. Spurred by this, Berkeley is announcing the creation of the new Berkeley Institute for Data Science (BIDS), scheduled to start full operations in Spring 2014 (once the renovations of the Doe 190 space are completed). BIDS will be the hub of our activity in the broader Moore/Sloan initiative, as a partner with the UW eScience Institute and the newly minted NYU Center for Data Science.

Since the two Foundations, Berkeley and our university partners will provide ample detail elsewhere (see link summary at the bottom), I want to give my own perspective. This process has been, as one can imagine, a complex one: we were putting together a campus-wide effort that was very different from any traditional grant proposal, as it involved not only a team of PIs from many departments who normally don't work together, but also serious institutional commitment. But I have seen real excitement in the entire team: there is a sense that we have been given the chance to tackle a big and meaningful problem, and that people are willing to move way out of their comfort zone and take risks. The probability of failure is non-negligible, but these are the kinds of problems worth failing on.

### Using the Custom Anaconda Management Pack with Jupyter

To work with Spark jobs interactively on the Hortonworks HDP cluster, you can use Jupyter Notebooks via Anaconda Enterprise Notebooks, which is a multi-user notebook server with collaborative features for your data science team and integration with enterprise authentication. Refer to our previous blog post on Using Anaconda with PySpark for Distributed Language Processing on a Hadoop Cluster for more information about configuring Jupyter with PySpark.

## anaconda-mpack-n.png

### Using the Custom Anaconda Management Pack with Zeppelin

You can also use Anaconda with Zeppelin on your HDP cluster. In HDP 2.5 and Zeppelin 0.6, you’ll need to configure Zeppelin to point to the custom version of Anaconda installed on the HDP cluster by navigating to Zeppelin Notebook > Configs > Advanced zeppelin-env in the Ambari Cluster Dashboard UI in your browser:

## anaconda-mpack-o.png

Scroll down to the zeppelin_env_content property, uncomment, and set the following line to match the location of the Anaconda on your HDP cluster nodes:

export PYSPARK_PYTHON="/opt/continuum/anaconda/bin/python"

## anaconda-mpack-p.png

Then restart the Zeppelin service when prompted.

You should also configure the zeppelin.pyspark.python property in the Zeppelin PySpark interpreter to point to Anaconda (/opt/continuum/anaconda/bin/python):

## anaconda-mpack-q.png

Then restart the Zeppelin interpreter when prompted. Note that the PySpark interpreter configuration process will be improved and centralized in Zeppelin in a future version.

Once you’ve configured Zeppelin to point to the location of Anaconda on your HDP cluster, data scientists can run interactive Zeppelin notebooks with Anaconda and use all of the data science libraries they know and love in Anaconda with their PySpark and SparkR jobs:

## anaconda-mpack-r.png

### Get Started with Custom Anaconda Management Packs for Hortonworks in Your Enterprise

If you’re interested in generating custom Anaconda management packs for Hortonworks HDP and Ambari to empower your data science team, we can help! Get in touch with us by using our contact us page for more information about this functionality and our enterprise Anaconda platform subscriptions.

If you’d like to test-drive the enterprise features of Anaconda on a bare-metal, on-premises or cloud-based cluster, please contact us at sales@continuum.io.

### Jake Vanderplas

#### Reproducible Data Analysis in Jupyter

Jupyter notebooks provide a useful environment for interactive exploration of data. A common question I get, though, is how you can progress from this nonlinear, interactive, trial-and-error style of exploration to a more linear and reproducible analysis based on organized, packaged, and tested code. This series of videos presents a case study in how I personally approach reproducible data analysis within the Jupyter notebook.

Each video is approximately 5-8 minutes; the videos are available in a YouTube Playlist. Alternatively, below you can find the videos with some description and links to relevant resources

In [1]:
# Quick utility to embed the videos below
from IPython.display import YouTubeVideo
def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'):
return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350)


## Part 1: Loading and Visualizing Data

In this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.

In [2]:
embed_video(1)

Out[2]:

Relevant resources:

## Part 2: Further Data Exploration

In this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.

In [3]:
embed_video(2)

Out[3]:

Relevant Resources:

## Part 3: Version Control with Git & GitHub

In this video, I set up a repository on GitHub and commit the notebook into version control.

In [4]:
embed_video(3)

Out[4]:

Relevant Resources:

## Part 4: Working with Data and GitHub

In this video, I refactor the data download script so that it only downloads the data when needed

In [5]:
embed_video(4)

Out[5]:

## Part 5: Creating a Python Package

In this video, I move the data download utility into its own separate package

In [6]:
embed_video(5)

Out[6]:

Relevant Resources:

## Part 6: Unit Testing with PyTest

In this video, I add unit tests for the data download utility

In [7]:
embed_video(6)

Out[7]:

Relevant resources:

## Part 7: Refactoring for Speed

In this video, I refactor the data download function to be a bit faster

In [8]:
embed_video(7)

Out[8]:

Relevant Resources:

## Part 8: Debugging a Broken Function

In this video, I discover that my refactoring has caused a bug. I debug it and fix it.

In [9]:
embed_video(8)

Out[9]:

## Part 8.5: Finding and Fixing a scikit-learn bug

In this video, I discover a bug in the scikit-learn codebase, and go through the process of submitting a GitHub Pull Request fixing the bug

In [10]:
embed_video(9)

Out[10]:

## Part 9: Further Data Exploration: PCA and GMM

In this video, I apply unsupervised learning techniques to the data to explore what we can learn from it

In [11]:
embed_video(10)

Out[11]:

Relevant Resources:

## Part 10: Cleaning-up the Notebook

In this video, I clean-up the unsupervised learning analysis to make it more reproducible and presentable.

In [12]:
embed_video(11)

Out[12]:

Relevant Resources:

This post was composed within an IPython notebook; you can view a static version here or download the full source here.

## March 02, 2017

### Enthought

#### Webinar – Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

What: Presentation and Q&A with Dr. Michael Connell, VP, Enthought Training Solutions
Who Should Watch: Anyone who wants to develop proficiency in Python for scientific, engineering, analytic, quantitative, or data science applications, including team leaders considering Python training for a group, learning and development coordinators supporting technical teams, or individuals who want to develop their Python skills for professional applications

View Recording

Python is an uniquely flexible language – it can be used for everything from software engineering (writing applications) to web app development, system administration to “scientific computing” — which includes scientific analysis, engineering, modeling, data analysis, data science, and the like.

Unlike some “generalist” providers who teach generic Python to the lowest common denominator across all these roles, Enthought specializes in Python training for professionals in scientific and analytic fields. In fact, that’s our DNA, as we are first and foremost scientists, engineers, and data scientists ourselves, who just happen to use Python to drive our daily data wrangling, modeling, machine learning, numerical analysis, simulation, and more.

If you’re a professional using Python, you’ve probably had the thought, “how can I be better, smarter, and faster in using Python to get my work done?” That’s where Enthought comes in – we know that you don’t just want to learn generic Python syntax, but instead you want to learn the key tools that fit the work you do, you want hard-won expert insights and tips without having to discover them yourself through trial and error, and you want to be able to immediately apply what you learn to your work.

Bottom line: you want results and you want the best value for your invested time and money. These are some of the guiding principles in our approach to training.

In this webinar, we’ll give you the information you need to decide whether Enthought’s Python training is the right solution for your or your team’s unique situation, helping answer questions such as:

• What kinds of Python training does Enthought offer? Who is it designed for?
• Who will benefit most from Enthought’s training (current skill levels, roles, job functions)?
• What are the key things that make Enthought’s training different from other providers and resources?
• What are the differences between Enthought’s training courses and who is each one best for?
• What specific skills will I have after taking an Enthought training course?
• Will I enjoy the curriculum, the way the information is presented, and the instructor?
• Why do people choose to train with Enthought? Who has Enthought worked with and what is their feedback?

We’ll also provide a guided tour and insights about our our five primary course offerings to help you understand the fit for you or your team:

View Recording

Presenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT

About Enthought’s Python Training

Enthought’s Python training is designed to accelerate the development of skill and confidence for people using Python in their work.  In addition to the Python language, we develop proficiency in the core tools used by all scientists, engineers, analysts, and data scientists, such as NumPy (fast array computing), Matplotlib (data visualization and plotting), and Pandas (data wrangling and analysis), as well as tools and techniques that are more specialized for different technical roles.

Our courses are created by our Python experts and instructors based on their extensive experience in using Python and its many technical packages to solve real-world problems across domains ranging from geophysics to biotechnology to aeronautical engineering to marketing analysis and everything in between. The courses have also been refined based on lessons learned over more than a decade of teaching thousands of people to use Python effectively in their everyday work.

About Enthought’s Python Instructors

More than “trainers,” our instructors are professional peers, which means you won’t just go through a checklist of programming or computer science topics, you’ll learn from a Python expert who can guide you through the ins and outs of applying specific concepts to scientific and analytic problems.

## Additional Resources

 Onsite Python Training

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Upcoming Open Courses

See the syllabi and schedule for:

Learn More

Download Enthought’s Pandas Cheat Sheets

See the “A Peek Under the Hood of the Pandas Mastery Workshop” webinar

## March 01, 2017

### Continuum Analytics news

#### Secure and Scalable Data Science Deployments with Anaconda

Monday, February 27, 2017
Kristopher Overholt
Continuum Analytics

Christine Doig
Continuum Analytics

In our previous blog post about Productionizing and Deploying Data Science Projects, we discussed best practices and recommended tools that can be used in the production and deployment stages of collaborative Open Data Science workflows.

Traditional data science project deployments involve lengthy and complex processes to deliver secure and scalable applications in enterprise environments. The result is that data scientists spend a nontrivial amount of time setting up, configuring, and maintaining deployment infrastructure, which takes valuable time away from data exploration and analysis tasks in the data science workflow.

When deploying data science applications in an enterprise environment, there are a number of implementation details that must be considered, including:

• Managing runtime dependencies and project environments for each application
• Ensuring application availability, uptime, and monitoring status
• Engineering data science applications for scalability
• Sharing compute resources across an organization
• Securing data access and network communication in applications
• Managing authentication and access control of deployed applications

In this blog post, we’ll introduce the next generation of Anaconda Enterprise v5, which enables you to deploy a wide range of data science applications with a single click, including live Python and R notebooks, interactive applications and dashboards, and models with REST APIs. Anaconda Enterprise handles all of the enterprise security, scalability, and encapsulation details related to application deployments so that your data science team doesn’t have to.

### Introducing the Next Generation Open Data Science Platform

Over the last few years, Anaconda and Anaconda Enterprise have been supercharging data science teams and empowering enterprise organizations with Open Data Science by enabling on-premise package management, secure enterprise notebook collaboration, data science project management, and scalable cluster workflows with Hadoop, Spark, Dask, machine learning, streaming analytics, and more.

At AnacondaCON 2017, we announced the newest capability of end-to-end data science workflows powered by Anaconda in the next generation of Anaconda Enterprise v5: secure and scalable Open Data Science deployments as part of an integrated data science experience!

Using Anaconda Enterprise, anyone on the data science team can encapsulate and deploy their data science projects as live applications with a single click, including:

• Live Python and R notebooks
• Interactive applications and dashboards using Bokeh, Datashader, Shiny and more
• Machine learning models or applications with REST APIs, including Tensorflow, scikit-learn, H2O, Theano, Keras, Caffe, and more

Using the power and flexibility of Anaconda and Open Data Science, any application, notebook, or model can be encapsulated and deployed on a server or scalable cluster, and the deployed applications can be easily and securely shared within your data science team or enterprise organization.

## anaconda-enterprise-deploy-a.gif

### Data Science Deployment Functionality in Anaconda Enterprise

With a single click of the Deploy button, data scientists will be able to leverage powerful application deployment functionality in Anaconda Enterprise v5, including:

• Deploy data science projects using the same powerful 730+ libraries in Anaconda (machine learning, visualization, optimization, data analysis, and more) that your data science team already knows and loves
• Scalable on-premise or cloud-based deployment server with configurable cluster sizes
• Single-click deployment functionality for secure data science project deployments, complete with enterprise authentication/authorization and secure end-to-end encryption
• Sharing and collaboration of deployed applications that integrates with enterprise authentication and identity management protocols and services
• Data science application encapsulation, containerization, and cluster orchestration using industry-standard tooling
• Centralized administration and control of deployed applications and cluster utilization across your organization
• Connectivity to various data storage backends, databases, and formats

The new data science deployment capability in Anaconda Enterprise builds on existing features in the Anaconda platform to enable powerful end-to-end Open Data Science workflows, including on-premise package management/governance, secure enterprise notebook collaboration and project management, and scalable cluster workflows with Hadoop, Spark, Dask, machine learning, streaming analytics, and more.

## anaconda-enterprise-deploy-b.gif

The functionality in the current version of Anaconda Enterprise v4, including Anaconda Enterprise Notebooks, Anaconda Repository, and Anaconda Scale is currently being migrated to Anaconda Enterprise v5, which will be available as a GA release later this year.

Additionally, we’ll be implementing even more enterprise features to enable complete end-to-end Open Data Science workflows for your data science team, including model management, model scoring, scheduled execution of notebooks and applications, and more.

### Discover Effortless Open Data Science Deployments with Anaconda Enterprise

Are you interested in using Anaconda Enterprise in your organization to deploy data science projects, including live notebooks, machine learning models, dashboards, and interactive applications?

The next generation of Anaconda Enterprise v5, which features one-click secure and scalable data science deployments, is now available as a technical preview as part of the Anaconda Enterprise Innovator Program.

Join the Anaconda Enterprise v5 Innovator Program today to discover the powerful data science deployment capabilities for yourself. Anaconda Enterprise handles your secure and scalable data science project encapsulation and deployment requirements so that your data science team can focus on data exploration and analysis workflows.

## anaconda-enterprise-deploy-c.gif

Get in touch with us if you’d like to learn more about how Anaconda Subscriptions can supercharge your data science team and empower your enterprise with Open Data Science, including data science deployments, an on-premise package repository, collaborative notebooks, scalable cluster workflows, and custom consulting/training solutions.

## February 28, 2017

### Titus Brown

#### Advancing metagenome classification and comparison by MinHash fingerprinting of IMG/M data sets.

This is our just-submitted proposal for the JGI-NERSC "Facilities Integrating Collaborations for User Science" call. Enjoy!

1. Brief description: (Limit 1 page)

Abstract: Sourmash is a command-line tool and Python library that calculates and compares MinHash signatures from sequence data. Sourmash "compare" and "gather" functionality enables comparison and characterization of signatures. Using sourmash we have calculated and indexed all of the microbial genomes in the NCBI Reference Sequences database and stored them in a searchable Sequence Bloom Tree (SBT). The utility of the SBT is dependent upon the diversity of the sequences used to generate the signatures it contains. We propose calculation of signatures for the approximately 5,200 private microbial genomes and all of private and public metagenomes in the IMG/M database. Genome signatures will be indexed in an SBT and subsequently used to discern the taxonomic breakdown of these metagenomes with sourmash "gather". Because MinHash signatures are one-way and do not include full sample information (e.g. metadata, contigs) they will only be usable to discover relevant samples at JGI, without revealing unpublished information about the samples.

Scope of Work:

We propose to calculate MinHash signatures for all the private genomes in the Integrated Microbial Genomes and Microbiomes (IMG/M) database and to index them in a searchable Sequence Bloom Tree (SBT) for genome comparisons and taxonomic classification using sourmash. We have successfully calculated and indexed signatures for the NCBI Reference Sequences (RefSeq) database and the Sequence Read Archive using sourmash (Brown 2016, Irber 2016) on the Michigan State University High Performance Computer and the Amazon Cloud. We will extend those methods to the approximately 5,200 private genomes in the IMG/M database to increase the number and diversity of the genomes in our publicly available SBT. We request an initial allocation of 50,000 CPU hours for calculation of MinHash signatures for the IMG/M microbial genomes and metagenomes. Calculation of 60,000 microbial genomes for the NCBI RefSeq database took approximately 36 CPU hours, however, time required for calculation of metagenome and classification with "gather" is unknown. Confidential information and full details about private data will remain private, although we do propose to link the signatures to the samples on the JGI Web site so that protected data sets can be accessed appropriately.

1. Background information: (Limit 1 page)

Technical Information:

Data: Non-public collection of microbial genomes (~5,200), and private and public metagenomes (~6,700) in the Integrated Microbial Genomes and Microbiomes (IMG/M) database

Format: fastq/fasta format

Metadata: all metadata recorded, especially taxonomy

Technical Challenges:

We have tested the method (MinHash + SBT) implemented in Sourmash on many public genomes and the basic technical issues have all been resolved. Calculation of signatures is well understood computationally and is roughly linear in the amount of sequence data and only needs to be done once, and we believe that less than 5000 CPU hours will be needed to calculate signatures for 5,000 genomes and less than 50,000 metagenomes. However, we have not yet done all-by-all clustering of metagenomes, which will be considerably more CPU intensive and may present CPU and memory challenges.

MinHash signature calculation is highly parallelizable, since each signature doesn't depend on other datasets. Our experience with calculating signatures from the SRA (streaming data through the network) and RefSeq (reading data from a Lustre filesystem) show that the computation is I/O bound, and uses very low memory. If we have access to NERSC and the data is available locally at Cori we can avoid transferring large volumes of data to other clusters or the cloud, and since the signatures are very small we can transfer only the results.

1. Project Description: (Limit 4 pages)

Description:

We propose to incorporate the Integrated Microbial Genomes and Microbiomes (IMG/M) database's private genome collection into a searchable Sequence Bloom Tree (SBT) index to facilitate taxonomic classification of DNA sequence sets. Reimplementation and adaptation of the MinHash computational method (Broder 1997) allows for compression of DNA sequence into a set of hashes that comprise a signature; this signature can be used for fast and accurate comparison of multiple sequence sets. To date, we have successfully downloaded and calculated signatures for all of the microbial genomes in the NCBI Reference Sequences (RefSeq) database (Brown 2016, Irber 2016). The addition of approximately 5,200 private microbial genome sequences and many metagenomes archived in the IMG/M database to our current index will enhance the utility of the archive and provide a great resource for rapid comparison of DNA sequence data. This should provide a valuable resource to biologists, microbial ecologists, bioinformaticians and others interested in comparative metagenomics and taxonomic classification.

Introduction:

Comparative sequence analysis, be it between genomes or metagenomes (Oulas, Pavloudi et al. 2015) can significantly expand our biological knowledge by letting us ask questions directly of ecosystems and their inhabitants. Advances in sequencing technologies have revolutionized biology by enabling scientists to generate terabytes or more of data for a single experiment. Unlike targeted approaches such as 16S, whole genome sequence comparisons let scientists assess multiple genes simultaneously, and predict the functional repertoire of organisms and/or entire communities in the case of bacteria(Oulas, Pavloudi et al. 2015).

Despite rapid advancements in sequencing technologies over the last decade, many challenges in the analysis of sequence data remain. Sequence data is spread across multiple archives (EMBL, JGI, NCBI), data files are large and not easily downloaded or stored, and options for comparison of multiple sequence files are limited. To address this issue we adapted the MinHash dimensionality reduction technique (Broder 1997) for large scale database search and metagenome taxonomy breakdown in Sourmash (Brown and Irber 2016). Built around the same technology as mash (Ondov, Treangen et al. 2016) Sourmash is a command-line tool and Python library for computing MinHash sketches from DNA sequences, comparing them to each other, plotting comparisons, and breaking metagenomes into their composition in terms of known signatures.

So far we have constructed an SBT from approximately 60,000 microbial genomes in the NCBI RefSeq database and demonstrated the ability to search the SBT yielding matches based upon similarities in signatures. Calculating 60,000 signatures took approximately 36 CPU hours and yielded an uncompressed 1 GB data file. Indexing the signatures in an SBT with the sourmash sbt index function took approximately 5 and ? hours, required 6.5 GB of RAM, and yielded a 3.2 GB index. Searching the index is rapid, requires less than 3 seconds, and requires less than 100 MB of RAM. This SBT is available for download (http://spacegraphcats.ucdavis.edu.s3.amazonaws.com/microbe-sbt-k21-2016-11-27.tar.gz). The SBT is dynamic; new genomes can be added to expand the tree resulting in enhanced searchability.

We have also extended MinHash to support fine-grained resolution of metagenomes (Brooks et al., unpublished).

One of the main uses for the SBT is "gather", an extension of the SBT search that calculates the intersection between the signatures in an SBT and a mixed signature from a metagenome (Figure 1). This is a mildly novel extension of the MinHash approach for decomposing a metagenome into its constituent known genomes (and estimating the unknown content).

Figure 1. Representation of sourmash sbt gather functionality. Sourmash SBT gather calculates the intersection between the input (blue circle) and the SBT (green circle). In this case our SBT contains signatures for the all of the microbial genomes in the NCBI RefSeq database. The intersection (blue stars) represents shared hashes between the input and the index resulting in taxonomic classification of organisms represented by the input reads. Yellow and red stars represent unique hashes in the input and SBT respectively.

Utilization:

SBTs containing all of the genomes and metagenomes from the IMG/M database will be used to compare samples with unknown composition to determine similarity and classification. These comparisons will serve multiple purposes including optimization of comparative genomic and metagenomic methods, characterization of samples with unknown composition, and discovery of novel organisms. Implementation of sequence bloom trees in sourmash search and gather facilitate identification and classification against a database of genomes. The output of both SBT functions includes the fraction of the query present in the input and the the ID associated with that genome (Figure 2). Collectively, these functions let us determine if a sequence set is present in a query and also identify the component genomes (i.e. specific species/strains) present in a metagenome.

Figure 2. Sourmash SBT gather output. Sourmash was run on the command line with a kmer size (-k) of 21 for an 12 genome synthetic metagenome. Output include (left to right) genome fraction in the input and taxonomic classification of the match. Default parameters compare the input to the index and discard hashes when a match is found. All matches are compared to the original index to determine the fraction of the genome present in the input.

MinHashing sequence data for "gather" functionality reduces the dataset by a constant size (approximately 10,000-100,000-fold) (Ondov, Treangen et al. 2016). Using this technique complex communities such as soil metagenomes can be compared and taxonomically classified without assembly. MinHashing the genomes in IMG/M, especially those from underexplored branches in the tree of life such as those from JGI's Genomic Encyclopedia of Bacteria and Archaea (GEBA) will enrich the current index of genomes and improve the taxonomic analysis capabilities of sourmash; here note that sourmash signatures can also be converted to mash signatures without loss, using a newly evolved standard format, so the fruits of this project will be usable well beyond our software.

Since calculating MinHash signatures produces an irreversible summary of the data set, they do not compromise the content of protected data sets. Thus we will make data searchable without revealing the entire dataset's content publicly. We will provide a link in the signatures to the data source in JGI that would require users of Sourmash to login to JGI IMG/M to access the whole dataset. Our proposal would connect to a broader community of researchers potentially interested in those private datasets with JGI, and should encourage new collaborations and increased use of/access to JGI resources.

In addition to indexing genomic datasets for taxonomic analysis, Sourmash can also index transcriptomic, metagenomic or metatranscriptomic datasets for comparative analysis. Thus it has the potential to be a one-stop-shop for searching all archives include SRA, ENA, and IMG/M while preserving data privacy as mentioned above. Towards this end we intend to index all of metagenomic datasets in the IMG/M database, determine the taxonomic breakdown of these genomes with "gather", and make this data available in a public data set with links to JGI resources. We have also developed a decentralized system for SBT storage and MinHash signatures sharing using IPFS (InterPlanetary File System) and calculated MinHash signatures for all of the WGS microbial sequences in the SRA(Brown 2016, Irber 2016).

The Lab for Data Intensive Biology at UC Davis is uniquely well suited for this project. We have demonstrated the ability to achieve similar goals with the NCBI Refseq database. Furthermore, the Lab for Data Intensive Biology practices open science, and is actively engaging the community in the development of software, as can be seen in Sourmash, so we expect community contributions to be stimulated by this proposal.

Community Interest:

A number of scientific communities will benefit from this project. Currently, we lack means for fast, accurate, and lightweight comparative analysis of metagenomes. We expect that these indexes will lead to the development of new technologies that take advantage of the MinHash dimensionality reduction technique and the SBT for comparative sequence analysis. Blog posts regarding the RefSeq index and results from Sourmash have generated comments from many scientists in the field and spurred contributions to the development of Sourmash, which is publicly available on GitHub.

DOE mission:

Our proposal directly addresses the overall mission of the Joint Genome Institute by optimizing methodology required for the analysis of Bioenergy, carbon cycle, and biogeochemistry relevant data sets. The products of our proposal will add approximately 5,200 microbial genomes - many novel - to a public database. Subsequently, we will use this SBT to characterize the metagenomes from IMG/M with SBT gather. This substantial addition will enable researchers to enrich the data and conclusions they are able to draw from their sequencing efforts. Preliminary data for this proposal has enabled researchers including ourselves to explore hypotheses in environmentally relevant metagenomes. Applications of sequence comparison directly relevant to the DOE mission include assessing the impacts of heavy metals (Algora, Vasileiadis et al. 2015), natural disasters (Hiraoka, Machiyama et al. 2016), climate change (Hultman, Waldrop et al. 2015), and crops (Jesus, Liang et al. 2016) on microbial communities. In accordance with the mission of the DOE, and in collaboration with Jiarong Guo and Jim Tiedje, we have begun the analysis of sequence data from 3 biofuel crops -- corn, switchgrass, and Miscanthus. These data are part of one of the largest metagenome sequencing projects to date (4.5 trillion base pairs of sequence). Sourmash will enable fast composition analysis and comparison of these large metagenomic samples. Note that breadth of genomes included in our database is critical for this analysis.

We have demonstrated that MinHashing facilitates taxonomic classification of novel metagenomes by identifying the intersection between signatures from novel datasets and those contained in the index. The addition of non-public genomes from the IMG/M database will enrich our index and facilitate the identification of previously unidentifiable organisms in DNA sequence sets.

1. References: (No page limit)

Algora, C., S. Vasileiadis, K. Wasmund, M. Trevisan, M. Kr"ger, E. Puglisi and L. Adrian (2015). "Manganese and iron as structuring parameters of microbial communities in Arctic marine sediments from the Baffin Bay." FEMS Microbiol Ecol 91(6).

Broder, A. Z. (1997). On the resemblance and containment of documents. Proceedings.

Brown, C. T. (2016). "Categorizing 400,000 microbial genome shotgun datasets from the SRA." http://ivory.idyll.org/blog/tag/sourmash.html.

Brown, C. T. (2016). "Quickly searching all of the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists!" http://ivory.idyll.org/blog/2016-sourmash-sbt-more.html.

Brown, C. T. and L. Irber (2016). sourmash: a library for MinHash sketching of DNA, The Journal of Open Science.

Hiraoka, S., A. Machiyama, M. Ijichi, K. Inoue, K. Oshima, M. Hattori, S. Yoshizawa, K. Kogure and W. Iwasaki (2016). "Genomic and metagenomic analysis of microbes in a soil environment affected by the 2011 Great East Japan Earthquake tsunami." BMC Genomics 17: 53.

Hultman, J., M. P. Waldrop, R. Mackelprang, M. M. David, J. McFarland, S. J. Blazewicz, J. Harden, M. R. Turetsky, A. D. McGuire, M. B. Shah, N. C. VerBerkmoes, L. H. Lee, K. Mavrommatis and J. K. Jansson (2015). "Multi-omics of permafrost, active layer and thermokarst bog soil microbiomes." Nature 521(7551): 208-212.

Irber, L. (2016). "Minhashing all the things (part 1): microbial genomes." http://blog.luizirber.org/2016/12/28/soursigs-arch-1/.

Jesus, E. d. C., C. Liang, J. F. Quensen, E. Susilawati, R. D. Jackson, T. C. Balser and J. M. Tiedje (2016). "Influence of corn, switchgrass, and prairie cropping systems on soil microbial communities in the upper Midwest of the United States." GCB Bioenergy 8(2): 481--494.

Ondov, B. D., T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, S. Koren and A. M. Phillippy (2016). "Mash: fast genome and metagenome distance estimation using MinHash." Genome Biol 17(1): 132.

Oulas, A., C. Pavloudi, P. Polymenakou, G. A. Pavlopoulos, N. Papanikolaou, G. Kotoulas, C. Arvanitidis and I. Iliopoulos (2015). "Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies." Bioinform Biol Insights 9: 75-88.

## February 27, 2017

### Continuum Analytics news

#### Secure and Scalable Data Science Deployments with Anaconda

Monday, February 27, 2017
Kristopher Overholt
Continuum Analytics

Christine Doig
Continuum Analytics

In our previous blog post about Productionizing and Deploying Data Science Projects, we discussed best practices and recommended tools that can be used in the production and deployment stages of collaborative Open Data Science workflows.

Traditional data science project deployments involve lengthy and complex processes to deliver secure and scalable applications in enterprise environments. The result is that data scientists spend a nontrivial amount of time setting up, configuring, and maintaining deployment infrastructure, which takes valuable time away from data exploration and analysis tasks in the data science workflow.

When deploying data science applications in an enterprise environment, there are a number of implementation details that must be considered, including:

• Managing runtime dependencies and project environments for each application
• Ensuring application availability, uptime, and monitoring status
• Engineering data science applications for scalability
• Sharing compute resources across an organization
• Securing data access and network communication in applications
• Managing authentication and access control of deployed applications

In this blog post, we’ll introduce the next generation of Anaconda Enterprise v5, which enables you to deploy a wide range of data science applications with a single click, including live Python and R notebooks, interactive applications and dashboards, and models with REST APIs. Anaconda Enterprise handles all of the enterprise security, scalability, and encapsulation details related to application deployments so that your data science team doesn’t have to.

### Introducing the Next Generation Open Data Science Platform

Over the last few years, Anaconda and Anaconda Enterprise have been supercharging data science teams and empowering enterprise organizations with Open Data Science by enabling on-premise package management, secure enterprise notebook collaboration, data science project management, and scalable cluster workflows with Hadoop, Spark, Dask, machine learning, streaming analytics, and more.

At AnacondaCON 2017, we announced the newest capability of end-to-end data science workflows powered by Anaconda in the next generation of Anaconda Enterprise v5: secure and scalable Open Data Science deployments as part of an integrated data science experience!

Using Anaconda Enterprise, anyone on the data science team can encapsulate and deploy their data science projects as live applications with a single click, including:

• Live Python and R notebooks
• Interactive applications and dashboards using Bokeh, Datashader, Shiny and more
• Machine learning models or applications with REST APIs, including Tensorflow, scikit-learn, H2O, Theano, Keras, Caffe, and more

Using the power and flexibility of Anaconda and Open Data Science, any application, notebook, or model can be encapsulated and deployed on a server or scalable cluster, and the deployed applications can be easily and securely shared within your data science team or enterprise organization.

## anaconda-enterprise-deploy-a (1).gif

### Data Science Deployment Functionality in Anaconda Enterprise

With a single click of the Deploy button, data scientists will be able to leverage powerful application deployment functionality in Anaconda Enterprise v5, including:

• Deploy data science projects using the same powerful 730+ libraries in Anaconda (machine learning, visualization, optimization, data analysis, and more) that your data science team already knows and loves
• Scalable on-premise or cloud-based deployment server with configurable cluster sizes
• Single-click deployment functionality for secure data science project deployments, complete with enterprise authentication/authorization and secure end-to-end encryption
• Sharing and collaboration of deployed applications that integrates with enterprise authentication and identity management protocols and services
• Data science application encapsulation, containerization, and cluster orchestration using industry-standard tooling
• Centralized administration and control of deployed applications and cluster utilization across your organization
• Connectivity to various data storage backends, databases, and formats

The new data science deployment capability in Anaconda Enterprise builds on existing features in the Anaconda platform to enable powerful end-to-end Open Data Science workflows, including on-premise package management/governance, secure enterprise notebook collaboration and project management, and scalable cluster workflows with Hadoop, Spark, Dask, machine learning, streaming analytics, and more.

## anaconda-enterprise-deploy-b.gif

The functionality in the current version of Anaconda Enterprise v4, including Anaconda Enterprise Notebooks, Anaconda Repository, and Anaconda Scale is currently being migrated to Anaconda Enterprise v5, which will be available as a GA release later this year.

Additionally, we’ll be implementing even more enterprise features to enable complete end-to-end Open Data Science workflows for your data science team, including model management, model scoring, scheduled execution of notebooks and applications, and more.

### Discover Effortless Open Data Science Deployments with Anaconda Enterprise

Are you interested in using Anaconda Enterprise in your organization to deploy data science projects, including live notebooks, machine learning models, dashboards, and interactive applications?

The next generation of Anaconda Enterprise v5, which features one-click secure and scalable data science deployments, is now available as a technical preview as part of the Anaconda Enterprise Innovator Program.

Join the Anaconda Enterprise v5 Innovator Program today to discover the powerful data science deployment capabilities for yourself. Anaconda Enterprise handles your secure and scalable data science project encapsulation and deployment requirements so that your data science team can focus on data exploration and analysis workflows.

## anaconda-enterprise-deploy-c (1).gif

Get in touch with us if you’d like to learn more about how Anaconda Subscriptions can supercharge your data science team and empower your enterprise with Open Data Science, including data science deployments, an on-premise package repository, collaborative notebooks, scalable cluster workflows, and custom consulting/training solutions.

### Matthew Rocklin

#### Dask Release 0.14.0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

## Summary

Dask just released version 0.14.0. This release contains some significant internal changes as well as the usual set of increased API coverage and bug fixes. This blogpost outlines some of the major changes since the last release January, 27th 2017.

1. Structural sharing of graphs between collections
2. Refactor communications system
3. Many small dataframe improvements
4. Top-level persist function

You can install new versions using Conda or Pip

conda install -c conda-forge dask distributed


or

pip install dask[complete] distributed --upgrade


## Share Graphs between Collections

Dask collections (arrays, bags, dataframes, delayed) hold onto task graphs that have all of the tasks necessary to create the desired result. For larger datasets or complex calculations these graphs may have thousands, or sometimes even millions of tasks. In some cases the overhead of handling these graphs can become significant.

This is especially true because dask collections don’t modify their graphs in place, they make new graphs with updated computations. Copying graph data structures with millions of nodes can take seconds and interrupt interactive workflows.

To address this dask.arrays and dask.delayed collections now use special graph data structures with structural sharing. This significantly cuts down on the amount of overhead when building repetitive computations.

import dask.array as da

x = da.ones(1000000, chunks=(1000,))  # 1000 chunks of size 1000


### Version 0.13.0

%time for i in range(100): x = x + 1
CPU times: user 2.69 s, sys: 96 ms, total: 2.78 s
Wall time: 2.78 s


### Version 0.14.0

%time for i in range(100): x = x + 1
CPU times: user 756 ms, sys: 8 ms, total: 764 ms
Wall time: 763 ms


The difference in this toy problem is moderate but for real world cases this can difference can grow fairly large. This was also one of the blockers identified by the climate science community stopping them from handling petabyte scale analyses.

We chose to roll this out for arrays and delayed first just because those are the two collections that typically produce large task graphs. Dataframes and bags remain as before for the time being.

## Communications System

Dask communicates over TCP sockets. It uses Tornado’s IOStreams to handle non-blocking communication, framing, etc.. We’ve run into some performance issues with Tornado when moving large amounts of data. Some of this has been improved upstream in Tornado directly, but we still want the ability to optionally drop Tornado’s byte-handling communication stack in the future. This is especially important as dask gets used in institutions with faster and more exotic interconnects (supercomputers). We’ve been asked a few times to support other transport mechanisms like MPI.

The first step (and probably hardest step) was to make Dask’s communication system is pluggable so that we can use different communication options without significant source-code changes. We managed this a month ago and now it is possible to add other transports to Dask relatively easily. TCP remains the only real choice today though there is also an experimental ZeroMQ option (which provides little-to-no performance benefit over TCP) as well as a fully in-memory option in development.

For users the main difference you’ll see is that tcp:// is now prepended many places. For example:

$dask-scheduler distributed.scheduler - INFO - ----------------------------------------------- distributed.scheduler - INFO - Scheduler at: tcp://192.168.1.115:8786 ...  ## Variety of Dataframe Changes As usual the Pandas API has been more fully covered by community contributors. Some representative changes include the following: 1. Support non-uniform categoricals: We no longer need to do a full pass through the data when categorizing a column. Instead we categorize each partition independently (even if they have different category values) and then unify these categories only when necessary df['x'] = df['x'].astype('category') # this is now fast  2. Groupby cumulative reductions df.groupby('x').cumsum()  3. Support appending to Parquet collections df.to_parquet('/path/to/foo.parquet', append=True)  4. A new string and HTML representation of dask.dataframes. Typically Pandas prints dataframes on the screen by rendering the first few rows of data. However, Because Dask.dataframes are lazy we don’t have this data and so typically render some metadata about the dataframe >>> df # version 0.13.0 dd.DataFrame<make-ti..., npartitions=366, divisions=(Timestamp('2000-01-01 00:00:00', freq='D'), Timestamp('2000-01-02 00:00:00', freq='D'), Timestamp('2000-01-03 00:00:00', freq='D'), ..., Timestamp('2000-12-31 00:00:00', freq='D'), Timestamp('2001-01-01 00:00:00', freq='D'))>  This rendering, while informative, can be improved. Now we render dataframes as a Pandas dataframe, but place metadata in the dataframe instead of the actual data. >>> df # version 0.14.0 Dask DataFrame Structure: x y z npartitions=366 2000-01-01 float64 float64 int64 2000-01-02 ... ... ... ... ... ... ... 2000-12-31 ... ... ... 2001-01-01 ... ... ... Dask Name: make-timeseries, 366 tasks  Additionally this renders nicely as an HTML table in a Jupyter notebook ## Variety of Distributed System Changes There have also been a wide variety of changes to the distributed system. I’ll include a representative sample here to give a flavor of what has been happening: 1. Ensure first-come-first-served priorities when dealing with multiple clients 2. Send small amounts of data through Channels. Channels are a way for multiple clients/users connected to the same scheduler to publish and exchange data between themselves. Previously they only transmitted Futures (which could in trun point to larger data living on the cluster). However we found that it was useful to communicate small bits of metadata as well, for example to signal progress or stopping critera between clients collaborating on the same workloads. Now you can publish any msgpack serializable data on Channels. # Publishing Client scores = client.channel('scores') scores.append(123.456) # Subscribing Client scores = client.channel('scores') while scores.data[-1] < THRESHOLD: ... continue working ...  3. We’re better at estimating the size in data of SciPy Sparse matrices and Keras models. This allows Dask to make smarter choices about when it should and should not move data around for load balancing. Additionally Dask can now also serialize Keras models. 4. To help people deploying on clusters that have a shared network file system (as is often the case in scientific or academic institutions) the scheduler and workers can now communicate connection information using the --scheduler-file keyword dask-scheduler --scheduler-file /path/to/scheduler.json dask-worker --scheduler-file /path/to/scheduler.json dask-worker --scheduler-file /path/to/scheduler.json >>> client = Client(scheduler_file='/path/to/scheudler.json')  Previously we needed to communicate the address of the scheduler, which could be challenging when we didn’t know on which node the scheduler would be run. ## Other There are a number of smaller details not mentioned in this blogpost. For more information visit the changelogs and documentation Additionally a great deal of Dask work over the last month has happened outside of these core dask repositories. You can install or upgrade using Conda or Pip conda install -c conda-forge dask distributed  or pip install dask[complete] distributed --upgrade  ## Acknowledgements Since the last 0.13.0 release on January 27th the following developers have contributed to the dask/dask repository: • Antoine Pitrou • Chris Barber • Daniel Davis • Elmar Ritsch • Erik Welch • jakirkham • Jim Crist • John Crickett • jspreston • Juan Luis Cano Rodríguez • kayibal • Kevin Ernst • Markus Gonser • Matthew Rocklin • Martin Durant • Nir • Sinhrks • Talmaj Marinc • Vlad Frolov • Will Warner And the following developers have contributed to the dask/distributed repository: • Antoine Pitrou • Ben Schreck • bmaisonn • Brett Naul • Demian Wassermann • Israel Saeta Pérez • John Crickett • Joseph Crail • Malte Gerken • Martin Durant • Matthew Rocklin • Min RK • strets123 ## February 23, 2017 ### Continuum Analytics news #### Continuum Analytics to Speak at Galvanize New York Friday, February 17, 2017 Chief Data Scientist and Co-founder Travis Oliphant to Discuss the Open Data Science Innovations That Will Change Our World NEW YORK, NY—February 17, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Chief Data Scientist and Co-founder Travis Oliphant will be speaking at Galvanize New York. During his Reaching the full-potential of a data-driven world in the Anaconda Community presentation, taking place on February 21 at 7:00 p.m. EST, Oliphant will discuss how the Anaconda platform is bringing together Python and other Open Data Science tools to bring about innovations that will change our world. Oliphant will discuss how the rise of Python and data science has driven tremendous growth of the Open Data Science community. In addition, he will describe the open source technology developed at Continuum Analytics––including a preview of Anaconda Enterprise 5.0––and explain how attendees can participate in the growing business opportunities around the Anaconda ecosystem. WHO: Travis Oliphant, chief data scientist and co-founder, Continuum Analytics WHAT: “Reaching the full-potential of a data-driven world in the Anaconda Community” WHEN: February 21, 7:00 p.m. - 9:00 p.m. EST WHERE: Galvanize New York - West Soho - 315 Hudson St. New York, NY 10013 REGISTER: HERE Oliphant has a Ph.D. from the Mayo Clinic and B.S. and M.S. degrees in Mathematics and Electrical Engineering from Brigham Young University. Since 1997, he has worked extensively with Python for numerical and scientific programming, most notably as the primary developer of the NumPy package, and as a founding contributor of the SciPy package. He is also the author of the definitive Guide to NumPy. He has served as a director of the Python Software Foundation and as a director of NumFOCUS. About Anaconda Powered by Continuum Analytics Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io ### Media Contact: Jill Rosenthal InkHouse continuumanalytics@inkhouse.com ## February 21, 2017 ### Matthieu Brucher #### Playing with a Bela (1.bis): Compile last Audio Toolkit on Bela A few months ago, I started playing with the Bela board. At the time, I had issues compiling Audio ToolKit with clang. Since then and thanks to Travis-CI, I figured out what was going on. Unfortunately, the Beagle Board doesn’t have complete C++11 support, so I’ve added the remaining pieces, and you need also a new Boost. # What not to do I started with trying to compile a new Clang with libc++, but it seems that I need more than 8GB on the SD card! So I’ll wait until I can get such a card to try again. Then, the other thing I did is trying to compile a full Boost 1.61 (because that’s what I use on Travis CI), but this froze the board… # What to do So the only thing to do is to compile Boost test: ./b2 --with-test --with-system link=shared stage and then point the Boost root folder in Audio Toolkit CMake file to this Boost folder. # Conclusion I’ll post a longer post on Clang update when it’s done, but meanwhile, I can already start playing with Audio ToolKit on the Bela! ## February 20, 2017 ### Matthew Rocklin #### Dask Development Log This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation To increase transparency I’m blogging weekly(ish) about the work done on Dask and related projects during the previous week. This log covers work done between 2017-02-01 and 2017-02-20. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected. Themes of the last couple of weeks: 1. Profiling experiments with Dask-GLM 2. Subsequent graph optimizations, both non-linear fusion and avoiding repeatedly creating new graphs 3. Tensorflow and Keras experiments 4. XGBoost experiments 5. Dask tutorial refactor 6. Google Cloud Storage support 7. Cleanup of Dask + SKLearn project ### Dask-GLM and iterative algorithms Dask-GLM is currently just a bunch of solvers like Newton, Gradient Descent, BFGS, Proximal Gradient Descent, and ADMM. These are useful in solving problems like logistic regression, but also several others. The mathematical side of this work is mostly done by Chris White and Hussain Sultan at Capital One. We’ve been using this project also to see how Dask can scale out machine learning algorithms. To this end we ran a few benchmarks here: https://github.com/dask/dask-glm/issues/26 . This just generates and solves some random problems, but at larger scales. What we found is that some algorithms, like ADMM perform beautifully, while for others, like gradient descent, scheduler overhead can become a substantial bottleneck at scale. This is mostly just because the actual in-memory NumPy operations are so fast; any sluggishness on Dask’s part becomes very apparent. Here is a profile of gradient descent: Notice all the white space. This is Dask figuring out what to do during different iterations. We’re now working to bring this down to make all of the colored parts of this graph squeeze together better. This will result in general overhead improvements throughout the project. ### Graph Optimizations - Aggressive Fusion We’re approaching this in two ways: 1. More aggressively fuse tasks together so that there are fewer blocks for the scheduler to think about 2. Avoid repeated work when generating very similar graphs In the first case, Dask already does standard task fusion. For example, if you have the following to tasks: x = f(w) y = g(x) z = h(y)  Dask (along with every other compiler-like project since the 1980’s) already turns this into the following: z = h(g(f(w)))  What’s tricky with a lot of these mathematical or optimization algorithms though is that they are mostly, but not entirely linear. Consider the following example: y = exp(x) - 1/x  Visualized as a node-link diagram, this graph looks like a diamond like the following:  o exp(x) - 1/x / \ exp(x) o o 1/x \ / o x  Graphs like this generally don’t get fused together because we could compute both exp(x) and 1/x in parallel. However when we’re bound by scheduling overhead and when we have plenty of parallel work to do, we’d prefer to fuse these into a single task, even though we lose some potential parallelism. There is a tradeoff here and we’d like to be able to exchange some parallelism (of which we have a lot) for less overhead. PR here dask/dask #1979 by Erik Welch (Erik has written and maintained most of Dask’s graph optimizations). ### Graph Optimizations - Structural Sharing Additionally, we no longer make copies of graphs in dask.array. Every collection like a dask.array or dask.dataframe holds onto a Python dictionary holding all of the tasks that are needed to construct that array. When we perform an operation on a dask.array we get a new dask.array with a new dictionary pointing to a new graph. The new graph generally has all of the tasks of the old graph, plus a few more. As a result, we frequently make copies of the underlying task graph. y = (x + 1) assert set(y.dask).issuperset(x.dask)  Normally this doesn’t matter (copying graphs is usually cheap) but it can become very expensive for large arrays when you’re doing many mathematical operations. Now we keep dask graphs in a custom mapping (dict-like object) that shares subgraphs with other arrays. As a result, we rarely make unnecessary copies and some algorithms incur far less overhead. Work done in dask/dask #1985. ### TensorFlow and Keras experiments Two weeks ago I gave a talk with Stan Seibert (Numba developer) on Deep Learning (Stan’s bit) and Dask (my bit). As part of that talk I decided to launch tensorflow from Dask and feed Tensorflow from a distributed Dask array. See this blogpost for more information. That experiment was nice in that it showed how easy it is to deploy and interact with other distributed servies from Dask. However from a deep learning perspective it was immature. Fortunately, it succeeded in attracting the attention of other potential developers (the true goal of all blogposts) and now Brett Naul is using Dask to manage his GPU workloads with Keras. Brett contributed code to help Dask move around Keras models. He seems to particularly value Dask’s ability to manage resources to help him fully saturate the GPUs on his workstation. ### XGBoost experiments After deploying Tensorflow we asked what would it take to do the same for XGBoost, another very popular (though very different) machine learning library. The conversation for that is here: dmlc/xgboost #2032 with prototype code here mrocklin/dask-xgboost. As with TensorFlow, the integration is relatively straightforward (if perhaps a bit simpler in this case). The challenge for me is that I have little concrete experience with the applications that these libraries were designed to solve. Feedback and collaboration from open source developers who use these libraries in production is welcome. ### Dask tutorial refactor The dask/dask-tutorial project on Github was originally written or PyData Seattle in July 2015 (roughly 19 months ago). Dask has evolved substantially since then but this is still our only educational material. Fortunately Martin Durant is doing a pretty serious rewrite, both correcting parts that are no longer modern API, and also adding in new material around distributed computing and debugging. ### Google Cloud Storage Dask developers (mostly Martin) maintain libraries to help Python users connect to distributed file systems like HDFS (with hdfs3, S3 (with s3fs, and Azure Data Lake (with adlfs), which subsequently become usable from Dask. Martin has been working on support for Google Cloud Storage (with gcsfs) with another small project that uses the same API. ### Cleanup of Dask+SKLearn project Last year Jim Crist published three great blogposts about using Dask with SKLearn. The result was a small library dask-learn that had a variety of features, some incredibly useful, like a cluster-ready Pipeline and GridSearchCV, other less so. Because of the experimental nature of this work we had labeled the library “not ready for use”, which drew some curious responses from potential users. Jim is now busy dusting off the project, removing less-useful parts and generally reducing scope to strictly model-parallel algorithms. ## February 19, 2017 ### Titus Brown #### Request for Compute Infrastructure to Support the Data Intensive Biology Summer Institute for Sequence Analysis at UC Davis Note: we were just awarded this allocation on Jetstream for DIBSI. Huzzah! ## Abstract: Large datasets have become routine in biology. However, performing a computational analysis of a large dataset can be overwhelming, especially for novices. From June 18 to July 21, 2017 (30 days), the Lab for Data Intensive Biology will be running several different computational training events at the University of California, Davis for 100 people and 25 instructors. In addition, there will be a week-long instructor training in how to reuse our materials, and focused workshops, such as: GWAS for veterinary animals, shotgun environmental -omics, binder, non-model RNAseq, introduction to Python, and lesson development for undergraduates. The materials for the workshop were previously developed and tested by approximately 200 students on Amazon Web Services cloud compute services at Michigan State University's Kellogg Biological Station from 2010 and 2016, with support from the USDA and NIH. Materials are and will continue to be CC-BY, with scripts and associated code under BSD; the material will be adapted for Jetstream cloud usage and made available for future use. Keywords: Sequencing, Bioinformatics, Training Principal investigator: C. Titus Brown Field of science: Genomics Resource Justification: We are requesting 100 m.medium instances with 6 cores, 16 GB RAM, and 130 GB VM space each for each instructor and student for 4 weeks. The total request is for 432,000 service units (6 cores * 24 hrs/day * 30 days * 100 people). To accommodate large size data files, an additional 100 GB of storage volumes are requested for each person. Persistent storage beyond the duration is not necessary for this training workshop. These calculations are based on running the course for seven years with approximately 200 students total over the past six years on AWS cloud services. Syllabus: http://ivory.idyll.org/dibsi/ http://angus.readthedocs.io/en/2016/ Resources: IU/TACC (Jetstream) ## February 16, 2017 ### Continuum Analytics news #### Continuum Analytics to Speak at Galvanize New York Friday, February 17, 2017 Chief Data Scientist and Co-founder Travis Oliphant to Discuss the Open Data Science Innovations That Will Change Our World NEW YORK, NY—February 17, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Chief Data Scientist and Co-founder Travis Oliphant will be speaking at Galvanize New York. During his Reaching the full-potential of a data-driven world in the Anaconda Community presentation, taking place on February 21 at 7:00 p.m. EST, Oliphant will discuss how the Anaconda platform is bringing together Python and other Open Data Science tools to bring about innovations that will change our world. Oliphant will discuss how the rise of Python and data science has driven tremendous growth of the Open Data Science community. In addition, he will describe the open source technology developed at Continuum Analytics––including a preview of Anaconda Enterprise 5.0––and explain how attendees can participate in the growing business opportunities around the Anaconda ecosystem. WHO: Travis Oliphant, chief data scientist and co-founder, Continuum Analytics WHAT: “Reaching the full-potential of a data-driven world in the Anaconda Community” WHEN: February 21, 7:00 p.m. - 9:00 p.m. EST WHERE: Galvanize New York - West Soho - 315 Hudson St. New York, NY 10013 REGISTER: HERE Oliphant has a Ph.D. from the Mayo Clinic and B.S. and M.S. degrees in Mathematics and Electrical Engineering from Brigham Young University. Since 1997, he has worked extensively with Python for numerical and scientific programming, most notably as the primary developer of the NumPy package, and as a founding contributor of the SciPy package. He is also the author of the definitive Guide to NumPy. He has served as a director of the Python Software Foundation and as a director of NumFOCUS. About Anaconda Powered by Continuum Analytics Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io ### Media Contact: Jill Rosenthal InkHouse continuumanalytics@inkhouse.com ## February 15, 2017 ### Continuum Analytics news #### New Research eBook - Winning at Data Science: How Teamwork Leads to Victory Wednesday, February 15, 2017 Michele Chambers EVP Anaconda Business Unit & CMO Continuum Analytics As I write this blog post, the entire Anaconda team (myself included) are recovering from an eye-opening, inspiring and downright incredible experience at our very first AnacondaCON event this past week. The JW Marriott Austin was brimming with hundreds of people looking to immerse and learn more about Open Data Science in the enterprise. Being at the conference and chatting with customers, prospects, community members, others with inquiring minds with an interest in the Open Data Science movement further validated that data science has emerged more prominently in the enterprise—but not as quickly as we’d like. And not as quickly as enterprise leaders should like - and now is an ideal time to look carefully at priorities. As we’re catching our breath from the whirlwind that was AnacondaCON, we’re excited to reveal the findings of a study we have been working on for many months, providing answers to some of the questions we saw surface at the event: How are enterprises responding to the undiscovered value and advances in data science? How can they use data science to its full advantage? We asked company decision leaders (200, to be exact) and data scientists (500+) to help us understand the current beliefs and attitudes on data science. Some highlights include: • While 96 percent of company execs say data science is critical to the success of their business and 73 percent rank it as a top three most valuable technologies, 22 percent aren’t making full use of the data available to them • A whooping 94% of enterprises are using open source for Data Science but only 50% are using the results in the frontlines of their business. • Only 31 percent of execs are using data science daily, and less than half have implemented data science teams • These hesitations to adopt are ultimately due to companies being satisfied with the status quo (38 percent), struggling to calculate ROI (27 percent) and budgetary restrictions (24 percent) So, what’s missing? Collaboration. Our survey revealed that 69 percent of respondents associate “Open Data Science” with collaboration. No longer just a one-person job, teams are clearly what’s needed in order to capitalize on the volume of data—data science is a team sport. As we saw at both AnacondaCON and within our survey results, collaboration helps enterprises harness their data faster and extract more value to ultimately give people superpowers to change the world. Download our full eBook Winning at Data Science: How Teamwork Leads to Victory and read our press release to learn more. *This study was conducted by research firm Vanson Bourne, surveying 200 company executives and 500 data scientists at U.S. organizations. ### Enthought #### Traits and TraitsUI: Reactive User Interfaces for Rapid Application Development in Python The Enthought Tool Suite team is pleased to announce the release of Traits 4.6. Together with the release of TraitsUI 5.1 last year, these core packages of Enthought’s open-source rapid application development tools are now compatible with Python 3 as well as Python 2.7. Long-time fans of Enthought’s open-source offerings will be happy to hear about the recent updates and modernization we’ve been working on, including the recent release of Mayavi 4.5 with Python 3 support, while newcomers to Python will be pleased that there is an easy way to get started with GUI programming which grows to allow you to build applications with sophisticated, interactive 2D and 3D visualizations. ## A Brief Introduction to Traits and TraitsUI Traits is a mature reactive programming library for Python that allows application code to respond to changes on Python objects, greatly simplifying the logic of an application. TraitsUI is a tool for building desktop applications on top of the Qt or WxWidgets cross-platform GUI toolkits. Traits, together with TraitsUI, provides a programming model for Python that is similar in concept to modern and popular Javascript frameworks like React, Vue and Angular but targeting desktop applications rather than the browser. Traits is also the core of Enthought’s open source 2D and 3D visualization libraries Chaco and Mayavi, drives the internal application logic of Enthought products like Canopy, Canopy Geoscience and Virtual Core, and Enthought’s consultants appreciate its the way it facilitates the rapid development of desktop applications for our consulting clients. It is also used by several open-source scientific software projects such as the HyperSpy multidimensional data analysis library and the pi3Diamond application for controlling diamond nitrogen-vacancy quantum physics experiments, and in commercial projects such as the PyRX Virtual Screening software for computational drug discovery. The open-source pi3Diamond application built with Traits, TraitsUI and Chaco by Swabian Instruments. Traits is part of the Enthought Tool Suite of open source application development packages and is available to install through Enthought Canopy’s Package Manager (you can download Canopy here) or via Enthought’s new edm command line package and environment management tool. Running edm install traits at the command line will install Traits into your current environment. ## Traits The Traits library provides a new type of Python object which has an event stream associated with each attribute (or “trait”) of the object that tracks changes to the attribute. This means that you can decouple your application model much more cleanly: rather than an object having to know all the work which might need to be done when it changes its state, instead other parts of the application register the pieces of work that each of them need when the state changes and Traits automatically takes care running that code. This results in simpler, more modular and loosely-coupled code that is easier to develop and maintain. Traits also provides optional data validation and initialization that dramatically reduces the amount of boilerplate code that you need to write to set up objects into a working state and ensure that the state remains valid. This makes it more likely that your code is correct and does what you expect, resulting in fewer subtle bugs and more immediate and useful errors when things do go wrong. When you consider all the things that Traits does, it would be reasonable to expect that it may have some impact on performance, but the heart of Traits is written in C and knows more about the structure of the data it is working with than general Python code. This means that it can make some optimizations that the Python interpreter can’t, the net result of which is that code written with Traits is often faster than equivalent pure Python code. ## Example: A To-Do List in Traits To be more concrete, let’s look at writing some code to model a to-do list. For this, we are going to have a “to-do item” which represents one task and a “to-do list” which keeps track of all the tasks and which ones still need to be done. Each “to-do item” should have a text description and a boolean flag which indicates whether or not it has been done. In standard Python you might write this something like: class ToDoItem(object): def __init__(self, description='Something to do', completed=False): self.description = description self.completed = completed But with Traits, this would look like: from traits.api import Bool, HasTraits, Unicode class ToDoItem(HasTraits): description = Unicode('Something to do') completed = Bool You immediately notice that Traits is declarative – all we have to do is declare that the ToDoItem has attributes description and completed and Traits will set those up for us automatically with default values – no need to write an __init__ method unless you want to, and you can override the defaults by passing keyword arguments to the constructor: >>> to_do = ToDoItem(description='Something else to do') >>> print(to_do.description) Something else to do >>> print(to_do.completed) False Not only is this code simpler, but we’ve declared that the description attribute’s type is Unicode and the completed attribute’s type is Bool, which means that Traits will validate the type of new values set to these Traits: >>> to_do.completed = 'yes' TraitError: The 'completed' trait of a ToDoItem instance must be a boolean, but a value of 'yes' <type 'str'> was specified. Moving on to the second class, the “to-do list” which tracks which items are completed. With standard Python classes each ToDoItem would need to know the list which it belonged to and have a special method that handles changing the completed state, which at its simplest might look something like: class ToDoItem(object): def __init__(self, to_do_list, description='', completed=False): self.to_do_list = to_do_list self.description = description self.completed = completed def update_completed(self, completed): self.completed = completed self.to_do_list.update() And this would be even more complex if an item might be a member of multiple “to do list” instances. Or worse, some other class which doesn’t have an update() method, but still needs to know when a task has been completed. Traits solves this problem by having each attribute being reactive: there is an associated stream of change events that interested code can subscribe to. You can use the on_trait_change method to hook up a function that reacts to changes: >>>> def observer(new_value): ... print("Value changed to: {}".format(new_value)) ... >>> to_do.on_trait_change(observer, 'completed') >>> to_do.completed = True Value changed to: True >>> to_do.completed = False Value changed to: False It would be easy to have the “to-do list” class setup update observers for each of its items. But, setting up these listeners manually for everything that you want to listen to can get tedious. For example, we’d need to track when we add new items and remove old items so we could add and remove listeners as appropriate. Traits has a couple of mechanisms to automatically observe the streams of changes and avoid that sort of bookkeeping code. A class holding a list of our ToDoItems which automatically reacts to changes both in the list, and the completed state of each of these items might look something like this: from traits.api import HasTraits, Instance, List, Property, on_trait_change class ToDoList(HasTraits): items = List(Instance(ToDoItem)) remaining_items = List(Instance(ToDoItem)) remaining = Property(Int, depends_on='remaining_items') @on_trait_change('items.completed') def update(self): self.remaining_items = [item for item in self.items if not item.completed] def _get_remaining(self): return len(self.remaining_items) The @on_trait_change decorator sets up an observer on the items list and the completed attribute of each of the objects in the list which calls the method whenever a change occurs, updating the value of the remaining_items list. An alternative way of reacting is to have a Property, which is similar to a regular Python property, but which is lazily recomputed as needed when a dependency changes. In this case the remaining property listens for when the remaining_items list changes and will be recomputed by the specially-named _get_remaining method when the value is next asked for. >>> todo_list = ToDoList(items=[ ... ToDoItem(description='Unify relativity and quantum mechanics'), ... ToDoItem(description='Prove Riemann Hypothesis')]) ... >>> print(todo_list.remaining) 2 >>> todo_list.items[0].completed = True >>> print(todo_list.remaining) 1 Perhaps the most important fact about this is that we didn’t need to modify our original ToDoItem in any way to support the ToDoList functionality. In fact we can have multiple ToDoLists sharing ToDoItems, or even have other objects which listen for changes to the ToDoItems, and everything still works with no further modifications. Each class can focus on what it needs to do without worrying about the internals of the other classes. Hopefully you can see how Traits allows you to do more with less code, making your applications and libraries simpler, more robust and flexible. Traits has many more features than we can show in a simple example like this, but comprehensive documentation is available at http://docs.enthought.com/traits and, being BSD-licensed open-source, the Traits code is available at https://github.com/enthought/traits. ## TraitsUI One place where reactive frameworks really shine is in building user interfaces. When a user interacts with a GUI they change the state of the UI and a reactive system can use those changes to update the state of the business model appropriately. In fact all of the reactive Javascript frameworks mentioned earlier in the article come with integrated UI systems that make it very easy to describe a UI view declaratively with HTML and hook it up to a model in Javascript. In the same way Traits comes with strong integration with TraitsUI, a desktop GUI-building library that allows you describe a UI view declaratively with Python and hook it up to a Traits model. TraitsUI itself sits on top of either the wxPython, PyQt or PySide GUI library wrappers, and in principle could have other backends written for it if needed. Between the facilities that the Traits and TraitsUI libraries provide, it is possible to quickly build desktop applications with clear separation of concerns between UI and business logic. TraitsUI uses the standard Model-View-Controller or Model-View-ViewModel patterns for building GUI applications, and it allows you to add complexity as needed. Often all that you require is a model class written in Traits and simple declarative view on that class, and TraitsUI will handle the rest for you. ## Example: A To-Do List UI Getting started with TraitsUI is simple. If you have TraitsUI and a compatible GUI toolkit installed in your working environment, such as by running the command-line: edm install pyqt traitsui then any Traits object has a default GUI available with no additional work: >>> todo_item.configure_traits() With a little more finesse we can improve the view. In TraitsUI you do this by creating a View for your HasTraits class: from traitsui.api import HGroup, Item, VGroup, View todo_item_view = View( VGroup( Item('description', style='custom', show_label=False), HGroup(Item('completed')), ), title='To Do', width=360, height=240, resizable=True, ) Views are defined declaratively, and are independent of the model: we can have multiple Views for the same model class, or even have a View which works with several different model classes. You can even declare a default view as part of your class definition if you want. In any case, once you have a view you can use it by passing it as the view parameter: >>>todo_item.configure_traits(view=todo_item_view) This produces a fairly nice, if basic, UI to edit an object. If you run these examples within an interactive IPython terminal session (such as from the Canopy editor, or the IPython QtConsole), you’ll see that these user interfaces are hooked up “live” to the underlying Traits objects: when you type into the text field or toggle the “completed” checkbox, the values of attributes change automatically. Coupled with the ability to write Traits code that reacts to those changes you can write powerful applications with comparatively little code. For a complete example of a TraitsUI application, have a look at the full to-do list application on Github. These examples only scratch the surface of what TraitsUI is capable of. With more work you can create UIs with complex views including tables and trees, or add menu bars and toolbars which can drive the application. And 2D and 3D plotting is available via the Chaco and Mayavi libraries. Full documentation for TraitsUI is available at http://docs.enthought.com/traitsui including a complete example of writing an image capture application using TraitsUI. ## Enthought Tool Suite The Enthought Tool Suite is a battle-tested rapid application development system. If you need to make your science or business code more accessible, it provides a toolkit that you can use to build applications for your users with a lot less difficulty than using low-level wxPython or Qt code. And if you want to focus on what you do best, Enthought Consulting can help with scientist-developers to work on your project. But best of all the Enthought Tool Suite is open source and licensed with the same BSD-style license as Python itself, so it is free to use, and if you love it and find you want to improve it, we welcome your participation and contributions! Join us at http://www.github.com/enthought. ## Traits and TraitsUI Resources The post Traits and TraitsUI: Reactive User Interfaces for Rapid Application Development in Python appeared first on Enthought Blog. ### Continuum Analytics news #### New Research Proves Increased Awareness in the Value of Open Data Science, but Enterprises are Slow to Respond Wednesday, February 15, 2017 Data science is critical for success, but Continuum Analytics finds just 49 percent have data science teams in place AUSTIN, Texas—February 15, 2017—New research announced today by Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, finds that 96 percent of data science and analytics decision makers agree that data science is critical to the success of their business, yet a whopping 22 percent are failing to make full use of the data available. These findings are included in Continuum Analytics’ new eBook, Winning at Data Science: How Teamwork Leads to Victory, based on the company’s inaugural study that explores the state of Open Data Science in the enterprise. Download the eBook here. The research, conducted by independent research firm Vanson Bourne, surveyed 200 data science and analytics decision makers at U.S. organizations of all sizes and industries, to examine the state of Open Data Science in the enterprise. Continuum Analytics also surveyed more than 500 data scientists to uncover similarities and disparities between the two groups. Topics ranged from the value of data science, challenges around adoption and how data science is being utilized in the enterprise. Key takeaways and findings from the research include: • The benefits of data science in the enterprise are undisputed; 73 percent of respondents ranked it as one of the top three most valuable technologies they use. Conversely, findings show that a disparity exists between understanding the impact of data science and actually executing it in the enterprise––62 percent said data science is used at least on a weekly basis, but just 31 percent of that group are using it daily. • When comparing the beliefs of executives/IT managers with data scientists, nearly all respondents from both groups agree on the critical impact of data science in the enterprise. However, a divide exists around where companies are in the data science lifecycle. Just 24 percent of data scientists feel their companies have reached the “teen” stage––developed enough to hold its own with room to mature––as opposed to the 40 percent of executives who feel confident they have arrived at this stage of development. • Despite the benefits offered by data science, 22 percent of enterprise respondents report that their teams are failing to use the data to its potential. What’s more, 14 percent use data science very minimally or not at all, due to three primary adoption barriers: executive teams that are satisfied with the status quo (38 percent), a struggle to calculate ROI (27 percent) and budgetary restrictions (24 percent). While obstacles persist, an increasingly data-driven world calls for data science teams in the enterprise—it’s not a one person job. Though 89 percent of organizations have at least one data scientist, less than half have data science teams. Findings revealed that 69 percent of respondents associate Open Data Science with collaboration, proving that teamwork is essential to exploit the power of the data, requiring a combination of skills best tackled by a strong team. “Over 94 percent of the enterprises in the survey rely on open source for data science. Open Data Science is the Rosetta Stone to unlocking the value locked away in data, especially Big Data,” said Michele Chambers, EVP Anaconda Business Unit, Continuum Analytics. “Our research shows that data science is no longer just for competitive advantage; it needs to be infused into day-to-day operations to maximize the value of data. Data science is business and the best run businesses run Open Data Science.” For more information about the survey results, read the Anaconda blog post here. To view the full eBook, download here. About Anaconda powered by Continuum Analytics Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io. ## ## February 14, 2017 ### Titus Brown #### My thoughts for "Imagining Tomorrow's University" So I've been invited to Imagining Tomorrow's University, and they have this series of questions they'd like me to answer. (Note that you can follow the conversation at #TomorrowsUni on Twitter.) Conveniently I already answered many of these questions in my "What is Open Science?" blog post. I've copy/pasted from that for the first two answers. Q: What is your two sentence definition of open science (or open research)? A: Open science is the philosophical perspective that sharing is good and that barriers to sharing should be lowered as much as possible. The practice of open science is concerned with the details of how to lower or erase the technical, social, and cultural barriers to sharing. Q: Why is open science important for transforming research and learning? A: The potential value of open science should be immediately obvious: easier and faster access to ideas, methods, and data should drive science forward faster! But open science can also aid with reproducibility and replication, decrease the effects of economic inequality in the sciences by liberating ideas from subscription paywalls, and provide reusable materials for teaching and training. Q: How can open science increase the societal impact of university research? A: I have two answers. first, if open science accelerates research progress, then that increases the societal impact intrinsically. second, serendipity will strike. Most of my "wins" from open science have been unexpected - people using our research products in ways I never could have predicted or intended. This is really only possible if those research products are made fully available. Q: How is open science part of, and important for your own research, teaching, and service agendas? A: I think it's philosophically central to my view of how research should work. In that sense, it's integral to our research agenda, and it increases the impacts of our research and teaching. For service, I'm not sure what to say, although I prefer to donate my time to open organizations. Q: What are the important activities, structures, etc. that have supported you in pursuing open science? A: If I had to pick one, it would be the Moore Foundation. Without question, the Moore Foundation Data Driven Discovery Investigator award (links here) validate my decision to do open science in the past, and in turn gives me the freedom to try new things in the future. I think blogging and Twitter have been integral to my pursuit of open science, my development of perspectives, and my discovery of a community of thought and practice around open science. Q: What are the major technical, organizational, social, or cultural challenges you face, particularly as related to openness and sharing within your university and academia? A: While most scientists are supportive of open science in theory, in fact most scientists are leery of actually sharing things widely before publication. This is disappointing but understandable in light of the incentive systems in place. At my Assistant Professor job, I received a lot of administrator pushback on the time I was expending on open science, and this even made its way into a tenure letter. That having been said, in publication and funding reviews, I've received nothing but positive comments, so I think that's more important than what my administrative chain says. My colleagues have been nothing but supportive (see above, "theory" vs "practice".) Q: If you had a senior leadership role in a university, what would you do to promote change and improve your university? A: I'm not convinced there's anything that can be done by a university leader. University leadership is largely irrelevant to the daily practice of research, teaching, and service, in my experience. (I think university leadership is very important in facilitating a good environment at their institution, so they're not useless at all; they just don't have anything to do with my research directly, and nor should they.) I think we need community leaders to effect change, and by community leaders I mean research leaders (senior folk with strong research careers - members of the National Academy, Nobel laureates, etc.). These folk need to visibly and loudly abandon the broken "journal prestige" system, forcefully push back against university administration on matters of research evaluation and tenure, and be a loud presence on grant panels and editorial boards. The other thing we need is more open science practice. I feel like too much time is spent talking about how wonderful open science would be if we could just mandate foo bar and baz, and not enough time is spent actually doing science. Conveniently, Bjorn Brembs has written up this problem in detail. Q: What$10M or more, risky and potentially transformative, big idea research proposal would you be writing if you had the right open science resources, and institutional support?

A: What a coincidence! I happen to have written something up here, What about all those genes of unknown function?. But it would cost $50m. One particularly relevant bit: More importantly, I'd insist on pre-publication sharing of all the data within a walled garden of all the grantees, together with regular meetings at which all the grad students and postdocs could mix to talk about how to make use of the data. (This is an approach that Sage Biosciences has been pioneering for biomedical research.) I'd probably also try to fund one or two groups to facilitate the data storage and analysis -- maybe at$250k a year or so? -- so that all of the technical details could be dealt with.

But, while this approach could have massive impact on biology, I can answer the question a different way, too: what would I do with $10m if it landed in my lap? I'd probably try to build something like Manylabs. I was pretty inspired by the environment there during a recent visit, and I think it could translate into a slightly more academic setting easily. I envision an institute that combines open space for brainstorming, collaboration, and networking, with regular short-term training events (a la Software & Data Carpentry) and long-term data science fellows (a la the Moore/Sloan Data Science Environments) while providing grants for a bunch of sabbatical folk. I'd park it in downtown Davis (good coffee, beer, food, bicycling), fill it with interesting people, and stir well. However, let's be honest --$10m isn't enough to effect real change in our university system, and in any case my experience with big grants is you have to over-promise in order to get the necessary funding. (There are one or two exceptions to this, but it's a pretty good rule ;).

If you wanted me to effect interesting change on the university level, I'd need about $10m a year for 10 years to run an incubator institute as a proof of concept. And given$1 bn/10 years to spend, I think we could do something really interesting by building a decentralized university for teaching and research. Happy to chat...

I have more to say but maybe I'll save it for the post-event blogging :)

--titus

### Continuum Analytics news

#### The Dominion: An Open Data Science Film

Thursday, February 9, 2017
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

The inaugural AndacondaCON event was full of surprises. Personalized legos, delicious Texan BBQ, Anaconda “swag” and a preview of what might be the most dramatic, life-altering data science movie of all time: The Dominion

The Dominion tells the story of a world similar to ours. One full of possibilities but is being threatened by old machines and bad code. Leo, the main character joins forces with Matrix-like heroes to save businesses from alien-infected software. Leo and his team take it upon themselves to fight the Dominion's agents —Macros—who want to stop the world from open source innovation.

In an effort to defeat the Macros, they board the rebel fleet flagship, The Anaconda, loaded with data science packages and payload to help the team drop into any environment to free people from the Macros’ hindering code.

Leo and his team believe that the world is moving to Open Data Science, and that distributed cloud data is going to liberate humanity—all they have to do is work together to make it happen.

The question is—do you? Are you ready for the future? Board The Anaconda to begin the journey with Leo, his team and us. The time is now.

Couldn’t make it to AnacondaCON? There’s always next year, and watch the full trailer below.

#### AnacondaCON Recap: A Community Comes Together

Tuesday, February 14, 2017
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

Last week, more than 400 Open Data Science community members descended on the city of Austin to attend our inaugural AnacondaCON event. From data scientists to engineers to business analysts, community members shared best practices, fostered new connections and worked to accelerate both their personal and organization’s path to Open Data Science. One thing is absolute—by the end of the three days, everyone felt the unmistakable buzz and excitement around a growing market on the edge of transforming the world.

In case you weren’t able to join us (or are just looking to relive the fun), below are some highlights from AnacondaCON 2017:

Peter Wang, our CTO and co-founder, kicked off the event with a keynote about the ongoing need for better access to data, a theme that was echoed throughout a number of breakout sessions over the two days.

## AnacondaCOND1S2(8of29).jpg

Sessions varied between business and technology tracks, enabling everyone to find a topic that resonated with them. Industry leaders from organizations such as Forrester, Capital One, General Electric, Clover Health, Bloomberg and many others shared personal insights on working with Open Data Science. These speakers also discussed areas in which they hoped would be given attention by the community.

## AnacondaCOND1S5(9of15).jpg

We encouraged everyone to bring their data science teams, and many did—including our newest #AnacondaCREW members from Legoland! The theme of teamwork and collaboration permeated throughout the event. Since data science is a team sport—and we don’t want anyone working alone—we gave everyone their very own team…a team of Anaconda Legos to take home!

## AnacondaCOND2S2(24of37).jpg

After an insightful and inspiring first day, we ended the night with some southern hospitality. We changed gears, spending time getting to know each other and having some fun! With a spread of classic Texan BBQ, a mariachi band and Anaconda-themed airbrush tattoos,I think it’s safe to say no one got a lot of sleep that night.

## IMG_0486.jpg

Day two was just as successful—between sessions, attendees engaged in lively discussions about different ways to capitalize on data science to drive new outcomes and impact organizations in new ways. The Artificial Intelligence panel and City of Boston presentation on data science in the public sector also helped spark exciting conversation among attendees.

## AnacondaCOND2S4(38of75).jpg

Travis Oliphant, chief data scientist and co-founder of Continuum Analytics, closed out the conference by sharing a look at the tremendous  growth of the Anaconda community and his thoughts on how the community can continue to grow and thrive, pushing the Open Data Science movement forward.

## AnacondaCON17-128.jpg

On behalf of our entire team, we want to thank everyone who attended and helped make this year’s event such a success. Below are a few of your own highlights from the event. Hope to see you again in 2018.

All photos courtesy of Casey Chapman-Ross Photography

### Matthieu Brucher

#### Review of Intel Parallel Studio 2017: Advisor

Recently, I got access to the latest release of Parallel Studio with an update version of Advisor. 6 years after my last review, let’s dive into it again!

First lots of things changed since the first release of Parallel Studio. Lots of the dedicated tools were merged with big ones (Amplifier is now VTune, Composer is Debugger…) but still kept their nicer GUIs and workflows. They also evolved quite a lot in a decade, focusing now on how to extract the maximum performance of the Intel CPUs through vectorization and threading.

#### Context

Intel provides a large range of tools, from a compiler to a sampling profiler (based on hardware counters), from a debugger to tools analyzing programs behavior. Of course, all these applications have a goal: selling more chips. As it’s not just about the chip, this is a fair fight: you need to be able to use the most of your hardware.

Of course, I’ll use Audio Toolkit for the demo and the screenshots.

#### Presentation

Let’s start with the beginning and start a new project. You will need to set up a new project (or open an old one), which leads you to the following screenshot.

Setting up project properties in Advisor

Survey hotspots analysis is basically what you require, you may want to tick in Survey trip count analysis the Collect information about FLOPS if that’s the kind of analysis you are looking for. In future versions, it will be required for the roofline analysis which is currently not commercially available.

Once the configuration is done, let’s run the 1. Survey target, which would lead to the next screenshot.

Summary tab

I suggest to save snapshots (with the camera icon) after a run, as each run will actually overwrite e000, and bundle at least the source code with it.

Now it is possible to see the results:

Advisor results for an IIR filter

I guess it is time now for a quick demo on how we can decrypt such results and improve on them.

#### Demo

The first interesting bit is that it is indeed the IIR filter that takes most of the relevant time. Advisor only works on loops, but as audio processing is about loops, everything is fine. Each loop has different annotations, and the ones in IIR filter have the note “Compiler lacks sufficient information to vectorize the loop”. The issue here is that the Visual Studio compiler can’t vectorize properly, so let’s use the Intel compiler (in CMake GUI, use -t “Intel C++ Compiler 17.0”).

Advisor results for an IIR filter (Intel compiler)

I added here a pragma in the source code to force the vectorization, so the results are quite interesting. We have a good speed up compared to the previous version (6.9s to 4.7s), but here we are skewed because the order of the filter is odd, so there is an even number of coefficients for this loop (FIR part of the filter), and this works great for SSE2. Here only one loop is vectorized, where the icon for the loop is orange instead of blue.

If I want to push and ask for AVX instructions, then we will start seeing indications that the loop may be inefficient. In the following screenshot, I reordered the FIR loop so that we are vectorized on the number of samples of the processing and not the number of coefficients (usually there are only a handful of coefficients but up to hundreds of samples, so far more opportunities for vectorization). So this loop is not marked as inefficient. But the second one (the IIR part) is inefficient as we can’t reorder the loop straight away.

Advisor results for an IIR filter (optimized Intel compiler)

Here, we see that Advisor tags all the calls to the loop as Remainder (or Vectorized Remainder), which is the part of the vectorized loop finishes (the start is Peel, before the samples are aligned, then Body, when data is aligned and the full content of the register is used, and then Remainder when the data is aligned, but only the first parts of the vector registers can be used). And the efficiency of the loop is poor, only 9%, compared to the 76% of the reordered loop.

#### Conclusion

This was a small tutorial on Advisor, I also added alignment in the arrays in a filter so that Peel would be reduced, and other optimizations. I didn’t talk about the rest of the analytics Advisor provides, but you get the idea, and the fun of these tools is also to explore them.

One final note, Advisor doesn’t like huge applications, it thrives at small applications (with a small number of loops), so try to extract your kernels with representative data.

## February 11, 2017

### Matthew Rocklin

#### Experiment with Dask and TensorFlow

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

## Summary

This post briefly describes potential interactions between Dask and TensorFlow and then goes through a concrete example using them together for distributed training with a moderately complex architecture.

This post was written in haste and the attached experiment is of low quality, see disclaimers below. A similar and much better example with XGBoost is included in the comments at the end.

## Introduction

Dask and TensorFlow both provide distributed computing in Python. TensorFlow excels at deep learning applications while Dask is more generic. We can combine both together in a few applications:

1. Simple data parallelism: hyper-parameter searches during training and predicting already-trained models against large datasets are both trivial to distribute with Dask as they would be trivial to distribute with any distributed computing system (Hadoop/Spark/Flink/etc..) We won’t discuss this topic much. It should be straightforward.
2. Deployment: A common pain point with TensorFlow is that setup isn’t well automated. This plagues all distributed systems, especially those that are run on a wide variety of cluster managers (see cluster deployment blogpost for more information). Fortunately, if you already have a Dask cluster running it’s trivial to stand up a distributed TensorFlow network on top of it running within the same processes.
3. Pre-processing: We pre-process data with dask.dataframe or dask.array, and then hand that data off to TensorFlow for training. If Dask and TensorFlow are co-located on the same processes then this movement is efficient. Working together we can build efficient and general use deep learning pipelines.

In this blogpost we look very briefly at the first case of simple parallelism. Then go into more depth on an experiment that uses Dask and TensorFlow in a more complex situation. We’ll find we can accomplish a fairly sophisticated workflow easily, both due to how sensible TensorFlow is to set up and how flexible Dask can be in advanced situations.

## Motivation and Disclaimers

Distributed deep learning is fundamentally changing the way humanity solves some very hard computing problems like natural language translation, speech-to-text transcription, image recognition, etc.. However, distributed deep learning also suffers from public excitement, which may distort our image of its utility. Distributed deep learning is not always the correct choice for most problems. This is for two reasons:

1. Focusing on single machine computation is often a better use of time. Model design, GPU hardware, etc. can have a more dramatic impact than scaling out. For newcomers to deep learning, watching online video lecture series may be a better use of time than reading this blogpost.
2. Traditional machine learning techniques like logistic regression, and gradient boosted trees can be more effective than deep learning if you have finite data. They can also sometimes provide valuable interpretability results.

Regardless, there are some concrete take-aways, even if distributed deep learning is not relevant to your application:

1. TensorFlow is straightforward to set up from Python
2. Dask is sufficiently flexible out of the box to support complex settings and workflows
3. We’ll see an example of a typical distributed learning approach that generalizes beyond deep learning.

Additionally the author does not claim expertise in deep learning and wrote this blogpost in haste.

## Simple Parallelism

Most parallel computing is simple. We easily apply one function to lots of data, perhaps with slight variation. In the case of deep learning this can enable a couple of common workflows:

1. Build many different models, train each on the same data, choose the best performing one. Using dask’s concurrent.futures interface, this looks something like the following:

# Hyperparameter search
client = Client('dask-scheduler-address:8786')
scores = client.map(train_and_evaluate, hyper_param_list, data=data)
best = client.submit(max, scores)
best.result()

2. Given an already-trained model, use it to predict outcomes on lots of data. Here we use a big data collection like dask.dataframe:

# Distributed prediction

df = dd.read_parquet('...')
... # do some preprocessing here
df['outcome'] = df.map_partitions(predict)


These techniques are relatively straightforward if you have modest exposure to Dask and TensorFlow (or any other machine learning library like scikit-learn), so I’m going to ignore them for now and focus on more complex situations.

Interested readers may find this blogpost on TensorFlow and Spark of interest. It is a nice writeup that goes over these two techniques in more detail.

## A Distributed TensorFlow Application

We’re going to replicate this TensorFlow example which uses multiple machines to train a model that fits in memory using parameter servers for coordination. Our TensorFlow network will have three different kinds of servers:

1. Workers: which will get updated parameters, consume training data, and use that data to generate updates to send back to the parameter servers
2. Parameter Servers: which will hold onto model parameters, synchronizing with the workers as necessary
3. Scorer: which will periodically test the current parameters against validation/test data and emit a current cross_entropy score to see how well the system is running.

This is a fairly typical approach when the model can fit in one machine, but when we want to use multiple machines to accelerate training or because data volumes are too large.

We’ll use TensorFlow to do all of the actual training and scoring. We’ll use Dask to do everything else. In particular, we’re about to do the following:

1. Prepare data with dask.array
2. Set up TensorFlow workers as long-running tasks
3. Feed data from Dask to TensorFlow while scores remain poor
4. Let TensorFlow handle training using its own network

## Prepare Data with Dask.array

For this toy example we’re just going to use the mnist data that comes with TensorFlow. However, we’ll artificially inflate this data by concatenating it to itself many times across a cluster:

def get_mnist():
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('/tmp/mnist-data', one_hot=True)
return mnist.train.images, mnist.train.labels

import dask.array as da
from dask import delayed

datasets = [delayed(get_mnist)() for i in range(20)]  # 20 versions of same dataset
images = [d[0] for d in datasets]
labels = [d[1] for d in datasets]

images = [da.from_delayed(im, shape=(55000, 784), dtype='float32') for im in images]
labels = [da.from_delayed(la, shape=(55000, 10), dtype='float32') for la in labels]

images = da.concatenate(images, axis=0)
labels = da.concatenate(labels, axis=0)

>>> images
dask.array<concate..., shape=(1100000, 784), dtype=float32, chunksize=(55000, 784)>

images, labels = c.persist([images, labels])  # persist data in memory


This gives us a moderately large distributed array of around a million tiny images. If we wanted to we could inspect or clean up this data using normal dask.array constructs:

im = images[1].compute().reshape((28, 28))
plt.imshow(im, cmap='gray')


im = images.mean(axis=0).compute().reshape((28, 28))
plt.imshow(im, cmap='gray')


im = images.var(axis=0).compute().reshape((28, 28))
plt.imshow(im, cmap='gray')


This shows off how one can use Dask collections to clean up and provide pre-processing and feature generation on data in parallel before sending it to TensorFlow. In our simple case we won’t actually do any of this, but it’s useful in more real-world situations.

Finally, after doing our preprocessing on the distributed array of all of our data we’re going to collect images and labels together and batch them into smaller chunks. Again we use some dask.array constructs and dask.delayed when things get messy.

images = images.rechunk((10000, 784))
labels = labels.rechunk((10000, 10))

images = images.to_delayed().flatten().tolist()
labels = labels.to_delayed().flatten().tolist()
batches = [delayed([im, la]) for im, la in zip(images, labels)]

batches = c.compute(batches)


Now we have a few hundred pairs of NumPy arrays in distributed memory waiting to be sent to a TensorFlow worker.

## Setting up TensorFlow workers alongside Dask workers

Dask workers are just normal Python processes. TensorFlow can launch itself from a normal Python process. We’ve made a small function here that launches TensorFlow servers alongside Dask workers using Dask’s ability to run long-running tasks and maintain user-defined state. All together, this is about 80 lines of code (including comments and docstrings) and allows us to define our TensorFlow network on top of Dask as follows:

\$ pip install git+https://github.com/mrocklin/dask-tensorflow

from dask.distibuted import Client  # we already had this above
client = Client('dask-scheduler-address:8786')

from dask_tensorflow import start_tensorflow
tf_spec, dask_spec = start_tensorflow(client, ps=1, worker=4, scorer=1)

>>> tf_spec.as_dict()
{'ps': ['192.168.100.1:2227'],
'scorer': ['192.168.100.2:2222'],
'worker': ['192.168.100.3:2223',
'192.168.100.4:2224',
'192.168.100.5:2225',
'192.168.100.6:2226']}

>>> dask_spec
{'ps': ['tcp://192.168.100.1:34471'],
'scorer': ['tcp://192.168.100.2:40623'],
'worker': ['tcp://192.168.100.3:33075',
'tcp://192.168.100.4:37123',
'tcp://192.168.100.5:32839',
'tcp://192.168.100.6:36822']}


This starts three groups of TensorFlow servers in the Dask worker processes. TensorFlow will manage its own communication but co-exist right alongside Dask in the same machines and in the same shared memory spaces (note that in the specs above the IP addresses match but the ports differ).

This also sets up a normal Python queue along which Dask can safely send information to TensorFlow. This is how we’ll send those batches of training data between the two services.

## Define TensorFlow Model and Distribute Roles

Now is the part of the blogpost where my expertise wanes. I’m just going to copy-paste-and-modify a canned example from the TensorFlow documentation. This is a simplistic model for this problem and it’s entirely possible that I’m making transcription errors. But still, it should get the point across. You can safely ignore most of this code. Dask stuff gets interesting again towards the bottom:

import math
import tempfile
import time
from queue import Empty

IMAGE_PIXELS = 28
hidden_units = 100
learning_rate = 0.01
sync_replicas = False
replicas_to_aggregate = len(dask_spec['worker'])

def model(server):
worker_device = "/job:%s/task:%d" % (server.server_def.job_name,
server.server_def.task_index)
task_index = server.server_def.task_index
is_chief = task_index == 0

with tf.device(tf.train.replica_device_setter(
worker_device=worker_device,
ps_device="/job:ps/cpu:0",
cluster=tf_spec)):

global_step = tf.Variable(0, name="global_step", trainable=False)

# Variables of the hidden layer
hid_w = tf.Variable(
tf.truncated_normal(
[IMAGE_PIXELS * IMAGE_PIXELS, hidden_units],
stddev=1.0 / IMAGE_PIXELS),
name="hid_w")
hid_b = tf.Variable(tf.zeros([hidden_units]), name="hid_b")

# Variables of the softmax layer
sm_w = tf.Variable(
tf.truncated_normal(
[hidden_units, 10],
stddev=1.0 / math.sqrt(hidden_units)),
name="sm_w")
sm_b = tf.Variable(tf.zeros([10]), name="sm_b")

# Ops: located on the worker specified with task_index
x = tf.placeholder(tf.float32, [None, IMAGE_PIXELS * IMAGE_PIXELS])
y_ = tf.placeholder(tf.float32, [None, 10])

hid_lin = tf.nn.xw_plus_b(x, hid_w, hid_b)
hid = tf.nn.relu(hid_lin)

y = tf.nn.softmax(tf.nn.xw_plus_b(hid, sm_w, sm_b))
cross_entropy = -tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y, 1e-10, 1.0)))

opt = tf.train.AdamOptimizer(learning_rate)

if sync_replicas:
if replicas_to_aggregate is None:
replicas_to_aggregate = num_workers
else:
replicas_to_aggregate = replicas_to_aggregate

opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate=replicas_to_aggregate,
total_num_replicas=num_workers,
name="mnist_sync_replicas")

train_step = opt.minimize(cross_entropy, global_step=global_step)

if sync_replicas:
local_init_op = opt.local_step_init_op
if is_chief:
local_init_op = opt.chief_init_op

ready_for_local_init_op = opt.ready_for_local_init_op

# Initial token and chief queue runners required by the sync_replicas mode
chief_queue_runner = opt.get_chief_queue_runner()
sync_init_op = opt.get_init_tokens_op()

init_op = tf.global_variables_initializer()
train_dir = tempfile.mkdtemp()

if sync_replicas:
sv = tf.train.Supervisor(
is_chief=is_chief,
logdir=train_dir,
init_op=init_op,
local_init_op=local_init_op,
ready_for_local_init_op=ready_for_local_init_op,
recovery_wait_secs=1,
global_step=global_step)
else:
sv = tf.train.Supervisor(
is_chief=is_chief,
logdir=train_dir,
init_op=init_op,
recovery_wait_secs=1,
global_step=global_step)

sess_config = tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=False,
device_filters=["/job:ps", "/job:worker/task:%d" % task_index])

# The chief worker (task_index==0) session will prepare the session,
# while the remaining workers will wait for the preparation to complete.
if is_chief:
print("Worker %d: Initializing session..." % task_index)
else:
print("Worker %d: Waiting for session to be initialized..." %
task_index)

sess =