## April 29, 2016

### Continuum Analytics news

#### Open Data Science: Bringing “Magic” to Modern Analytics

Posted Friday, April 29, 2016

Science fiction author Arthur C. Clarke once wrote, “any sufficiently advanced technology is indistinguishable from magic.”

We’re nearer than ever to that incomprehensible, magical future. Our gadgets understand our speech, driverless cars have made their debut and we’ll soon be viewing virtual worlds at home.

These “magical” technologies spring from a 21st-century spirit of innovation—but not only from big companies. Thanks to the Internet—and to the open source movement—companies of all sizes are able to spur advancements in science and technology.

It’s no different for advanced analytics. And it’s about time.

In the past, our analytics tools were proprietary, product-oriented solutions. These were necessarily limited in flexibility and they locked customers into the slow innovation cycles and whims of vendors. These closed-source solutions forced a “one size fits all” approach to analytics with monolithic tools that did not offer easy customization for different needs.

Open Data Science has changed that. It offers innovative software—free of proprietary restrictions and tailorable for all varieties of data science teams—created in the transparent collaboration that is driving today’s tech boom.

The Magic 8-Ball of Automated Modeling

One of Open Data Science's most visible points of innovation is in the sphere of data science modeling.

Initially, models were created exclusively by statisticians and analysts for business professionals, but demand from the business sector for software that could do this job gave rise to automatic model fitting—often called “black box” analytics—in which analysts let software algorithmically generate models that fit data and create predictive models.

Such a system creates models, but much like a magic 8-ball, it offers its users answers without business explanations. Mysteries are fun for toys, but no business will bet on them. Quite understandably, no marketing manager or product manager wants to approach the CEO with predictions, only to be stumped when he asks how the manager arrived at them. As Clarke knew, it’s not really magic creating the models, it’s advanced technology and it too operates under assumptions that might or might not make sense for the business.

App Starters Means More Transparent Modeling

Today’s business professionals want faster time-to-value and are dazzled by advanced technologies like automated model fitting, but they also want to understand exactly how and why the work.

That’s why Continuum Analytics is hard at work on Open Data Science solutions including Anaconda App Starters, expected to debut later this year. App Starters are solution “templates” aimed to be a 60-80 percent data science solution that make it easy for businesses to have a starting point. App Starters serve the same purpose as the “black box”—faster time-to-value— but are not a “black box” in that it allows analysts to see exactly how the model was created and to tweak models as desired.

Because the App Starters are are based on Open Data Science, they don’t include proprietary restrictions that keep business professionals or data scientists in the dark regarding the analytics pipeline including the algorithms. It still provides the value of “automagically” creating models, but the details of how it does so are transparent and accessible to the team. With App Starters, business professionals will finally have confidence in the models they’re using to formulate business strategies, while getting faster time-to-value from their growing data.

Over time App Starters will get more sophisticated and will include recommendations—just like how Netflix offers up movie and tv show recommendations for your watching pleasure—that will learn and suggest algorithms and visualizations that best fit the data. Unlike “black boxes” the entire narrative as to why recommendations are offered will be available for the business analyst to learn and gain confidence in the recommendations. However, the business analyst can choose to use the recommendation, tweak the recommendation, use the template without recommendations or they could try tuning the suggested models to find a perfect fit. This type of innovation will further the advancement of sophisticated data science solutions that realize more business value, while instilling confidence in the solution.

Casting Spells with Anaconda

Although App Starters are about to shake up automated modeling, businesses require melding new ideas with tried-and-true solutions. In business analytics, for instance, tools like Microsoft Excel are a staple of the field and being able to integrate them with newer “magic” is highly desirable.

Fortunately, interoperability is one of the keystones of the Open Data Science philosophy and Anaconda provides a way to bridge the reliable old world with the magical new one. With Anaconda, analysts who are comfortable using Excel have an entry point into the world of predictive analytics from the comfort of their spreadsheets. By using the same familiar interface, analysts can access powerful Python libraries to apply cutting-edge analytics to their data. Anaconda recognizes that business analysts want to improve—not disrupt—a proven workflow.

Because Anaconda leverages the Python ecosystem, analysts using Anaconda will achieve powerful results. They might apply a formula to an Excel sheet with a million data rows to predict repeat customers or they may create beautiful, informative visualizations to show how sales have shifted to a new demographic after the company’s newest marketing campaign kicked off. With Anaconda, business analysts can continue using Excel as their main interface, while harnessing the newest “magic” available in the open source community.

Open Data Science for Wizards…and Apprentices

Open Data Science is an inclusive movement. Although open source languages like Python and R dominate data science and allow for the most advanced—and therefore “magical”—analytics technology available, the community is open to all levels of expertise.

Anaconda is a great way for business analysts, for example, to embark on the road toward advanced analytics. But solutions, like App Starters, give advanced wizards the algorithmic visibility to alter and improve models as they see fit.

Open Data Science gives us the “sufficiently advanced technology” that Arthur C. Clarke mentioned—but it puts the power of that magic in our hands.

## April 26, 2016

### Matthieu Brucher

#### Analog modeling of a diode clipper (3b): Simulation

Let’s dive directly inside the second diode clipper and follow exactly the same pattern.

# Second diode clipper

So first let’s remember the equation:

$\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})$

## Forward Euler

The forward Euler approximation is then:

$V_{on+1} = V_{on} + h(\frac{V_{in+1} - V_{on}}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_{on}}{nV_t}))$

## Backward Euler

Backward Euler approximation is now:

$V_{on+1} - V_{on} = h(\frac{V_{in+1} - V_{on+1}}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_{on+1}}{nV_t}))$

(The equations are definitely easier to derive…)

## Trapezoidal rule

And finally trapezoidal rule gives:

$V_{on+1} - V_{on} = h(\frac{V_{in+1}}{R_1 C_1} - \frac{V_{on+1} + V_{on}}{2 R_1 C_1} + \frac{I_s}{C_1}(sinh(\frac{V_{on+1}}{nV_t}) + sinh(\frac{V_{on}}{nV_t})))$

## Starting estimates

For the estimates, we use exactly the same methods as the previous clipper, so I won’t recall them.

## Graphs

Let’s start with the comparison of the three different methods:

Numerical optimization comparison

The first obvious change is that the forward Euler can give pretty good results. This makes me think I may have made a mistake in the previous circuit, but as I had to derive the equation before doing the approximation, this may be the reason.

For the original estimates, just like last time, the results are identical:

Original estimates comparison

OK, let’s compare the result of the first iteration with different original estimates:
One step comparison

All estimates give a similar result, but the affine estimates give a better estimate than linear which gives a far better result than the default/copying estimate.

# Conclusion

Just for fun, let’s display the difference between the two clippers:

Diode clippers comparison

Obviously, the second clipper is more symmetric than the first one and thus will create less harmonics (which is confirmed by a spectrogram), and this is also easier to optimize (the second clipper uses at least one less iteration than the first one).

All things considered, the Newton Raphson algorithm is always efficient, with around 3 or less iterations for these circuits. Trying bisection or something else may not be that interesting, except if you are heavily using SIMD instructions. In this case, the optimization may be faster because you have a similar number of iterations.

Original estimates done with the last optimized value always works great although affine estimates are usually faster. The tricky part is deriving the equation. And more often than not, you make mistakes when implementing them!

Next step: DK method…

## April 25, 2016

### Continuum Analytics news

#### Accelerate 2.2 Released!

Posted Monday, April 25, 2016

We're happy to announce the latest update to Accelerate with the release of version 2.2. This version of Accelerate adds compatibility with the recently released Numba 0.25, and also expands the Anaconda Platform in two new directions:

• Data profiling
• MKL-accelerated ufuncs

I'll discuss each of these in detail below.

### Data Profiling

We've built up quite a bit of experience over the years optimizing numerical Python code for our customers, and these projects follow some common patterns. First, the most important step in the optimization process is profiling a realistic test case. You can't improve what you can't measure, and profiling is critical to identify the true bottlenecks in an application. Even experienced developers are often surprised by profiling results when they see which functions are consuming the most time. Ensuring the test case is realistic (but not necessarily long) is also very important, as unit and functional tests for applications tend to use smaller, or differently shaped, input data sets. The scaling behavior of many algorithms is non-linear, so profiling with a very small input can give misleading results.

The second step in optimization is to consider alternative implementations for the critical functions identified in the first step, possibly adopting a different algorithm, parallelizing the calculation to make use of multiple cores or a GPU, or moving up a level to eliminate or batch unnecessary calls to the function. In this step of the process, we often found ourselves lacking a critical piece of information: what data types and sizes were being passed to this function? The best approach often depends on this information. Are these NumPy arrays or custom classes? Are the arrays large or small? 32-bit or 64-bit float? What dimensionality? Large arrays might benefit from GPU acceleration, but small arrays often require moving up the call stack in order to see if calculations can be batched.

Rather than having to manually modify the code to collect this data type information in an ad-hoc way, we've added a new profiling tool to Accelerate that can record this type information as a part of normal profiling. For lack of a better term, we're calling this "data profiling."

We collect this extra information using a modified version of the built-in Python profiling mechanism, and can display it using the standard pstats-style table:

ncalls  tottime percall cumtime percall filename:lineno(function)
300/100 0.01313 0.0001313 0.03036 0.0003036  linalg.py:532(cholesky(a:ndarray(dtype=float64, shape=(3, 3))))
200/100 0.004237 4.237e-05 0.007189 7.189e-05 linalg.py:139(_commonType())
200/100 0.003431 3.431e-05 0.005312 5.312e-05 linalg.py:106(_makearray(a:ndarray(dtype=float64, shape=(3, 3))))
400/200 0.002663 1.332e-05 0.002663 1.332e-05 linalg.py:111(isComplexType(t:type))
300/100 0.002185 2.185e-05 0.002185 2.185e-05 linalg.py:209(_assertNdSquareness())
200/100 0.001592 1.592e-05 0.001592 1.592e-05 linalg.py:124(_realType(t:type, default:NoneType))
200/100 0.00107 1.07e-05 0.00107 1.07e-05 linalg.py:198(_assertRankAtLeast2())
100 0.000162 1.62e-06 0.000162 1.62e-06 linalg.py:101(get_linalg_error_extobj(callback:function))

The recorded function signatures now include data types, and NumPy arrays also have dtype and shape information. In the above example, we've selected only the linear algebra calls from the execution of a PyMC model. Here we can clearly see the Cholesky decomposition is being done on 3x3 matrices, which would dictate our optimization strategy if cholesky was the bottleneck in the code (in this case, it is not).

We've also integrated the SnakeViz profile visualization tool into the Accelerate profiler, so you can easily collect and view profile information right inside your Jupyter notebooks:

## profiling.png

All it takes to profile a function and view it in a notebook is a few lines:

from accelerate import profiler

p = profiler.Profile()

p.run('my_function_to_profile()')

profiler.plot(p)

### MKL-Accelerated Ufuncs

MKL is perhaps best known for high performance, multi-threaded linear algebra functionality, but MKL also provides highly optimized math functions, like sin() and cos() for arrays. Anaconda already ships with the numexpr library, which is linked against MKL to provide fast array math support. However, we have future plans for Accelerate that go beyond what numexpr can provide, so in the latest release of Accelerate, we've exposed the MKL array math functions as NumPy ufuncs you can call directly.

For code that makes extensive use of special math functions on arrays with many thousands of elements, the performance speedup is quite amazing:

import numpy as np

from accelerate.mkl import ufuncs as mkl_ufuncs

def spherical_to_cartesian_numpy(r, theta, phi):

    cos_theta = np.cos(theta)

    sin_theta = np.sin(theta)

    cos_phi = np.cos(phi)

    sin_phi = np.sin(phi)

    x = r * sin_theta * cos_phi

    y = r * sin_theta * sin_phi

    z = r * cos_theta

def spherical_to_cartesian_mkl(r, theta, phi):

    cos_theta = mkl_ufuncs.cos(theta)

    sin_theta = mkl_ufuncs.sin(theta)

    cos_phi = mkl_ufuncs.cos(phi)

    sin_phi = mkl_ufuncs.sin(phi)

        x = r * sin_theta * cos_phi

    y = r * sin_theta * sin_phi

    z = r, cos_theta

        return x, y, z

 n = 100000

r, theta, phi = np.random.uniform(1, 10, n), np.random.uniform(0, np.pi, n), np.random.uniform(-np.pi, np.pi, n)

%timeit spherical_to_cartesian_numpy(r, theta, phi)

%timeit spherical_to_cartesian_mkl(r, theta, phi)

    100 loops, best of 3: 7.01 ms per loop

    1000 loops, best of 3: 978 µs per loop

A speedup of 7x is not bad for a 2.3 GHz quad core laptop CPU from 2012. In future releases, we are looking to expand and integrate this functionality further into the Anaconda Platform, so stay tuned!

### Summary

For more information about these new features, take a look at the Accelerate manual:

You can install Accelerate with conda and use it free for 30 days:

conda install accelerate

Try it out, and let us know what you think. Academic users can get a free subscription to Anaconda (including several useful tools, like Accelerate) by following these instructions. Contact sales@continuum.io to find out how to get a subscription to Anaconda at your organization.

## April 20, 2016

### Matthew Rocklin

#### Ad Hoc Distributed Random Forests

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://www.youtube.com/watch?v=FkPlEqB8AnE

## TL;DR.

Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data.

Our machine learning model does not perform well, but we do learn how to execute ad-hoc computations easily.

## Motivation

In the past few posts we analyzed data on a cluster with Dask collections:

Often our computations don’t fit neatly into the bag, dataframe, or array abstractions. In these cases we want the flexibility of normal code with for loops, but still with the computational power of a cluster. With the dask.distributed task interface, we achieve something close to this.

## Application: Naive Distributed Random Forest Algorithm

As a motivating application we build a random forest algorithm from the ground up using the single-machine Scikit Learn library, and dask.distributed’s ability to quickly submit individual tasks to run on the cluster. Our algorithm will look like the following:

1. Pull data from some external source (S3) into several dataframes on the cluster
2. For each dataframe, create and train one RandomForestClassifier
3. Scatter single testing dataframe to all machines
4. For each RandomForestClassifier predict output on test dataframe
5. Aggregate independent predictions from each classifier together by a majority vote. To avoid bringing too much data to any one machine, perform this majority vote as a tree reduction.

## Data: NYC Taxi 2015

As in our blogpost on distributed dataframes we use the data on all NYC Taxi rides in 2015. This is around 20GB on disk and 60GB in RAM.

We predict the number of passengers in each cab given the other numeric columns like pickup and destination location, fare breakdown, distance, etc..

We do this first on a small bit of data on a single machine and then on the entire dataset on the cluster. Our cluster is composed of twelve m4.xlarges (4 cores, 15GB RAM each).

Disclaimer and Spoiler Alert: I am not an expert in machine learning. Our algorithm will perform very poorly. If you’re excited about machine learning you can stop reading here. However, if you’re interested in how to build distributed algorithms with Dask then you may want to read on, especially if you happen to know enough machine learning to improve upon my naive solution.

## API: submit, map, gather

We use a small number of dask.distributed functions to build our computation:

futures = executor.scatter(data)                     # scatter data
future = executor.submit(function, *args, **kwargs)  # submit single task
futures = executor.map(function, sequence)           # submit many tasks
results = executor.gather(futures)                   # gather results
executor.replicate(futures, n=number_of_replications)


In particular, functions like executor.submit(function, *args) let us send individual functions out to our cluster thousands of times a second. Because these functions consume their own results we can create complex workflows that stay entirely on the cluster and trust the distributed scheduler to move data around intelligently.

## Load Pandas from S3

First we load data from Amazon S3. We use the s3.read_csv(..., collection=False) function to load 178 Pandas DataFrames on our cluster from CSV data on S3. We get back a list of Future objects that refer to these remote dataframes. The use of collection=False gives us this list of futures rather than a single cohesive Dask.dataframe object.

from distributed import Executor, s3
e = Executor('52.91.1.177:8786')

parse_dates=['tpep_pickup_datetime',
'tpep_dropoff_datetime'],
collection=False)
dfs = e.compute(dfs)


Each of these is a lightweight Future pointing to a pandas.DataFrame on the cluster.

>>> dfs[:5]
[<Future: status: finished, type: DataFrame, key: finalize-a06c3dd25769f434978fa27d5a4cf24b>,
<Future: status: finished, type: DataFrame, key: finalize-7dcb27364a8701f45cb02d2fe034728a>,
<Future: status: finished, type: DataFrame, key: finalize-b0dfe075000bd59c3a90bfdf89a990da>,
<Future: status: finished, type: DataFrame, key: finalize-1c9bb25cefa1b892fac9b48c0aef7e04>,
<Future: status: finished, type: DataFrame, key: finalize-c8254256b09ae287badca3cf6d9e3142>]


If we’re willing to wait a bit then we can pull data from any future back to our local process using the .result() method. We don’t want to do this too much though, data transfer can be expensive and we can’t hold the entire dataset in the memory of a single machine. Here we just bring back one of the dataframes:

>>> df = dfs[0].result()

VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0 0.3 16.30

## Train on a single machine

To start lets go through the standard Scikit Learn fit/predict/score cycle with this small bit of data on a single machine.

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

df_train, df_test = train_test_split(df)

columns = ['trip_distance', 'pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_latitude', 'payment_type',
'fare_amount', 'mta_tax', 'tip_amount', 'tolls_amount']

est = RandomForestClassifier(n_estimators=4)
est.fit(df_train[columns], df_train.passenger_count)


This builds a RandomForestClassifer with four decision trees and then trains it against the numeric columns in the data, trying to predict the passenger_count column. It takes around 10 seconds to train on a single core. We now see how well we do on the holdout testing data:

>>> est.score(df_test[columns], df_test.passenger_count)
0.65808188654721012


This 65% accuracy is actually pretty poor. About 70% of the rides in NYC have a single passenger, so the model of “always guess one” would out-perform our fancy random forest.

>>> from sklearn.metrics import accuracy_score
>>> import numpy as np
>>> accuracy_score(df_test.passenger_count,
...                np.ones_like(df_test.passenger_count))
0.70669390028780987


This is where my ignorance in machine learning really kills us. There is likely a simple way to improve this. However, because I’m more interested in showing how to build distributed computations with Dask than in actually doing machine learning I’m going to go ahead with this naive approach. Spoiler alert: we’re going to do a lot of computation and still not beat the “always guess one” strategy.

## Fit across the cluster with executor.map

First we build a function that does just what we did before, builds a random forest and then trains it on a dataframe.

def fit(df):
est = RandomForestClassifier(n_estimators=4)
est.fit(df[columns], df.passenger_count)
return est


Second we call this function on all of our training dataframes on the cluster using the standard e.map(function, sequence) function. This sends out many small tasks for the cluster to run. We use all but the last dataframe for training data and hold out the last dataframe for testing. There are more principled ways to do this, but again we’re going to charge ahead here.

train = dfs[:-1]
test = dfs[-1]

estimators = e.map(fit, train)


This takes around two minutes to train on all of the 177 dataframes and now we have 177 independent estimators, each capable of guessing how many passengers a particular ride had. There is relatively little overhead in this computation.

## Predict on testing data

Recall that we kept separate a future, test, that points to a Pandas dataframe on the cluster that was not used to train any of our 177 estimators. We’re going to replicate this dataframe across all workers on the cluster and then ask each estimator to predict the number of passengers for each ride in this dataset.

e.replicate([test], n=48)

def predict(est, X):
return est.predict(X[columns])

predictions = [e.submit(predict, est, test) for est in estimators]


Here we used the executor.submit(function, *args, **kwrags) function in a list comprehension to individually launch many tasks. The scheduler determines when and where to run these tasks for optimal computation time and minimal data transfer. As with all functions, this returns futures that we can use to collect data if we want in the future.

Developers note: we explicitly replicate here in order to take advantage of efficient tree-broadcasting algorithms. This is purely a performance consideration, everything would have worked fine without this, but the explicit broadcast turns a 30s communication+computation into a 2s communication+computation.

## Aggregate predictions by majority vote

For each estimator we now have an independent prediction of the passenger counts for all of the rides in our test data. In other words for each ride we have 177 different opinions on how many passengers were in the cab. By averaging these opinions together we hope to achieve a more accurate consensus opinion.

For example, consider the first four prediction arrays:

>>> a_few_predictions = e.gather(predictions[:4])  # remote futures -> local arrays
>>> a_few_predictions
[array([1, 2, 1, ..., 2, 2, 1]),
array([1, 1, 1, ..., 1, 1, 1]),
array([2, 1, 1, ..., 1, 1, 1]),
array([1, 1, 1, ..., 1, 1, 1])]


For the first ride/column we see that three of the four predictions are for a single passenger while one prediction disagrees and is for two passengers. We create a consensus opinion by taking the mode of the stacked arrays:

from scipy.stats import mode
import numpy as np

def mymode(*arrays):
array = np.stack(arrays, axis=0)
return mode(array)[0][0]

>>> mymode(*a_few_predictions)
array([1, 1, 1, ..., 1, 1, 1])


And so when we average these four prediction arrays together we see that the majority opinion of one passenger dominates for all of the six rides visible here.

## Tree Reduction

We could call our mymode function on all of our predictions like this:

>>> mode_prediction = e.submit(mymode, *predictions)  # this doesn't scale well


Unfortunately this would move all of our results to a single machine to compute the mode there. This might swamp that single machine.

Instead we batch our predictions into groups of size 10, average each group, and then repeat the process with the smaller set of predictions until we have only one left. This sort of multi-step reduction is called a tree reduction. We can write it up with a couple nested loops and executor.submit. This is only an approximation of the mode, but it’s a much more scalable computation. This finishes in about 1.5 seconds.

from toolz import partition_all

while len(predictions) > 1:
predictions = [e.submit(mymode, *chunk)
for chunk in partition_all(10, predictions)]

result = e.gather(predictions)[0]

>>> result
array([1, 1, 1, ..., 1, 1, 1])


## Final Score

Finally, after completing all of our work on our cluster we can see how well our distributed random forest algorithm does.

>>> accuracy_score(result, test.result().passenger_count)
0.67061974451423045


Still worse than the naive “always guess one” strategy. This just goes to show that, no matter how sophisticated your Big Data solution is, there is no substitute for common sense and a little bit of domain expertise.

## What didn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

• Clearly this would have benefited from more machine learning knowledge. What would have been a good approach for this problem?
• I’ve been thinking a bit about memory management of replicated data on the cluster. In this exercise we specifically replicated out the test data. Everything would have worked fine without this step but it would have been much slower as every worker gathered data from the single worker that originally had the test dataframe. Replicating data is great until you start filling up distributed RAM. It will be interesting to think of policies about when to start cleaning up redundant data and when to keep it around.
• Several people from both open source users and Continuum customers have asked about a general Dask library for machine learning, something akin to Spark’s MLlib. Ideally a future Dask.learn module would leverage Scikit-Learn in the same way that Dask.dataframe leverages Pandas. It’s not clear how to cleanly break up and parallelize Scikit-Learn algorithms.

## Conclusion

This blogpost gives a concrete example using basic task submission with executor.map and executor.submit to build a non-trivial computation. This approach is straightforward and not restrictive. Personally this interface excites me more than collections like Dask.dataframe; there is a lot of freedom in arbitrary task submission.

## April 19, 2016

### Matthieu Brucher

#### Book review: The Culture Map: Decoding How People Think and Get Things Done in a Global World

I work in an international company, and there are lots of people from different cultures around me, and with whom I need to interact. Out of the blue, it feels like it’s easy to work with all of them, I mean, how difficult could it be to work with them?

Actually, it’s easy, but sometimes interactions are intriguing and people do not react the way you expect them to react. And why is that? Lots of reasons, of course, but one of them is that they have a different culture and do not expect you to explicitly tell them what they did wrong (which is something I do. A lot).

#### Content and opinions

Enters Erin Meyer. She had to navigate between these cultures, as she’s American, married to a French guy, in France. In her book, she presents 8 scales, and each culture is placed differently on each scale.

I won’t enter in all the details of the different scales, but they are about all the different ways of people interacting with other people. Whether it’s about scheduling, feedback to decision-making, all cultures are different. And sometimes, even if the cultures are close geographically, they can be quite different on some scales. After all, they are all influenced by their short or long history, their philosophers…

All the scales are imaged with stories of Meyer’s experience in teaching them, stories from her students, and they are always spot on.

#### Conclusion

Of course, the book only tells you the differences, what to look for. It doesn’t educate you to do the right think. This takes practice, and it requires work.

Also it doesn’t solve all interaction problems. Everyone is different in one’s own culture (not even talking about people having several cultures…), on the left or on the right compared to the average of one’s culture on each scale. So you can’t sum up someone to a culture. But if you want to learn more about interacting with people, you already know that.

## April 18, 2016

### Continuum Analytics news

#### Conda + Spark

Posted Tuesday, April 19, 2016

In my previous post, I described different scenarios for bootstrapping Python on a multi-node cluster. I offered a general solution using Anaconda for cluster management and solution using a custom conda env deployed with Knit.

In a follow-up to that post, I was asked if the machinery in Knit would also work for Spark. Sure--of course! In fact, much of Knit's design comes from Spark's deploy codebase. Here, I am going to demonstrate how we can ship a Python environment, complete with desired dependencies, as part of a Spark job without installing Python on every node.

## Spark YARN Deploy

First, I want to briefly describe key points in Spark's YARN deploy methodologies. After negotiating which resources to provision with YARN's Resource Manager, Spark asks for a directory to be constructed on HDFS: /user/ubuntu/.sparkStaging/application_1460665326796_0065/ The directory will always be in the user's home, and the application ID issued by YARN is appended to the directory name (thinking about this now, perhaps this is obvious and straightforward to JAVA/JVM folks where bundling Uber JARs has long been the practice in traditional Map-Reduce jobs). In any case, Spark then uploads itself to the stagingDirectory, and when YARN provisions a container, the contents of the directory are pulled down and the spark-assembly jar is executed. If you are using PySpark or sparkR, a corresponding pyspark.zip and sparkr.zip will be found in the staging directory as well.

Occasionally, users see FileNotFoundException errors -- this can be caused by a few things: incorrect Spark Contexts, incorrect SPARK_HOME, and I have faint recollection that there was a packaging problem once where pyspark.zip or sparkr.zip was missing, or could not be created do to permissions? Anyway -- below is the output you will see when Spark works cleanly.

16/04/15 13:01:03 INFO Client: Uploading resource file:/opt/anaconda/share/spark-1.6.0/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://ip-172-31-50-60:9000/user/ubuntu/.sparkStaging/application_1460665326796_0065/spark-assembly-1.6.0-hadoop2.6.0.jar

16/04/15 13:01:07 INFO Client: Uploading resource file:/opt/anaconda/share/spark-1.6.0/python/lib/pyspark.zip -> hdfs://ip-172-31-50-60:9000/user/ubuntu/.sparkStaging/application_1460665326796_0065/pyspark.zip

Not terribly exciting, but positive confirmation that Spark is uploading local files to HDFS.

## Bootstrap-Fu Redux

Most of what I described above is what the YARN framework allows developers to do -- it's more that Spark implements a YARN application than Spark doing magical things (and Knit as well!). If I were using Scala/Java, I would package up everything in a jar and use spark-submit -- Done!

Unfortunately, there's a little more work to be done for an Uber Python jar equivalent.

## Hard-Links Won't Travel

One of the killer features of conda is environment management. When conda creates a new environment, it uses hard-links when possible. Generally, this greatly reduces disk usage. But, if we move the directory to another machine, we're probably just moving a handful of hard-links and not the files themselves. Fortunately, we can tell conda: "No! Copy the files!"

For example:

conda create -p /home/ubuntu/dev --copy -y -q python=3 pandas scikit-learn

By using the --copy, we "install all packages using copies instead of hard or soft-linking." The headers in various files in the bin/ directory may have lines like #!/home/ubuntu/dev/bin/python. But, we don't need to be concerned about that -- we're not going to be using 2to3, idle, pip, etc. If we zipped up the environment, we could move this onto another machine of a similar OS type, execute Python, and we'd be able to load any library in the lib/python3.45/site-packages directory.

We're very close to our Uber Python jar -- now with a zipped conda directory in mind, let's proceed.

zip -r dev.zip dev

## Death by ENV Vars

We are going to need a handful of specific command line options and environment variables: Spark Yarn Configuration and Spark Environment Variables. We'll be using:

• PYSPARK_PYTHON: The Python binary Spark should use
• spark.yarn.appMasterEnv.PYSPARK_PYTHON (though this one could be wrong/unnecessary/only used for --master yarn-cluster)
• --archives: include local tgz/jar/zip in .sparkStaging directory and pull down into temporary YARN container

We'll also need a test script. The following is a reasonable test to prove which Python Spark is using -- we're writing a no-op function which returns Python's various paths it is using to find libraries

# test_spark.py

import os

import sys

from pyspark import SparkContext

from pyspark import SparkConf

conf = SparkConf()

conf.setAppName("get-hosts")

sc = SparkContext(conf=conf)

def noop(x):

    import socket

    import sys

    return socket.gethostname() + ' '.join(sys.path) + ' '.join(os.environ)

rdd = sc.parallelize(range(1000), 100)

hosts = rdd.map(noop).distinct().collect()

print(hosts)

And executing everything together:

 PYSPARK_PYTHON=./ANACONDA/dev/bin/python spark-submit \

 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ANACONDA/dev/bin/python \

 --master yarn-cluster \

 --archives /home/ubuntu/dev.zip#ANACONDA \

 /home/ubuntu/test_spark.py

We'll get the following output in the yarn logs:

It's a little hard to parse -- what should be noted are file paths like:

.../container_1460665326796_0070_01_000002/ANACONDA/dev/lib/python3.5/site- packages

This is demonstrating that Spark is using the unzipped directory in the YARN container. Ta-da!

## Thoughts

Okay, perhaps that's not super exciting, so let's zoom out again:

1. We create a zipped conda environment with dependencies: pandas, python=3,...
2. We successfully launched a Python Spark job without any Python binaries or libraries previously installed on the nodes.

There is an open JIRA ticket discussing the option of having Spark ingest a requirements.txt and building the Python environment as a preamble to a Spark job. This is also a fairly novel approach to the same end -- using Spark to bootstrap a runtime environment. It's even a bit more general, since the method described above relies on YARN. I first saw this strategy in use with streamparse. Similarly to the implementation in JIRA ticket, streamparse can ship a Python requirements.txt and construct a Python environment as part of a Streamparse Storm job!

## Rrrrrrrrrrrrrr

Oh, and R conda environments work as well...but it's more involved.

## Create/Munge R Env

First, it's pretty cool that conda can install and manage R environments. Again, we create a conda environment with R binaries and libraries

conda create -p /home/ubuntu/r_env --copy -y -q r-essentials -c r

R is not exactly relocatable so we need to munge a bit:

sed -i "s/home\/ubuntu/.r_env.zip/g" /home/ubuntu/r_env/bin/R

zip -r r_env.zip r_env

My R skills are at a below-novice level, so the following test script could probably be improved

# /home/ubuntu/test_spark.R

library(SparkR)

sc <- sparkR.init(appName="get-hosts-R")

noop <- function(x) {

  path <- toString(.libPaths())

  host <- toString(Sys.info()['nodename'])

  host_path <- toString(cbind(host,path))

  host_path

}

rdd <- SparkR:::parallelize(sc, 1:1000, 100)

hosts <- SparkR:::map(rdd, noop)

d_hosts <- SparkR:::distinct(hosts)

out <- SparkR:::collect(d_hosts)

print(out)

Execute (and the real death by options):

SPARKR_DRIVER_R=./r_env.zip/r_env/lib/R spark-submit --master yarn-cluster \

--conf spark.yarn.appMasterEnv.R_HOME=./r_env.zip/r_env/lib64/R \

--conf spark.yarn.appMasterEnv.RHOME=./r_env.zip/r_env \

--conf spark.yarn.appMasterEnv.R_SHARE_DIR=./r_env.zip/r_env/lib/R/share \

--conf spark.yarn.appMasterEnv.R_INCLUDE_DIR=./r_env.zip/r_env/lib/R/include \

--conf spark.executorEnv.R_HOME=./r_env.zip/r_env/lib64/R \

--conf spark.executorEnv.RHOME=./r_env.zip/r_env \

--conf spark.executorEnv.R_SHARE_DIR=./r_env.zip/r_env/lib/R/share \

--conf spark.executorEnv.R_INCLUDE_DIR=./r_env.zip/r_env/lib/R/include \

--conf  spark.r.command=./r_env.zip/r_env/bin/Rscript \

--archives r_env.zip \

/home/ubuntu/test_spark.R

Example output:

[1] "ip-172-31-50-59, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/230/sparkr.zip, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/229/r_env.zip/r_env/lib64/R/library"

[1] "ip-172-31-50-61, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/183/sparkr.zip, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/182/r_env.zip/r_env/lib64/R/library"

This post is also published on Ben's website here.

#### High Performance Hadoop with Anaconda and Dask on Your Cluster

Posted Monday, April 18, 2016

## Overview

Dask is a flexible open source parallel computation framework that lets you comfortably scale up and scale out your analytics. If you’re running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.

Dask enables distributed computing in pure Python and complements the existing numerical and scientific computing capability within Anaconda. Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data, and it scales up resiliently and elastically on clusters with hundreds of nodes.

Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP.

In this post, we’ll show you how you can use Anaconda with Dask for distributed computations and workflows, including distributed dataframes, arrays, text processing, and custom parallel workflows that can help you make the most of Anaconda and Dask on your cluster. We’ll work with Anaconda and Dask interactively from the Jupyter Notebook while the heavy computations are running on the cluster.

## Installing Anaconda and Dask on your Hadoop or High Performance Computing (HPC) Cluster

There are many different ways to get started with Anaconda and Dask on your Hadoop or HPC cluster, including manual setup via SSH; by integrating with resource managers such as YARN, SGE, or Slurm; launching instances on Amazon EC2; or by using the enterprise-ready Anaconda for cluster management.

Anaconda for cluster management makes it easy to install familiar packages from Anaconda (including NumPy, SciPy, Pandas, NLTK, scikit-learn, scikit-image, and access to 720+ more packages in Anaconda) and the Dask parallel processing framework on all of your bare-metal or cloud-based cluster nodes. You can provision centrally managed installations of Anaconda, Dask and the Jupyter notebook using two simple commands with Anaconda for cluster management:

$acluster create dask-cluster -p dask-cluster $ acluster install dask notebook

Additional features of Anaconda for cluster management include:

• Easily install Python and R packages across multiple cluster nodes
• Manage multiple conda environments across a cluster
• Push local conda environments to all cluster nodes
• Works on cloud-based and bare-metal clusters with existing Hadoop installations
• Remotely SSH and upload/download files to and from cluster nodes

Once you’ve installed Anaconda and Dask on your cluster, you can perform many types of distributed computations, including text processing (similar to Spark), distributed dataframes, distributed arrays, and custom parallel workflows. We’ll show some examples in the following sections.

## Distributed Text and Language Processing (Dask Bag)

Dask works well with standard computations such as text processing and natural language processing and with data in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask Bag collection is similar to other parallel frameworks and supports operations like  filter, count, fold, frequencies, pluck, and take, which are useful for working with a collection of Python objects such as text.

For example, we can use the natural language processing toolkit (NLTK) in Anaconda to perform distributed language processing on a Hadoop cluster, all while working interactively in a Jupyter notebook.

In this example, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.

First, we import libraries from Dask and connect to the Dask distributed scheduler:

>>> import dask

>>> from distributed import Executor, hdfs, progress

>>> e = Executor('54.164.41.213:8786')

Next, we load 242 GB of JSON data from HDFS using pure Python:

>>> import json

>>> lines = hdfs.read_text('/user/ubuntu/RC_2015-*.json')

>>> js = lines.map(json.loads)

We can filter and load the data into distributed memory across the cluster:

>>> movies = js.filter(lambda d: 'movies' in d['subreddit'])

>>> movies = e.persist(movies)

Once we’ve loaded the data into distributed memory, we can import the NLTK library from Anaconda and construct stacked expressions to tokenize words, tag parts of speech, and filter out non-words from the dataset.

>>> import nltk

>>> pos = e.persist(movies.pluck('body')

...                       .map(nltk.word_tokenize)

...                       .map(nltk.pos_tag)

...                       .concat()

...                       .filter(lambda (word, pos): word.isalpha()))

In this example, we’ll generate a list of the top 10 proper nouns from the movies subreddit.

>>> f = e.compute(pos.filter(lambda (word, type): type == 'NNP')

...                  .pluck(0)

...                  .frequencies()

...                  .topk(10, lambda (word, count): count))

>>> f.result()

[(u'Marvel', 35452),

 (u'Star', 34849),

 (u'Batman', 31749),

 (u'Wars', 28875),

 (u'Man', 26423),

 (u'John', 25304),

 (u'Superman', 22476),

 (u'Hollywood', 19840),

 (u'Max', 19558),

 (u'CGI', 19304)]

Finally, we can use Bokeh to generate an interactive plot of the resulting data:

## Analysis with Distributed Dataframes (Dask DataFrame)

Dask allows you to work with familiar Pandas dataframe syntax on a single machine or on many nodes on a Hadoop or HPC cluster. You can work with data stored in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask DataFrame collection mimics the Pandas API, uses Pandas under the hood, and supports operations like head, groupby, value_counts, merge, and set_index.

For example, we can use the Dask to perform computations with dataframes on a Hadoop cluster with data stored in HDFS, all while working interactively in a Jupyter notebook.

First, we import libraries from Dask and connect to the Dask distributed scheduler:

>>> import dask

>>> from distributed import Executor, hdfs, progress, wait, s3

>>> e = Executor('54.164.41.213:8786')

Next, we’ll load the NYC taxi data in CSV format from HDFS using pure Python and persist the data in memory:

>>> df = hdfs.read_csv('/user/ubuntu/nyc/yellow_tripdata_2015-*.csv',

                       parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'],

                       header='infer')

>>> df = e.persist(df)

We can perform familiar operations such as computing value counts on columns and statistical correlations:

>>> df.payment_type.value_counts().compute()

1    91574644

2    53864648

3      503070

4      170599

5          28

Name: payment_type, dtype: int64

>>> df2 = df.assign(payment_2=(df.payment_type == 2),

...                 no_tip=(df.tip_amount == 0))

>>> df2.astype(int).corr().compute()

           no_tip    payment_2

no_tip    1.000000    0.943123

payment_2    0.943123    1.000000

Dask runs entirely asynchronously, leaving us free to explore other cells in the notebook while computations happen in the background. Dask also handles all of the messy CSV schema handling for us automatically.

Finally, we can use Bokeh to generate an interactive plot of the resulting data:

## Numerical, Statistical and Scientific Computations with Distributed Arrays (Dask Array)

Dask works well with numerical and scientific computations on n-dimensional array data. The Dask Array collection mimics a subset of the NumPy API, uses NumPy under the hood, and supports operations like dot, flatten, max, mean, and std.

For example, we can use the Dask to perform computations with arrays on a cluster with global temperature/weather data stored in NetCDF format (like HDF5), all while working interactively in a Jupyter notebook. The data files contain measurements that were taken every six hours at every quarter degree latitude and longitude.

First, we import the netCDF4 library and point to the data files stored on disk:

>>> import netCDF4

>>> from glob import glob

>>> filenames = sorted(glob('2014-*.nc3'))

>>> t2m = [netCDF4.Dataset(fn).variables['t2m'] for fn in filenames]

>>> t2m[0]

<class 'netCDF4._netCDF4.Variable'>

int16 t2m(time, latitude, longitude)

    scale_factor: 0.00159734395579

    add_offset: 268.172358066

    _FillValue: -32767

    missing_value: -32767

    units: K

    long_name: 2 metre temperature

unlimited dimensions:

current shape = (4, 721, 1440)

filling off

We then import Dask and read in the data from the NumPy arrays:

>>> import dask.array as da

>>> xs = [da.from_array(t, chunks=t.shape) for t in t2m]

>>> x = da.concatenate(xs, axis=0)

We can then perform distributed computations on the cluster, such as computing the mean temperature, variance of the temperature over time, and normalized temperature. We can view the progress of the computations as they run on the cluster nodes and continue to work in other cells in the notebook:

>>> avg, std = da.compute(x.mean(axis=0), x.std(axis=0))

>>> z = (x - avg) / std

>>> progress(z)

We can plot the resulting normalized temperature using matplotlib:

We can also create interactive widgets in the notebook to interact with and visualize the data in real-time while the computations are running across the cluster:

## Creating Custom Parallel Workflows

When one of the standard Dask collections isn’t a good fit for your workflow, Dask gives you the flexibility to work with different file formats and custom parallel workflows. The Dask Imperative collection lets you wrap functions in existing Python code and run the computations on a single machine or across a cluster.

In this example, we have multiple files stored hierarchically in a custom file format (Feather for reading and writing Python and R dataframes on disk). We can build a custom workflow by wrapping the code with Dask Imperative and making use of the Feather library:

>>> import feather

>>> from dask import delayed

>>> from glob import glob

>>> import os

>>> lazy_dataframes = []

>>> for directory in glob('2016-*'):

...     for symbol in os.listdir(directory):

...         filename = os.path.join(directory, symbol)

...         df = delayed(feather.read_dataframe)(filename)

...         df = delayed(pd.DataFrame.assign)(df,

                    date=pd.Timestamp(directory),

                    symbol=symbol)

...         lazy_dataframes.append(df)

View more examples and documentation in the Dask documentation. For more information about using Anaconda and Dask to scale out Python on your cluster, check out our recent webinar on High Performance Hadoop with Python.

You can get started with Anaconda and Dask using Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$conda install anaconda-client -n root $ anaconda login

$conda install anaconda-cluster -c anaconda-cluster In addition to Anaconda subscriptions, there are many different ways that Continuum can help you get started with Anaconda and Dask to construct parallel workflows, parallelize your existing code, or integrate with your existing Hadoop or HPC cluster, including: • Architecture consulting and review • Manage Python packages and environments on a cluster • Develop custom package management solutions on existing clusters • Migrate and parallelize existing code with Python and Dask • Architect parallel workflows and data pipelines with Dask • Build proof of concepts and interactive applications with Dask • Custom product/OSS core development • Training on parallel development with Dask For more information about the above solutions, or if you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io. ## April 17, 2016 ### Titus Brown #### MinHash signatures as ways to find samples, and collaborators? As I wrote last week my latest enthusiasm is MinHash sketches, applied (for the moment) to RNAseq data sets. Briefly, these are small "signatures" of data sets that can be used to compare data sets quickly. In the previous blog post, I talked a bit about their effectiveness and showed that (at least in my hands, and on a small data set of ~200 samples) I could use them to cluster RNAseq data sets by species. What I didn't highlight in that blog post is that they could potentially be used to find samples of interest as well as (maybe) collaborators. ## Finding samples of interest The "samples of interest" idea is pretty clear - supposed we had a collection of signatures from all the RNAseq in the the Sequence Read Archive? Then we could search the entire SRA for data sets that were "close" to ours, and then just use those to do transcriptome studies. It's not yet clear how well this might work for finding RNAseq data sets with similar expression patterns, but if you're working with non-model species, then it might be a good way to pick out all the data sets that you should use to generate a de novo assembly. More generally, as we get more and more data, finding relevant samples may get harder and harder. This kind of approach lets you search on sequence content, not annotations or metadata, which may be incomplete or inaccurate for all sorts of reasons. In support of this general idea, I have defined a provisional file format (in YAML) that can be used to transport around these signatures. It's rather minimal and fairly human readable - we would need to augment it with additional metadata fields for any serious use in databases(but see below for more discussion on that). Each record (and there can currently only be one record per signature file) can contain multiple different sketches, corresponding to different k-mer sizes used in generating the sketch. (For different sized sketches with the same k-mers, you just store the biggest one, because we're using bottom sketches so the bigger sketches properly include the smaller sketches.) If you want to play with some signatures, you can -- here's an executable binder with some examples of generating distance matrices between signatures, and plotting them. Note that by far the most time is spent in loading the signatures - the comparisons are super quick, and in any case could be sped up a lot by moving them from pure Python over to C. I've got a pile of all echinoderm SRA signatures already built, for those who are interested in looking at a collection -- look here. ## Finding collaborators Searching public databases is all well and good, and is a pretty cool application to enable with a few dozen lines of code. But I'm also interested in enabling the search of pre-publication data and doing matchmaking between potential collaborators. How could this work? Well, the interesting thing about these signatures is that they are irreversible signatures with a one-sided error (a match means something; no match means very little). This means that you can't learn much of anything about the original sample from the signature unless you have a matching sample, and even then all you know is the species and maybe something about the tissue/stage being sequenced. In turn, this means that it might be possible to convince people to publicly post signatures of pre-publication mRNAseq data sets. Why would they do this?? An underappreciated challenge in the non-model organism world is that building reference transcriptomes requires a lot of samples. Sure, you can go sequence just the tissues you're interested in, but you have to sequence deeply and broadly in order to generate good enough data to produce a good reference transcriptome so that you can interpret your own mRNAseq. In part because of this (as well as many other reasons), people are slow to publish on their mRNAseq - and, generally, data isn't made available pre-publication. What if you could go fishing for collaborators on building a reference transcriptome? Very few people are excited about just publishing a transcriptome (with some reason, when you see papers that publish 300), but those are really valuable building blocks for the field as a whole. So, suppose you had some RNAseq, and you wanted to find other people with RNAseq from the same organism, and there was this service where you could post your RNAseq signature and get notified when similar signatures were posted? You wouldn't need to do anything more than supply an e-mail address along with your signature, and if you're worried about leaking information about who you are, it's easy enough to make new e-mail addresses. I dunno. Seems interesting. Could work. Right? One fun point is that this could be a distributed service. The signatures are small enough (~1-2 kb) that you can post them on places like github, and then have aggregators that collect them. The only "centralized" service involved would be in searching all of them, and that's pretty lightweight in practice. Another fun point is that we already have a good way to communicate RNAseq for the limited purpose of transcrpiptome assembly -- diginorm. Abundance-normalized RNAseq is useless for doing expression analysis, and if you normalize a bunch of samples together you can't even figure out what the original tissue was. So, if you're worried about other people having access to your expression levels, you can simply normalize the data all together before handing it over. ## Further thoughts As I said in the first post, this was all nucleated by reading the mash and MetaPalette papers. In my review for MetaPalette, I suggested that they look at mash to see if MinHash signatures could be used to dramatically reduce their database size, and now that I actually understand MinHash a bit more, I think the answer is clearly yes. Which leads to another question - the Mash folk are clearly planning to use MinHash & mash to search assembled genomes, with a side helping of unassembled short and long reads. If we can all agree on an interchange format or three, why couldn't we just start generating public signatures of all the things, mRNAseq and genomic and metagenomic all? I see many, many uses, all somewhat dimly... (Lest anyone think I believe this to be a novel observation, clearly the Mash folk are well ahead of me here -- they undersold it in their paper, so I didn't notice until I re-read it with this in mind, but it's there :). Anyway, it seems like a great idea and we should totally do it. Who's in? What are the use cases? What do we need to do? Where is it going to break? --titus p.s. Thanks to Luiz Irber for some helpful discussion about YAML formats! ## April 14, 2016 ### Matthew Rocklin #### Fast Message Serialization This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project Very high performance isn’t about doing one thing well, it’s about doing nothing poorly. This week I optimized the inter-node communication protocol used by dask.distributed. It was a fun exercise in optimization that involved several different and unexpected components. I separately had to deal with Pickle, NumPy, Tornado, MsgPack, and compression libraries. This blogpost is not advertising any particular functionality, rather it’s a story of the problems I ran into when designing and optimizing a protocol to quickly send both very small and very large numeric data between machines on the Python stack. We care very strongly about both the many small messages case (thousands of 100 byte messages per second) and the very large messages case (100-1000 MB). This spans an interesting range of performance space. We end up with a protocol that costs around 5 microseconds in the small case and operates at 1-1.5 GB/s in the large case. ## Identify a Problem This came about as I was preparing a demo using dask.array on a distributed cluster for a Continuum webinar. I noticed that my computations were taking much longer than expected. The Web UI quickly pointed me to the fact that my machines were spending 10-20 seconds moving 30 MB chunks of numpy array data between them. This is very strange because I was on 100MB/s network, and so I expected these transfers to happen in more like 0.3s than 15s. The Web UI made this glaringly apparent, so my first lesson was how valuable visual profiling tools can be when they make performance issues glaringly obvious. Thanks here goes to the Bokeh developers who helped the development of the Dask real-time Web UI. ## Problem 1: Tornado’s sentinels Dask’s networking is built off of Tornado’s TCP IOStreams. There are two common ways to delineate messages on a socket, sentinel values that signal the end of a message, and prefixing a length before every message. Early on we tried both in Dask but found that prefixing a length before every message was slow. It turns out that this was because TCP sockets try to batch small messages to increase bandwidth. Turning this optimization off ended up being an effective and easy solution, see the TCP_NODELAY parameter. However, before we figured that out we used sentinels for a long time. Unfortunately Tornado does not handle sentinels well for large messages. At the receipt of every new message it reads through all buffered data to see if it can find the sentinel. This makes lots and lots of copies and reads through lots and lots of bytes. This isn’t a problem if your messages are a few kilobytes, as is common in web development, but it’s terrible if your messages are millions or billions of bytes long. Switching back to prefixing messages with lengths and turning off the no-delay optimization moved our bandwidth up from 3MB/s to 20MB/s per node. Thanks goes to Ben Darnell (main Tornado developer) for helping us to track this down. ## Problem 2: Memory Copies A nice machine can copy memory at 5 GB/s. If your network is only 100 MB/s then you can easily suffer several memory copies in your system without caring. This leads to code that looks like the following: socket.send(header + payload)  This code concatenates two bytestrings, header and payload before sending the result down a socket. If we cared deeply about avoiding memory copies then we might instead send these two separately: socket.send(header) socket.send(payload)  But who cares, right? At 5 GB/s copying memory is cheap! Unfortunately this breaks down under either of the following conditions 1. You are sloppy enough to do this multiple times 2. You find yourself on a machine with surprisingly low memory bandwidth, like 10 times slower, as is the case on some EC2 machines. Both of these were true for me but fortunately it’s usually straightforward to reduce the number of copies down to a small number (we got down to three), with moderate effort. ## Problem 3: Unwanted Compression Dask compresses all large messages with LZ4 or Snappy if they’re available. Unfortunately, if your data isn’t very compressible then this is mostly lost time. Doubly unforutnate is that you also have to decompress the data on the recipient side. Decompressing not-very-compressible data was surprisingly slow. Now we compress with the following policy: 1. If the message is less than 10kB, don’t bother 2. Pick out five 10kB samples of the data and compress those. If the result isn’t well compressed then don’t bother compressing the full payload. 3. Compress the full payload, if it doesn’t compress well then just send along the original to spare the receiver’s side from compressing. In this case we use cheap checks to guard against unwanted compression. We also avoid any cost at all for small messages, which we care about deeply. ## Problem 4: Cloudpickle is not as fast as Pickle This was surprising, because cloudpickle mostly defers to Pickle for the easy stuff, like NumPy arrays. In [1]: import numpy as np In [2]: data = np.random.randint(0, 255, dtype='u1', size=10000000) In [3]: import pickle, cloudpickle In [4]: %time len(pickle.dumps(data, protocol=-1)) CPU times: user 8.65 ms, sys: 8.42 ms, total: 17.1 ms Wall time: 16.9 ms Out[4]: 10000161 In [5]: %time len(cloudpickle.dumps(data, protocol=-1)) CPU times: user 20.6 ms, sys: 24.5 ms, total: 45.1 ms Wall time: 44.4 ms Out[5]: 10000161  But it turns out that cloudpickle is using the Python implementation, while pickle itself (or cPickle in Python 2) is using the compiled C implemenation. Fortunately this is easy to correct, and a quick typecheck on common large dataformats in Python (NumPy and Pandas) gets us this speed boost. ## Problem 5: Pickle is still slower than you’d expect Pickle runs at about half the speed of memcopy, which is what you’d expect from a protocol that is mostly just “serialize the dtype, strides, then tack on the data bytes”. There must be an extraneous memory copy in there. See issue 7544 ## Problem 6: MsgPack is bad at large bytestrings Dask serializes most messages with MsgPack, which is ordinarily very fast. Unfortunately the MsgPack spec doesn’t support bytestrings greater than 4GB (which do come up for us) and the Python implementations don’t pass through large bytestrings very efficiently. So we had to handle large bytestrings separately. Any message that contains bytestrings over 1MB in size will have them stripped out and sent along in a separate frame. This both avoids the MsgPack overhead and avoids a memory copy (we can send the bytes directly to the socket). ## Problem 7: Tornado makes a copy Sockets on Windows don’t accept payloads greater than 128kB in size. As a result Tornado chops up large messages into many small ones. On linux this memory copy is extraneous. It can be removed with a bit of logic within Tornado. I might do this in the moderate future. ## Results We serialize small messages in about 5 microseconds (thanks msgpack!) and move large bytes around in the cost of three memory copies (about 1-1.5 GB/s) which is generally faster than most networks in use. Here is a profile of sending and receiving a gigabyte-sized NumPy array of random values through to the same process over localhost (500 MB/s on my machine.)  381360 function calls (381323 primitive calls) in 1.451 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.366 0.366 0.366 0.366 {built-in method dumps} 8 0.289 0.036 0.291 0.036 iostream.py:360(write) 15353 0.228 0.000 0.228 0.000 {method 'join' of 'bytes' objects} 15355 0.166 0.000 0.166 0.000 {method 'recv' of '_socket.socket' objects} 15362 0.156 0.000 0.398 0.000 iostream.py:1510(_merge_prefix) 7759 0.101 0.000 0.101 0.000 {method 'send' of '_socket.socket' objects} 17/14 0.026 0.002 0.686 0.049 gen.py:990(run) 15355 0.021 0.000 0.198 0.000 iostream.py:721(_read_to_buffer) 8 0.018 0.002 0.203 0.025 iostream.py:876(_consume) 91 0.017 0.000 0.335 0.004 iostream.py:827(_handle_write) 89 0.015 0.000 0.217 0.002 iostream.py:585(_read_to_buffer_loop) 122567 0.009 0.000 0.009 0.000 {built-in method len} 15355 0.008 0.000 0.173 0.000 iostream.py:1010(read_from_fd) 38369 0.004 0.000 0.004 0.000 {method 'append' of 'list' objects} 7759 0.004 0.000 0.104 0.000 iostream.py:1023(write_to_fd) 1 0.003 0.003 1.451 1.451 ioloop.py:746(start)  Dominant unwanted costs include the following: 1. 400ms: Pickling the NumPy array 2. 400ms: Bytestring handling within Tornado After this we’re just bound by pushing bytes down a wire. ## Conclusion Writing fast code isn’t about writing any one thing particularly well, it’s about mitigating everything that can get in your way. As you approch peak performance, previously minor flaws suddenly become your dominant bottleneck. Success here depends on frequent profiling and keeping your mind open to unexpected and surprising costs. ## April 13, 2016 ### Titus Brown #### Applying MinHash to cluster RNAseq samples (I gave a talk on this on Monday, April 11th - you can see the slides slides here, on figshare. This is a Reproducible Blog Post. You can regenerate all the figures and play with this software yourself on binder.) So, my latest enthusiasm is MinHash sketches. A few weeks back, I had the luck to be asked to review both the mash paper (preprint here) and the MetaPalette paper (preprint here). The mash paper made me learn about MinHash sketches, while the MetaPalette paper made some very nice points about shared k-mers and species identification. After reading, I got to thinking. I wondered to myself, hey, could I use MinHash signatures to cluster unassembled Illumina RNAseq samples? While the mash folk showed that MinHash could be applied to raw reads nicely, I guessed that the greater dynamic range of gene expression would cause problems - mainly because high-abundance transcripts would yield many, many erroneous k-mers. Conveniently, however, my lab has some not-so-secret sauce for dealing with this problem - would it work, here? I thought it might. Combined with all of this, my former grad student, Dr. Qingpeng Zhang (first author on the not-so-secret sauce, above) has some other still-unpublished work showing that the the first ~1m reads of metagenome samples can be used to cluster samples together. So, I reasoned, perhaps it would work well to stream the first million or so reads from the beginning of RNAseq samples through our error trimming approach, compute a MinHash signature, and then use that signature to identify the species from which the RNAseq was isolated (and perhaps even closely related samples). tl; dr? It seems to work, with some modifications. For everything below, I used a k-mer hash size of 32 and only chose read data sets with reads of length 72 or higher. (Here's a nice presentation on MinHash, via Luiz Irber.) ## MinHash is super easy to implement I implemented MinHash in only a few lines of Python; see the repository at https://github.com/dib-lab/sourmash/. The most relevant code is sourmash_lib.py. Here, I'm using a bottom sketch, and at the moment I'm building some of it on top of khmer, although I will probably remove that requirement soon. After lots of trial and error (some of it reported below), I settled on using a k-mer size of k=32, and a sketch size of 500. (You can go down to a sketch size of 100, but you lose resolution. Lower k-mer sizes have the expected effect of slowly decreasing resolution; odd k-mer sizes effectively halve the sketch size.) ## How fast is it, and how much memory does it use, and how big are the sketches? I haven't bothered benchmarking it, but • everything but the hash function itself is on Python; • on my 3 yro laptop it takes about 5 minutes to add 1m reads; • the memory usage of sourmash itself is negligible - error trimming the reads requires about 1 GB of RAM; • the sketches are tiny - less than a few kb - and the program is dominated by the Python overhead. So it's super fast and super lightweight. ## Do you need to error trim the reads? The figure below shows a dendrogram next to a distance matrix of 8 samples - four mouse samples, untrimmed, and the same four mouse samples, trimmed at low-abundance k-mers. (You can see the trimming command here, using khmer's trim-low-abund command.) The two house mouse samples are replicates, and they always cluster together. However, they are much further apart without trimming. The effect of trimming on the disease mouse samples (which are independent biological samples, I believe) is much less; it rearranges the tree a bit but it's not as convincing as with the trimming. So you seem to get better resolution when you error trim the reads, which is expected. The signal isn't as strong as I thought it'd be, though. Have to think about that; I'm surprised MinHash is that robust to errors! ## Species group together pretty robustly with only 1m reads How many reads do you need to use? If you're looking for species groupings, not that many -- 1m reads is enough to cluster mouse vs yeast separately. (Which is good, right? If that didn't work...) Approximately 1m reads turns out to work equally well for 200 echinoderm (sea urchin and sea star) samples, too. Here, I downloaded all 204 echinoderm HiSeq mRNAseq data sets from SRA, trimmed them as above, and computed the MinHash signatures, and then compared them all to each other. The blocks of similarity are all specific species, and all the species groups cluster properly, and none of them (with one exception) cluster with other species. This is also an impressive demonstration of the speed of MinHash - you can do all 204 samples against each other in about 10 seconds. Most of that time is spent loading my YAML format into memory; the actual comparison takes < 1s! (The whole notebook for making all of these figures takes less than 30 seconds to run, since the signatures are already there; check it out!) ## Species that do group together may actually belong together In the urchin clustering above, there's only one "confused" species grouping where one cluster contains more than one species - that's Patiria miniata and Patiria pectinifera, which are both bat stars. I posted this figure on Facebook and noted the grouping, and Dan Rokhsar pointed out that on Wikipedia, Patiria has been identified as a complex of three closely related species in the Pacific. So that's good - it seems like the only group that has cross-species clustering is, indeed, truly multi-species. ## You can sample any 1m reads and get pretty similar results In theory, FASTQ files from shotgun sequencing are perfectly random, so you should be able to pick any 1m reads you want - including the first 1m. In practice, of course, this is not true. How similar are different subsamples? Answer: quite similar. All seven 1m read subsamples (5 random, one from the middle, one from the end) are above 70% in similarity. ## (Very) widely divergent species don't cross-compare at all If you look at (say) yeast and mouse, there's simply no similarity there at all. 32-mer signatures are apparently very specific. (The graph below is kind of stupid. It's just looking at similarity between mouse and yeast data sets as you walk through the two data streams. It's 0.2% all the way.) ## Species samples get more similar (or stay the same) as you go through the stream What happens when you look at more than 1m reads? Do the streams get more or less similar? If you walk through two streams and update the MinHash signature regularly, you see either constant similarity or a general increase in similarity; in the mouse replicates, it's constant and high, and between disease mouse and house mouse, it grows as you step through the stream. (The inflection points are probably due to how we rearrange the reads during the abundance trimming. More investigation needed.) Yeast replicates also maintain high similarity through the data stream. ## What we're actually doing is mostly picking k-mers from the transcriptome (This is pretty much what we expected, but as my dad always said, "trust but verify.") The next question is, what are we actually seeing signatures of? For example, in the above mouse example, we see growing similarity between two mouse data sets as we step through the data stream. Is this because we're counting more sequencing artifacts as we look at more data, or is this because we're seeing true signal? To investigate, I calculated the MinHash signature of the mouse RNA RefSeq file, and then asked if the streams were getting closer to that as we walked through them. They are: So, it seems like part of what's happening here is that we are looking at the True Signature of the mouse transcriptome. Good to know. And that's it for today, folks. ## What can this be used for? So, it all seems to work pretty well - the mash folk are dead-on right, and this is a pretty awesome and simple way to look at sequences. Right now, my approach above seems like it's most useful for identifying what species some RNAseq is from. If we can do that, then we can start thinking about other uses. If we can't do that pretty robustly, then that's a problem ;). So that's where I started. It might be fun to run against portions of the SRA to identify mislabeled samples. Once we have the SRA digested, we can make that available to people who are looking for more samples from their species of interest; whether this is useful will depend. I'm guessing that it's not immediately useful, since the SRA species identification seem pretty decent. One simple idea is to simply run this on each new sample you get back from a sequencing facility. "Hey, this looks like Drosophila. ...did you intend to sequence Drosophila?" It won't work for identifying low-lying contamination that well, but it could identify mis-labeled samples pretty quickly. Tracy Teal suggested that this could be used in-house in large labs to find out if others in the lab have samples of interest to you. Hmm. More on that idea later. ## Some big remaining questions • Do samples actually cluster by expression similarity? Maybe - more work needed. • Can this be used to compare different metagenomes using raw reads? No, probably not very well. At least, the metagenomes I care about are too diverse; you will probably need a different strategy. I'm thinking about it. ## One last shoutout I pretty much reimplemented parts of mash; there's nothing particularly novel here, other than exploring it in my own code on public data :). So, thanks, mash authors! --titus ## April 12, 2016 ### Continuum Analytics news #### Using Anaconda with PySpark for Distributed Language Processing on a Hadoop Cluster Posted Tuesday, April 12, 2016 ### Overview Working with your favorite Python packages along with distributed PySpark jobs across a Hadoop cluster can be difficult due to tedious manual setup and configuration issues, which is a problem that becomes more painful as the number of nodes in your cluster increases. Anaconda makes it easy to manage packages (including Python, R and Scala) and their dependencies on an existing Hadoop cluster with PySpark, including data processing, machine learning, image processing and natural language processing. ## 1-pyspark-reddit-language.png In a previous post, we’ve demonstrated how you can use libraries in Anaconda to query and visualize 1.7 billion comments on a Hadoop cluster. In this post, we’ll use Anaconda to perform distributed natural language processing with PySpark using a subset of the same data set. We’ll configure different enterprise Hadoop distributions, including Cloudera CDH and Hortonworks HDP, to work interactively on your Hadoop cluster with PySpark, Anaconda and a Jupyter Notebook. ## 2-pyspark-reddit-language.png In the remainder of this post, we'll: 1. Install Anaconda and the Jupyter Notebook on an existing Hadoop cluster. 2. Load the text/language data into HDFS on the cluster. 3. Configure PySpark to work with Anaconda and the Jupyter Notebook with different enterprise Hadoop distributions. 4. Perform distributed natural language processing on the data with the NLTK library from Anaconda. 5. Work locally with a subset of the data using Pandas and Bokeh for data analysis and interactive visualization. ### Provisioning Anaconda on a cluster Because we’re installing Anaconda on an existing Hadoop cluster, we can follow the bare-metal cluster setup instructions in Anaconda for cluster management from a Windows, Mac, or Linux machine. We can install and configure conda on each node of the existing Hadoop cluster with a single command: $ acluster create cluster-hadoop --profile cluster-hadoop

After a few minutes, we’ll have a centrally managed installation of conda across our Hadoop cluster in the default location of /opt/anaconda.

### Installing Anaconda packages on the cluster

Once we’ve provisioned conda on the cluster, we can install the packages from Anaconda that we’ll need for this example to perform language processing, data analysis and visualization:

$acluster conda install nltk pandas bokeh We’ll need to download the NLTK data on each node of the cluster. For convenience, we can do this using the distributed shell functionality in Anaconda for cluster management: $ acluster cmd 'sudo /opt/anaconda/bin/python -m nltk.downloader -d /usr/share/nltk_data all'

In this post, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.

Note that we could convert the data into different formats or load it into various query engines; however, since the focus of this blog post is using libraries with Anaconda, we will be working with the raw JSON data in PySpark.

We’ll load the reddit comment data into HDFS from the head node. You can SSH into the head node by running the following command from the client machine:

$acluster ssh The remaining commands in this section will be executed on the head node. If it doesn’t already exist, we’ll need to create a user directory in HDFS and assign the appropriate permissions: $ sudo -u hdfs hadoop fs -mkdir /user/ubuntu

$sudo -u hdfs hadoop fs -chown ubuntu /user/ubuntu We can then move the data by running the following command with valid AWS credentials, which will transfer the reddit comment data from the year 2015 (242 GB of JSON data) from a public Amazon S3 bucket into HDFS on the cluster: $ hadoop distcp s3n://AWS_KEY:AWS_SECRET@blaze-data/reddit/json/2015/*.json /user/ubuntu/

Replace AWS_KEY and AWS_SECRET in the above command with valid Amazon AWS credentials.

### Configuring the spark-submit command with your Hadoop Cluster

To use Python from Anaconda along with PySpark, you can set the PYSPARK_PYTHON environment variable on a per-job basis along with the spark-submit command. If you’re using the Anaconda parcel for CDH, you can run a PySpark script (e.g., spark-job.py) using the following command:

$PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py If you’re using Anaconda for cluster management with Cloudera CDH or Hortonworks HDP, you can run the PySpark script using the following command (note the different path to Python): $ PYSPARK_PYTHON=/opt/anaconda/bin/python spark-submit spark-job.py

### Installing and Configuring the Notebook with your Hadoop Cluster

Using the spark-submit command is a quick and easy way to verify that our PySpark script works in batch mode. However, it can be tedious to work with our analysis in a non-interactive manner as Java and Python logs scroll by.

## 4-pyspark-reddit-language.png

Instead, we can use the Jupyter Notebook on our Hadoop cluster to work interactively with our data via Anaconda and PySpark.

## 3-pyspark-reddit-language.png

Using Anaconda for cluster management, we can install Jupyter Notebook on the head node of the cluster with a single command, then open the notebook interface in our local web browser:

$acluster install notebook $ acluster open notebook

Once we’ve opened a new notebook, we’ll need to configure some environment variables for PySpark to work with Anaconda. The following sections include details on how to configure the environment variables for Anaconda to work with PySpark on Cloudera CDH and Hortonworks HDP.

#### Using the Anaconda Parcel with Cloudera CDH

If you’re using the Anaconda parcel with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and the Anaconda 4.0 parcel.

>>> import os

>>> import sys

>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"

>>> os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

#### Using Anaconda for cluster management with Cloudera CDH

If you’re using Anaconda for cluster management with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and Anaconda for cluster management 1.4.0.

>>> import os

>>> import sys

>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"

>>> os.environ["SPARK_HOME"] = "/opt/anaconda/parcels/CDH/lib/spark"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

#### Using Anaconda for cluster management with Hortonworks HDP

If you’re using Anaconda for cluster management with Hortonworks HDP, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Hortonworks HDP running Spark 1.6.0 and Anaconda for cluster management 1.4.0.

>>> import os

>>> import sys

>>> os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

### Initializing the SparkContext

After we’ve configured Anaconda to work with PySpark on our Hadoop cluster, we can initialize a SparkContext that we’ll use for distributed computations. In this example, we’ll be using the YARN resource manager in client mode:

>>> from pyspark import SparkConf

>>> from pyspark import SparkContext

>>> conf = SparkConf()

>>> conf.setMaster('yarn-client')

>>> conf.setAppName('anaconda-pyspark-language')

>>> sc = SparkContext(conf=conf)

Now that we’ve created a SparkContext, we can load the JSON reddit comment data into a Resilient Distributed Dataset (RDD) from PySpark:

>>> lines = sc.textFile("/user/ubuntu/*.json")

Next, we decode the JSON data and decide that we want to filter comments from the movies subreddit:

>>> import json

>>> data = lines.map(json.loads)

>>> movies = data.filter(lambda x: x['subreddit'] == 'movies')

We can then persist the RDD in distributed memory across the cluster so that future computations and queries will be computed quickly from memory. Note that this operation only marks the RDD to be persisted; the data will be persisted in memory after the first computation is triggered:

>>> movies.persist()

We can count the total number of comments in the movies subreddit (about 2.9 million comments):

>>> movies.count()

2905085

We can inspect the first comment in the dataset, which shows fields for the author, comment body, creation time, subreddit, etc.:

>>> movies.take(1)

CPU times: user 8 ms, sys: 0 ns, total: 8 msWall time: 113 ms

[{u'archived': False,

 u'author': u'kylionsfan',

 u'author_flair_css_class': None,

 u'author_flair_text': None,

 u'body': u'Goonies',

 u'controversiality': 0,

 u'created_utc': u'1420070402',

 u'distinguished': None,

 u'downs': 0,

 u'edited': False,

 u'gilded': 0,

 u'id': u'cnas90u',

 u'link_id': u't3_2qyjda',

 u'name': u't1_cnas90u',

 u'parent_id': u't3_2qyjda',

 u'retrieved_on': 1425124282,

 u'score': 1,

 u'score_hidden': False,

 u'subreddit': u'movies',

 u'subreddit_id': u't5_2qh3s',

 u'ups': 1}]

### Distributed Natural Language Processing

Now that we’ve filtered a subset of the data and loaded it into memory across the cluster, we can perform distributed natural language computations using Anaconda with PySpark.

First, we define a parse() function that imports the natural language toolkit (NLTK) from Anaconda and tags words in each comment with their corresponding part of speech. Then, we can map the parse() function to the movies RDD:

>>> def parse(record):

...    import nltk

...    tokens = nltk.word_tokenize(record["body"])

...    record["n_words"] = len(tokens)

...    record["pos"] = nltk.pos_tag(tokens)

...    return record

>>> movies2 = movies.map(parse)

Let’s take a look at the body of one of the comments:

>>> movies2.take(10)[6]['body']

u'Dawn of the Apes was such an incredible movie, it should be up there in my opinion.'

And the same comment with tagged parts of speech (e.g., nouns, verbs, prepositions):

>>> movies2.take(10)[6]['pos']

[(u'Dawn', 'NN'),

(u'of', 'IN'),

(u'the', 'DT'),

(u'Apes', 'NNP'),

(u'was', 'VBD'),

(u'such', 'JJ'),

(u'an', 'DT'),

(u'incredible', 'JJ'),

(u'movie', 'NN'),

(u',', ','),

(u'it', 'PRP'),

(u'should', 'MD'),

(u'be', 'VB'),

(u'up', 'RP'),

(u'there', 'RB'),

(u'in', 'IN'),

(u'my', 'PRP$'), (u'opinion', 'NN'), (u'.', '.')]  We can define a get_NN() function that extracts nouns from the records, filters stopwords, and removes non-words from the data set: >>> def get_NN(record): ... import re ... from nltk.corpus import stopwords ... all_pos = record["pos"] ... ret = [] ... for pos in all_pos: ... if pos[1] == "NN" \ ... and pos[0] not in stopwords.words('english') \ ... and re.search("^[0-9a-zA-Z]+$", pos[0]) is not None:

...            ret.append(pos[0])

...    return ret

>>> nouns = movies2.flatMap(get_NN)

We can then generate word counts for the nouns that we extracted from the dataset:

>>> counts = nouns.map(lambda word: (word, 1))

After we’ve done the heavy lifting, processing, filtering and cleaning on the text data using Anaconda and PySpark, we can collect the reduced word count results onto the head node.

>>> top_nouns = counts.countByKey()

>>> top_nouns = dict(top_nouns)

In the next section, we’ll continue our analysis on the head node of the cluster while working with familiar libraries in Anaconda, all in the same interactive Jupyter notebook.

### Local analysis with Pandas and Bokeh

Now that we’ve done the heavy lifting using Anaconda and PySpark across the cluster, we can work with the results as a dataframe in Pandas, where we can query and inspect the data as usual:

>>> import pandas as pd

>>> df = pd.DataFrame(top_nouns.items(), columns=['Noun', 'Count'])

Let’s sort the resulting word counts, and view the top 10 nouns by frequency:

>>> df = df.sort_values('Count', ascending=False)

>>> df_top_10 = df.head(10)

>>> df_top_10

 Noun Count movie 539698 film 220366 time 157595 way 112752 gt 105313 http 92619 something 87835 lot 85573 scene 82229 thing 82101

Let’s generate a bar chart of the top 10 nouns using Pandas:

>>> %matplotlib inline

>>> df_top_10.plot(kind='bar', x=df_top_10['Noun'])

 5-pyspark-reddit-language.png 
 

Finally, we can use Bokeh to generate an interactive plot of the data:

>>> from bokeh.charts import Bar, show

>>> from bokeh.io import output_notebook

>>> from bokeh.charts.attributes import cat

>>> output_notebook()

>>> p = Bar(df_top_10,

...         label=cat(columns='Noun', sort=False),

...         values='Count',

...         title='Top N nouns in r/movies subreddit')

>>> show(p)

 6-pyspark-reddit-language.png 
 

### Conclusion

In this post, we used Anaconda with PySpark to perform distributed natural language processing and computations on data stored in HDFS. We configured Anaconda and the Jupyter Notebook to work with PySpark on various enterprise Hadoop distributions (including Cloudera CDH and Hortonworks HDP), which allowed us to work interactively with Anaconda and the Hadoop cluster. This made it convenient to work with Anaconda for the distributed processing with PySpark, while reducing the data to a size that we could work with on a single machine, all in the same interactive notebook environment. The complete notebook for this example with Anaconda, PySpark, and NLTK can be viewed on Anaconda Cloud.

You can get started with Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$conda install anaconda-client $ anaconda login

$conda install anaconda-cluster -c anaconda-cluster If you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io. The enterprise features of Anaconda, including the cluster management functionality and on-premises repository, are certified for use with Cloudera CDH 5. If you’re running into memory errors, performance issues (related to JVM overhead or Python/Java serialization), problems translating your existing Python code to PySpark, or other limitations with PySpark, stay tuned for a future post about a parallel processing framework in pure Python that works with libraries in Anaconda and your existing Hadoop cluster, including HDFS and YARN. ### Matthieu Brucher #### Analog modeling of a diode clipper (3a): Simulation Now that we have a few methods, let’s try to simulate them. For both circuits, I’ll use the forward Euler, then backward Euler and trapezoidal approximations, then I will show the results of changing the start estimate and then finish by the Newton Raphson optimization. I haven’t checked (yet?) algorithms that don’t use the derivative like the bisection or Brent algorithm. All graphs are done with a x4 oversampling (although I also tried x8, x16 and x32). # First diode clipper Let’s start with the original equation: $V_i - 2 R_1 I_s sinh(V_o/nV_t) - \int \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_o = 0$ ## Forward Euler Let’s now figure out what to do with the integral by deriving the equation: $\frac{dV_o}{dt} = \frac{\frac{dV_i}{dt} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})}{1 + \frac{2 I_s R_1}{nV_t} cosh(\frac{V_o}{nV_t})}$ So now we have the standard form that can used the usual way. For the derivative of the input, I’ll always use the trapezoidal approximation, and then for the output one, I’ll use the forward Euler which leads to the “simple” equation: $V_{on+1} = V_{on} + \frac{V_{in+1} - V_{in} - \frac{4 h I_s}{C_1} sinh(\frac{V_{on}}{nV_t})}{1 + \frac{2 I_s R_1}{nV_t} cosh(\frac{V_{on}}{nV_t})}$ ## Backward Euler For the backward Euler, I’ll start from the integral equation again and remove the time dependency: $V_{in+1} - V_{in} - 2 R_1 I_s (sinh(\frac{V_{on+1}}{nV_t}) - sinh(\frac{V_{on}}{nV_t})) - \int^{t_{n+1}}_{t_n} \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_{on+1} + V_{on} = 0$ Now the discretization becomes: $V_{in+1} - V_{in} - 2 R_1 I_s (sinh(\frac{V_{on+1}}{nV_t}) - sinh(\frac{V_{on}}{nV_t})) - \frac{2 h I_s}{C_1} sinh(\frac{V_{on+1}}{nV_t}) - V_{on+1} + V_{on} = 0$ I didn’t use this equation for the Backward Euler because I would have had a dependency in the sinh term, so I would still have required the numerical methods to solve the equation. ## Trapezoidal rule Here, we just need to change the discretization for a trapezoidal one: $V_{in+1} - V_{in} - I_s sinh(\frac{V_{on+1}}{nV_t}) (\frac{h}{C_1} + 2 R_1) - I_s sinh(\frac{V_{on}}{nV_t}) (\frac{h}{C_1} - 2 R_1) - V_{on+1} + V_{on} = 0$ ## Starting estimates Starting from the different rules, we need to replace sinh(x): • for the pivotal by $\frac{x}{x_0} sinh(x_0)$ • for the tangent rule by $\frac{x}{nV_t} cosh(\frac{x_0}{nV_t}) + y_0 - \frac{x_0}{nV_t} cosh(\frac{x_0}{nV_t})$ ## Graphs Let’s see now how all these optimizers compare (with an estimate of the next element being the last optimized value): Numerical optimization comparison Obviously, the Forward Euler method is definitely not good. Although is on average 4 times lower, the accuracy is definitely not good enough. On the other end, the other two methods give similar results (probably because I try to achieve a convergence quite strong, with less that 10e-8 difference between two iterations). Now, how does the original estimate impact the results? I tried the Backward Euler to start, and the results are identical: Original estimates comparison To have a better picture, let’s turn down the number of iterations to 1 for all the estimates: One step comparison So all the estimates give a similar result. By comparing the number of iterations with the three estimates, the pivotal method gives the worst results, whereas the affine estimates lowers the number of iterations by one. Of course, there is a price to pay in the computation. So the obvious choice is to use trapezoidal approximation with affine starting point estimate, which is not my default choice in SimpleOverdriveFilter. # To be continued The post is getting longer than I thought, so let’s keep it there for now and the next post on the subject will tackle the other diode clipper circuit. ## April 11, 2016 ### Continuum Analytics news #### Data Science with Python at ODSC East Posted Tuesday, April 12, 2016 By: Sheamus McGovern, Open Data Science Conference Chair ## HxO7gaZZ.png At ODSC East, the most influential minds and institutions in data science will convene at the Boston Convention & Exhibition Center from May 20th to the 22nd to discuss and teach the newest and most exciting developments in data science. As you know, the Python ecosystem is now one of the most important data science development environments available today. This is due, in large part, to the existence of a rich suite of user-facing data analysis libraries. Powerful Python machine learning libraries like Scikit-learn, XGBoost and others bring sophisticated predictive analytics to the masses. The NLTK and Gensim libraries enable deep analysis of textual information in Python and the Topik library provides a high-level interface to these and other, natural language libraries, adding a new layer of usability. The Pandas library has brought data analysis in Python to a new level by providing expressive data structures for quick and intuitive data manipulation and analysis. The notebook ecosystem in Python has also flourished with the development of the Jupyter, Rodeo and Beaker notebooks. The notebook interface is an increasingly popular way for data scientists to perform complex analyses that serve the purpose of conveying and sharing analyses and their results to colleagues and to stakeholders. Python is also host to a number of rich web-development frameworks that are used not only for building data science dash boards, but also for full-scale data science powered web-apps. Flask and Django lead the way in terms of the Python web-app development landscape, but Bottle and Pyramid are also quite popular. With Numba or Cython, code can approach speeds akin to that of C or C++ and new developments, like the Dask package, to make computing on larger-than-memory datasets very easy. Visualization libraries, like Plot.ly and Bokeh, have brought rich, interactive and impactful data visualization tools to the fingertips of data analysts everywhere. Anaconda has streamlined the use of many of these wildly popular open source data science packages by providing an easy way to install, manage and use Python libraries. With Anaconda, users no longer need to worry about tedious incompatibilities and library management across their development environments. Several of the most influential Python developers and data scientists will be talking and teaching at ODSC East. Indeed, Peter Wang will be speaking ODSC East. Peter is the co-founder and CTO at Continuum Analytics, as well as the mastermind behind the popular Bokeh visualization library, the Blaze ecosystem, which simplifies the the analysis of Big Data with Python and Anaconda. At ODSC East, there will be over 100 speakers, 20 workshops and 10 training sessions spanning seven conferences that focused on Open Data Science, Disruptive Data Science, Big Data science, Data Visualization, Data Science for Good, Open Data and a Careers and Training conference. See below for very small sampling of some of the powerful Python workshops and speakers we will have at ODSC East. ●Bayesian Statistics Made Simple - Allen Downey, Think Python ●Intro to Scikit learn for Machine Learning - Andreas Mueller, NYU Center for Data Science ●Parallelizing Data Science in Python with Dask - Matthew Rocklin, Continuum Analytics ●Interactive Viz of a Billion Points with Bokeh Datashader – Peter Wang, Continuum Analytics ## April 07, 2016 ### Titus Brown #### Bashing on monstrous sequencing collections So, there's this fairly large collection of about 700 RNAseq samples, from 300 species in 40 or so phyla. It's called the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), and was funded by the Moore Foundation as a truly field-wide collaboration to improve our reference collection for genes (and more). Back When, it was sequenced and assembled by the National Center for Genome Resources, and published in PLOS Biology (Keeling et al., 2014). Partly because we think assembly has improved in the last few years, partly as an educational exercise, partly as an infrastructure exercise, partly as a demo, and partly just because we can, Lisa Cohen in my lab is starting to reassemble all of the data - starting with about 10%. She has some of the basic evaluations (mostly via transrate) posted, and before we pull the trigger on the rest of the assemblies, we're pausing to reflect and to think about what metrics to use, and what kinds of resources we plan to produce. (We are not lacking in ideas, but we might be lacking in good ideas, if you know what I mean.) In particular, this exercise raises some interesting questions that we hope to dig into: • what does a good transcriptome look like, and how could having 700 assemblies help us figure that out? (hint: distributions) • what is a good canonical set of analyses for characterizing transcriptome assemblies? • what products should we be making available for each assembly? • what kind of data formatting makes it easiest for other bioinformaticians to build off of the compute we're doing? • how should we distribute the workflow components? (Lisa really likes shell scripts but I've been lobbying for something more structured. 'make' doesn't really fit the bill here, though.) • how do we "alert" the community if and when we come up with better assemblies? How do we merge assemblies between programs and efforts, and properly credit everyone involved? Anyway, feedback welcome, here or on Lisa's post! We are happy to share methods, data, analyses, results, etc. etc. --titus p.s. Yes, that's right. I ask new grad students to start by assemblying 700 transcriptomes. So? :) ### Martin Fitzpatrick #### Why are numpy calculations not affected by the global interpreter lock? Many numpy calculations are unaffected by the GIL, but not all. While in code that does not require the Python interpreter (e.g. C libraries) it is possible to specifically release the GIL - allowing other code that depends on the interpreter to continue running. In the Numpy C codebase the macros NPY_BEGIN_THREADS and NPY_END_THREADS are used to delimit blocks of code that permit GIL release. You can see these in this search of the numpy source. The NumPy C API documentation has more information on threading support. Note the additional macros NPY_BEGIN_THREADS_DESCR, NPY_END_THREADS_DESCR and NPY_BEGIN_THREADS_THRESHOLDED which handle conditional GIL release, dependent on array dtypes and the size of loops. Most core functions release the GIL - for example Universal Functions (ufunc) do so as described: as long as no object arrays are involved, the Python Global Interpreter Lock (GIL) is released prior to calling the loops. It is re-acquired if necessary to handle error conditions. With regard to your own code, the source code for NumPy is available. Check the functions you use (and the functions they call) for the above macros. Note also that the performance benefit is heavily dependent on how long the GIL is released - if your code is constantly dropping in/out of Python you won’t see much of an improvement. The other option is to just test it. However, bear in mind that functions using the conditional GIL macros may exhibit different behaviour with small and large arrays. A test with a small dataset may therefore not be an accurate representation of performance for a larger task. There is some additional information on parallel processing with numpy available on the official wiki and a useful post about the Python GIL in general over on Programmers.SE. ## April 06, 2016 ### Continuum Analytics news #### Anaconda 4.0 Release Posted Wednesday, April 6, 2016 We are happy to announce that Anaconda 4.0 has been released, which includes the new Anaconda Navigator. Did you notice we skipped from release 2.5 to 4.0? Sharp eyes! The team decided to move up to 4.0 release number to reduce confusion with common Python versions. Anaconda Navigator is a desktop graphical user interface included in Anaconda that allows you to launch applications and easily manage conda packages, environments and channels without the need to use command line commands. It is available for Windows, OS X and Linux. For those familiar with the Anaconda Launcher, Anaconda Navigator has replaced Launcher. If you are already using Anaconda Cloud to host private packages, you can access them easily by signing in with your Anaconda Cloud account. ## navigator-home.png There are four main components in Anaconda Navigator, each one can be selected by clicking the corresponding tab on the left-hand column: • Home where you can install, upgrade, and launch applications • Environments allows you to manage channels, environments and packages. • Learning shows a long list of learning resources in several categories: webinars, documentation, videos and training. • Community where you can connect to other users through events, forums and social media. If you already have Anaconda installed, update to Anaconda 4.0 by using conda: conda update conda conda install anaconda=4.0 The full list of changes, fixes and updates for Anaconda v4.0 can be found in the changelog. We’d very much appreciate your feedback on the latest release, especially the new Anaconda Navigator. Please submit comments or issues through our anaconda-issues GitHub repo. ## April 05, 2016 ### Enthought #### Just Released: PyXLL v 3.0 (Python in Excel). New Real Time Data Stream Capabilities, Excel Ribbon Integration, and More. Download a free 30 day trial of PyXLL and try it with your own data. Since PyXLL was first released back in 2010 it has grown hugely in popularity and is used by businesses in many different sectors. The original motivation for PyXLL was to be able to use all the best bits of Excel […] ### Matthieu Brucher #### Analog modeling of a diode clipper (2): Discretization Let’s start with the two equations we got from the last post and see what we can do with usual/academic tools to solve them (I will tackle nodal and ZDF tools later in this series). # Euler and trapezoidal approximation The usual tools start with a specific form:$$\dot{y} = f(y)$

I’ll work with the second clipper whose equation is of this form:

$\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) = f(V_o)$

## Forward Euler

The simplest way of computing the derivative term is to use the following rule, with h, the inverse of the sampling frequency:

$V_{on+1} = V_{on} + h f(V_{on})$

The nice thing about this rule is that it is easy to compute. The main drawback is that the result may not be accurate and stable enough (let’s keep this for the next post, with actual derivations).

## Backward Euler

Instead of using the past to compute the new sample, we can use the future, which leads to

$V_{on+1} = V_{on} + h f(V_{on+1})$

As the result is present on both sides, solving the problem is not simple. In this case, we can even say that the equation has no close form solution (due to the sinh term), and thus no analytical solution. The only way is to use numerical methods like Brent or Newton Raphson to solve the equation.

## Trapezoidal approximation

Another solution is to combine both solution to have a better approximation of the derivative term:

$V_{on+1} = V_{on} + \frac{h}{2}(f(V_{on}) + f(V_{on+1}))$

We still need the numerical methods to solve the clipper equation with this method, but like the Backward Euler method, this one is said to be A-stable, which is a mandatory condition when solving stiff systems (or systems that have a bad condition number). For a one variable system, the condition number is of course 1…

## Other approximations

There are different other ways of approximating this derivative term. The most used one is the trapezoidal methods, but there are others like all the linear multistep methods (that actually encompass the first three).

# Numerical methods

Let’s try to analyse a few numerical methods. If we used trapezoidal approximation, then the following function needs to be considered:

$g(V_{on+1}) = V_{on+1} - V_{on} - \frac{h}{2}(f(V_{on}) + f(V_{on+1}))$

The goal is to find a value where the function is zero, called a root of the function.

## Bisection method

This method is simple to implement, from two starting points, on either side of the root. Then, we take the middle of the interval, check the sign of it and keep the original point that has a different sign, and keep on until we get close enough of the root.

What is interesting with this method is that it can be vectorized easily by checking several values in the interval instead of just one.

## Newton-Raphson

This numerical method requires the derivative function of g(). Then we can start from the original starting point and iterate this series:

$x_{n+1} = x_n - \frac{g(x_n)}{g'(x_n)}$

For those who are used to optimize cost function and know about the Newton method, it is exactly the same as this one. To optimize a cost function, we need to find a zero of the derivative function. So if g() is this derivative function, then we end up on the Newton method to minimize (or maximize) a cost function.

That being said, if the rate of convergence is quadratic for the Newton-Raphson method, it may not converge. The only way to achieve convergence is to be close enough to the root we are looking for and have some conditions that are usually quite complex to check (see the wikipedia page).

## Starting point

The are several ways of starting the methods.

The first one is any enough: just use the result of the last optimization.

The second one is a little bit more complex: approximate all the complex functions (like sinh) by their tangent and solve the resulting polynomial.

The third one is derived from the second one and called pivotal method/mystran method. Instead of using the tangent, we use the linear function that crosses the origin and the last point. The idea here is that it can be more stable that the tangent method (consider doing this for the hyperbolic tangent, the resulting result could be quite far).

# Conclusion

Of course, there are other numerical methods that I haven’t spoken about. Any can be tried and used. Please do so and report your results!

Let’s see how the ones I’ve shown behave in the next post.

## April 04, 2016

### Continuum Analytics news

#### Anaconda Powers TaxBrain to Transform Washington Through Transparency, Access and Collaboration

Posted Monday, April 4, 2016

Continuum Analytics and Open Source Policy Center Leverage the Power of Open Source to Build Vital Policy Forecasting Models with Anaconda

AUSTIN, TX—April 4, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading open source analytics platform powered by Python, today announced that Anaconda is powering the American Enterprise Institute’s (AEI) Open Source Policy Center (OSPC) TaxBrain initiative. TaxBrain is a web-based application that lets users simulate and study the effect of tax policy reforms using open source economic models. TaxBrain provides transparent analysis for policy makers and the public, ultimately creating a more democratic and scientific platform to analyze economic policy.

“OSPC’s mission is to empower the public to contribute to government policymaking through open source methods and technology, making policy analysis more transparent, trustworthy and collaborative,” said Matt Jensen, founder and managing director of the Open Source Policy Center at the American Enterprise Institute. “TaxBrain is OSPC’s first product, and with Anaconda it is already improving tax policy by making the policy analysis process more democratic and scientific. By leveraging the power of open source, we are able to provide policy makers, journalists and the general public with the information they need to impact and change policy for the better.”

TaxBrain is made possible by a community of economists, data scientists, software developers, and policy experts who are motivated by a shared belief that public policy should be guided by open scientific inquiry, rather than proprietary analysis. The community also believes that the analysis of public policy should be freely available to everyone, rather than just to a select group of those in power.

“The TaxBrain initiative is only the beginning of a much larger movement to use open source approaches in policy and government,” said Travis Oliphant, CEO and co-founder of Continuum Analytics. “TaxBrain is the perfect example of how Anaconda can inspire people to harness the power of data science to enable positive changes. With Anaconda, the OSPC is able to empower a growing community with the superpowers necessary to promote change and democratic policy reform.”

Anaconda has allowed TaxBrain to tap a vast network of outside contributors in the PyData community to accelerate the total number of open source economic models. Contributions from the PyData community come quickly and—because of Anaconda—get large performance gains from Numba, the Python compiler included in Anaconda, and are easy to integrate into TaxBrain. These contributions can be used in other applications and are hosted on the Anaconda Cloud (anaconda.org/ospc).

AEI is a nonprofit, nonpartisan public policy research organization that works to expand liberty, increase individual opportunity, and strengthen free enterprise.

Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team––data scientists, developers, DevOps, data engineers and business analysts––to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

## April 02, 2016

### NeuralEnsemble

#### EU Human Brain Project Releases Platforms to the Public

"Geneva, 30 March 2016 — The Human Brain Project (HBP) is pleased to announce the release of initial versions of its six Information and Communications Technology (ICT) Platforms to users outside the Project. These Platforms are designed to help the scientific community to accelerate progress in neuroscience, medicine, and computing.

[...]

The six HBP Platforms are:
• The Neuroinformatics Platform: registration, search, analysis of neuroscience data.
• The Brain Simulation Platform: reconstruction and simulation of the brain.
• The High Performance Computing Platform: computing and storage facilities to run complex simulations and analyse large data sets.
• The Medical Informatics Platform: searching of real patient data to understand similarities and differences among brain diseases.
• The Neuromorphic Computing Platform: access to computer systems that emulate brain microcircuits and apply principles similar to the way the brain learns.
• The Neurorobotics Platform: testing of virtual models of the brain by connecting them to simulated robot bodies and environments.
All the Platforms can be accessed via the HBP Collaboratory, a web portal where users can also find guidelines, tutorials and information on training seminars. Please note that users will need to register to access the Platforms and that some of the Platform resources have capacity limits."

... More in the official press release here.

The HBP held an online release event on 30 March:

Prof. Felix Schürmann (EPFL-BBP, Geneva), Dr. Eilif Muller (EPFL-BBP, Geneva), and Prof. Idan Segev (HUJI, Jerusalem) present an overview of the mission, tools, capabilities and science of the EU Human Brain Project (HBP) Brain Simulation Platform:

A publicly accessible forum for the BSP is here:
https://forum.humanbrainproject.eu/c/bsp
and for community models
https://forum.humanbrainproject.eu/c/community-models
and for community models of hippocampus in particular
https://forum.humanbrainproject.eu/c/community-models/hippocampus

## April 01, 2016

### Enthought

#### Virtual Core: CT, Photo, and Well Log Co-visualization

Enthought is pleased to announce Virtual Core 1.8.  Virtual Core automates aspects of core description for geologists, drastically reducing the time and effort required for core description, and its unified visualization interface displays cleansed whole-core CT data alongside core photographs and well logs.  It provides tools for geoscientists to analyze core data and extract features from […]

#### Canopy Geoscience: Python-Based Analysis Environment for Geoscience Data

Today we officially release Canopy Geoscience 0.10.0, our Python-based analysis environment for geoscience data. Canopy Geoscience integrates data I/O, visualization, and programming, in an easy-to-use environment. Canopy Geoscience is tightly integrated with Enthought Canopy’s Python distribution, giving you access to hundreds of high-performance scientific libraries to extract information from your data. The Canopy Geoscience environment […]

## March 31, 2016

### Continuum Analytics news

#### Why Every CEO Needs To Understand Data Science

Posted Thursday, March 31, 2016

Tech culture has perpetuated the myth that data science is a sort of magic; something that only those with exceptional math skills, deep technical know-how and industry knowledge can understand or act on. While it’s true that math skills and technical knowledge are required to effectively extract insights from data, it’s far from magic. Given a little time and effort, anyone can become familiar with the basic concepts.

As a CEO, you don’t need to understand every technical detail, but it’s very important to have a good grasp of the entire process behind extracting useful insights from your data. Click on the full article below to read the five big-picture steps you must take to ensure you understand data science (and to ensure your company is gaining actionable insights throughout the process).

## March 30, 2016

ec2-machine:~$ipython # Start IPython console  In [1]: from distributed import Executor, s3, progress In [2]: e = Executor('127.0.0.1:8786') In [3]: e Out[3]: <Executor: scheduler=127.0.0.1:8786 workers=64 threads=64>  ### Notebooks Alternatively we set up a globally visible Jupyter notebook server: localmachine:~$ dec2 dask-distributed address    # Get location of head node

localmachine:~$dec2 ssh # SSH into head node ec2-machine:~$ jupyter notebook --ip="*"          # Start Jupyter Server


Then navigate to http://XXX:XXX:XXX:XXX:8888 from your local browser. Note, this method is not secure, see Jupyter docs for a better solution.

### Public datasets

We repeat the experiments from our earlier blogpost on NYCTaxi data.

df = s3.read_csv('dask-data/nyc-taxi/2015', lazy=False)
progress(df)

VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0 0.3 16.30

## Acknowledgments

The dec2 startup script is largely the work of Daniel Rodriguez. Daniel usually works on Anaconda for cluster management which is normally part of an Anaconda subscription but is also free for moderate use (4 nodes.) This does things similar to dec2, but much more maturely.

DEC2 was inspired by the excellent spark-ec2 setup script, which is how most Spark users, myself included, were first able to try out the library. The spark-ec2 script empowered many new users to try out a distributed system for the first time.

The S3 work was largely done by Hussain Sultan (Capital One) and Martin Durant (Continuum).

## What didn’t work

• Originally we automatically started an IPython console on the local machine and connected it to the remote scheduler. This felt slick, but was error prone due to mismatches between the user’s environment and the remote cluster’s environment.
• It’s tricky to replicate functionality that’s part of a proprietary product (Anaconda for cluster management.) Fortunately, Continuum management has been quite supportive.
• There aren’t many people in the data science community who know Salt, the system that backs dec2. I expect maintenance to be a bit tricky moving forward, especially during periods when Daniel and other developers are occupied.

## March 15, 2016

### Mark Fenner

#### Householder Bidiagonalization

We’re going to make use of Householder reflections to turn a generic matrix into a bidiagonal matrix. I’ve laid a lot of foundation for these topics in the posts I just linked. Check them out! Onward. Reducing a Generic Matrix to a Bidiagonal Matrix Using Householder Reflections Why would we want to do this? Well, […]

## March 12, 2016

### Mark Fenner

#### Givens Rotations and the Case of the Blemished Bidiagonal Matrix

Last time, we looked at using Givens rotations to perform a QR factorization of a matrix. Today, we’re going to be a little more selective and use Givens rotations to walk values off the edge of a special class of matrices Old Friends Here are few friends that we introduced last time. I updated the […]

## March 10, 2016

### William Stein

#### Open source is now ready to compete with Mathematica for use in the classroom

When I think about what makes SageMath different, one of the most fundamental things is that it was created by people who use it every day.  It was created by people doing research math, by people teaching math at universities, and by computer programmers and engineers using it for research.  It was created by people who really understand computational problems because we live them.  We understand the needs of math research, teaching courses, and managing an open source project that users can contribute to and customize to work for their own unique needs.

The tools we were using, like Mathematica, are clunky, very expensive, and just don't do everything we need.  And worst of all, they are closed source software, meaning that you can't even see how they work, and can't modify them to do what you really need.  For teaching math, professors get bogged down scheduling computer labs and arranging for their students to buy and install expensive software.

So I started SageMath as an open source project at Harvard in 2004, to solve the problem that other math software is expensive, closed source, and limited in functionality, and to create a powerful tool for the students in my classes.  It wasn't a project that was intended initially as something to be used by hundred of thousands of people.  But as I got into the project and as more professors and students started contributing to the project, I could clearly see that these weren't just problems that pissed me off, they were problems that made everyone angry.

The scope of SageMath rapidly expanded.  Our mission evolved to create a free open source serious competitor to Mathematica and similar closed software that the mathematics community was collective spending hundreds of millions of dollars on every year. After a decade of work by over 500 contributors, we made huge progress.

But installing SageMath was more difficult than ever.  It was at that point that I decided I needed to do something so that this groundbreaking software that people desperately needed could be shared with the world.

So I created SageMathCloud, which is an extremely powerful web-based collaborative way for people to easily use SageMath and other open source software such as LaTeX, R, and Jupyter notebooks easily in their teaching  and research.   I created SageMathCloud based on nearly two decades of experience using math software in the classroom and online, at Harvard, UC San Diego, and University of Washington.

SageMathCloud is commercial grade, hosted in Google's cloud, and very large classes are using it heavily right now.  It solves the installation problem by avoiding it altogether.  It is entirely open source.

Open source is now ready to directly compete with Mathematica for use in the classroom.  They told us we could never make something good enough for mass adoption, but we have made something even better.  For the first time, we're making it possible for you to easily use Python and R in your teaching instead of Mathematica; these are industry standard mainstream open source programming languages with strong support from Google, Microsoft and other industry leaders.   For the first time, we're making it possible for you to collaborate in real time and manage your course online using the same cutting edge software used by elite mathematicians at the best universities in the world.

A huge community in academia and in industry are all working together to make open source math software better at a breathtaking pace, and the traditional closed development model just can't keep up.

## March 08, 2016

### Matthieu Brucher

#### Announcement: Audio TK 1.1.0

This is mainly a bug fix release. A nasty bug on increasing processing sizes would corrupt the input data and thus change the results. It is advised to upgrade to this release as soon as possible.

Changelog:

1.1.0
* Fix a really nasty bug when changing processing sizes
* Implemented a basic AllPassFilter (algorithmic reverb)

## March 07, 2016

### Continuum Analytics news

#### Continuum Analytics Launches Anaconda Skills Accelerator Program to Turn Professionals Into Data Scientists ASAP

Posted Monday, March 7, 2016

Professional Development Residency Program Propels Scientists, Physicists, Mathematicians and Engineers into Data Science Careers

AUSTIN, TX—March 7, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading modern open source analytics platform powered by Python, today launched its Anaconda Skills Accelerator Program (ASAP). The new professional development residency program, led by a team of world class experts, is highly selective and brings together scientists, physicists, mathematicians and engineers to tackle challenging real world projects. Participants will use advanced technology and pragmatic techniques across a broad set of technical areas including—data management, analytic modeling, visualization, high performance computing on multi-core CPUs and GPUs, vectorization and modern distributed computing techniques such as Spark and Dask.

“One of the greatest challenges affecting our industry is the disconnect between the data science needs and expectations of enterprises and the current skills of today’s scientists, mathematicians and engineers,” said Travis Oliphant, CEO and co-founder of Continuum Analytics. “Fostering true data science experts requires more than traditional training programs. Continuum Analytics’ ASAP empowers professionals with the necessary technology and hands on experience to tackle big data challenges and solve the data problems that are inside every organization today.”

Participants emerge from this residency program as Certified Anaconda Professionals (CAP), enabling them to manage challenging data science and data engineering problems in business, science, engineering, data analytics, data integration and machine learning.

Advantages to CAP Designation

• Consulting and placement opportunities with Continuum Analytics clients

• Membership to the CAP Alumni Association

• Individual subscription to premium Anaconda packages usually only available as part of Anaconda Workgroup or Anaconda Enterprise

• Complimentary one year subscription to priority Anaconda support (only available for companies who put their teams of three or more through the program)

• A 50 percent discounted pass to the annual Anaconda conference

For additional information about the Continuum Analytics’ ASAP, qualifications and process, please visit: http://continuum.io/asap.

Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, data engineers, and business analysts – to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

## BluePyOpt

The BlueBrain Python Optimisation Library (BluePyOpt) is an extensible framework for data-driven model parameter optimisation that wraps and standardises several existing open-source tools. It simplifies the task of creating and sharing these optimisations, and the associated techniques and knowledge. This is achieved by abstracting the optimisation and evaluation tasks into various reusable and flexible discrete elements according to established best-practices. Further, BluePyOpt provides methods for setting up both small- and large-scale optimisations on a variety of platforms, ranging from laptops to Linux clusters and cloud-based compute infrastructures.

The code is available here:
https://github.com/BlueBrain/BluePyOpt
A preprint to the paper is available here:
http://arxiv.org/abs/1603.00500

## March 04, 2016

### Continuum Analytics news

#### Open Data Science Is the David Beckham of Your Data Team

Posted Friday, March 4, 2016

## Screen Shot 2016-03-04 at 11.28.19 AM.png

If data scientists were professional athletes, they’d be ditching their swim caps and donning cleats right about now.

Traditional data science teams function like swim teams. Although all members strive for one goal — winning the meet — each works individually, concentrating on his or her isolated lane or heat. As each player finishes his heat, the team’s score is tallied. If enough team members hit their individual goals, then the team succeeds. But team members’ efforts are isolated and desynchronized, and some members might pull off incredible times, while others lag behind in their heats.

Today, the industry is moving toward open data science (ODS), which enables data scientists to function more like a soccer team: Team members aren’t restricted to their own lanes but are instead free to move about the field. Even goalies can score in soccer, and ODS developers are similarly encouraged to contribute wherever their skills intersect with development challenges. No team members are relegated to “second-string” or niche roles: In a given project, developers might contribute to model building, while domain experts like quants offer insights about code structure or visualization. As with soccer, the pace is fast, and the process is engaging and fun.

### Why ODS?

Data is the indisputable king of the modern economy, but too many data scientists still function like swimmers, each working with his own set of tools to manage heaps of data. To work as a true team, data scientists (and their tools) must function together.

ODS is the revolution rising to meet this challenge. Instead of forcing data scientists to settle on a single language and proprietary toolset, open data science champions inclusion. Just as ODS encourages data scientists to function as one, it also recognizes the potential of open source tools — for data, analytics and computation — to form a connected, collaborative ecosystem.

Developers, analytics and domain experts adhering to ODS principles gain these key advantages over traditional and proprietary approaches:

• Availability: Open source tools offer many packages and approaches within an ecosystem  to solve problems that make it faster to build solutions quickly.
• Innovation: The lack of vendor lock-in means the developer community is continuously updating tools, which is ideal for collaborative workflows and quick-to-fail projects.
• Interoperability: Rather than pigeonholing developers into a single language or set of tools, ODS embraces a rich ecosystem of compatible open source tools.
• Transparency: Open data science tools facilitate transparency between teams, emphasizing innovations like the Jupyter/IPython notebook that allows developers to easily share code and visualizations.

### ODS Is Making Analytics More Precise

Over the past few years — largely thanks to the growth of ODS — the pace of innovation has quickened in the analytics field. Now, researchers and academics immediately release their algorithms to the open source community, which has empowered data scientists using these tools to become ever more granular in their analyses.

Because open data science has provided more ways to build, deploy and consume analytics, businesses can know their customers more personally than ever before.

Previously, a sales analysis might have yielded insights like “Women over 50 living in middle-class households are most likely to buy this product.” Now, these analyses are shedding ever-greater light on buyer personas. Today, an analyst might tell marketers, “Stay-at-home moms in suburban areas who tend to shop online between 3 p.m. and 6 p.m. are more likely to buy the premium product than the entry-level one.”

Notice the difference? The more precise the analysis is, the greater its use to business professionals. Information about where, how and when buyers purchase a product are invaluable insights for marketers.

ODS has also broadened data scientists’ options for visualization. Let’s say a telecommunications company needs to illustrate rotational churn in a way that shows which customers are likely to leave over a period of time. Perhaps the company is using Salesforce, but Salesforce doesn’t offer the rotational churn analysis the business is looking for. If the business’ data scientist is using ODS tools, then that data scientist can create a model for the rotational churn and embed it into Salesforce. The analysis can include rich, contextual visualization that clearly illustrates customers’ rotational churn patterns.

### Data Science Isn’t One-Size-Fits-All

Just like there’s no one right way to run a business or market a product, there’s no one right way to do data science.

For instance, one data scientist might employ a visual component framework, while another accomplishes the same task with a command line interface or integrated language development environment. Business analysts and data engineers in the data science team may prefer spreadsheets and visual exploration.

Thanks to this flexibility, ODS has become the foundation of modern predictive analytics and one platform has emerged to manage its tools successfully: Anaconda.

Anaconda is the leading open source analytics ecosystem and full-stack ODS platform for Python, R, Scala and many other data science languages. It adeptly manages packages, dependencies and environments for multiple languages.

Data science teams using Anaconda can share models and results through Jupyter/IPython notebooks and can utilize an enormous trove of libraries for analytics and visualization. Furthermore, Anaconda supercharges Python to deliver high performance analytics for data science team that want to scale up and out their prototypes without worrying about incurring delays when moving to production.

With Anaconda, data science teams decrease project times by better managing and illustrating data sets and offering clearer insights to business professionals. Get your data science team out of the pool by making the switch to ODS and Anaconda. Before you know it, your data scientists will be scoring goals and on their way to the World Cup.

If you are interested in learning more about this topic and are attending the Gartner BI & Analytics Summit in Grapevine, TX, join me in our session on Why Open Data Science Matters, Tuesday, March 15th from 10:45-11:30amCT.

## March 03, 2016

### Mark Fenner

#### Givens Rotations and QR

Today I want to talk about Givens rotations. Givens rotations are a generalization of the rotation matrix you might remember from high school trig class. Instead of rotating in the plane of a 2D matrix, we can rotated in any plane of a larger dimension matrix. We’ll use these rotations to selectively place zeros in […]

## March 02, 2016

### Continuum Analytics news

#### Introducing Constructor 1.0

Posted Wednesday, March 2, 2016

I am excited to announce version 1.0 of the constructor project. Constructor combines a collection of conda packages into a standalone installer similar to the Anaconda installer. Using constructor you can create installers for all the platforms conda supports, namely Linux, Mac OS X and Windows.

• The Linux and Mac installers are self extracting bash scripts. They consist of a bash header, and a tar archive. Besides providing some help options, and an install dialog, the bash header extracts the tar archive, which in turn consists of conda packages.
• The Windows installer is an executable (.exe), which when executed opens up a dialog box, which walks the user through the install process and installs the conda packages contained within the executable. This is all achieved by relying on the NSIS (Nullsoft Scriptable Install System) under the cover.

This is the first public release of this project, which was previously proprietary and known as "cas-installer".

As constructor relies on conda, it needs to be installed into the root conda environment:

conda install -n root constructor


Constructor builds an installer for the current platform by default, and can also build an installer for other platforms, although Windows installers must be created on Windows and all Unix installers must be created on some Unix platform.

The constructor command takes an installer specification directory as its argument. This directory needs to contain a file construct.yaml, which specifies information like the name and version of the installer, the conda channels to pull packages from, the conda packages included in the installer, and so on. The complete list of keys in this file can be found in the documentation. The directory may also contain some optional files such as a license file and image files for the Windows installer.

We created a documented example to demonstrate how to build installers for Linux, OS X and Windows that are similar to Anaconda installers, but significantly smaller.

Have fun!

## February 28, 2016

### Titus Brown

#### An e-mail from Seven Bridges Genomics re patents and patent applications

Preface from Titus: this is an e-mail written by Deniz Kural of Seven Bridges Genomics in response to concerns and accusations about their patents and patent applications on genomics workflow engines and graph genome analysis techniques. It was sent to a closed list initially, and I asked Deniz if we could publicize it; he agreed. Note that I removed the messages to which he was replying, given the closed nature of the original list, and Deniz lightly edited the e-mail to remove some personal details and fix discontinuity.

Here are some references to the Twitter conversation: Variant reference graphs patented!?, looks like Seven Bridges is trying to patent Galaxy or Taverna?

Please remember to abide by the commenting policy on this blog if you decide to post a comment here.

Dear All,

Firstly, thank you for the considered discussion and civility on this thread. In this email, I've tried to cover why SBG has a patent practice & next steps with GA4GH; and what the patents cover / prior art concerns.

Purpose of SBGs patents:

Developed economies have a complicated relationship to patents; as one of the few instruments underpinning the wealth these knowledge economies create. However, science requires unfettered freedom to build upon previous ideas, and open communication. So what to do?

Seven Bridges was founded in 2009, with our name a nod to graph theory (Euler). We've avoided patents until 2013, and having been made aware of various industrial and academic institutions pursuing bioinformatics and cloud patents, have decided to pursue our own patents as an active, operating company in a field with some publicly traded entities aggressively prosecuting their IP claims, as a protective measure. The patents listed in there have been submitted before the existence of the GA4GH mailing list. Some of the government funding we receive compels us to obtain IP and measures our success on this criteria. We would welcome patent reform, and would rather not spend our funds on patents. We are significantly larger than a lab, but tiny compared to our established software/biotech competitors, and ultimately can't match their legal funds. We'd also love to see the demise of patent trolls. Thus, it is not our intention to limit scientific discussion / publication.

As an example: We've released the source code of our pipeline editor (unrelated to graph genomes) more than a year ago under GPL, and last month decided to also start using an Apache license to make the patent rights of the user more explicit. We'd like to preserve an open space for common formats and standards - common workflow language, and graph genome exchange formats (more on this below). Open source / Apache is a start, but not a universal panacea.

Many of the suggestions on this thread regarding patents and licenses would be a positive step forward, and we would welcome a clarification of GA4GH policies, including overall governance. Although I had the opportunity to discuss our patents & answer some questions on previous phone calls, I also wish that we've had a wider conversation sooner.

What's patented re: Graph Genomes / SBG patent process + prior art

We have an obligation to submit prior art, and precisely for the reasons outlined in this thread. Likewise, it does us no good to obtain and maintain patents that can't realistically hold when challenged. I've tried to explain the content of our patents and a way to incorporate community-submitted prior art below.

None of our patent applications are on "genome graphs" per se - i.e. representing variants or sequence data as a graph genome -- we truly believe that having a common, open representation or specification of graph genomes is needed. This is also the basis of our interest and participation in this group. Indeed, in my first presentation to this group about two summers ago, I've presented an edge graph representation, and a coordinate system that goes along with it (which counts as a public disclosure) thus taking it off from what can be patented. SBG's tools must accept and write out formats commonly used and adopted by the community to gain traction and adoption & use open and free APIs for input and output.

We've pursued patents for improvements that go into our toolchain (aligners, variant callers, etc.) - a good portion while it may not be obvious at first, are tied to producing efficient / fast implementations, having a practical benefit, and another group of patents relate to application patents around not having restrictions to apply graph genomes to various specific domains. Please see below re: on prior art for a specific application or concern.

It presents an interesting challenge on how external counsel could become a graph genome expert and write patents on this area intelligently. We've realized after a full cycle of this, that external counsel often produces lower quality applications due to lack of context and limited time investment, resulting a lot of resources spent downstream. Thus we've been pursuing measures to improve the quality and substance of our process, including hiring in-house last year.

The way our patent process (essentially a chore) now works as follows: Our attorneys sit on our weekly R&D meetings, and then follow-up if anything novel is presented. Currently we have about 45 PhDs and 100 engineers, with 20+ R&D members working on graph-related issues alone, and the IP activity is thus comparable to institutions of similar size. We try not to spend engineering time on detailed patent work -- time better spent to build tools and products. Our scientists are not trained in patent novelty vs scientific novelty, and thus tend to be conservative with disclosures.

Thus, our researchers write a short disclosure to our attorney, who then writes the claims (as a lawyer would), and goes back-and-forth with external entities until a specific set of claims are finalized / approved. For reasons beyond my understanding, the patent system seems to encourage starting with a broad set of claims and working with the patent examiner to make them more specific to exactly what the engineer built.

Our attorneys are obliged to pay attention to prior art. We circulate papers - including pioneering graph genome work, but also from highly related fields of genome assembly and transcriptomics. It's worth noting that many of our submissions are in "application" stage, and indeed may be thrown out. The claims of a final submission, after the review steps, often look very different as outlined above. We also have the USPTO review and point out previous work. That said, more prior art is better, and welcome.

Thus, if a person trained in reading patent claims (or anyone really), feels like we've overlooked work from 2013 (or any of the submission dates) & before, we'd like to hear from you on this issue. Even better to name which claims and applications it relates to, and which sections of the paper. We'll happily submit the evidence to USPTO and remove or revise those claims. Please keep in mind sometimes we need to wait for the USPTO to throw it out depending on which stage we're at. Please do not hesitate to write to ip@sbgenomics.com which goes directly to our in house counsel.

Likewise, we'd like to have an open channel of communication in general, so if you have any other questions or concerns please email on related issues.

Best, Deniz

(Titus sez: Deniz is on Twitter at @denizkural.)

## February 26, 2016

### NeuralEnsemble

#### Stop plotting your data -- HoloViews 1.4 released!

We are pleased to announce the fifth public release of HoloViews, a Python package for exploring and visualizing scientific data:

http://holoviews.org

HoloViews provides composable, sliceable, declarative data structures for building even complex visualizations easily.  Instead of you having to explicitly and laboriously plot your data, HoloViews lets you simply annotate your data so that any part of it visualizes itself automatically.  You can now work with large datasets as easily as you work with simple datatypes at the Python prompt.

The new version can be installed using conda:

conda install -c ioam holoviews

Release 1.4 introduces major new features, incorporating over 1700 new commits and closing 142 issues:

- Now supports both Bokeh (bokeh.pydata.org) and matplotlib backends, with Bokeh providing extensive interactive features such as panning and zooming linked axes, and customizable callbacks

- DynamicMap: Allows exploring live streams from ongoing data collection or simulation, or parameter spaces too large to fit into your computer's or your browser's memory, from within a Jupyter notebook

- Columnar data support: Underlying data storage can now be in Pandas dataframes, NumPy arrays, or Python dictionaries, allowing you to define HoloViews objects without copying or reformatting your data

- New Element types: Area (area under or between curves), Spikes (sequence of lines, e.g. spectra, neural spikes, or rug plots), BoxWhisker (summary of a distribution), QuadMesh (nonuniform rasters), Trisurface (Delaunay-triangulated surface plots)

- New Container type: GridMatrix (grid of heterogenous Elements)

- Improved layout handling, with better support for varying aspect  ratios and plot sizes

- Improved help system, including recursively listing and searching the help for all the components of a composite object

- Improved Jupyter/IPython notebook support, including improved export using nbconvert, and standalone HTML output that supports dynamic widgets even without a Python server

- Significant performance improvements for large or highly nested data

And of course we have fixed a number of bugs found by our very dedicated users; please keep filing Github issues if you find any!

For the full list of changes, see:

https://github.com/ioam/holoviews/releases

HoloViews is now supported by Continuum Analytics, and is being used in a wide range of scientific and industrial projects.  HoloViews remains freely available under a BSD license, is Python 2 and 3 compatible, and has minimal external dependencies, making it easy to integrate into your workflow. Try out the extensive tutorials at holoviews.org today!

Jean-Luc R. Stevens
Philipp Rudiger
James A. Bednar

Continuum Analytics, Inc., Austin, TX, USA
School of Informatics, The University of Edinburgh, UK

### Matthew Rocklin

#### Distributed Dask Arrays

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

In this post we analyze weather data across a cluster using NumPy in parallel with dask.array. We focus on the following:

1. How to set up the distributed scheduler with a job scheduler like Sun GridEngine.
2. How to load NetCDF data from a network file system (NFS) into distributed RAM
3. How to manipulate data with dask.arrays
4. How to interact with distributed data using IPython widgets

This blogpost has an accompanying screencast which might be a bit more fun than this text version.

This is the third in a sequence of blogposts about dask.distributed:

1. Dask Bags on GitHub Data
2. Dask DataFrames on HDFS
3. Dask Arrays on NetCDF data

## Setup

We wanted to emulate the typical academic cluster setup using a job scheduler like SunGridEngine (similar to SLURM, Torque, PBS scripts and other technologies), a shared network file system, and typical binary stored arrays in NetCDF files (similar to HDF5).

To this end we used Starcluster, a quick way to set up such a cluster on EC2 with SGE and NFS, and we downloaded data from the European Centre for Meteorology and Weather Forecasting

To deploy dask’s distributed scheduler with SGE we made a scheduler on the master node:

sgeadmin@master:~$dscheduler distributed.scheduler - INFO - Start Scheduler at: 172.31.7.88:8786  And then used the qsub command to start four dask workers, pointing to the scheduler address: sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 1 ("dworker") has been submitted
sgeadmin@master:~$qsub -b y -V dworker 172.31.7.88:8786 Your job 2 ("dworker") has been submitted sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 3 ("dworker") has been submitted
sgeadmin@master:~$qsub -b y -V dworker 172.31.7.88:8786 Your job 4 ("dworker") has been submitted  After a few seconds these workers start on various nodes in the cluster and connect to the scheduler. ## Load sample data on a single machine On the shared NFS drive we’ve downloaded several NetCDF3 files, each holding the global temperature every six hours for a single day: >>> from glob import glob >>> filenames = sorted(glob('*.nc3')) >>> filenames[:5] ['2014-01-01.nc3', '2014-01-02.nc3', '2014-01-03.nc3', '2014-01-04.nc3', '2014-01-05.nc3']  We use conda to install the netCDF4 library and make a small function to read the t2m variable for “temperature at two meters elevation” from a single filename: $ conda install netcdf4

import netCDF4
with netCDF4.Dataset(fn) as f:
return f.variables['t2m'][:]


This converts a single file into a single numpy array in memory. We could call this on an individual file locally as follows:

>>> load_temperature(filenames[0])
array([[[ 253.96238624,  253.96238624,  253.96238624, ...,  253.96238624,
253.96238624,  253.96238624],
[ 252.80590921,  252.81070124,  252.81389593, ...,  252.79792249,
252.80111718,  252.80271452],
...
(4, 721, 1440)


Our dataset has dimensions of (time, latitude, longitude). Note above that each day has four time entries (measurements every six hours).

The NFS set up by Starcluster is unfortunately quite small. We were only able to fit around five months of data (136 days) in shared disk.

## Load data across cluster

We want to call the load_temperature function on our list filenames on each of our four workers. We connect a dask Executor to our scheduler address and then map our function on our filenames:

>>> from distributed import Executor, progress
>>> e = Executor('172.31.7.88:8786')
>>> e
<Executor: scheduler=172.31.7.88:8786 workers=4 threads=32>

>>> futures = e.map(load_temperature, filenames)
>>> progress(futures)


After this completes we have several numpy arrays scattered about the memory of each of our four workers.

## Coordinate with dask.array

We coordinate these many numpy arrays into a single logical dask array as follows:

>>> from distributed.collections import futures_to_dask_arrays
>>> xs = futures_to_dask_arrays(futures)  # many small dask arrays

>>> import dask.array as da
>>> x = da.concatenate(xs, axis=0)        # one large dask array, joined by time
>>> x
dask.array<concate..., shape=(544, 721, 1440), dtype=float64, chunksize=(4, 721, 1440)>


This single logical dask array is comprised of 136 numpy arrays spread across our cluster. Operations on the single dask array will trigger many operations on each of our numpy arrays.

## Interact with Distributed Data

We can now interact with our dataset using standard NumPy syntax and other PyData libraries. Below we pull out a single time slice and render it to the screen with matplotlib.

from matplotlib import pyplot as plt
plt.imshow(x[100, :, :].compute(), cmap='viridis')
plt.colorbar()


In the screencast version of this post we hook this up to an IPython slider widget and scroll around time, which is fun.

## Speed

We benchmark a few representative operations to look at the strengths and weaknesses of the distributed system.

### Single element

This single element computation accesses a single number from a single NumPy array of our dataset. It is bound by a network roundtrip from client to scheduler, to worker, and back.

>>> %time x[0, 0, 0].compute()
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 9.72 ms


### Single time slice

This time slice computation pulls around 8 MB from a single NumPy array on a single worker. It is likely bound by network bandwidth.

>>> %time x[0].compute()
CPU times: user 24 ms, sys: 24 ms, total: 48 ms
Wall time: 274 ms


### Mean computation

This mean computation touches every number in every NumPy array across all of our workers. Computing means is quite fast, so this is likely bound by scheduler overhead.

>>> %time x.mean().compute()
CPU times: user 88 ms, sys: 0 ns, total: 88 ms
Wall time: 422 ms


## Interactive Widgets

To make these times feel more visceral we hook up these computations to IPython Widgets.

This first example looks fairly fluid. This only touches a single worker and returns a small result. It is cheap because it indexes in a way that is well aligned with how our NumPy arrays are split up by time.

@interact(time=[0, x.shape[0] - 1])
def f(time):
return x[time, :, :].mean().compute()


This second example is less fluid because we index across our NumPy chunks. Each computation touches all of our data. It’s still not bad though and quite acceptable by today’s standards of interactive distributed data science.

@interact(lat=[0, x.shape[1] - 1])
def f(lat):
return x[:, lat, :].mean().compute()


## Normalize Data

Until now we’ve only performed simple calculations on our data, usually grabbing out means. The image of the temperature above looks unsurprising. The image is dominated by the facts that land is warmer than oceans and that the equator is warmer than the poles. No surprises there.

To make things more interesting we subtract off the mean and divide by the standard deviation over time. This will tell us how unexpectedly hot or cold a particular point was, relative to all measurements of that point over time. This gives us something like a geo-located Z-Score.

z = (x - x.mean(axis=0)) / x.std(axis=0)
z = e.persist(z)
progress(z)


plt.imshow(z[slice].compute(), cmap='RdBu_r')
plt.colorbar()


We can now see much more fine structure of the currents of the day. In the screencast version we hook this dataset up to a slider as well and inspect various times.

I’ve avoided displaying GIFs of full images changing in this post to keep the size down, however we can easily render a plot of average temperature by latitude changing over time here:

import numpy as np
xrange = 90 - np.arange(z.shape[1]) / 4

@interact(time=[0, z.shape[0] - 1])
def f(time):
plt.figure(figsize=(10, 4))
plt.plot(xrange, z[time].mean(axis=1).compute())
plt.ylabel("Normalized Temperature")
plt.xlabel("Latitude (degrees)")


## Conclusion

We showed how to use distributed dask.arrays on a typical academic cluster. I’ve had several conversations with different groups about this topic; it seems to be a common case. I hope that the instructions at the beginning of this post prove to be helpful to others.

It is really satisfying to me to couple interactive widgets with data on a cluster in an intuitive way. This sort of fluid interaction on larger datasets is a core problem in modern data science.

## What didn’t work

As always I’ll include a section like this on what didn’t work well or what I would have done with more time:

• No high-level read_netcdf function: We had to use the mid-level API of executor.map to construct our dask array. This is a bit of a pain for novice users. We should probably adapt existing high-level functions in dask.array to robustly handle the distributed data case.
• Need a larger problem: Our dataset could have fit into a Macbook Pro. A larger dataset that could not have been efficiently investigated from a single machine would have really cemented the need for this technology.
• Easier deployment: The solution above with qsub was straightforward but not always accessible to novice users. Additionally while SGE is common there are several other systems that are just as common. We need to think of nice ways to automate this for the user.
• XArray integration: Many people use dask.array on single machines through XArray, an excellent library for the analysis of labeled nd-arrays especially common in climate science. It would be good to integrate this new distributed work into the XArray project. I suspect that doing this mostly involves handling the data ingest problem described above.
• Reduction speed: The computation of normalized temperature, z, took a surprisingly long time. I’d like to look into what is holding up that computation.

## February 25, 2016

### Stefan van der Walt

#### Python & Matplotlib on OSX

One day, we will hopefully have a grand unified build and package management system for Python where everything is free & open and Just Works (TM). Until then, you have two options:

## brew Python + pip

1. brew install python3
2. pyvenv -v ~/envs/py3
3. source ~/envs/py3/bin/activate
4. pip install matplotlib

Pros/cons:

• Pip is the standard Python package management tool, and uses the official Python Package Index (PyPi) repository.
• Wheels on PyPi are built by authors themselves from open recipes.
• Binary wheels may not be available for all packages.
• Pip is not the best of package management tools.

## Conda

2. conda create -n py3 python=3.5 matplotlib
3. source activate py3

Pros/cons:

• Conda is a great package management tool.
• Conda environments are well tested; almost everything works out of the box. This includes fast linear algebra using MKL.
• Some of the conda build recipes are closed and binary wheels may not be available for all packages1.
• Conda and pip do not always play well together.
• Conda packages are supported almost exclusively by a single company.

1. Some members of the community maintain their own channels, but there are still some issues to be aware of when mixing those channels and the official ones

### William Stein

#### "If you were new faculty, would you start something like SageMathCloud sooner?"

I was recently asked by a young academic: "If you were a new faculty member again, would you start something like SageMathCloud sooner or simply leave for industry?" The academic goes on to say "I am increasingly frustrated by continual evidence that it is more valuable to publish a litany of computational papers with no source code than to do the thankless task of developing a niche open source library; deep mathematical software is not appreciated by either mathematicians or the public."

I wanted to answer that "things have gotten better" since back in 2000 when I started as an academic who does computation. Unfortunately, I think they have gotten worse. I do not understand why. In fact, this evening I just received the most recent in a long string of rejections by the NSF.

Regarding a company versus taking a job in industry, for me personally there is no point in starting a company unless you have a goal that can only be accomplished via a company, since building a business from scratch is extremely hard and has little to do with math or research. I do have such a goal: "create a viable open source alternative to Mathematica, etc...". I was very clearly told by Michael Monagan (co-founder of Maplesoft) in 2006 that this goal could not be accomplished in academia, and I spent the last 10 years trying to prove him wrong.

On the other hand, leaving for a job in industry means that your focus will switch from "pure" research to solving concrete problems that make products better for customers. That said, many of the mathematicians who work on open source math software do so because they care so much about making the experience of using math software much better for the math community. What often drives Sage developers is exactly the sort of passionate care for "consumer focus" and products that also makes one successful in industry. I'm sure you know exactly what I mean, since it probably partly motivates your work. It is sad that the math community turns its back on such people. If the community were to systematically embrace them, instead of losing all these 300K+/year engineers to mathematics entirely -- which is exactly what we do constantly -- the experience of doing mathematics could be massively improved into the future. But that is not what the community has chosen to do. We are shooting ourselves in the foot. Now that I have seen how academia works from the inside over 15 years I'm starting to understand a little why these things change very slowly, if ever. In the mathematics department I'm at, there are a small handful of research areas in pure math, and due to how hiring works (voting system, culture, etc.) we have spent the last 10 years hiring in those areas little by little (to replace people who die/retire/leave). I imagine most mathematics departments are very similar. "Open source software" is not one of those traditional areas. Nobody will win a Fields Medal in it. Overall, the mathematical community does not value open source mathematical software in proportion to its value, and doesn't understand its importance to mathematical research and education. I would like to say that things have got a lot better over the last decade, but I don't think they have. My personal experience is that much of the "next generation" of mathematicians who would have changed how the math community approaches open source software are now in industry, or soon will be, and hence they have no impact on academic mathematical culture. Every one of my Ph.D. students are now at Google/Facebook/etc. We as a community overall would be better off if, when considering how we build departments, we put "mathematical software writers" on an equal footing with "algebraic geometers". We should systematically consider quality open source software contributions on a potentially equal footing with publications in journals. To answer the original question, YES, knowing what I know now, I really wish I had started something like SageMathCloud sooner. In fact, here's the previously private discussion from eight years ago when I almost did. -- - There is a community generated followup ... ## February 24, 2016 ### William Stein #### Elliptic curves: Magma versus Sage ### Elliptic Curves Elliptic curves are certain types of nonsingular plane cubic curves, e.g., y^2 = x^3 + ax +b, which are central to both number theory and cryptography (e.g., they are used to compute the hash in bitcoin). ### Magma and Sage If you want to do a wide range of explicit computations with elliptic curves, for research purposes, you will very likely use SageMath or Magma. If you're really serious, you'll use both. Both Sage and Magma are far ahead of all other software (e.g., Mathematica, Maple and Matlab) for elliptic curves. ### A Little History When I started contributing to Magma in 1999, I remember that Magma was way, way behind Pari. I remember having lunch with John Cannon (founder of Magma), and telling him I would no longer have to use Pari if only Magma would have dramatically faster code for computing point counts on elliptic curves. A few years later, John wisely hired Mark Watkins to work fulltime on Magma, and Mark has been working there for over a decade. Mark is definitely one of the top people in the world at implementing (and using) computational number theory algorithms, and he's ensured that Magma can do a lot. Some of that "do a lot" means catching up with (and surpassing!) what was in Pari and Sage for a long time (e.g., point counting, p-adic L-functions, etc.) However, in addition, many people have visited Sydney and added extremely deep functionality for doing higher descents to Magma, which is not available in any open source software. Search for Magma in this paper to see how, even today, there seems to be no open source way to compute the rank of the curve y2 = x3 + 169304x + 25788938. (The rank is 0.) ### Two Codebases There are several elliptic curves algorithms available only in Magma (e.g., higher descents) ... and some available only in Sage (L-function rank bounds, some overconvergent modular symbols, zeros of L-functions, images of Galois representations). I could be wrong about functionality not being in Magma, since almost anything can get implemented in a year... The code bases are almost completely separate, which is a very good thing. Any time something gets implemented in one, it gets (or should get) tested via a big run on elliptic curves up to some bound in the other. This sometimes results in bugs being found. I remember refereeing the "integral points" code in Sage by running it against all curves up to some bound and comparing to what Magma output, and getting many discrepancies, which showed that there were bugs in both Sage and Magma. Thus we would be way better off if Sage could do everything Magma does (and vice versa). ## February 23, 2016 ### Mark Fenner #### Householder Reflections and QR My next two or three posts are going to deal with the QR and SVD techniques. QR can be done in several different ways. I’m going to work through two methods of computing QR that will shed some light on our SVD implementation. So, these two topics dovetail nicely. Onward! QR via Householder Matrices For […] ## February 22, 2016 ### Titus Brown #### An #openscienceprize entry: Integrating server-side annotation into the hypothes.is ecosystem Over the last few months, I've been playing with hypothes.is and thinking about how to use it to further my scientific work. This resulted in some brainstorming with Jon Udell and Maryann Martone about, well, lots of things. And now we're putting in an open science prize entry! tl; dr? Check out a draft of our proposal to the open science prize competition, "Annotation as a service: helping build a global knowledge layer with annotation of the dark literature". ## What is Hypothes.is? If you're not familiar with hypothes.is, it's an awesome, game-changing system that lets you add (and view) annotations to virtually any Web page or PDF document. The key bit is that it's utterly permissionless - the hypothes.is service is completely client-side, so you can add and view annotations on any content to which you have access. For example, if you're reading this blog post in a hypothes.is compatible browser (try Firefox or Google Chrome), you should see this sentence highlighted; click on it and you should see an annotation window pop out with some text in it. Hypothes.is lets me decorate papers (including closed access papers) manually with whatever I want. More specifically, hypothes.is lets me connect any two resources with an "anchor" in one, pointing to another, with some text and tags. This can be done manually. It's done without permission by either resource - as long as the resource has an ID (URL or PDF ID), it works. ## So what's the #openscienceprize entry about? We want to expand this concept to the notion of automated annotation, where any source of textual analysis (including links to database, reviews, post-publication-date updates, etc.) can be used to create annotations for any given piece of literature, linking it to relevant resources, metacommentary, etc. I don't know of anything like this currently and so it's not easy to describe how it's useful. The first use case, Wormbase, is one example - right now, Wormbase can't place links to their gene information on the lit, but with this they could do so quite easily. More generally, I see this as opening the floodgates for rampant creativity to be applied to marking up scientific papers. We'd love your feedback on the idea, along with any brainstorming or critiques you might have. Please go check out the proposal at https://2016-aesir.readthedocs.org/en/latest/ and let us know what you think! --titus ### Matthew Rocklin #### Pandas on HDFS with Dask Dataframes This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project In this post we use Pandas in parallel across an HDFS cluster to read CSV data. We coordinate these computations with dask.dataframe. A screencast version of this blogpost is available here and the previous post in this series is available here. To start, we connect to our scheduler, import the hdfs module from the distributed library, and read our CSV data from HDFS. >>> from distributed import Executor, hdfs, progress >>> e = Executor('127.0.0.1:8786') >>> e <Executor: scheduler=127.0.0.1:8786 workers=64 threads=64> >>> nyc2014 = hdfs.read_csv('/nyctaxi/2014/*.csv', ... parse_dates=['pickup_datetime', 'dropoff_datetime'], ... skipinitialspace=True) >>> nyc2015 = hdfs.read_csv('/nyctaxi/2015/*.csv', ... parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']) >>> nyc2014, nyc2015 = e.persist([nyc2014, nyc2015]) >>> progress(nyc2014, nyc2015)  Our data comes from the New York City Taxi and Limousine Commission which publishes all yellow cab taxi rides in NYC for various years. This is a nice model dataset for computational tabular data because it’s large enough to be annoying while also deep enough to be broadly appealing. Each year is about 25GB on disk and about 60GB in memory as a Pandas DataFrame. HDFS breaks up our CSV files into 128MB chunks on various hard drives spread throughout the cluster. The dask.distributed workers each read the chunks of bytes local to them and call the pandas.read_csv function on these bytes, producing 391 separate Pandas DataFrame objects spread throughout the memory of our eight worker nodes. The returned objects, nyc2014 and nyc2015, are dask.dataframe objects which present a subset of the Pandas API to the user, but farm out all of the work to the many Pandas dataframes they control across the network. ## Play with Distributed Data If we wait for the data to load fully into memory then we can perform pandas-style analysis at interactive speeds. >>> nyc2015.head()  VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount 0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0 0.3 17.05 1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0 0.3 17.80 2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0 0.3 10.80 3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0 0.3 4.80 4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0 0.3 16.30 >>> len(nyc2014) 165114373 >>> len(nyc2015) 146112989  Interestingly it appears that the NYC cab industry has contracted a bit in the last year. There are fewer cab rides in 2015 than in 2014. When we ask for something like the length of the full dask.dataframe we actually ask for the length of all of the hundreds of Pandas dataframes and then sum them up. This process of reaching out to all of the workers completes in around 200-300 ms, which is generally fast enough to feel snappy in an interactive session. The dask.dataframe API looks just like the Pandas API, except that we call .compute() when we want an actual result. >>> nyc2014.passenger_count.sum().compute() 279997507.0 >>> nyc2015.passenger_count.sum().compute() 245566747  Dask.dataframes build a plan to get your result and the distributed scheduler coordinates that plan on all of the little Pandas dataframes on the workers that make up our dataset. ## Pandas for Metadata Let’s appreciate for a moment all the work we didn’t have to do around CSV handling because Pandas magically handled it for us. >>> nyc2015.dtypes VendorID int64 tpep_pickup_datetime datetime64[ns] tpep_dropoff_datetime datetime64[ns] passenger_count int64 trip_distance float64 pickup_longitude float64 pickup_latitude float64 RateCodeID int64 store_and_fwd_flag object dropoff_longitude float64 dropoff_latitude float64 payment_type int64 fare_amount float64 extra float64 mta_tax float64 tip_amount float64 tolls_amount float64 improvement_surcharge float64 total_amount\r float64 dtype: object  We didn’t have to find columns or specify data-types. We didn’t have to parse each value with an int or float function as appropriate. We didn’t have to parse the datetimes, but instead just specified a parse_datetimes= keyword. The CSV parsing happened about as quickly as can be expected for this format, clocking in at a network total of a bit under 1 GB/s. Pandas is well loved because it removes all of these little hurdles from the life of the analyst. If we tried to reinvent a new “Big-Data-Frame” we would have to reimplement all of the work already well done inside of Pandas. Instead, dask.dataframe just coordinates and reuses the code within the Pandas library. It is successful largely due to work from core Pandas developers, notably Masaaki Horikoshi (@sinhrks), who have done tremendous work to align the API precisely with the Pandas core library. ## Analyze Tips and Payment Types In an effort to demonstrate the abilities of dask.dataframe we ask a simple question of our data, “how do New Yorkers tip?”. The 2015 NYCTaxi data is quite good about breaking down the total cost of each ride into the fare amount, tip amount, and various taxes and fees. In particular this lets us measure the percentage that each rider decided to pay in tip. >>> nyc2015[['fare_amount', 'tip_amount', 'payment_type']].head()  fare_amount tip_amount payment_type 0 12.0 3.25 1 1 14.5 2.00 1 2 9.5 0.00 2 3 3.5 0.00 2 4 15.0 0.00 2 In the first two lines we see evidence supporting the 15-20% tip standard common in the US. The following three lines interestingly show zero tip. Judging only by these first five lines (a very small sample) we see a strong correlation here with the payment type. We analyze this a bit more by counting occurrences in the payment_type column both for the full dataset, and filtered by zero tip: >>> %time nyc2015.payment_type.value_counts().compute() CPU times: user 132 ms, sys: 0 ns, total: 132 ms Wall time: 558 ms 1 91574644 2 53864648 3 503070 4 170599 5 28 Name: payment_type, dtype: int64 >>> %time nyc2015[nyc2015.tip_amount == 0].payment_type.value_counts().compute() CPU times: user 212 ms, sys: 4 ms, total: 216 ms Wall time: 1.69 s 2 53862557 1 3365668 3 502025 4 170234 5 26 Name: payment_type, dtype: int64  We find that almost all zero-tip rides correspond to payment type 2, and that almost all payment type 2 rides don’t tip. My un-scientific hypothesis here is payment type 2 corresponds to cash fares and that we’re observing a tendancy of drivers not to record cash tips. However we would need more domain knowledge about our data to actually make this claim with any degree of authority. ## Analyze Tips Fractions Lets make a new column, tip_fraction, and then look at the average of this column grouped by day of week and grouped by hour of day. First, we need to filter out bad rows, both rows with this odd payment type, and rows with zero fare (there are a surprising number of free cab rides in NYC.) Second we create a new column equal to the ratio of tip_amount / fare_amount. >>> df = nyc2015[(nyc2015.fare_amount > 0) & (nyc2015.payment_type != 2)] >>> df = df.assign(tip_fraction=(df.tip_amount / df.fare_amount))  Next we choose to groupby the pickup datetime column in order to see how the average tip fraction changes by day of week and by hour. The groupby and datetime handling of Pandas makes these operations trivial. >>> dayofweek = df.groupby(df.tpep_pickup_datetime.dt.dayofweek).tip_fraction.mean() >>> hour = df.groupby(df.tpep_pickup_datetime.dt.hour).tip_fraction.mean() >>> dayofweek, hour = e.persist([dayofweek, hour]) >>> progress(dayofweek, hour)  Grouping by day-of-week doesn’t show anything too striking to my eye. However I would like to note at how generous NYC cab riders seem to be. A 23-25% tip can be quite nice: >>> dayofweek.compute() tpep_pickup_datetime 0 0.237510 1 0.236494 2 0.236073 3 0.246007 4 0.242081 5 0.232415 6 0.259974 Name: tip_fraction, dtype: float64  But grouping by hour shows that late night and early morning riders are more likely to tip extravagantly: >>> hour.compute() tpep_pickup_datetime 0 0.263602 1 0.278828 2 0.293536 3 0.276784 4 0.348649 5 0.248618 6 0.233257 7 0.216003 8 0.221508 9 0.217018 10 0.225618 11 0.231396 12 0.225186 13 0.235662 14 0.237636 15 0.228832 16 0.234086 17 0.240635 18 0.237488 19 0.272792 20 0.235866 21 0.242157 22 0.243244 23 0.244586 Name: tip_fraction, dtype: float64 In [24]:  We plot this with matplotlib and see a nice trough during business hours with a surge in the early morning with an astonishing peak of 34% at 4am: ## Performance Lets dive into a few operations that run at different time scales. This gives a good understanding of the strengths and limits of the scheduler. >>> %time nyc2015.head() CPU times: user 4 ms, sys: 0 ns, total: 4 ms Wall time: 20.9 ms  This head computation is about as fast as a film projector. You could perform this roundtrip computation between every consecutive frame of a movie; to a human eye this appears fluid. In the last post we asked about how low we could bring latency. In that post we were running computations from my laptop in California and so were bound by transcontinental latencies of 200ms. This time, because we’re operating from the cluster, we can get down to 20ms. We’re only able to be this fast because we touch only a single data element, the first partition. Things change when we need to touch the entire dataset. >>> %time len(nyc2015) CPU times: user 48 ms, sys: 0 ns, total: 48 ms Wall time: 271 ms  The length computation takes 200-300 ms. This computation takes longer because we touch every individual partition of the data, of which there are 178. The scheduler incurs about 1ms of overhead per task, add a bit of latency and you get the ~200ms total. This means that the scheduler will likely be the bottleneck whenever computations are very fast, such as is the case for computing len. Really, this is good news; it means that by improving the scheduler we can reduce these durations even further. If you look at the groupby computations above you can add the numbers in the progress bars to show that we computed around 3000 tasks in around 7s. It looks like this computation is about half scheduler overhead and about half bound by actual computation. ## Conclusion We used dask+distributed on a cluster to read CSV data from HDFS into a dask dataframe. We then used dask.dataframe, which looks identical to the Pandas dataframe, to manipulate our distributed dataset intuitively and efficiently. We looked a bit at the performance characteristics of simple computations. ## What doesn’t work As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time. • Dask dataframe implements a commonly used subset of Pandas functionality, not all of it. It’s surprisingly hard to communicate the exact bounds of this subset to users. Notably, in the distributed setting we don’t have a shuffle algorithm, so groupby(...).apply(...) and some joins are not yet possible. • If you want to use threads, you’ll need Pandas 0.18.0 which, at the time of this writing, was still in release candidate stage. This Pandas release fixes some important GIL related issues. • The 1ms overhead per task limit is significant. While we can still scale out to clusters far larger than what we have here, we probably won’t be able to strongly accelerate very quick operations until we reduce this number. • We use the hdfs3 library to read data from HDFS. This library seems to work great but is new and could use more active users to flush out bug reports. ## Setup and Data You can obtain public data from the New York City Taxi and Limousine Commission here. I downloaded this onto the head node and dumped it into HDFS with commands like the following:  wget https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-{01..12}.csv
$hdfs dfs -mkdir /nyctaxi$ hdfs dfs -mkdir /nyctaxi/2015
$hdfs dfs -put yellow*.csv /nyctaxi/2015/  The cluster was hosted on EC2 and was comprised of nine m3.2xlarges with 8 cores and 30GB of RAM each. Eight of these nodes were used as workers; they used processes for parallelism, not threads. ## February 21, 2016 ### Fabian Pedregosa #### SAGA algorithm in the lightning library Recently I've implemented, together with Arnaud Rachez, the SAGA[1] algorithm in the lightning machine learning library (which by the way, has been recently moved to the new scikit-learn-contrib project). The lightning library uses the same API as scikit-learn but is particularly adapted to online learning. As for the SAGA algorithm, its performance is similar to other variance-reduced stochastic algorithms such as SAG[3] or SVRG[2] but it has the advantage with respect to SAG[3] that it allows non-smooth penalty terms (such as$\ell_1\$ regularization). It is implemented in lightning as SAGAClassifier and SAGARegressor.

We have taken care to make this implementation as efficient as possible. As for most stochastic gradient algorithms, a naive implementation takes 3 lines of code and is straightforward to implement. However, there are many tricks that are time-consuming and error-prone to implement but make a huge difference in efficiency.

A small example, more as a sanity check than to claim anything. The following plot shows the suboptimality as a function of time for three similar methods: SAG, SAGA and SVRG. The dataset used in the RCV1 dataset (test set, obtained from the libsvm webpage), consisting of 677.399 samples and 47.236 features. Interestingly, all methods can solve this rather large-scale problem within a few seconds. Within them, SAG and SAGA have a very similar performance and SVRG seems to be reasonably faster.

A note about the benchmarks: it is difficult to compare fairly stochastic gradient methods because at the end it usually boils down to how you choose the step size. In this plot I set the step size of all methods to 1/(3L) , where L is the Lipschitz constant of the objective function, as I think this is a popular choice. I would have prefered 1/L but SVRG was not converging for this step size. The code for the benchmarks can be found here.

1. A. Defazio, F. Bach & S. Lacoste-Julien. "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives" (2014).

2. Rie Johnson and Tong Zhang. "Accelerating stochastic gradient descent using predictive variance reduction." Advances in Neural Information Processing Systems. 2013.

3. Mark Schmidt, Nicolas Le Roux, and Francis Bach. "Minimizing finite sums with the stochastic average gradient." arXiv preprint arXiv:1309.2388 (2013).

### Titus Brown

#### Introducing undergraduates to code review

A few days, I gave an invited lecture on code review here at UC Davis. The class was the ECS capstone class, consisting of about 100 CS majors who were all working on some form of group project - all but a few were doing something with programming. I was told the class had little to no experience with code review, but did know C++ rather well.

Since I only had two hours AND I really hate lectures, I really wanted to do something that was somewhat hands on, but didn't involve programming or computer use. Asking my Twitter network, I got a few links -- Azalee Boestrom has code review tutorial, and Maxime Boissonneault demoed exercism.io for me -- but neither fit what I had in mind.

The big thing was this: somewhere during my thought process, I realized that I wanted to show the students the basic connection between code review, testing, and refactoring. How do you show all of that to people who might have never done any of that?

I ended up with the plan that follows. It went better than expected!

In preparation for class, I extracted some code from our khmer package -- specifically, the KmerIterator code that takes a string of DNA and produces a succession of hashed values from it -- and defactored it into a horrid function. You can see the result here. I also built a simple Makefile.

In class, I started by asking the class what they thought the purpose of code review was, and polled them on it via a free-text entry Google Form. This led to a discussion about maintainability vs correctness, and my point that you couldn't approach the latter over a long period of time without the former.

Then, I moved on to the code:

First, I ran through the purpose of the (ugly) code with lots of hand-waving.

Then, I asked the question: how do we know if the code works?

To answer that, I live-coded a very simple version of the function that was slow, but so simple that you could guarantee it worked -- see it here. I saved the output from this "presumed-good" version, and compared it to the output of the ugly version and showed that it matched.

For good measure, I also automated that as a smoke test, so that with a simple 'make test' we could check to see if our latest code changes produced presumed-good output.

Now I had code that probably worked.

Second, I polled the class (with a free-text entry approach using Google Forms) for their thoughts on what the code needed. Remember, it's ugly!

I went through the resulting comments, and pulled out a few suggestions.

One was "write down in detail what the biology background is." I explained that you probably didn't want that kind of comment in the code -- generally, anyone working on this code should already be somewhat versed in the biology.

Another suggestion was "Comment the code." So I started putting in comments like "set this variable to 0", which led to a discussion of how too many comments could be silly and what we really wanted was important comments. Here, I brought in my confession that I'd spent 15 minutes trying to understand why i started at k, so we commented that plus a few other things.

Third, I was running a bit low on time, so I took a previous suggestion and refactored the code by making longer, better variable names. Here, I could show that the code not only compiled but passed tests on the first try (hence the commit comment).

Finally, I took the suggestion of one astute observer and split up my now too-long function by building it into a class. The result looks reasonably similar to the actual code in khmer, which I pointed out; even better, it passed my smoke test (hence the commit comment). I got applause!

We closed with some discussion about where to look for more information, and how to think about this stuff. I showed them some live examples, and referred them to David Soergel's paper, Rampant software errors may undermine scientific results as an entry point into some literature on error rates; I also showed them Python coverage output and our code review guidelines for the khmer project.

All in all, I managed to hold their attention (by live coding, by cracking jokes, and by being energetic). (I only got one negative comment about the jokes! :).

I think I demonstrated about 80% of what I wanted to show, and filled up the two hours nicely. I can do a better job next time, of course, and if I get invited back I'll try to provide a better outline of points to make and references; I winged it too much this time and so it's not a re-usable approach yet, I think.

--titus

### Filipe Saraiva

#### Call to Co-maintainers of Cantor

Cantor, the software to scientific programming in worksheet-style interface, had (and has!) several developers working in different parts of the code along the years. Thanks to the plugin-based architecture of Cantor, a developer can to create a new backend to communicate with some new programming language, an assistant, or some other piece of software, requiring just the knowledge of Cantor API.

Because it Cantor has 10 backends using different sets of technologies to provide communication between the software and some programming language. Citing some examples, Python 2 backend uses the Python/C API, R and Python 3 backends use D-Bus protocol, and Scilab backend uses KProcess. So, there are a mix of different technologies in the project.

You can think the maintenance of Cantor can be a headache if the work is done by just one person. Since I received the maintainer status, I am trying to improve the coordination of the project, including the distribution of responsibilities among developers.

In this way, I requested to KDE sysadmin and that amazing team created in KDE Bugzilla different components to each backend in Cantor. For now I asked to some developers to take the responsibility and manage the bugs reported to backends developed by them, but some components are missing maintainers yet, specially the backends of Maxima, Qualculate, R, and Sage.

So, if you were a developer of some of those backends and you would like to continue developing Cantor, please comment in this post or send an e-mail to me in filipe at kde.org and I will put you as the maintainer of the component.

Of course, all developers can to contribute for Cantor and solve his bugs, but the work as maintainer of some component will help us in the task of manage and improve the project.

## February 18, 2016

### Pierre de Buyl

#### Code as a paper

Publishing scientific software is a nontrivial problem. I present here, and ask feedback, on the idea to upload the software to arXiv with a presentation paper and use it as an evolving reference.

The situation is the following: I am developing a code that, while still under development, should provide scientific outputs in the coming months. I'd like to cite properly the software in my papers at that point, and not in a hypotethical future where all my wishlist has been satisfied. I would like to cite it as it is used, and would like a "single point of citation" for the lifetime of the code.

Many people have already thought about this issue. See the recent Software Credit Workshop, the related blog post by Martin Fenner, Mick Watson's blog, Gaël Varoquaux's blog, Titus Brown's blog and probably many others. There are several old and new solutions to publish software:

• Paper: Publishing a software paper in a computing oriented journal. In physics, a well-known venue is Computer Physics Communications for instance.
• Web page: Setup a web page for your software and link to the source (a tarball, a version control system, etc).
• Crazy (in a good sense): Launching a new journal
• Zenodo/GitHub: Obtain a DOI for your GitHub-hosted code via Zenodo, as presented here

Here are my criteria:

• Open-access (by principle).
• Low or inexistant author charges (pragmatic condition).
• Provides a single citable reference point for the software and its updates. Be explicit on the version, though.
• Persistent.
• Allows more than "just" the code. That is, I would like to document the first release by an extended note, similar to what would be found in a traditional software paper.

Let us review the propositions above:

• Paper: Only allows one version. Also, is probably paywalled or expensive through publication fees.
• Web page: Hard to cite. Likely not persistent enough.
• Crazy: Experience shows that it is not realistic, at least now.
• Zenodo/GitHub: One DOI per version. Also, the DOI points to an exact reproduction of the source repository. It is thus rather inflexible.

The arXiv identifier is, I believe, stable and citable. One can attach a tarball of code with the paper and it is possible to update a paper with newer versions, allowing for update to the software in a simple form: v1, v2, etc. As far as journal are concerned, The Journal of Open Research Software has reasonable fees and allows to distribute the code via GitHub. I don't know how they handle software revisions though.

## February 17, 2016

### Matthew Rocklin

#### Introducing Dask distributed

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We analyze JSON data on a cluster using pure Python projects.

Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with a new distributed memory scheduler. This enables dask’s existing parallel algorithms to scale across 10s to 100s of nodes, and extends a subset of PyData to distributed computing. Over the next few weeks I and others will write about this system. Please note that dask+distributed is developing quickly and so the API is likely to shift around a bit.

Today we start simple with the typical cluster computing problem, parsing JSON records, filtering, and counting events using dask.bag and the new distributed scheduler. We’ll dive into more advanced problems in future posts.

A video version of this blogpost is available here.

## GitHub Archive Data on S3

GitHub releases data dumps of their public event stream as gzipped compressed, line-delimited, JSON. This data is too large to fit comfortably into memory, even on a sizable workstation. We could stream it from disk but, due to the compression and JSON encoding this takes a while and so slogs down interactive use. For an interactive experience with data like this we need a distributed cluster.

### Setup and Data

We provision nine m3.2xlarge nodes on EC2. These have eight cores and 30GB of RAM each. On this cluster we provision one scheduler and nine workers (see setup docs). (More on launching in later posts.) We have five months of data, from 2015-01-01 to 2015-05-31 on the githubarchive-data bucket in S3. This data is publicly avaialble if you want to play with it on EC2. You can download the full dataset at https://www.githubarchive.org/ .

The first record looks like the following:

 {'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/9152315?',
'gravatar_id': '',
'id': 9152315,
'url': 'https://api.github.com/users/davidjhulse'},
'created_at': '2015-01-01T00:00:00Z',
'id': '2489368070',
'commits': [{'author': {'email': 'david.hulse@live.com',
'name': 'davidjhulse'},
'distinct': True,
'message': 'Altered BingBot.jar\n\nFixed issue with multiple account support',
'sha': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],
'distinct_size': 1,
'push_id': 536740396,
'size': 1},
'public': True,
'repo': {'id': 28635890,
'name': 'davidjhulse/davesbingrewardsbot',
'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},
'type': 'PushEvent'}


So we have a large dataset on S3 and a moderate sized play cluster on EC2, which has access to S3 data at about 100MB/s per node. We’re ready to play.

## Play

We start an ipython interpreter on our local laptop and connect to the dask scheduler running on the cluster. For the purposes of timing, the cluster is on the East Coast while the local machine is in California on commercial broadband internet.

>>> from distributed import Executor, s3
>>> e = Executor('54.173.84.107:8786')
>>> e
<Executor: scheduler=54.173.84.107:8786 workers=72 threads=72>


Our seventy-two worker processes come from nine workers with eight processes each. We chose processes rather than threads for this task because computations will be bound by the GIL. We will change this to threads in later examples.

We start by loading a single month of data into distributed memory.

import json
text = s3.read_text('githubarchive-data', '2015-01', compression='gzip')
records = e.persist(records)


The data lives in S3 in hourly files as gzipped encoded, line delimited JSON. The s3.read_text and text.map functions produce dask.bag objects which track our operations in a lazily built task graph. When we ask the executor to persist this collection we ship those tasks off to the scheduler to run on all of the workers in parallel. The persist function gives us back another dask.bag pointing to these remotely running results. This persist function returns immediately, and the computation happens on the cluster in the background asynchronously. We gain control of our interpreter immediately while the cluster hums along.

The cluster takes around 40 seconds to download, decompress, and parse this data. If you watch the video embedded above you’ll see fancy progress-bars.

We ask for a single record. This returns in around 200ms, which is fast enough that it feels instantaneous to a human.

>>> records.take(1)
({'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/9152315?',
'gravatar_id': '',
'id': 9152315,
'url': 'https://api.github.com/users/davidjhulse'},
'created_at': '2015-01-01T00:00:00Z',
'id': '2489368070',
'commits': [{'author': {'email': 'david.hulse@live.com',
'name': 'davidjhulse'},
'distinct': True,
'message': 'Altered BingBot.jar\n\nFixed issue with multiple account support',
'sha': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],
'distinct_size': 1,
'push_id': 536740396,
'size': 1},
'public': True,
'repo': {'id': 28635890,
'name': 'davidjhulse/davesbingrewardsbot',
'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},
'type': 'PushEvent'},)


This particular event is a 'PushEvent'. Let’s quickly see all the kinds of events. For fun, we’ll also time the interaction:

>>> %time records.pluck('type').frequencies().compute()
CPU times: user 112 ms, sys: 0 ns, total: 112 ms
Wall time: 2.41 s

[('ReleaseEvent', 44312),
('MemberEvent', 69757),
('IssuesEvent', 693363),
('PublicEvent', 14614),
('CreateEvent', 1651300),
('PullRequestReviewCommentEvent', 214288),
('PullRequestEvent', 680879),
('ForkEvent', 491256),
('DeleteEvent', 256987),
('PushEvent', 7028566),
('IssueCommentEvent', 1322509),
('GollumEvent', 150861),
('CommitCommentEvent', 96468),
('WatchEvent', 1321546)]


And we compute the total count of all commits for this month.

>>> %time records.count().compute()
CPU times: user 134 ms, sys: 133 µs, total: 134 ms
Wall time: 1.49 s

14036706


We see that it takes a few seconds to walk through the data (and perform all scheduling overhead.) The scheduler adds about a millisecond overhead per task, and there are about 1000 partitions/files here (the GitHub data is split by hour and there are 730 hours in a month) so most of the cost here is overhead.

## Investigate Jupyter

We investigate the activities of Project Jupyter. We chose this project because it’s sizable and because we understand the players involved and so can check our accuracy. This will require us to filter our data to a much smaller subset, then find popular repositories and members.

>>> jupyter = (records.filter(lambda d: d['repo']['name'].startswith('jupyter/'))
.repartition(10))
>>> jupyter = e.persist(jupyter)


All records, regardless of event type, have a repository which has a name like 'organization/repository' in typical GitHub fashion. We filter all records that start with 'jupyter/'. Additionally, because this dataset is likely much smaller, we push all of these records into just ten partitions. This dramatically reduces scheduling overhead. The persist call hands this computation off to the scheduler and then gives us back our collection that points to that computing result. Filtering this month for Jupyter events takes about 7.5 seconds. Afterwards computations on this subset feel snappy.

>>> %time jupyter.count().compute()
CPU times: user 5.19 ms, sys: 97 µs, total: 5.28 ms
Wall time: 199 ms

747

>>> %time jupyter.take(1)
CPU times: user 7.01 ms, sys: 259 µs, total: 7.27 ms
Wall time: 182 ms

({'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/26679?',
'gravatar_id': '',
'id': 26679,
'url': 'https://api.github.com/users/marksteve'},
'created_at': '2015-01-01T13:25:44Z',
'id': '2489612400',
'org': {'avatar_url': 'https://avatars.githubusercontent.com/u/7388996?',
'gravatar_id': '',
'id': 7388996,
'url': 'https://api.github.com/orgs/jupyter'},
'public': True,
'repo': {'id': 5303123,
'name': 'jupyter/nbviewer',
'url': 'https://api.github.com/repos/jupyter/nbviewer'},
'type': 'WatchEvent'},)


So the first event of the year was by 'marksteve' who decided to watch the 'nbviewer' repository on new year’s day.

Notice that these computations take around 200ms. I can’t get below this from my local machine, so we’re likely bound by communicating to such a remote location. A 200ms latency is not great if you’re playing a video game, but it’s decent for interactive computing.

Here are all of the Jupyter repositories touched in the month of January,

>>> %time jupyter.pluck('repo').pluck('name').distinct().compute()
CPU times: user 2.84 ms, sys: 4.03 ms, total: 6.86 ms
Wall time: 204 ms

['jupyter/dockerspawner',
'jupyter/design',
'jupyter/docker-demo-images',
'jupyter/jupyterhub',
'jupyter/configurable-http-proxy',
'jupyter/nbshot',
'jupyter/sudospawner',
'jupyter/colaboratory',
'jupyter/strata-sv-2015-tutorial',
'jupyter/tmpnb-deploy',
'jupyter/nature-demo',
'jupyter/nbcache',
'jupyter/jupyter.github.io',
'jupyter/try.jupyter.org',
'jupyter/jupyter-drive',
'jupyter/tmpnb',
'jupyter/tmpnb-redirector',
'jupyter/nbindex',
'jupyter/nbviewer',
'jupyter/oauthenticator']


And the top ten most active people on GitHub.

>>> %time (jupyter.pluck('actor')
.frequencies()
.topk(10, lambda kv: kv[1])
.compute())
CPU times: user 8.03 ms, sys: 90 µs, total: 8.12 ms
Wall time: 226 ms

[('rgbkrk', 156),
('minrk', 87),
('Carreau', 87),
('KesterTong', 74),
('jhamrick', 70),
('bollwyvl', 25),
('pkt', 18),
('ssanderson', 13),
('smashwilson', 13),
('ellisonbg', 13)]


Nothing too surprising here if you know these folks.

## Full Dataset

The full five months of data is too large to fit in memory, even for this cluster. When we represent semi-structured data like this with dynamic data structures like lists and dictionaries there is quite a bit of memory bloat. Some careful attention to efficient semi-structured storage here could save us from having to switch to such a large cluster, but that will have to be the topic of another post.

Instead, we operate efficiently on this dataset by flowing it through memory, persisting only the records we care about. The distributed dask scheduler descends from the single-machine dask scheduler, which was quite good at flowing through a computation and intelligently removing intermediate results.

From a user API perspective, we call persist only on the jupyter dataset, and not the full records dataset.

>>> full = (s3.read_text('githubarchive-data', '2015', compression='gzip')

>>> jupyter = (full.filter(lambda d: d['repo']['name'].startswith('jupyter/'))
.repartition(10))

>>> jupyter = e.persist(jupyter)


It takes 2m36s to download, decompress, and parse the five months of publicly available GitHub events for all Jupyter events on nine m3.2xlarges.

There were seven thousand such events.

>>> jupyter.count().compute()
7065


We find which repositories saw the most activity during that time:

>>> %time (jupyter.pluck('repo')
.pluck('name')
.frequencies()
.topk(20, lambda kv: kv[1])
.compute())
CPU times: user 6.98 ms, sys: 474 µs, total: 7.46 ms
Wall time: 219 ms

[('jupyter/jupyterhub', 1262),
('jupyter/nbviewer', 846),
('jupyter/jupyter_notebook', 507),
('jupyter/jupyter-drive', 505),
('jupyter/notebook', 451),
('jupyter/docker-demo-images', 363),
('jupyter/tmpnb', 284),
('jupyter/jupyter_client', 162),
('jupyter/dockerspawner', 149),
('jupyter/colaboratory', 134),
('jupyter/jupyter_core', 127),
('jupyter/strata-sv-2015-tutorial', 108),
('jupyter/jupyter_nbconvert', 103),
('jupyter/configurable-http-proxy', 89),
('jupyter/hubpress.io', 85),
('jupyter/jupyter.github.io', 84),
('jupyter/tmpnb-deploy', 76),
('jupyter/nbconvert', 66),
('jupyter/jupyter_qtconsole', 59)]


We see that projects like jupyterhub were quite active during that time while, surprisingly, nbconvert saw relatively little action.

## Local Data

The Jupyter data is quite small and easily fits in a single machine. Let’s bring the data to our local machine so that we can compare times:

>>> %time L = jupyter.compute()
CPU times: user 4.74 s, sys: 10.9 s, total: 15.7 s
Wall time: 30.2 s


It takes surprisingly long to download the data, but once its here, we can iterate far more quickly with basic Python.

>>> from toolz.curried import pluck, frequencies, topk, pipe
>>> %time pipe(L, pluck('repo'), pluck('name'), frequencies,
dict.items, topk(20, key=lambda kv: kv[1]), list)
CPU times: user 11.8 ms, sys: 0 ns, total: 11.8 ms
Wall time: 11.5 ms

[('jupyter/jupyterhub', 1262),
('jupyter/nbviewer', 846),
('jupyter/jupyter_notebook', 507),
('jupyter/jupyter-drive', 505),
('jupyter/notebook', 451),
('jupyter/docker-demo-images', 363),
('jupyter/tmpnb', 284),
('jupyter/jupyter_client', 162),
('jupyter/dockerspawner', 149),
('jupyter/colaboratory', 134),
('jupyter/jupyter_core', 127),
('jupyter/strata-sv-2015-tutorial', 108),
('jupyter/jupyter_nbconvert', 103),
('jupyter/configurable-http-proxy', 89),
('jupyter/hubpress.io', 85),
('jupyter/jupyter.github.io', 84),
('jupyter/tmpnb-deploy', 76),
('jupyter/nbconvert', 66),
('jupyter/jupyter_qtconsole', 59)]


The difference here is 20x, which is a good reminder that, once you no longer have a large problem you should probably eschew distributed systems and act locally.

## Conclusion

Downloading, decompressing, parsing, filtering, and counting JSON records is the new wordcount. It’s the first problem anyone sees. Fortunately it’s both easy to solve and the common case. Woo hoo!

Here we saw that dask+distributed handle the common case decently well and with a Pure Python stack. Typically Python users rely on a JVM technology like Hadoop/Spark/Storm to distribute their computations. Here we have Python distributing Python; there are some usability gains to be had here like nice stack traces, a bit less serialization overhead, and attention to other Pythonic style choices.

Over the next few posts I intend to deviate from this common case. Most “Big Data” technologies were designed to solve typical data munging problems found in web companies or with simple database operations in mind. Python users care about these things too, but they also reach out to a wide variety of fields. In dask+distributed development we care about the common case, but also support less traditional workflows that are commonly found in the life, physical, and algorithmic sciences.

By designing to support these more extreme cases we’ve nailed some common pain points in current distributed systems. Today we’ve seen low latency and remote control; in the future we’ll see others.

## What doesn’t work

I’ll have an honest section like this at the end of each upcoming post describing what doesn’t work, what still feels broken, or what I would have done differently with more time.

• The imports for dask and distributed are still strange. They’re two separate codebases that play very nicely together. Unfortunately the functionality you need is sometimes in one or in the other and it’s not immediately clear to the novice user where to go. For example dask.bag, the collection we’re using for records, jupyter, etc. is in dask but the s3 module is within the distributed library. We’ll have to merge things at some point in the near-to-moderate future. Ditto for the API: there are compute methods both on the dask collections (records.compute()) and on the distributed executor (e.compute(records)) that behave slightly differently.

• We lack an efficient distributed shuffle algorithm. This is very important if you want to use operations like .groupby (which you should avoid anyway). The user API here doesn’t even cleanly warn users that this is missing in the distributed case which is kind of a mess. (It works fine on a single machine.) Efficient alternatives like foldby are available.

• I would have liked to run this experiment directly on the cluster to see how low we could have gone below the 200ms barrier we ran into here.

• dask, the original project
• distributed, the distributed memory scheduler powering the cluster computing
• dask.bag, the user API we’ve used in this post.
• This post largely repeats work by Blake Griffith in a similar post last year with an older iteration of the dask distributed scheduler

## February 16, 2016

### Titus Brown

#### Is mybinder 95% of the way to next-gen computational science publishing, or only 90%?

If you haven't seen mybinder.org, you should go check it out. It's a site that runs IPython/Jupyter Notebooks from GitHub for free, and I think it's a solution to publishing reproducible computational work.

For a really basic example, take a look at my demo Software Carpentry lesson. Clicking on that link brings up a Python notebook that will load and plot some data; it's interactive and you can go ahead and execute the cells.

There are a couple of really appealing things about this.

First, it's simple -- the source repository is just a directory that contains a Jupyter notebook and some data. It's pretty much what you'd have if you were actually, you know, doing data analysis with data in a repo.

Second, it's built on top of solid technology. mybinder's foundation is Docker and Jupyter Notebook, both of which are robust and have large, supportive communities. Sure, behind the scenes, mybinder is using some bleeding edge tech (Kubernetes running on Google Compute Engine) to provision and load balance the compute. But I bet it would take me no more than 30 minutes to cobble together something that would let me run any mybinder-compatible repository on my laptop, or on any compute server that can run Docker.

Third, it's super configurable -- because it's based on Docker, you can install pretty much anything you want. I tested this out myself by installing R and an R kernel for the Jupyter notebook -- see this mybinder link and repo. The Dockerfile that installs things is pretty straightforward (given that installing all that stuff isn't straightforward in the first place ;).

Fourth, it's open source. Go ahead, do what you want with it!

I believe mybinder is the hitherto missing link for publishing reproducible computation. Before this, we could give people our data and our data analysis scripts, but not our compute environment - or at least not in a way that was easy to run. We can now provide readers with a link that they can use to instantly go from my paper to a running instantiation of my paper's data analysis.

## What's missing and easy?

Authentication and Cost: right now, mybinder is running on someone else's dime -- specifically, the Freeman Lab's dime. This can't continue forever, especially since it's totes open for abuse. I expect that sooner or later someone will gin up a way to run this on many cloud compute services; once that works, it should be pretty easy to build a site that redirects you to your favorite compute service and lets you enter your own credentials, then boom there you go. Insta-compute.

(I think current mybinder limitations are in the several GB of RAM range, with a few dozen MB of disk space.)

Executing anything other than GitHub public repos: mybinder only runs off of public GitHub repositories right now, but that's not going to be too tricky to fix, either. BitBucket and GitLab and private repos can be served almost as easily.

RStudio and RMarkdown: currently, it's all Project Jupyter, but it should be pretty easy to get RStudio Server or some sort of RMarkdown application (Shiny?) working on it.

Pre-built Docker images: you need to derive your Dockerfile from the andrewosh/binder-base image, which means you can't use pre-built images from the Docker Hub (even if they were built from binder-base). I predict that will go away soon enough, since it's not a fundamental requirement of the way they're doing things.

## What's missing and not easy?

Big files and big/long compute: Coordinating the delivery of big files, big compute, and big memory (potentially for a long period of time) is something that's out of easy reach, I think. Docker isn't ready for really data-intensive stuff, from what I can tell. More, I'm not sure when (if ever) people will completely automate the heavyweight data analysis - surely it's further away than most people using Jupyter or RStudio for data analysis.

The split that my lab has made here is to use a workflow engine (e.g. make, pydoit, or snakemake) for the compute & data intensive stuff, and then feed those intermediate results (assembly and mapping stats, quantification, etc.) into analysis notebooks. For mybinder purposes, there should be no problem saving those intermediate results into a github repo for us and everyone else to analyze and reanalyze.

## What are the next steps?

I bet this will evolve pretty fast, because it's an easy way for scientists to deliver compute to friends, colleagues, and students.

In terms of learning how to make use of mybinder, Software Carpentry already teaches scientists how to do much of the necessary stuff -- scripting, git, and Jupyter Notebooks are all part of the standard curriculum. It's not a big leap to show students some simple Docker files. There's virtually no unnecessary configuration required at the moment, which is a big win; we can talk about why everything is actually important for the end result, which is <ahem> not always the case in computing.

Beyond that, I'm not sure where the gaps are - as usual we're missing incentives for adopting this, so I'd be interested in thinking about how to structure things there. But I look forward to seeing how this evolves!

--titus

p.s. Everware does something similar to mybinder. I'm aware of it, I just haven't had the chance to try it out - I'd love to hear from people who have used it!

p.p.s. For a real, and super cool, example of mybinder, see: Min Ragan-Kelly's LIGO/gravitational waves tutorial.

#### Three remote workshops: regex in Python, scipy.optimize, and building Web sites

We're running three half-day workshops for your remote viewing pleasure! All three will be live-streamed via YouTube (well, Hangouts on Air).

Today 2/17, at 9:15am PT, Tiffany Timbers (Simon Fraser U.) is going to be going through regular expressions in Python - see the workshop description.

On Friday, 2/19, same time, same place, Ariel Rokem (U. Washington) will be introducing scipy.optimize.

And, in just under two weeks, on Monday 2/29, Adelaide Rhodes (Oregon State University) will be showing us how to build static Web sites with Sphinx.

All three workshops will be live-streamed as well as available for archival viewing. We'll post the YouTube links on the workshop pages above, but you can always tweet us at @ctitusbrown or @jessicamizzi if you want the hook-up. (We'll also do live-stream support as possible.)

--titus

## February 14, 2016

### Titus Brown

#### The likely challenges of (post) publication peer review

CORRECTION: I mistakenly linked to Geoff Bilder, Jennifer Lin, and Cameron's piece on infrastructure in the first posted version, rather than Cameron's post on culture in data sharing. Both are worth reading but the latter is more relevant to this post, and I also wanted to make sure I correctly attributed the former to Geoff, Jennifer, and Cameron all.

I am a huge fan of the idea of changing up scientific peer review.

Anyone paying attention has seen a nigh-daily drumbeat of blog posts, tweets, and op-eds pointing out the shortcomings of the current approach, which is dominated by anonymous pre-publication peer review.

For those who don't know how the current system works, it goes something like this:

1. Author submits paper to journal
2. Journal finds editor for paper.
3. Editor decides whether paper is important enough to review, finds reviewers for paper.
4. Reviewers review paper, send reviews to editor.
5. Editor collates reviews, anonymizes them, and sends them to authors along with a decision (reject, major revisions, minor revisions, accept).
6. Go to #1 (it it's a reject), or go to #4 (if revisions required), or go to...
7. Publish!
8. Garner academic fame, reap grant money, write another paper, go to #1.

This is how much of scientific communication works, and peer-reviewed pubs are how reputation is garnered and credit is assigned.

There are many concerns with this approach in the modern day world, with all the possibilities enabled by the Intertubes - papers are often not available prior to publication, which delays scientific progress; journals are often unnecessarily expensive, with high profit margins; papers have to find big effects to be considered significant enough to even review; reviewers are often inexpert, usually uncompensated, rarely timely, and sometimes quite obnoxious; the reviews are rarely published, which means the reviewer perspective is rarely heard; reviews are not re-used, so when a paper transitions from journal to journal, the reviews must be redone; editors make incomprehensible decisions; the process is slow; the process if opaque; there's little accountability; and probably many other things that I am missing.

A precis might be that, so far, publishing and peer review have really failed to take deep advantage of the opportunities provided by quick, easy, instantaneous worldwide communication.

In recognition of this idea that there may be better ways to do communicate science, I've indulged in lots of experiments in alt-publishing since becoming an Assistant Professor - I blog and tweet a lot, I sign my reviews, I (used to) post many of my reviews on my blog, I've been trying out F1000Research's open review system, most of our papers are written openly & blogged & preprinted, I've indulged in authorship experiments, and we've tried pre- and post-publication peer review in journal clubs and on my blog.

As a result of all of these experiments, I've learned a lot about how the system works and reacts to attempts to change. It's mostly been positive (and I certainly have no complaints about my career trajectory). But the main conclusion I've reached is a tentative and disappointing one - the whole system is complicated, and is deeply rooted in the surprisingly conservative culture of science. And, because of this, I don't think there's a simple way to cut the Gordian Knot and move quickly to a new publication system.

In recent weeks, Michael Eisen has been very vocal about changing to a system of post-publication peer review (PPPR). (Also see this more detailed blog post.) This morning, I objected to some snark directed at Michael and other PPPR advocates, and this led to a decent amount of back and forth on Twitter (see this tweet and descendants). But, of course, Twitter isn't great for conveying complicated positions, so I thought I'd take to the blogs.

Below, I tried to distill my opinions down to three main points. Let's see how it goes.

## 1. Practice in this area is still pretty shallow in science.

I don't think we have enough experience to make decisions yet. We need more experiments.

There's (IMO) legitimate confusion over questions like anonymity and platform. The question of metrics (qualitative or quantitative) rears its ugly head - when is something adequately peer reviewed, and when does something "count" in whatever way it needs to count? And how are we supposed to evaluate a deeply technical paper that's outside of our field? What if it has a mixture of good and bad reviews? How do we focus our attention on papers outside of our immediate sphere? Anyone who thinks hard about these things and reaches a simple conclusion is kidding themselves.

(Of course, the fun bit is that if you think hard about these things you quickly reach the conclusion that our current practice is horrible, too. But that doesn't mean new practices are automatically better.)

## 2. The Internet is not a friendly place, and it takes a lot of work to create well-behaving communities.

As my corner of science has struggled to embrace "online", I've watched scientists recap the same mistakes that many open source and social media communities made over the last two decades.

The following is a good generalization, in my experience:

Any worthwhile community these days has active (if light) moderation, a code of conduct laying out the rules, and a number of active participants who set the tone and invite new people into the community. These communities take time to grow and cannot be created by fiat.

I think any proposal for PPPR needs to explicitly address the questions of methods of moderation, selection of moderators, expected conduct, and what to do with bad actors. It also needs to address community growth and sustainability. The latter questions are where most PPPR commenters focus, but the former questions are critical as well.

## 3. Scholarship in this area is still pretty young.

There are a lot of knee jerk reactions (PPPR good! PPPR bad! I like blue!), but there isn't a lot of scholarship and discussion in this area, and most of it happens on blogs (which are largely invisible to 99% of scientists). There are only a few people thinking hard about the whole picture; I really like some of Cameron Neylon's posts, for example, but the fact that his work stands out to me so much is (to me) a bad sign. (I think that's a compliment to you and a diss on the field as a whole, Cameron ;).

Worse, I've been entirely spoiled by reading Gabriella Coleman's book on Anonymous, "Hacker, Hoaxer, Whistleblower, Spy". This kind of deep, immersive research and reporting on how technological communities operate is hard to find, and most of what I have found is ignorant of the new uses and opportunities of technology. I've found nothing equivalent on science. (Pointers VERY welcome!)

## Concluding thoughts

At the end of the day, it's not clear to me that we will ever have an answer to Goodhart's Law -- "when a measure becomes a target, it ceases to be a good measure." There are tremendous pressures (livelihoods, reputation, funding) that will be placed on any system of peer review and publishing. I worry that any system we come up with will be even more easily perverted than the current system has been, and science (and scientists, and scientific progress) will suffer as a result.

Me? I'm going to continue experimenting, and talking with people, and seeing if I can identify and promulgate good practice from the bottom up. 'cause that's how I roll.

--titus

p.s. The reason I'm not posting reviews on my blog anymore has to do with time and energy - I've been overwhelmed for the last year or two. I think I need a better workflow for posting them that takes less of my focus.