April 29, 2016

Continuum Analytics news

Open Data Science: Bringing “Magic” to Modern Analytics

Posted Friday, April 29, 2016

Science fiction author Arthur C. Clarke once wrote, “any sufficiently advanced technology is indistinguishable from magic.”

We’re nearer than ever to that incomprehensible, magical future. Our gadgets understand our speech, driverless cars have made their debut and we’ll soon be viewing virtual worlds at home.

These “magical” technologies spring from a 21st-century spirit of innovation—but not only from big companies. Thanks to the Internet—and to the open source movement—companies of all sizes are able to spur advancements in science and technology.

It’s no different for advanced analytics. And it’s about time.

In the past, our analytics tools were proprietary, product-oriented solutions. These were necessarily limited in flexibility and they locked customers into the slow innovation cycles and whims of vendors. These closed-source solutions forced a “one size fits all” approach to analytics with monolithic tools that did not offer easy customization for different needs.

Open Data Science has changed that. It offers innovative software—free of proprietary restrictions and tailorable for all varieties of data science teams—created in the transparent collaboration that is driving today’s tech boom.

The Magic 8-Ball of Automated Modeling

One of Open Data Science's most visible points of innovation is in the sphere of data science modeling.

Initially, models were created exclusively by statisticians and analysts for business professionals, but demand from the business sector for software that could do this job gave rise to automatic model fitting—often called “black box” analytics—in which analysts let software algorithmically generate models that fit data and create predictive models.

Such a system creates models, but much like a magic 8-ball, it offers its users answers without business explanations. Mysteries are fun for toys, but no business will bet on them. Quite understandably, no marketing manager or product manager wants to approach the CEO with predictions, only to be stumped when he asks how the manager arrived at them. As Clarke knew, it’s not really magic creating the models, it’s advanced technology and it too operates under assumptions that might or might not make sense for the business.

App Starters Means More Transparent Modeling

Today’s business professionals want faster time-to-value and are dazzled by advanced technologies like automated model fitting, but they also want to understand exactly how and why the work.

That’s why Continuum Analytics is hard at work on Open Data Science solutions including Anaconda App Starters, expected to debut later this year. App Starters are solution “templates” aimed to be a 60-80 percent data science solution that make it easy for businesses to have a starting point. App Starters serve the same purpose as the “black box”—faster time-to-value— but are not a “black box” in that it allows analysts to see exactly how the model was created and to tweak models as desired.

Because the App Starters are are based on Open Data Science, they don’t include proprietary restrictions that keep business professionals or data scientists in the dark regarding the analytics pipeline including the algorithms. It still provides the value of “automagically” creating models, but the details of how it does so are transparent and accessible to the team. With App Starters, business professionals will finally have confidence in the models they’re using to formulate business strategies, while getting faster time-to-value from their growing data.

Over time App Starters will get more sophisticated and will include recommendations—just like how Netflix offers up movie and tv show recommendations for your watching pleasure—that will learn and suggest algorithms and visualizations that best fit the data. Unlike “black boxes” the entire narrative as to why recommendations are offered will be available for the business analyst to learn and gain confidence in the recommendations. However, the business analyst can choose to use the recommendation, tweak the recommendation, use the template without recommendations or they could try tuning the suggested models to find a perfect fit. This type of innovation will further the advancement of sophisticated data science solutions that realize more business value, while instilling confidence in the solution.  

Casting Spells with Anaconda

Although App Starters are about to shake up automated modeling, businesses require melding new ideas with tried-and-true solutions. In business analytics, for instance, tools like Microsoft Excel are a staple of the field and being able to integrate them with newer “magic” is highly desirable.

Fortunately, interoperability is one of the keystones of the Open Data Science philosophy and Anaconda provides a way to bridge the reliable old world with the magical new one. With Anaconda, analysts who are comfortable using Excel have an entry point into the world of predictive analytics from the comfort of their spreadsheets. By using the same familiar interface, analysts can access powerful Python libraries to apply cutting-edge analytics to their data. Anaconda recognizes that business analysts want to improve—not disrupt—a proven workflow.

Because Anaconda leverages the Python ecosystem, analysts using Anaconda will achieve powerful results. They might apply a formula to an Excel sheet with a million data rows to predict repeat customers or they may create beautiful, informative visualizations to show how sales have shifted to a new demographic after the company’s newest marketing campaign kicked off. With Anaconda, business analysts can continue using Excel as their main interface, while harnessing the newest “magic” available in the open source community.

Open Data Science for Wizards…and Apprentices

Open Data Science is an inclusive movement. Although open source languages like Python and R dominate data science and allow for the most advanced—and therefore “magical”—analytics technology available, the community is open to all levels of expertise.

Anaconda is a great way for business analysts, for example, to embark on the road toward advanced analytics. But solutions, like App Starters, give advanced wizards the algorithmic visibility to alter and improve models as they see fit.

Open Data Science gives us the “sufficiently advanced technology” that Arthur C. Clarke mentioned—but it puts the power of that magic in our hands.

by pcudia at April 29, 2016 01:05 PM

April 26, 2016

Matthieu Brucher

Analog modeling of a diode clipper (3b): Simulation

Let’s dive directly inside the second diode clipper and follow exactly the same pattern.

Second diode clipper

So first let’s remember the equation:

\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})

Forward Euler

The forward Euler approximation is then:

V_{on+1} = V_{on} + h(\frac{V_{in+1} - V_{on}}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_{on}}{nV_t}))

Backward Euler

Backward Euler approximation is now:

V_{on+1} - V_{on} = h(\frac{V_{in+1} - V_{on+1}}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_{on+1}}{nV_t}))

(The equations are definitely easier to derive…)

Trapezoidal rule

And finally trapezoidal rule gives:

V_{on+1} - V_{on} = h(\frac{V_{in+1}}{R_1 C_1} - \frac{V_{on+1} + V_{on}}{2 R_1 C_1} + \frac{I_s}{C_1}(sinh(\frac{V_{on+1}}{nV_t}) + sinh(\frac{V_{on}}{nV_t})))

Starting estimates

For the estimates, we use exactly the same methods as the previous clipper, so I won’t recall them.

Graphs

Let’s start with the comparison of the three different methods:

Numerical optimization comparisonNumerical optimization comparison

The first obvious change is that the forward Euler can give pretty good results. This makes me think I may have made a mistake in the previous circuit, but as I had to derive the equation before doing the approximation, this may be the reason.

For the original estimates, just like last time, the results are identical:

Original estimates comparisonOriginal estimates comparison

OK, let’s compare the result of the first iteration with different original estimates:
One step comparisonOne step comparison

All estimates give a similar result, but the affine estimates give a better estimate than linear which gives a far better result than the default/copying estimate.

Conclusion

Just for fun, let’s display the difference between the two clippers:

Diode clippers comparisonDiode clippers comparison

Obviously, the second clipper is more symmetric than the first one and thus will create less harmonics (which is confirmed by a spectrogram), and this is also easier to optimize (the second clipper uses at least one less iteration than the first one).

All things considered, the Newton Raphson algorithm is always efficient, with around 3 or less iterations for these circuits. Trying bisection or something else may not be that interesting, except if you are heavily using SIMD instructions. In this case, the optimization may be faster because you have a similar number of iterations.

Original estimates done with the last optimized value always works great although affine estimates are usually faster. The tricky part is deriving the equation. And more often than not, you make mistakes when implementing them!

Next step: DK method…

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at April 26, 2016 07:40 AM

April 25, 2016

Continuum Analytics news

Accelerate 2.2 Released!

Posted Monday, April 25, 2016

We're happy to announce the latest update to Accelerate with the release of version 2.2. This version of Accelerate adds compatibility with the recently released Numba 0.25, and also expands the Anaconda Platform in two new directions:

  • Data profiling
  • MKL-accelerated ufuncs

 I'll discuss each of these in detail below.

Data Profiling

We've built up quite a bit of experience over the years optimizing numerical Python code for our customers, and these projects follow some common patterns. First, the most important step in the optimization process is profiling a realistic test case. You can't improve what you can't measure, and profiling is critical to identify the true bottlenecks in an application. Even experienced developers are often surprised by profiling results when they see which functions are consuming the most time. Ensuring the test case is realistic (but not necessarily long) is also very important, as unit and functional tests for applications tend to use smaller, or differently shaped, input data sets. The scaling behavior of many algorithms is non-linear, so profiling with a very small input can give misleading results.

The second step in optimization is to consider alternative implementations for the critical functions identified in the first step, possibly adopting a different algorithm, parallelizing the calculation to make use of multiple cores or a GPU, or moving up a level to eliminate or batch unnecessary calls to the function. In this step of the process, we often found ourselves lacking a critical piece of information: what data types and sizes were being passed to this function? The best approach often depends on this information. Are these NumPy arrays or custom classes? Are the arrays large or small? 32-bit or 64-bit float? What dimensionality? Large arrays might benefit from GPU acceleration, but small arrays often require moving up the call stack in order to see if calculations can be batched.

Rather than having to manually modify the code to collect this data type information in an ad-hoc way, we've added a new profiling tool to Accelerate that can record this type information as a part of normal profiling. For lack of a better term, we're calling this "data profiling."

We collect this extra information using a modified version of the built-in Python profiling mechanism, and can display it using the standard pstats-style table:

ncalls  tottime percall cumtime percall filename:lineno(function)
300/100 0.01313 0.0001313 0.03036 0.0003036  linalg.py:532(cholesky(a:ndarray(dtype=float64, shape=(3, 3))))
200/100 0.004237 4.237e-05 0.007189 7.189e-05 linalg.py:139(_commonType())
200/100 0.003431 3.431e-05 0.005312 5.312e-05 linalg.py:106(_makearray(a:ndarray(dtype=float64, shape=(3, 3))))
400/200 0.002663 1.332e-05 0.002663 1.332e-05 linalg.py:111(isComplexType(t:type))
300/100 0.002185 2.185e-05 0.002185 2.185e-05 linalg.py:209(_assertNdSquareness())
200/100 0.001592 1.592e-05 0.001592 1.592e-05 linalg.py:124(_realType(t:type, default:NoneType))
200/100 0.00107 1.07e-05 0.00107 1.07e-05 linalg.py:198(_assertRankAtLeast2())
100 0.000162 1.62e-06 0.000162 1.62e-06 linalg.py:101(get_linalg_error_extobj(callback:function))

The recorded function signatures now include data types, and NumPy arrays also have dtype and shape information. In the above example, we've selected only the linear algebra calls from the execution of a PyMC model. Here we can clearly see the Cholesky decomposition is being done on 3x3 matrices, which would dictate our optimization strategy if cholesky was the bottleneck in the code (in this case, it is not).

We've also integrated the SnakeViz profile visualization tool into the Accelerate profiler, so you can easily collect and view profile information right inside your Jupyter notebooks:

All it takes to profile a function and view it in a notebook is a few lines:

from accelerate import profiler

p = profiler.Profile()

p.run('my_function_to_profile()')

profiler.plot(p)

MKL-Accelerated Ufuncs

MKL is perhaps best known for high performance, multi-threaded linear algebra functionality, but MKL also provides highly optimized math functions, like sin() and cos() for arrays. Anaconda already ships with the numexpr library, which is linked against MKL to provide fast array math support. However, we have future plans for Accelerate that go beyond what numexpr can provide, so in the latest release of Accelerate, we've exposed the MKL array math functions as NumPy ufuncs you can call directly.

For code that makes extensive use of special math functions on arrays with many thousands of elements, the performance speedup is quite amazing:

import numpy as np

from accelerate.mkl import ufuncs as mkl_ufuncs

 

def spherical_to_cartesian_numpy(r, theta, phi):

    cos_theta = np.cos(theta)

    sin_theta = np.sin(theta)

    cos_phi = np.cos(phi)

    sin_phi = np.sin(phi)

 

    x = r * sin_theta * cos_phi

    y = r * sin_theta * sin_phi

    z = r * cos_theta

 

def spherical_to_cartesian_mkl(r, theta, phi):

    cos_theta = mkl_ufuncs.cos(theta)

    sin_theta = mkl_ufuncs.sin(theta)

    cos_phi = mkl_ufuncs.cos(phi)

    sin_phi = mkl_ufuncs.sin(phi)

        x = r * sin_theta * cos_phi

    y = r * sin_theta * sin_phi

    z = r, cos_theta

        return x, y, z

 n = 100000

r, theta, phi = np.random.uniform(1, 10, n), np.random.uniform(0, np.pi, n), np.random.uniform(-np.pi, np.pi, n)

%timeit spherical_to_cartesian_numpy(r, theta, phi)

%timeit spherical_to_cartesian_mkl(r, theta, phi)

 

    100 loops, best of 3: 7.01 ms per loop

    1000 loops, best of 3: 978 µs per loop

A speedup of 7x is not bad for a 2.3 GHz quad core laptop CPU from 2012. In future releases, we are looking to expand and integrate this functionality further into the Anaconda Platform, so stay tuned!

Summary

For more information about these new features, take a look at the Accelerate manual:

You can install Accelerate with conda and use it free for 30 days:

conda install accelerate

Try it out, and let us know what you think. Academic users can get a free subscription to Anaconda (including several useful tools, like Accelerate) by following these instructions. Contact sales@continuum.io to find out how to get a subscription to Anaconda at your organization.

 

by swebster at April 25, 2016 02:27 PM

April 20, 2016

Matthew Rocklin

Ad Hoc Distributed Random Forests

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://www.youtube.com/watch?v=FkPlEqB8AnE

TL;DR.

Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data.

Our machine learning model does not perform well, but we do learn how to execute ad-hoc computations easily.

Motivation

In the past few posts we analyzed data on a cluster with Dask collections:

  1. Dask.bag on JSON records
  2. Dask.dataframe on CSV data
  3. Dask.array on HDF5 data

Often our computations don’t fit neatly into the bag, dataframe, or array abstractions. In these cases we want the flexibility of normal code with for loops, but still with the computational power of a cluster. With the dask.distributed task interface, we achieve something close to this.

Application: Naive Distributed Random Forest Algorithm

As a motivating application we build a random forest algorithm from the ground up using the single-machine Scikit Learn library, and dask.distributed’s ability to quickly submit individual tasks to run on the cluster. Our algorithm will look like the following:

  1. Pull data from some external source (S3) into several dataframes on the cluster
  2. For each dataframe, create and train one RandomForestClassifier
  3. Scatter single testing dataframe to all machines
  4. For each RandomForestClassifier predict output on test dataframe
  5. Aggregate independent predictions from each classifier together by a majority vote. To avoid bringing too much data to any one machine, perform this majority vote as a tree reduction.

Data: NYC Taxi 2015

As in our blogpost on distributed dataframes we use the data on all NYC Taxi rides in 2015. This is around 20GB on disk and 60GB in RAM.

We predict the number of passengers in each cab given the other numeric columns like pickup and destination location, fare breakdown, distance, etc..

We do this first on a small bit of data on a single machine and then on the entire dataset on the cluster. Our cluster is composed of twelve m4.xlarges (4 cores, 15GB RAM each).

Disclaimer and Spoiler Alert: I am not an expert in machine learning. Our algorithm will perform very poorly. If you’re excited about machine learning you can stop reading here. However, if you’re interested in how to build distributed algorithms with Dask then you may want to read on, especially if you happen to know enough machine learning to improve upon my naive solution.

API: submit, map, gather

We use a small number of dask.distributed functions to build our computation:

futures = executor.scatter(data)                     # scatter data
future = executor.submit(function, *args, **kwargs)  # submit single task
futures = executor.map(function, sequence)           # submit many tasks
results = executor.gather(futures)                   # gather results
executor.replicate(futures, n=number_of_replications)

In particular, functions like executor.submit(function, *args) let us send individual functions out to our cluster thousands of times a second. Because these functions consume their own results we can create complex workflows that stay entirely on the cluster and trust the distributed scheduler to move data around intelligently.

Load Pandas from S3

First we load data from Amazon S3. We use the s3.read_csv(..., collection=False) function to load 178 Pandas DataFrames on our cluster from CSV data on S3. We get back a list of Future objects that refer to these remote dataframes. The use of collection=False gives us this list of futures rather than a single cohesive Dask.dataframe object.

from distributed import Executor, s3
e = Executor('52.91.1.177:8786')

dfs = s3.read_csv('dask-data/nyc-taxi/2015',
                  parse_dates=['tpep_pickup_datetime',
                               'tpep_dropoff_datetime'],
                  collection=False)
dfs = e.compute(dfs)

Each of these is a lightweight Future pointing to a pandas.DataFrame on the cluster.

>>> dfs[:5]
[<Future: status: finished, type: DataFrame, key: finalize-a06c3dd25769f434978fa27d5a4cf24b>,
 <Future: status: finished, type: DataFrame, key: finalize-7dcb27364a8701f45cb02d2fe034728a>,
 <Future: status: finished, type: DataFrame, key: finalize-b0dfe075000bd59c3a90bfdf89a990da>,
 <Future: status: finished, type: DataFrame, key: finalize-1c9bb25cefa1b892fac9b48c0aef7e04>,
 <Future: status: finished, type: DataFrame, key: finalize-c8254256b09ae287badca3cf6d9e3142>]

If we’re willing to wait a bit then we can pull data from any future back to our local process using the .result() method. We don’t want to do this too much though, data transfer can be expensive and we can’t hold the entire dataset in the memory of a single machine. Here we just bring back one of the dataframes:

>>> df = dfs[0].result()
>>> df.head()
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0 0.3 16.30

Train on a single machine

To start lets go through the standard Scikit Learn fit/predict/score cycle with this small bit of data on a single machine.

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

df_train, df_test = train_test_split(df)

columns = ['trip_distance', 'pickup_longitude', 'pickup_latitude',
           'dropoff_longitude', 'dropoff_latitude', 'payment_type',
           'fare_amount', 'mta_tax', 'tip_amount', 'tolls_amount']

est = RandomForestClassifier(n_estimators=4)
est.fit(df_train[columns], df_train.passenger_count)

This builds a RandomForestClassifer with four decision trees and then trains it against the numeric columns in the data, trying to predict the passenger_count column. It takes around 10 seconds to train on a single core. We now see how well we do on the holdout testing data:

>>> est.score(df_test[columns], df_test.passenger_count)
0.65808188654721012

This 65% accuracy is actually pretty poor. About 70% of the rides in NYC have a single passenger, so the model of “always guess one” would out-perform our fancy random forest.

>>> from sklearn.metrics import accuracy_score
>>> import numpy as np
>>> accuracy_score(df_test.passenger_count,
...                np.ones_like(df_test.passenger_count))
0.70669390028780987

This is where my ignorance in machine learning really kills us. There is likely a simple way to improve this. However, because I’m more interested in showing how to build distributed computations with Dask than in actually doing machine learning I’m going to go ahead with this naive approach. Spoiler alert: we’re going to do a lot of computation and still not beat the “always guess one” strategy.

Fit across the cluster with executor.map

First we build a function that does just what we did before, builds a random forest and then trains it on a dataframe.

def fit(df):
    est = RandomForestClassifier(n_estimators=4)
    est.fit(df[columns], df.passenger_count)
    return est

Second we call this function on all of our training dataframes on the cluster using the standard e.map(function, sequence) function. This sends out many small tasks for the cluster to run. We use all but the last dataframe for training data and hold out the last dataframe for testing. There are more principled ways to do this, but again we’re going to charge ahead here.

train = dfs[:-1]
test = dfs[-1]

estimators = e.map(fit, train)

This takes around two minutes to train on all of the 177 dataframes and now we have 177 independent estimators, each capable of guessing how many passengers a particular ride had. There is relatively little overhead in this computation.

Predict on testing data

Recall that we kept separate a future, test, that points to a Pandas dataframe on the cluster that was not used to train any of our 177 estimators. We’re going to replicate this dataframe across all workers on the cluster and then ask each estimator to predict the number of passengers for each ride in this dataset.

e.replicate([test], n=48)

def predict(est, X):
    return est.predict(X[columns])

predictions = [e.submit(predict, est, test) for est in estimators]

Here we used the executor.submit(function, *args, **kwrags) function in a list comprehension to individually launch many tasks. The scheduler determines when and where to run these tasks for optimal computation time and minimal data transfer. As with all functions, this returns futures that we can use to collect data if we want in the future.

Developers note: we explicitly replicate here in order to take advantage of efficient tree-broadcasting algorithms. This is purely a performance consideration, everything would have worked fine without this, but the explicit broadcast turns a 30s communication+computation into a 2s communication+computation.

Aggregate predictions by majority vote

For each estimator we now have an independent prediction of the passenger counts for all of the rides in our test data. In other words for each ride we have 177 different opinions on how many passengers were in the cab. By averaging these opinions together we hope to achieve a more accurate consensus opinion.

For example, consider the first four prediction arrays:

>>> a_few_predictions = e.gather(predictions[:4])  # remote futures -> local arrays
>>> a_few_predictions
[array([1, 2, 1, ..., 2, 2, 1]),
 array([1, 1, 1, ..., 1, 1, 1]),
 array([2, 1, 1, ..., 1, 1, 1]),
 array([1, 1, 1, ..., 1, 1, 1])]

For the first ride/column we see that three of the four predictions are for a single passenger while one prediction disagrees and is for two passengers. We create a consensus opinion by taking the mode of the stacked arrays:

from scipy.stats import mode
import numpy as np

def mymode(*arrays):
    array = np.stack(arrays, axis=0)
    return mode(array)[0][0]

>>> mymode(*a_few_predictions)
array([1, 1, 1, ..., 1, 1, 1])

And so when we average these four prediction arrays together we see that the majority opinion of one passenger dominates for all of the six rides visible here.

Tree Reduction

We could call our mymode function on all of our predictions like this:

>>> mode_prediction = e.submit(mymode, *predictions)  # this doesn't scale well

Unfortunately this would move all of our results to a single machine to compute the mode there. This might swamp that single machine.

Instead we batch our predictions into groups of size 10, average each group, and then repeat the process with the smaller set of predictions until we have only one left. This sort of multi-step reduction is called a tree reduction. We can write it up with a couple nested loops and executor.submit. This is only an approximation of the mode, but it’s a much more scalable computation. This finishes in about 1.5 seconds.

from toolz import partition_all

while len(predictions) > 1:
    predictions = [e.submit(mymode, *chunk)
                   for chunk in partition_all(10, predictions)]

result = e.gather(predictions)[0]

>>> result
array([1, 1, 1, ..., 1, 1, 1])

Final Score

Finally, after completing all of our work on our cluster we can see how well our distributed random forest algorithm does.

>>> accuracy_score(result, test.result().passenger_count)
0.67061974451423045

Still worse than the naive “always guess one” strategy. This just goes to show that, no matter how sophisticated your Big Data solution is, there is no substitute for common sense and a little bit of domain expertise.

What didn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

  • Clearly this would have benefited from more machine learning knowledge. What would have been a good approach for this problem?
  • I’ve been thinking a bit about memory management of replicated data on the cluster. In this exercise we specifically replicated out the test data. Everything would have worked fine without this step but it would have been much slower as every worker gathered data from the single worker that originally had the test dataframe. Replicating data is great until you start filling up distributed RAM. It will be interesting to think of policies about when to start cleaning up redundant data and when to keep it around.
  • Several people from both open source users and Continuum customers have asked about a general Dask library for machine learning, something akin to Spark’s MLlib. Ideally a future Dask.learn module would leverage Scikit-Learn in the same way that Dask.dataframe leverages Pandas. It’s not clear how to cleanly break up and parallelize Scikit-Learn algorithms.

Conclusion

This blogpost gives a concrete example using basic task submission with executor.map and executor.submit to build a non-trivial computation. This approach is straightforward and not restrictive. Personally this interface excites me more than collections like Dask.dataframe; there is a lot of freedom in arbitrary task submission.

April 20, 2016 12:00 AM

April 19, 2016

Matthieu Brucher

Book review: The Culture Map: Decoding How People Think and Get Things Done in a Global World

I work in an international company, and there are lots of people from different cultures around me, and with whom I need to interact. Out of the blue, it feels like it’s easy to work with all of them, I mean, how difficult could it be to work with them?

Actually, it’s easy, but sometimes interactions are intriguing and people do not react the way you expect them to react. And why is that? Lots of reasons, of course, but one of them is that they have a different culture and do not expect you to explicitly tell them what they did wrong (which is something I do. A lot).

Content and opinions

Enters Erin Meyer. She had to navigate between these cultures, as she’s American, married to a French guy, in France. In her book, she presents 8 scales, and each culture is placed differently on each scale.

I won’t enter in all the details of the different scales, but they are about all the different ways of people interacting with other people. Whether it’s about scheduling, feedback to decision-making, all cultures are different. And sometimes, even if the cultures are close geographically, they can be quite different on some scales. After all, they are all influenced by their short or long history, their philosophers…

All the scales are imaged with stories of Meyer’s experience in teaching them, stories from her students, and they are always spot on.

Conclusion

Of course, the book only tells you the differences, what to look for. It doesn’t educate you to do the right think. This takes practice, and it requires work.

Also it doesn’t solve all interaction problems. Everyone is different in one’s own culture (not even talking about people having several cultures…), on the left or on the right compared to the average of one’s culture on each scale. So you can’t sum up someone to a culture. But if you want to learn more about interacting with people, you already know that.

by Matt at April 19, 2016 07:16 AM

April 18, 2016

Continuum Analytics news

Conda + Spark

Posted Tuesday, April 19, 2016

In my previous post, I described different scenarios for bootstrapping Python on a multi-node cluster. I offered a general solution using Anaconda for cluster management and solution using a custom conda env deployed with Knit.

In a follow-up to that post, I was asked if the machinery in Knit would also work for Spark. Sure--of course! In fact, much of Knit's design comes from Spark's deploy codebase. Here, I am going to demonstrate how we can ship a Python environment, complete with desired dependencies, as part of a Spark job without installing Python on every node.

Spark YARN Deploy

First, I want to briefly describe key points in Spark's YARN deploy methodologies. After negotiating which resources to provision with YARN's Resource Manager, Spark asks for a directory to be constructed on HDFS: /user/ubuntu/.sparkStaging/application_1460665326796_0065/ The directory will always be in the user's home, and the application ID issued by YARN is appended to the directory name (thinking about this now, perhaps this is obvious and straightforward to JAVA/JVM folks where bundling Uber JARs has long been the practice in traditional Map-Reduce jobs). In any case, Spark then uploads itself to the stagingDirectory, and when YARN provisions a container, the contents of the directory are pulled down and the spark-assembly jar is executed. If you are using PySpark or sparkR, a corresponding pyspark.zip and sparkr.zip will be found in the staging directory as well.

Occasionally, users see FileNotFoundException errors -- this can be caused by a few things: incorrect Spark Contexts, incorrect SPARK_HOME, and I have faint recollection that there was a packaging problem once where pyspark.zip or sparkr.zip was missing, or could not be created do to permissions? Anyway -- below is the output you will see when Spark works cleanly.

16/04/15 13:01:03 INFO Client: Uploading resource file:/opt/anaconda/share/spark-1.6.0/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://ip-172-31-50-60:9000/user/ubuntu/.sparkStaging/application_1460665326796_0065/spark-assembly-1.6.0-hadoop2.6.0.jar

16/04/15 13:01:07 INFO Client: Uploading resource file:/opt/anaconda/share/spark-1.6.0/python/lib/pyspark.zip -> hdfs://ip-172-31-50-60:9000/user/ubuntu/.sparkStaging/application_1460665326796_0065/pyspark.zip

Not terribly exciting, but positive confirmation that Spark is uploading local files to HDFS.

Bootstrap-Fu Redux

Most of what I described above is what the YARN framework allows developers to do -- it's more that Spark implements a YARN application than Spark doing magical things (and Knit as well!). If I were using Scala/Java, I would package up everything in a jar and use spark-submit -- Done!

Unfortunately, there's a little more work to be done for an Uber Python jar equivalent.

Hard-Links Won't Travel

One of the killer features of conda is environment management. When conda creates a new environment, it uses hard-links when possible. Generally, this greatly reduces disk usage. But, if we move the directory to another machine, we're probably just moving a handful of hard-links and not the files themselves. Fortunately, we can tell conda: "No! Copy the files!"

For example:

conda create -p /home/ubuntu/dev --copy -y -q python=3 pandas scikit-learn

By using the --copy, we "install all packages using copies instead of hard or soft-linking." The headers in various files in the bin/ directory may have lines like #!/home/ubuntu/dev/bin/python. But, we don't need to be concerned about that -- we're not going to be using 2to3, idle, pip, etc. If we zipped up the environment, we could move this onto another machine of a similar OS type, execute Python, and we'd be able to load any library in the lib/python3.45/site-packages directory.

We're very close to our Uber Python jar -- now with a zipped conda directory in mind, let's proceed.

zip -r dev.zip dev

Death by ENV Vars

We are going to need a handful of specific command line options and environment variables: Spark Yarn Configuration and Spark Environment Variables. We'll be using:

  • PYSPARK_PYTHON: The Python binary Spark should use
  • spark.yarn.appMasterEnv.PYSPARK_PYTHON (though this one could be wrong/unnecessary/only used for --master yarn-cluster)
  • --archives: include local tgz/jar/zip in .sparkStaging directory and pull down into temporary YARN container

We'll also need a test script. The following is a reasonable test to prove which Python Spark is using -- we're writing a no-op function which returns Python's various paths it is using to find libraries

# test_spark.py

import os

import sys

from pyspark import SparkContext

from pyspark import SparkConf

 

conf = SparkConf()

conf.setAppName("get-hosts")

 

sc = SparkContext(conf=conf)

 

def noop(x):

    import socket

    import sys

    return socket.gethostname() + ' '.join(sys.path) + ' '.join(os.environ)

 

rdd = sc.parallelize(range(1000), 100)

hosts = rdd.map(noop).distinct().collect()

print(hosts)

And executing everything together:

 PYSPARK_PYTHON=./ANACONDA/dev/bin/python spark-submit \

 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ANACONDA/dev/bin/python \

 --master yarn-cluster \

 --archives /home/ubuntu/dev.zip#ANACONDA \

 /home/ubuntu/test_spark.py

We'll get the following output in the yarn logs:

'ip-172-31-50-61 . /var/lib/hadoop- yarn/data/1/yarn/local/usercache/ubuntu/filecache/207/spark-assembly-1.6.0- hadoop2.6.0.jar /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/appcach e/application_1460665326796_0070/container_1460665326796_0070_01_000003/{{PWD}} /pyspark.zip{{PWD}}/py4j-0.9-src.zip /var/lib/hadoop-yarn/data/1/yarn/loca l/usercache/ubuntu/appcache/application_1460665326796_0070/container_1460665326 796_0070_01_000003/pyspark.zip /var/lib/hadoop-yarn/data/1/yarn/local/usercache /ubuntu/appcache/application_1460665326796_0070/container_1460665326796_0070_01 _000003/py4j-0.9-src.zip /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubunt u/appcache/application_1460665326796_0070/container_1460665326796_0070_01_00000 3/ANACONDA/dev/lib/python35.zip /var/lib/hadoop-yarn/data/1/yarn/local/usercach e/ubuntu/appcache/application_1460665326796_0070/container_1460665326796_0070_0 1_000003/ANACONDA/dev/lib/python3.5 /var/lib/hadoop-yarn/data/1/yarn/local/user cache/ubuntu/appcache/application_1460665326796_0070/container_1460665326796_00 70_01_000003/ANACONDA/dev/lib/python3.5/plat-linux /var/lib/hadoop-yarn/data/1/ yarn/local/usercache/ubuntu/appcache/application_1460665326796_0070/container_1 460665326796_0070_01_000003/ANACONDA/dev/lib/python3.5/lib-dynload /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/208/dev. zip/dev/lib/python3.5/site-packages/setuptools-20.6.7-py3.5.egg /var/lib/hadoop -yarn/data/1/yarn/local/usercache/ubuntu/appcache/application_1460665326796_007 0/container_1460665326796_0070_01_000003/ANACONDA/dev/lib/python3.5/site- packages ...', 'ip-172-31-50-62 . /var/lib/hadoop- yarn/data/1/yarn/local/usercache/ubuntu/filecache/209/spark-assembly-1.6.0- hadoop2.6.0.jar /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/appcach e/application_1460665326796_0070/container_1460665326796_0070_01_000002/{{PWD}} /pyspark.zip{{PWD}}/py4j-0.9-src.zip /var/lib/hadoop-yarn/data/1/yarn/loca l/usercache/ubuntu/appcache/application_1460665326796_0070/container_1460665326 796_0070_01_000002/pyspark.zip /var/lib/hadoop-yarn/data/1/yarn/local/usercache /ubuntu/appcache/application_1460665326796_0070/container_1460665326796_0070_01 _000002/py4j-0.9-src.zip /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubunt u/appcache/application_1460665326796_0070/container_1460665326796_0070_01_00000 2/ANACONDA/dev/lib/python35.zip /var/lib/hadoop-yarn/data/1/yarn/local/usercach e/ubuntu/appcache/application_1460665326796_0070/container_1460665326796_0070_0 1_000002/ANACONDA/dev/lib/python3.5 /var/lib/hadoop-yarn/data/1/yarn/local/user cache/ubuntu/appcache/application_1460665326796_0070/container_1460665326796_00 70_01_000002/ANACONDA/dev/lib/python3.5/plat-linux /var/lib/hadoop-yarn/data/1/ yarn/local/usercache/ubuntu/appcache/application_1460665326796_0070/container_1 460665326796_0070_01_000002/ANACONDA/dev/lib/python3.5/lib-dynload /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/211/dev. zip/dev/lib/python3.5/site-packages/setuptools-20.6.7-py3.5.egg /var/lib/hadoop -yarn/data/1/yarn/local/usercache/ubuntu/appcache/application_1460665326796_007 0/container_1460665326796_0070_01_000002/ANACONDA/dev/lib/python3.5/site- packages ...'

It's a little hard to parse -- what should be noted are file paths like:

.../container_1460665326796_0070_01_000002/ANACONDA/dev/lib/python3.5/site- packages

This is demonstrating that Spark is using the unzipped directory in the YARN container. Ta-da!

Thoughts

Okay, perhaps that's not super exciting, so let's zoom out again:

  1. We create a zipped conda environment with dependencies: pandas, python=3,...
  2. We successfully launched a Python Spark job without any Python binaries or libraries previously installed on the nodes.

There is an open JIRA ticket discussing the option of having Spark ingest a requirements.txt and building the Python environment as a preamble to a Spark job. This is also a fairly novel approach to the same end -- using Spark to bootstrap a runtime environment. It's even a bit more general, since the method described above relies on YARN. I first saw this strategy in use with streamparse. Similarly to the implementation in JIRA ticket, streamparse can ship a Python requirements.txt and construct a Python environment as part of a Streamparse Storm job!

Rrrrrrrrrrrrrr

Oh, and R conda environments work as well...but it's more involved.

Create/Munge R Env

First, it's pretty cool that conda can install and manage R environments. Again, we create a conda environment with R binaries and libraries

conda create -p /home/ubuntu/r_env --copy -y -q r-essentials -c r

R is not exactly relocatable so we need to munge a bit:

sed -i "s/home\/ubuntu/.r_env.zip/g" /home/ubuntu/r_env/bin/R

zip -r r_env.zip r_env

My R skills are at a below-novice level, so the following test script could probably be improved

# /home/ubuntu/test_spark.R

library(SparkR)

sc <- sparkR.init(appName="get-hosts-R")

 

noop <- function(x) {

  path <- toString(.libPaths())

  host <- toString(Sys.info()['nodename'])

  host_path <- toString(cbind(host,path))

  host_path

}

 

rdd <- SparkR:::parallelize(sc, 1:1000, 100)

hosts <- SparkR:::map(rdd, noop)

d_hosts <- SparkR:::distinct(hosts)

out <- SparkR:::collect(d_hosts)

print(out)

 

Execute (and the real death by options):

SPARKR_DRIVER_R=./r_env.zip/r_env/lib/R spark-submit --master yarn-cluster \

--conf spark.yarn.appMasterEnv.R_HOME=./r_env.zip/r_env/lib64/R \

--conf spark.yarn.appMasterEnv.RHOME=./r_env.zip/r_env \

--conf spark.yarn.appMasterEnv.R_SHARE_DIR=./r_env.zip/r_env/lib/R/share \

--conf spark.yarn.appMasterEnv.R_INCLUDE_DIR=./r_env.zip/r_env/lib/R/include \

--conf spark.executorEnv.R_HOME=./r_env.zip/r_env/lib64/R \

--conf spark.executorEnv.RHOME=./r_env.zip/r_env \

--conf spark.executorEnv.R_SHARE_DIR=./r_env.zip/r_env/lib/R/share \

--conf spark.executorEnv.R_INCLUDE_DIR=./r_env.zip/r_env/lib/R/include \

--conf  spark.r.command=./r_env.zip/r_env/bin/Rscript \

--archives r_env.zip \

/home/ubuntu/test_spark.R

 

Example output:

[1] "ip-172-31-50-59, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/230/sparkr.zip, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/229/r_env.zip/r_env/lib64/R/library"

 

[1] "ip-172-31-50-61, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/183/sparkr.zip, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/182/r_env.zip/r_env/lib64/R/library"

This post is also published on Ben's website here. 

by swebster at April 18, 2016 05:44 PM

High Performance Hadoop with Anaconda and Dask on Your Cluster

Posted Monday, April 18, 2016

Overview

Dask is a flexible open source parallel computation framework that lets you comfortably scale up and scale out your analytics. If you’re running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.

Dask enables distributed computing in pure Python and complements the existing numerical and scientific computing capability within Anaconda. Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data, and it scales up resiliently and elastically on clusters with hundreds of nodes.

Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP.

In this post, we’ll show you how you can use Anaconda with Dask for distributed computations and workflows, including distributed dataframes, arrays, text processing, and custom parallel workflows that can help you make the most of Anaconda and Dask on your cluster. We’ll work with Anaconda and Dask interactively from the Jupyter Notebook while the heavy computations are running on the cluster.

Installing Anaconda and Dask on your Hadoop or High Performance Computing (HPC) Cluster

There are many different ways to get started with Anaconda and Dask on your Hadoop or HPC cluster, including manual setup via SSH; by integrating with resource managers such as YARN, SGE, or Slurm; launching instances on Amazon EC2; or by using the enterprise-ready Anaconda for cluster management.

Anaconda for cluster management makes it easy to install familiar packages from Anaconda (including NumPy, SciPy, Pandas, NLTK, scikit-learn, scikit-image, and access to 720+ more packages in Anaconda) and the Dask parallel processing framework on all of your bare-metal or cloud-based cluster nodes. You can provision centrally managed installations of Anaconda, Dask and the Jupyter notebook using two simple commands with Anaconda for cluster management:

$ acluster create dask-cluster -p dask-cluster

$ acluster install dask notebook

Additional features of Anaconda for cluster management include:

  • Easily install Python and R packages across multiple cluster nodes
  • Manage multiple conda environments across a cluster
  • Push local conda environments to all cluster nodes
  • Works on cloud-based and bare-metal clusters with existing Hadoop installations
  • Remotely SSH and upload/download files to and from cluster nodes

Once you’ve installed Anaconda and Dask on your cluster, you can perform many types of distributed computations, including text processing (similar to Spark), distributed dataframes, distributed arrays, and custom parallel workflows. We’ll show some examples in the following sections.

Distributed Text and Language Processing (Dask Bag)

Dask works well with standard computations such as text processing and natural language processing and with data in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask Bag collection is similar to other parallel frameworks and supports operations like filter, count, fold, frequencies, pluck, and take, which are useful for working with a collection of Python objects such as text.

For example, we can use the natural language processing toolkit (NLTK) in Anaconda to perform distributed language processing on a Hadoop cluster, all while working interactively in a Jupyter notebook.

In this example, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.

First, we import libraries from Dask and connect to the Dask distributed scheduler:

>>> import dask

>>> from distributed import Executor, hdfs, progress

>>> e = Executor('54.164.41.213:8786')

Next, we load 242 GB of JSON data from HDFS using pure Python:

>>> import json

>>> lines = hdfs.read_text('/user/ubuntu/RC_2015-*.json')

>>> js = lines.map(json.loads)

We can filter and load the data into distributed memory across the cluster:

>>> movies = js.filter(lambda d: 'movies' in d['subreddit'])

>>> movies = e.persist(movies)

Once we’ve loaded the data into distributed memory, we can import the NLTK library from Anaconda and construct stacked expressions to tokenize words, tag parts of speech, and filter out non-words from the dataset.

>>> import nltk

>>> pos = e.persist(movies.pluck('body')

...                       .map(nltk.word_tokenize)

...                       .map(nltk.pos_tag)

...                       .concat()

...                       .filter(lambda (word, pos): word.isalpha()))

In this example, we’ll generate a list of the top 10 proper nouns from the movies subreddit.

>>> f = e.compute(pos.filter(lambda (word, type): type == 'NNP')

...                  .pluck(0)

...                  .frequencies()

...                  .topk(10, lambda (word, count): count))

 

>>> f.result()

[(u'Marvel', 35452),

 (u'Star', 34849),

 (u'Batman', 31749),

 (u'Wars', 28875),

 (u'Man', 26423),

 (u'John', 25304),

 (u'Superman', 22476),

 (u'Hollywood', 19840),

 (u'Max', 19558),

 (u'CGI', 19304)]

Finally, we can use Bokeh to generate an interactive plot of the resulting data:

View the full notebook for this distributed language processing example on Anaconda Cloud.

Analysis with Distributed Dataframes (Dask DataFrame)

Dask allows you to work with familiar Pandas dataframe syntax on a single machine or on many nodes on a Hadoop or HPC cluster. You can work with data stored in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask DataFrame collection mimics the Pandas API, uses Pandas under the hood, and supports operations like head, groupby, value_counts, merge, and set_index.

For example, we can use the Dask to perform computations with dataframes on a Hadoop cluster with data stored in HDFS, all while working interactively in a Jupyter notebook.

First, we import libraries from Dask and connect to the Dask distributed scheduler:

>>> import dask

>>> from distributed import Executor, hdfs, progress, wait, s3

>>> e = Executor('54.164.41.213:8786')

Next, we’ll load the NYC taxi data in CSV format from HDFS using pure Python and persist the data in memory:

>>> df = hdfs.read_csv('/user/ubuntu/nyc/yellow_tripdata_2015-*.csv', 

                       parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'],

                       header='infer')

>>> df = e.persist(df)

We can perform familiar operations such as computing value counts on columns and statistical correlations:

>>> df.payment_type.value_counts().compute()

1    91574644

2    53864648

3      503070

4      170599

5          28

Name: payment_type, dtype: int64

 

>>> df2 = df.assign(payment_2=(df.payment_type == 2),

...                 no_tip=(df.tip_amount == 0))

 

>>> df2.astype(int).corr().compute()

           no_tip    payment_2

no_tip    1.000000    0.943123

payment_2    0.943123    1.000000

Dask runs entirely asynchronously, leaving us free to explore other cells in the notebook while computations happen in the background. Dask also handles all of the messy CSV schema handling for us automatically.

Finally, we can use Bokeh to generate an interactive plot of the resulting data:

View the full notebook for this distributed dataframe example on Anaconda Cloud.

Numerical, Statistical and Scientific Computations with Distributed Arrays (Dask Array)

Dask works well with numerical and scientific computations on n-dimensional array data. The Dask Array collection mimics a subset of the NumPy API, uses NumPy under the hood, and supports operations like dot, flatten, max, mean, and std.

For example, we can use the Dask to perform computations with arrays on a cluster with global temperature/weather data stored in NetCDF format (like HDF5), all while working interactively in a Jupyter notebook. The data files contain measurements that were taken every six hours at every quarter degree latitude and longitude.

First, we import the netCDF4 library and point to the data files stored on disk:

>>> import netCDF4

>>> from glob import glob

>>> filenames = sorted(glob('2014-*.nc3'))

>>> t2m = [netCDF4.Dataset(fn).variables['t2m'] for fn in filenames]

>>> t2m[0]

<class 'netCDF4._netCDF4.Variable'>

int16 t2m(time, latitude, longitude)

    scale_factor: 0.00159734395579

    add_offset: 268.172358066

    _FillValue: -32767

    missing_value: -32767

    units: K

    long_name: 2 metre temperature

unlimited dimensions: 

current shape = (4, 721, 1440)

filling off

We then import Dask and read in the data from the NumPy arrays:

>>> import dask.array as da

>>> xs = [da.from_array(t, chunks=t.shape) for t in t2m]

>>> x = da.concatenate(xs, axis=0)

We can then perform distributed computations on the cluster, such as computing the mean temperature, variance of the temperature over time, and normalized temperature. We can view the progress of the computations as they run on the cluster nodes and continue to work in other cells in the notebook:

>>> avg, std = da.compute(x.mean(axis=0), x.std(axis=0))

>>> z = (x - avg) / std

>>> progress(z)

We can plot the resulting normalized temperature using matplotlib:

We can also create interactive widgets in the notebook to interact with and visualize the data in real-time while the computations are running across the cluster:

View the full notebook for this distributed array example on Anaconda Cloud.

Creating Custom Parallel Workflows

When one of the standard Dask collections isn’t a good fit for your workflow, Dask gives you the flexibility to work with different file formats and custom parallel workflows. The Dask Imperative collection lets you wrap functions in existing Python code and run the computations on a single machine or across a cluster.

In this example, we have multiple files stored hierarchically in a custom file format (Feather for reading and writing Python and R dataframes on disk). We can build a custom workflow by wrapping the code with Dask Imperative and making use of the Feather library:

>>> import feather

>>> from dask import delayed

>>> from glob import glob

>>> import os

 

>>> lazy_dataframes = []

>>> for directory in glob('2016-*'):

...     for symbol in os.listdir(directory):

...         filename = os.path.join(directory, symbol)

...         df = delayed(feather.read_dataframe)(filename)

...         df = delayed(pd.DataFrame.assign)(df,

                    date=pd.Timestamp(directory),

                    symbol=symbol)

...         lazy_dataframes.append(df)

View the full notebook for this custom parallel workflow example on Anaconda Cloud.

Additional Resources

View more examples and documentation in the Dask documentation. For more information about using Anaconda and Dask to scale out Python on your cluster, check out our recent webinar on High Performance Hadoop with Python.

You can get started with Anaconda and Dask using Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$ conda install anaconda-client -n root

$ anaconda login

$ conda install anaconda-cluster -c anaconda-cluster

In addition to Anaconda subscriptions, there are many different ways that Continuum can help you get started with Anaconda and Dask to construct parallel workflows, parallelize your existing code, or integrate with your existing Hadoop or HPC cluster, including:

  • Architecture consulting and review
  • Manage Python packages and environments on a cluster
  • Develop custom package management solutions on existing clusters
  • Migrate and parallelize existing code with Python and Dask
  • Architect parallel workflows and data pipelines with Dask
  • Build proof of concepts and interactive applications with Dask
  • Custom product/OSS core development
  • Training on parallel development with Dask

For more information about the above solutions, or if you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io.

 

by swebster at April 18, 2016 02:42 PM

April 17, 2016

Titus Brown

MinHash signatures as ways to find samples, and collaborators?

As I wrote last week my latest enthusiasm is MinHash sketches, applied (for the moment) to RNAseq data sets. Briefly, these are small "signatures" of data sets that can be used to compare data sets quickly. In the previous blog post, I talked a bit about their effectiveness and showed that (at least in my hands, and on a small data set of ~200 samples) I could use them to cluster RNAseq data sets by species.

What I didn't highlight in that blog post is that they could potentially be used to find samples of interest as well as (maybe) collaborators.

Finding samples of interest

The "samples of interest" idea is pretty clear - supposed we had a collection of signatures from all the RNAseq in the the Sequence Read Archive? Then we could search the entire SRA for data sets that were "close" to ours, and then just use those to do transcriptome studies. It's not yet clear how well this might work for finding RNAseq data sets with similar expression patterns, but if you're working with non-model species, then it might be a good way to pick out all the data sets that you should use to generate a de novo assembly.

More generally, as we get more and more data, finding relevant samples may get harder and harder. This kind of approach lets you search on sequence content, not annotations or metadata, which may be incomplete or inaccurate for all sorts of reasons.

In support of this general idea, I have defined a provisional file format (in YAML) that can be used to transport around these signatures. It's rather minimal and fairly human readable - we would need to augment it with additional metadata fields for any serious use in databases(but see below for more discussion on that). Each record (and there can currently only be one record per signature file) can contain multiple different sketches, corresponding to different k-mer sizes used in generating the sketch. (For different sized sketches with the same k-mers, you just store the biggest one, because we're using bottom sketches so the bigger sketches properly include the smaller sketches.)

If you want to play with some signatures, you can -- here's an executable binder with some examples of generating distance matrices between signatures, and plotting them. Note that by far the most time is spent in loading the signatures - the comparisons are super quick, and in any case could be sped up a lot by moving them from pure Python over to C.

I've got a pile of all echinoderm SRA signatures already built, for those who are interested in looking at a collection -- look here.

Finding collaborators

Searching public databases is all well and good, and is a pretty cool application to enable with a few dozen lines of code. But I'm also interested in enabling the search of pre-publication data and doing matchmaking between potential collaborators. How could this work?

Well, the interesting thing about these signatures is that they are irreversible signatures with a one-sided error (a match means something; no match means very little). This means that you can't learn much of anything about the original sample from the signature unless you have a matching sample, and even then all you know is the species and maybe something about the tissue/stage being sequenced.

In turn, this means that it might be possible to convince people to publicly post signatures of pre-publication mRNAseq data sets.

Why would they do this??

An underappreciated challenge in the non-model organism world is that building reference transcriptomes requires a lot of samples. Sure, you can go sequence just the tissues you're interested in, but you have to sequence deeply and broadly in order to generate good enough data to produce a good reference transcriptome so that you can interpret your own mRNAseq. In part because of this (as well as many other reasons), people are slow to publish on their mRNAseq - and, generally, data isn't made available pre-publication.

What if you could go fishing for collaborators on building a reference transcriptome? Very few people are excited about just publishing a transcriptome (with some reason, when you see papers that publish 300), but those are really valuable building blocks for the field as a whole.

So, suppose you had some RNAseq, and you wanted to find other people with RNAseq from the same organism, and there was this service where you could post your RNAseq signature and get notified when similar signatures were posted? You wouldn't need to do anything more than supply an e-mail address along with your signature, and if you're worried about leaking information about who you are, it's easy enough to make new e-mail addresses.

I dunno. Seems interesting. Could work. Right?

One fun point is that this could be a distributed service. The signatures are small enough (~1-2 kb) that you can post them on places like github, and then have aggregators that collect them. The only "centralized" service involved would be in searching all of them, and that's pretty lightweight in practice.

Another fun point is that we already have a good way to communicate RNAseq for the limited purpose of transcrpiptome assembly -- diginorm. Abundance-normalized RNAseq is useless for doing expression analysis, and if you normalize a bunch of samples together you can't even figure out what the original tissue was. So, if you're worried about other people having access to your expression levels, you can simply normalize the data all together before handing it over.

Further thoughts

As I said in the first post, this was all nucleated by reading the mash and MetaPalette papers. In my review for MetaPalette, I suggested that they look at mash to see if MinHash signatures could be used to dramatically reduce their database size, and now that I actually understand MinHash a bit more, I think the answer is clearly yes.

Which leads to another question - the Mash folk are clearly planning to use MinHash & mash to search assembled genomes, with a side helping of unassembled short and long reads. If we can all agree on an interchange format or three, why couldn't we just start generating public signatures of all the things, mRNAseq and genomic and metagenomic all? I see many, many uses, all somewhat dimly... (Lest anyone think I believe this to be a novel observation, clearly the Mash folk are well ahead of me here -- they undersold it in their paper, so I didn't notice until I re-read it with this in mind, but it's there :).

Anyway, it seems like a great idea and we should totally do it. Who's in? What are the use cases? What do we need to do? Where is it going to break?

--titus

p.s. Thanks to Luiz Irber for some helpful discussion about YAML formats!

by C. Titus Brown at April 17, 2016 10:00 PM

April 14, 2016

Matthew Rocklin

Fast Message Serialization

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Very high performance isn’t about doing one thing well, it’s about doing nothing poorly.

This week I optimized the inter-node communication protocol used by dask.distributed. It was a fun exercise in optimization that involved several different and unexpected components. I separately had to deal with Pickle, NumPy, Tornado, MsgPack, and compression libraries.

This blogpost is not advertising any particular functionality, rather it’s a story of the problems I ran into when designing and optimizing a protocol to quickly send both very small and very large numeric data between machines on the Python stack.

We care very strongly about both the many small messages case (thousands of 100 byte messages per second) and the very large messages case (100-1000 MB). This spans an interesting range of performance space. We end up with a protocol that costs around 5 microseconds in the small case and operates at 1-1.5 GB/s in the large case.

Identify a Problem

This came about as I was preparing a demo using dask.array on a distributed cluster for a Continuum webinar. I noticed that my computations were taking much longer than expected. The Web UI quickly pointed me to the fact that my machines were spending 10-20 seconds moving 30 MB chunks of numpy array data between them. This is very strange because I was on 100MB/s network, and so I expected these transfers to happen in more like 0.3s than 15s.

The Web UI made this glaringly apparent, so my first lesson was how valuable visual profiling tools can be when they make performance issues glaringly obvious. Thanks here goes to the Bokeh developers who helped the development of the Dask real-time Web UI.

Problem 1: Tornado’s sentinels

Dask’s networking is built off of Tornado’s TCP IOStreams.

There are two common ways to delineate messages on a socket, sentinel values that signal the end of a message, and prefixing a length before every message. Early on we tried both in Dask but found that prefixing a length before every message was slow. It turns out that this was because TCP sockets try to batch small messages to increase bandwidth. Turning this optimization off ended up being an effective and easy solution, see the TCP_NODELAY parameter.

However, before we figured that out we used sentinels for a long time. Unfortunately Tornado does not handle sentinels well for large messages. At the receipt of every new message it reads through all buffered data to see if it can find the sentinel. This makes lots and lots of copies and reads through lots and lots of bytes. This isn’t a problem if your messages are a few kilobytes, as is common in web development, but it’s terrible if your messages are millions or billions of bytes long.

Switching back to prefixing messages with lengths and turning off the no-delay optimization moved our bandwidth up from 3MB/s to 20MB/s per node. Thanks goes to Ben Darnell (main Tornado developer) for helping us to track this down.

Problem 2: Memory Copies

A nice machine can copy memory at 5 GB/s. If your network is only 100 MB/s then you can easily suffer several memory copies in your system without caring. This leads to code that looks like the following:

socket.send(header + payload)

This code concatenates two bytestrings, header and payload before sending the result down a socket. If we cared deeply about avoiding memory copies then we might instead send these two separately:

socket.send(header)
socket.send(payload)

But who cares, right? At 5 GB/s copying memory is cheap!

Unfortunately this breaks down under either of the following conditions

  1. You are sloppy enough to do this multiple times
  2. You find yourself on a machine with surprisingly low memory bandwidth, like 10 times slower, as is the case on some EC2 machines.

Both of these were true for me but fortunately it’s usually straightforward to reduce the number of copies down to a small number (we got down to three), with moderate effort.

Problem 3: Unwanted Compression

Dask compresses all large messages with LZ4 or Snappy if they’re available. Unfortunately, if your data isn’t very compressible then this is mostly lost time. Doubly unforutnate is that you also have to decompress the data on the recipient side. Decompressing not-very-compressible data was surprisingly slow.

Now we compress with the following policy:

  1. If the message is less than 10kB, don’t bother
  2. Pick out five 10kB samples of the data and compress those. If the result isn’t well compressed then don’t bother compressing the full payload.
  3. Compress the full payload, if it doesn’t compress well then just send along the original to spare the receiver’s side from compressing.

In this case we use cheap checks to guard against unwanted compression. We also avoid any cost at all for small messages, which we care about deeply.

Problem 4: Cloudpickle is not as fast as Pickle

This was surprising, because cloudpickle mostly defers to Pickle for the easy stuff, like NumPy arrays.

In [1]: import numpy as np

In [2]: data = np.random.randint(0, 255, dtype='u1', size=10000000)

In [3]: import pickle, cloudpickle

In [4]: %time len(pickle.dumps(data, protocol=-1))
CPU times: user 8.65 ms, sys: 8.42 ms, total: 17.1 ms
Wall time: 16.9 ms
Out[4]: 10000161

In [5]: %time len(cloudpickle.dumps(data, protocol=-1))
CPU times: user 20.6 ms, sys: 24.5 ms, total: 45.1 ms
Wall time: 44.4 ms
Out[5]: 10000161

But it turns out that cloudpickle is using the Python implementation, while pickle itself (or cPickle in Python 2) is using the compiled C implemenation. Fortunately this is easy to correct, and a quick typecheck on common large dataformats in Python (NumPy and Pandas) gets us this speed boost.

Problem 5: Pickle is still slower than you’d expect

Pickle runs at about half the speed of memcopy, which is what you’d expect from a protocol that is mostly just “serialize the dtype, strides, then tack on the data bytes”. There must be an extraneous memory copy in there.

See issue 7544

Problem 6: MsgPack is bad at large bytestrings

Dask serializes most messages with MsgPack, which is ordinarily very fast. Unfortunately the MsgPack spec doesn’t support bytestrings greater than 4GB (which do come up for us) and the Python implementations don’t pass through large bytestrings very efficiently. So we had to handle large bytestrings separately. Any message that contains bytestrings over 1MB in size will have them stripped out and sent along in a separate frame. This both avoids the MsgPack overhead and avoids a memory copy (we can send the bytes directly to the socket).

Problem 7: Tornado makes a copy

Sockets on Windows don’t accept payloads greater than 128kB in size. As a result Tornado chops up large messages into many small ones. On linux this memory copy is extraneous. It can be removed with a bit of logic within Tornado. I might do this in the moderate future.

Results

We serialize small messages in about 5 microseconds (thanks msgpack!) and move large bytes around in the cost of three memory copies (about 1-1.5 GB/s) which is generally faster than most networks in use.

Here is a profile of sending and receiving a gigabyte-sized NumPy array of random values through to the same process over localhost (500 MB/s on my machine.)

         381360 function calls (381323 primitive calls) in 1.451 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.366    0.366    0.366    0.366 {built-in method dumps}
        8    0.289    0.036    0.291    0.036 iostream.py:360(write)
    15353    0.228    0.000    0.228    0.000 {method 'join' of 'bytes' objects}
    15355    0.166    0.000    0.166    0.000 {method 'recv' of '_socket.socket' objects}
    15362    0.156    0.000    0.398    0.000 iostream.py:1510(_merge_prefix)
     7759    0.101    0.000    0.101    0.000 {method 'send' of '_socket.socket' objects}
    17/14    0.026    0.002    0.686    0.049 gen.py:990(run)
    15355    0.021    0.000    0.198    0.000 iostream.py:721(_read_to_buffer)
        8    0.018    0.002    0.203    0.025 iostream.py:876(_consume)
       91    0.017    0.000    0.335    0.004 iostream.py:827(_handle_write)
       89    0.015    0.000    0.217    0.002 iostream.py:585(_read_to_buffer_loop)
   122567    0.009    0.000    0.009    0.000 {built-in method len}
    15355    0.008    0.000    0.173    0.000 iostream.py:1010(read_from_fd)
    38369    0.004    0.000    0.004    0.000 {method 'append' of 'list' objects}
     7759    0.004    0.000    0.104    0.000 iostream.py:1023(write_to_fd)
        1    0.003    0.003    1.451    1.451 ioloop.py:746(start)

Dominant unwanted costs include the following:

  1. 400ms: Pickling the NumPy array
  2. 400ms: Bytestring handling within Tornado

After this we’re just bound by pushing bytes down a wire.

Conclusion

Writing fast code isn’t about writing any one thing particularly well, it’s about mitigating everything that can get in your way. As you approch peak performance, previously minor flaws suddenly become your dominant bottleneck. Success here depends on frequent profiling and keeping your mind open to unexpected and surprising costs.

April 14, 2016 12:00 AM

April 13, 2016

Titus Brown

Applying MinHash to cluster RNAseq samples

(I gave a talk on this on Monday, April 11th - you can see the slides slides here, on figshare.

This is a Reproducible Blog Post. You can regenerate all the figures and play with this software yourself on binder.)


So, my latest enthusiasm is MinHash sketches.

A few weeks back, I had the luck to be asked to review both the mash paper (preprint here) and the MetaPalette paper (preprint here). The mash paper made me learn about MinHash sketches, while the MetaPalette paper made some very nice points about shared k-mers and species identification.

After reading, I got to thinking.

I wondered to myself, hey, could I use MinHash signatures to cluster unassembled Illumina RNAseq samples? While the mash folk showed that MinHash could be applied to raw reads nicely, I guessed that the greater dynamic range of gene expression would cause problems - mainly because high-abundance transcripts would yield many, many erroneous k-mers. Conveniently, however, my lab has some not-so-secret sauce for dealing with this problem - would it work, here? I thought it might.

Combined with all of this, my former grad student, Dr. Qingpeng Zhang (first author on the not-so-secret sauce, above) has some other still-unpublished work showing that the the first ~1m reads of metagenome samples can be used to cluster samples together.

So, I reasoned, perhaps it would work well to stream the first million or so reads from the beginning of RNAseq samples through our error trimming approach, compute a MinHash signature, and then use that signature to identify the species from which the RNAseq was isolated (and perhaps even closely related samples).

tl; dr? It seems to work, with some modifications.

For everything below, I used a k-mer hash size of 32 and only chose read data sets with reads of length 72 or higher.

(Here's a nice presentation on MinHash, via Luiz Irber.)

MinHash is super easy to implement

I implemented MinHash in only a few lines of Python; see the repository at https://github.com/dib-lab/sourmash/. The most relevant code is sourmash_lib.py. Here, I'm using a bottom sketch, and at the moment I'm building some of it on top of khmer, although I will probably remove that requirement soon.

After lots of trial and error (some of it reported below), I settled on using a k-mer size of k=32, and a sketch size of 500. (You can go down to a sketch size of 100, but you lose resolution. Lower k-mer sizes have the expected effect of slowly decreasing resolution; odd k-mer sizes effectively halve the sketch size.)

How fast is it, and how much memory does it use, and how big are the sketches?

I haven't bothered benchmarking it, but

  • everything but the hash function itself is on Python;
  • on my 3 yro laptop it takes about 5 minutes to add 1m reads;
  • the memory usage of sourmash itself is negligible - error trimming the reads requires about 1 GB of RAM;
  • the sketches are tiny - less than a few kb - and the program is dominated by the Python overhead.

So it's super fast and super lightweight.

Do you need to error trim the reads?

The figure below shows a dendrogram next to a distance matrix of 8 samples - four mouse samples, untrimmed, and the same four mouse samples, trimmed at low-abundance k-mers. (You can see the trimming command here, using khmer's trim-low-abund command.)

The two house mouse samples are replicates, and they always cluster together. However, they are much further apart without trimming.

The effect of trimming on the disease mouse samples (which are independent biological samples, I believe) is much less; it rearranges the tree a bit but it's not as convincing as with the trimming.

./images/sourmash-trimming.png

So you seem to get better resolution when you error trim the reads, which is expected. The signal isn't as strong as I thought it'd be, though. Have to think about that; I'm surprised MinHash is that robust to errors!

Species group together pretty robustly with only 1m reads

How many reads do you need to use? If you're looking for species groupings, not that many -- 1m reads is enough to cluster mouse vs yeast separately. (Which is good, right? If that didn't work...)

./images/sourmash-dist-1m.png

Approximately 1m reads turns out to work equally well for 200 echinoderm (sea urchin and sea star) samples, too.

./images/sourmash-urchin.png

Here, I downloaded all 204 echinoderm HiSeq mRNAseq data sets from SRA, trimmed them as above, and computed the MinHash signatures, and then compared them all to each other. The blocks of similarity are all specific species, and all the species groups cluster properly, and none of them (with one exception) cluster with other species.

This is also an impressive demonstration of the speed of MinHash - you can do all 204 samples against each other in about 10 seconds. Most of that time is spent loading my YAML format into memory; the actual comparison takes < 1s!

(The whole notebook for making all of these figures takes less than 30 seconds to run, since the signatures are already there; check it out!)

Species that do group together may actually belong together

In the urchin clustering above, there's only one "confused" species grouping where one cluster contains more than one species - that's Patiria miniata and Patiria pectinifera, which are both bat stars.

./images/sourmash-patiria.png

I posted this figure on Facebook and noted the grouping, and Dan Rokhsar pointed out that on Wikipedia, Patiria has been identified as a complex of three closely related species in the Pacific.

So that's good - it seems like the only group that has cross-species clustering is, indeed, truly multi-species.

You can sample any 1m reads and get pretty similar results

In theory, FASTQ files from shotgun sequencing are perfectly random, so you should be able to pick any 1m reads you want - including the first 1m. In practice, of course, this is not true. How similar are different subsamples?

./images/sourmash-samples.png

Answer: quite similar. All seven 1m read subsamples (5 random, one from the middle, one from the end) are above 70% in similarity.

(Very) widely divergent species don't cross-compare at all

If you look at (say) yeast and mouse, there's simply no similarity there at all. 32-mer signatures are apparently very specific.

(The graph below is kind of stupid. It's just looking at similarity between mouse and yeast data sets as you walk through the two data streams. It's 0.2% all the way.)

./images/sourmash-yeast-vs-mouse.png

Species samples get more similar (or stay the same) as you go through the stream

What happens when you look at more than 1m reads? Do the streams get more or less similar?

If you walk through two streams and update the MinHash signature regularly, you see either constant similarity or a general increase in similarity; in the mouse replicates, it's constant and high, and between disease mouse and house mouse, it grows as you step through the stream.

(The inflection points are probably due to how we rearrange the reads during the abundance trimming. More investigation needed.)

./images/sourmash-mouse-replicates.png

Yeast replicates also maintain high similarity through the data stream.

./images/sourmash-yeast-replicates.png

What we're actually doing is mostly picking k-mers from the transcriptome

(This is pretty much what we expected, but as my dad always said, "trust but verify.")

The next question is, what are we actually seeing signatures of?

For example, in the above mouse example, we see growing similarity between two mouse data sets as we step through the data stream. Is this because we're counting more sequencing artifacts as we look at more data, or is this because we're seeing true signal?

To investigate, I calculated the MinHash signature of the mouse RNA RefSeq file, and then asked if the streams were getting closer to that as we walked through them. They are:

./images/sourmash-mouse-reads-v-ref.png

So, it seems like part of what's happening here is that we are looking at the True Signature of the mouse transcriptome. Good to know.

And that's it for today, folks.

What can this be used for?

So, it all seems to work pretty well - the mash folk are dead-on right, and this is a pretty awesome and simple way to look at sequences.

Right now, my approach above seems like it's most useful for identifying what species some RNAseq is from. If we can do that, then we can start thinking about other uses. If we can't do that pretty robustly, then that's a problem ;). So that's where I started.

It might be fun to run against portions of the SRA to identify mislabeled samples. Once we have the SRA digested, we can make that available to people who are looking for more samples from their species of interest; whether this is useful will depend. I'm guessing that it's not immediately useful, since the SRA species identification seem pretty decent.

One simple idea is to simply run this on each new sample you get back from a sequencing facility. "Hey, this looks like Drosophila. ...did you intend to sequence Drosophila?" It won't work for identifying low-lying contamination that well, but it could identify mis-labeled samples pretty quickly.

Tracy Teal suggested that this could be used in-house in large labs to find out if others in the lab have samples of interest to you. Hmm. More on that idea later.

Some big remaining questions

  • Do samples actually cluster by expression similarity? Maybe - more work needed.
  • Can this be used to compare different metagenomes using raw reads? No, probably not very well. At least, the metagenomes I care about are too diverse; you will probably need a different strategy. I'm thinking about it.

One last shoutout

I pretty much reimplemented parts of mash; there's nothing particularly novel here, other than exploring it in my own code on public data :). So, thanks, mash authors!

--titus

by C. Titus Brown at April 13, 2016 10:00 PM

April 12, 2016

Continuum Analytics news

Using Anaconda with PySpark for Distributed Language Processing on a Hadoop Cluster

Posted Tuesday, April 12, 2016

Overview

Working with your favorite Python packages along with distributed PySpark jobs across a Hadoop cluster can be difficult due to tedious manual setup and configuration issues, which is a problem that becomes more painful as the number of nodes in your cluster increases.

Anaconda makes it easy to manage packages (including Python, R and Scala) and their dependencies on an existing Hadoop cluster with PySpark, including data processing, machine learning, image processing and natural language processing.

 

 

In a previous post, we’ve demonstrated how you can use libraries in Anaconda to query and visualize 1.7 billion comments on a Hadoop cluster.

In this post, we’ll use Anaconda to perform distributed natural language processing with PySpark using a subset of the same data set. We’ll configure different enterprise Hadoop distributions, including Cloudera CDH and Hortonworks HDP,  to work interactively on your Hadoop cluster with PySpark, Anaconda and a Jupyter Notebook.

 

In the remainder of this post, we'll:

  1. Install Anaconda and the Jupyter Notebook on an existing Hadoop cluster.

  2. Load the text/language data into HDFS on the cluster.

  3. Configure PySpark to work with Anaconda and the Jupyter Notebook with different enterprise Hadoop distributions.

  4. Perform distributed natural language processing on the data with the NLTK library from Anaconda.

  5. Work locally with a subset of the data using Pandas and Bokeh for data analysis and interactive visualization.

Provisioning Anaconda on a cluster

Because we’re installing Anaconda on an existing Hadoop cluster, we can follow the bare-metal cluster setup instructions in Anaconda for cluster management from a Windows, Mac, or Linux machine. We can install and configure conda on each node of the existing Hadoop cluster with a single command:

$ acluster create cluster-hadoop --profile cluster-hadoop

After a few minutes, we’ll have a centrally managed installation of conda across our Hadoop cluster in the default location of /opt/anaconda.

Installing Anaconda packages on the cluster

Once we’ve provisioned conda on the cluster, we can install the packages from Anaconda that we’ll need for this example to perform language processing, data analysis and visualization:

$ acluster conda install nltk pandas bokeh

We’ll need to download the NLTK data on each node of the cluster. For convenience, we can do this using the distributed shell functionality in Anaconda for cluster management:

$ acluster cmd 'sudo /opt/anaconda/bin/python -m nltk.downloader -d /usr/share/nltk_data all'

Loading the data into HDFS

In this post, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.

Note that we could convert the data into different formats or load it into various query engines; however, since the focus of this blog post is using libraries with Anaconda, we will be working with the raw JSON data in PySpark.

We’ll load the reddit comment data into HDFS from the head node. You can SSH into the head node by running the following command from the client machine:

$ acluster ssh

The remaining commands in this section will be executed on the head node. If it doesn’t already exist, we’ll need to create a user directory in HDFS and assign the appropriate permissions:

$ sudo -u hdfs hadoop fs -mkdir /user/ubuntu

$ sudo -u hdfs hadoop fs -chown ubuntu /user/ubuntu

We can then move the data by running the following command with valid AWS credentials, which will transfer the reddit comment data from the year 2015 (242 GB of JSON data) from a public Amazon S3 bucket into HDFS on the cluster:

$ hadoop distcp s3n://AWS_KEY:AWS_SECRET@blaze-data/reddit/json/2015/*.json /user/ubuntu/

Replace AWS_KEY and AWS_SECRET in the above command with valid Amazon AWS credentials.

Configuring the spark-submit command with your Hadoop Cluster

To use Python from Anaconda along with PySpark, you can set the PYSPARK_PYTHON environment variable on a per-job basis along with the spark-submit command. If you’re using the Anaconda parcel for CDH, you can run a PySpark script (e.g., spark-job.py) using the following command:

$ PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py

If you’re using Anaconda for cluster management with Cloudera CDH or Hortonworks HDP, you can run the PySpark script using the following command (note the different path to Python):

$ PYSPARK_PYTHON=/opt/anaconda/bin/python spark-submit spark-job.py

Installing and Configuring the Notebook with your Hadoop Cluster

Using the spark-submit command is a quick and easy way to verify that our PySpark script works in batch mode. However, it can be tedious to work with our analysis in a non-interactive manner as Java and Python logs scroll by.

Instead, we can use the Jupyter Notebook on our Hadoop cluster to work interactively with our data via Anaconda and PySpark.

Using Anaconda for cluster management, we can install Jupyter Notebook on the head node of the cluster with a single command, then open the notebook interface in our local web browser:

$ acluster install notebook

$ acluster open notebook

Once we’ve opened a new notebook, we’ll need to configure some environment variables for PySpark to work with Anaconda. The following sections include details on how to configure the environment variables for Anaconda to work with PySpark on Cloudera CDH and Hortonworks HDP.

Using the Anaconda Parcel with Cloudera CDH

If you’re using the Anaconda parcel with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and the Anaconda 4.0 parcel.

>>> import os

>>> import sys

>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"

>>> os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Using Anaconda for cluster management with Cloudera CDH

If you’re using Anaconda for cluster management with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and Anaconda for cluster management 1.4.0.

>>> import os

>>> import sys

>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"

>>> os.environ["SPARK_HOME"] = "/opt/anaconda/parcels/CDH/lib/spark"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Using Anaconda for cluster management with Hortonworks HDP

If you’re using Anaconda for cluster management with Hortonworks HDP, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Hortonworks HDP running Spark 1.6.0 and Anaconda for cluster management 1.4.0.

>>> import os

>>> import sys

>>> os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Initializing the SparkContext

After we’ve configured Anaconda to work with PySpark on our Hadoop cluster, we can initialize a SparkContext that we’ll use for distributed computations. In this example, we’ll be using the YARN resource manager in client mode:

>>> from pyspark import SparkConf

>>> from pyspark import SparkContext

>>> conf = SparkConf()

>>> conf.setMaster('yarn-client')

>>> conf.setAppName('anaconda-pyspark-language')

>>> sc = SparkContext(conf=conf)

Loading the data into memory

Now that we’ve created a SparkContext, we can load the JSON reddit comment data into a Resilient Distributed Dataset (RDD) from PySpark:

>>> lines = sc.textFile("/user/ubuntu/*.json")

Next, we decode the JSON data and decide that we want to filter comments from the movies subreddit:

>>> import json

>>> data = lines.map(json.loads)

>>> movies = data.filter(lambda x: x['subreddit'] == 'movies')

We can then persist the RDD in distributed memory across the cluster so that future computations and queries will be computed quickly from memory. Note that this operation only marks the RDD to be persisted; the data will be persisted in memory after the first computation is triggered:

>>> movies.persist()

We can count the total number of comments in the movies subreddit (about 2.9 million comments):

>>> movies.count()

2905085

We can inspect the first comment in the dataset, which shows fields for the author, comment body, creation time, subreddit, etc.:

>>> movies.take(1)

CPU times: user 8 ms, sys: 0 ns, total: 8 msWall time: 113 ms

[{u'archived': False,

 u'author': u'kylionsfan',

 u'author_flair_css_class': None,

 u'author_flair_text': None,

 u'body': u'Goonies',

 u'controversiality': 0,

 u'created_utc': u'1420070402',

 u'distinguished': None,

 u'downs': 0,

 u'edited': False,

 u'gilded': 0,

 u'id': u'cnas90u',

 u'link_id': u't3_2qyjda',

 u'name': u't1_cnas90u',

 u'parent_id': u't3_2qyjda',

 u'retrieved_on': 1425124282,

 u'score': 1,

 u'score_hidden': False,

 u'subreddit': u'movies',

 u'subreddit_id': u't5_2qh3s',

 u'ups': 1}]

Distributed Natural Language Processing

Now that we’ve filtered a subset of the data and loaded it into memory across the cluster, we can perform distributed natural language computations using Anaconda with PySpark.

First, we define a parse() function that imports the natural language toolkit (NLTK) from Anaconda and tags words in each comment with their corresponding part of speech. Then, we can map the parse() function to the movies RDD:

>>> def parse(record):

...    import nltk

...    tokens = nltk.word_tokenize(record["body"])

...    record["n_words"] = len(tokens)

...    record["pos"] = nltk.pos_tag(tokens)

...    return record

 

>>> movies2 = movies.map(parse)

Let’s take a look at the body of one of the comments:

>>> movies2.take(10)[6]['body']

u'Dawn of the Apes was such an incredible movie, it should be up there in my opinion.'

And the same comment with tagged parts of speech (e.g., nouns, verbs, prepositions):

>>> movies2.take(10)[6]['pos']

[(u'Dawn', 'NN'),

(u'of', 'IN'),

(u'the', 'DT'),

(u'Apes', 'NNP'),

(u'was', 'VBD'),

(u'such', 'JJ'),

(u'an', 'DT'),

(u'incredible', 'JJ'),

(u'movie', 'NN'),

(u',', ','),

(u'it', 'PRP'),

(u'should', 'MD'),

(u'be', 'VB'),

(u'up', 'RP'),

(u'there', 'RB'),

(u'in', 'IN'),

(u'my', 'PRP$'),

(u'opinion', 'NN'),

(u'.', '.')] 

We can define a get_NN() function that extracts nouns from the records, filters stopwords, and removes non-words from the data set:

>>> def get_NN(record):

...    import re

...    from nltk.corpus import stopwords

...    all_pos = record["pos"]

...    ret = []

...    for pos in all_pos:

...        if pos[1] == "NN" \

...        and pos[0] not in stopwords.words('english') \

...        and re.search("^[0-9a-zA-Z]+$", pos[0]) is not None:

...            ret.append(pos[0])

...    return ret

>>> nouns = movies2.flatMap(get_NN)

 We can then generate word counts for the nouns that we extracted from the dataset:

>>> counts = nouns.map(lambda word: (word, 1))

After we’ve done the heavy lifting, processing, filtering and cleaning on the text data using Anaconda and PySpark, we can collect the reduced word count results onto the head node.

>>> top_nouns = counts.countByKey()

>>> top_nouns = dict(top_nouns)

In the next section, we’ll continue our analysis on the head node of the cluster while working with familiar libraries in Anaconda, all in the same interactive Jupyter notebook.

Local analysis with Pandas and Bokeh

Now that we’ve done the heavy lifting using Anaconda and PySpark across the cluster, we can work with the results as a dataframe in Pandas, where we can query and inspect the data as usual:

>>> import pandas as pd

>>> df = pd.DataFrame(top_nouns.items(), columns=['Noun', 'Count'])

Let’s sort the resulting word counts, and view the top 10 nouns by frequency:

>>> df = df.sort_values('Count', ascending=False)

>>> df_top_10 = df.head(10)

>>> df_top_10

Noun

Count

movie

539698

film

220366

time

157595

way

112752

gt

105313

http

92619

something

87835

lot

85573

scene

82229

thing

82101

Let’s generate a bar chart of the top 10 nouns using Pandas:

>>> %matplotlib inline

>>> df_top_10.plot(kind='bar', x=df_top_10['Noun'])

Finally, we can use Bokeh to generate an interactive plot of the data:

>>> from bokeh.charts import Bar, show

>>> from bokeh.io import output_notebook

>>> from bokeh.charts.attributes import cat

>>> output_notebook()

 

>>> p = Bar(df_top_10,

...         label=cat(columns='Noun', sort=False),

...         values='Count',

...         title='Top N nouns in r/movies subreddit')

>>> show(p)

 

Conclusion

In this post, we used Anaconda with PySpark to perform distributed natural language processing and computations on data stored in HDFS. We configured Anaconda and the Jupyter Notebook to work with PySpark on various enterprise Hadoop distributions (including Cloudera CDH and Hortonworks HDP), which allowed us to work interactively with Anaconda and the Hadoop cluster. This made it convenient to work with Anaconda for the distributed processing with PySpark, while reducing the data to a size that we could work with on a single machine, all in the same interactive notebook environment. The complete notebook for this example with Anaconda, PySpark, and NLTK can be viewed on Anaconda Cloud.

You can get started with Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$ conda install anaconda-client

$ anaconda login

$ conda install anaconda-cluster -c anaconda-cluster

If you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io. The enterprise features of Anaconda, including the cluster management functionality and on-premises repository, are certified for use with Cloudera CDH 5.

If you’re running into memory errors, performance issues (related to JVM overhead or Python/Java serialization), problems translating your existing Python code to PySpark, or other limitations with PySpark, stay tuned for a future post about a parallel processing framework in pure Python that works with libraries in Anaconda and your existing Hadoop cluster, including HDFS and YARN.

by swebster at April 12, 2016 02:30 PM

Matthieu Brucher

Analog modeling of a diode clipper (3a): Simulation

Now that we have a few methods, let’s try to simulate them. For both circuits, I’ll use the forward Euler, then backward Euler and trapezoidal approximations, then I will show the results of changing the start estimate and then finish by the Newton Raphson optimization. I haven’t checked (yet?) algorithms that don’t use the derivative like the bisection or Brent algorithm.

All graphs are done with a x4 oversampling (although I also tried x8, x16 and x32).

First diode clipper

Let’s start with the original equation:

V_i - 2 R_1 I_s sinh(V_o/nV_t) - \int \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_o = 0

Forward Euler

Let’s now figure out what to do with the integral by deriving the equation:

\frac{dV_o}{dt} = \frac{\frac{dV_i}{dt} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})}{1 + \frac{2 I_s R_1}{nV_t} cosh(\frac{V_o}{nV_t})}

So now we have the standard form that can used the usual way. For the derivative of the input, I’ll always use the trapezoidal approximation, and then for the output one, I’ll use the forward Euler which leads to the “simple” equation:

V_{on+1} = V_{on} + \frac{V_{in+1} - V_{in} - \frac{4 h I_s}{C_1} sinh(\frac{V_{on}}{nV_t})}{1 + \frac{2 I_s R_1}{nV_t} cosh(\frac{V_{on}}{nV_t})}

Backward Euler

For the backward Euler, I’ll start from the integral equation again and remove the time dependency:

V_{in+1} - V_{in} - 2 R_1 I_s (sinh(\frac{V_{on+1}}{nV_t}) - sinh(\frac{V_{on}}{nV_t})) - \int^{t_{n+1}}_{t_n} \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_{on+1} + V_{on} = 0

Now the discretization becomes:
V_{in+1} - V_{in} - 2 R_1 I_s (sinh(\frac{V_{on+1}}{nV_t}) - sinh(\frac{V_{on}}{nV_t})) - \frac{2 h I_s}{C_1} sinh(\frac{V_{on+1}}{nV_t}) - V_{on+1} + V_{on} = 0

I didn’t use this equation for the Backward Euler because I would have had a dependency in the sinh term, so I would still have required the numerical methods to solve the equation.

Trapezoidal rule

Here, we just need to change the discretization for a trapezoidal one:

V_{in+1} - V_{in} - I_s sinh(\frac{V_{on+1}}{nV_t}) (\frac{h}{C_1} + 2 R_1) - I_s sinh(\frac{V_{on}}{nV_t}) (\frac{h}{C_1} - 2 R_1) - V_{on+1} + V_{on} = 0

Starting estimates

Starting from the different rules, we need to replace sinh(x):

  • for the pivotal by \frac{x}{x_0} sinh(x_0)
  • for the tangent rule by \frac{x}{nV_t} cosh(\frac{x_0}{nV_t}) + y_0 - \frac{x_0}{nV_t} cosh(\frac{x_0}{nV_t})

Graphs

Let’s see now how all these optimizers compare (with an estimate of the next element being the last optimized value):

Numerical optimization comparisonNumerical optimization comparison

Obviously, the Forward Euler method is definitely not good. Although is on average 4 times lower, the accuracy is definitely not good enough. On the other end, the other two methods give similar results (probably because I try to achieve a convergence quite strong, with less that 10e-8 difference between two iterations).

Now, how does the original estimate impact the results? I tried the Backward Euler to start, and the results are identical:

Original estimates comparisonOriginal estimates comparison

To have a better picture, let’s turn down the number of iterations to 1 for all the estimates:

One step comparisonOne step comparison

So all the estimates give a similar result. By comparing the number of iterations with the three estimates, the pivotal method gives the worst results, whereas the affine estimates lowers the number of iterations by one. Of course, there is a price to pay in the computation.

So the obvious choice is to use trapezoidal approximation with affine starting point estimate, which is not my default choice in SimpleOverdriveFilter.

To be continued

The post is getting longer than I thought, so let’s keep it there for now and the next post on the subject will tackle the other diode clipper circuit.

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at April 12, 2016 07:58 AM

April 11, 2016

Continuum Analytics news

Data Science with Python at ODSC East

Posted Tuesday, April 12, 2016

By: Sheamus McGovern, Open Data Science Conference Chair

At ODSC East, the most influential minds and institutions in data science will convene at the Boston Convention & Exhibition Center from May 20th to the 22nd to discuss and teach the newest and most exciting developments in data science.

As you know, the Python ecosystem is now one of the most important data science development environments available today. This is due, in large part, to the existence of a rich suite of user-facing data analysis libraries.

Powerful Python machine learning libraries like Scikit-learn, XGBoost and others bring sophisticated predictive analytics to the masses. The NLTK and Gensim libraries enable deep analysis of textual information in Python and the Topik library provides a high-level interface to these and other, natural language libraries, adding a new layer of usability. The Pandas library has brought data analysis in Python to a new level by providing expressive data structures for quick and intuitive data manipulation and analysis.

The notebook ecosystem in Python has also flourished with the development of the Jupyter, Rodeo and Beaker notebooks. The notebook interface is an increasingly popular way for data scientists to perform complex analyses that serve the purpose of conveying and sharing analyses and their results to colleagues and to stakeholders. Python is also host to a number of rich web-development frameworks that are used not only for building data science dash boards, but also for full-scale data science powered web-apps. Flask and Django lead the way in terms of the Python web-app development landscape, but Bottle and Pyramid are also quite popular.

With Numba or Cython, code can approach speeds akin to that of C or C++ and new developments, like the Dask package, to make computing on larger-than-memory datasets very easy. Visualization libraries, like Plot.ly and Bokeh, have brought rich, interactive and impactful data visualization tools to the fingertips of data analysts everywhere.

Anaconda has streamlined the use of many of these wildly popular open source data science packages by providing an easy way to install, manage and use Python libraries. With Anaconda, users no longer need to worry about tedious incompatibilities and library management across their development environments.

Several of the most influential Python developers and data scientists will be talking and teaching at ODSC East. Indeed, Peter Wang will be speaking ODSC East. Peter is the co-founder and CTO at Continuum Analytics, as well as the mastermind behind the popular Bokeh visualization library, the Blaze ecosystem, which simplifies the the analysis of Big Data with Python and Anaconda. At ODSC East, there will be over 100 speakers, 20 workshops and 10 training sessions spanning seven conferences that focused on Open Data Science, Disruptive Data Science, Big Data science, Data Visualization, Data Science for Good, Open Data and a Careers and Training conference. See below for very small sampling of some of the powerful Python workshops and speakers we will have at ODSC East.

●Bayesian Statistics Made Simple - Allen Downey, Think Python

●Intro to Scikit learn for Machine Learning - Andreas Mueller, NYU Center for Data Science

●Parallelizing Data Science in Python with Dask - Matthew Rocklin, Continuum Analytics

●Interactive Viz of a Billion Points with Bokeh Datashader – Peter Wang, Continuum Analytics

 

 

by pcudia at April 11, 2016 09:07 PM

April 07, 2016

Titus Brown

Bashing on monstrous sequencing collections

So, there's this fairly large collection of about 700 RNAseq samples, from 300 species in 40 or so phyla. It's called the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), and was funded by the Moore Foundation as a truly field-wide collaboration to improve our reference collection for genes (and more). Back When, it was sequenced and assembled by the National Center for Genome Resources, and published in PLOS Biology (Keeling et al., 2014).

Partly because we think assembly has improved in the last few years, partly as an educational exercise, partly as an infrastructure exercise, partly as a demo, and partly just because we can, Lisa Cohen in my lab is starting to reassemble all of the data - starting with about 10%. She has some of the basic evaluations (mostly via transrate) posted, and before we pull the trigger on the rest of the assemblies, we're pausing to reflect and to think about what metrics to use, and what kinds of resources we plan to produce. (We are not lacking in ideas, but we might be lacking in good ideas, if you know what I mean.)

In particular, this exercise raises some interesting questions that we hope to dig into:

  • what does a good transcriptome look like, and how could having 700 assemblies help us figure that out? (hint: distributions)
  • what is a good canonical set of analyses for characterizing transcriptome assemblies?
  • what products should we be making available for each assembly?
  • what kind of data formatting makes it easiest for other bioinformaticians to build off of the compute we're doing?
  • how should we distribute the workflow components? (Lisa really likes shell scripts but I've been lobbying for something more structured. 'make' doesn't really fit the bill here, though.)
  • how do we "alert" the community if and when we come up with better assemblies? How do we merge assemblies between programs and efforts, and properly credit everyone involved?

Anyway, feedback welcome, here or on Lisa's post! We are happy to share methods, data, analyses, results, etc. etc.

--titus

p.s. Yes, that's right. I ask new grad students to start by
assemblying 700 transcriptomes. So? :)

by C. Titus Brown at April 07, 2016 10:00 PM

Martin Fitzpatrick

Why are numpy calculations not affected by the global interpreter lock?

Many numpy calculations are unaffected by the GIL, but not all.

While in code that does not require the Python interpreter (e.g. C libraries) it is possible to specifically release the GIL - allowing other code that depends on the interpreter to continue running. In the Numpy C codebase the macros NPY_BEGIN_THREADS and NPY_END_THREADS are used to delimit blocks of code that permit GIL release. You can see these in this search of the numpy source.

The NumPy C API documentation has more information on threading support. Note the additional macros NPY_BEGIN_THREADS_DESCR, NPY_END_THREADS_DESCR and NPY_BEGIN_THREADS_THRESHOLDED which handle conditional GIL release, dependent on array dtypes and the size of loops.

Most core functions release the GIL - for example Universal Functions (ufunc) do so as described:

as long as no object arrays are involved, the Python Global Interpreter Lock (GIL) is released prior to calling the loops. It is re-acquired if necessary to handle error conditions.

With regard to your own code, the source code for NumPy is available. Check the functions you use (and the functions they call) for the above macros. Note also that the performance benefit is heavily dependent on how long the GIL is released - if your code is constantly dropping in/out of Python you won’t see much of an improvement.

The other option is to just test it. However, bear in mind that functions using the conditional GIL macros may exhibit different behaviour with small and large arrays. A test with a small dataset may therefore not be an accurate representation of performance for a larger task.

There is some additional information on parallel processing with numpy available on the official wiki and a useful post about the Python GIL in general over on Programmers.SE.

by Martin Fitzpatrick at April 07, 2016 03:37 PM

April 06, 2016

Continuum Analytics news

Anaconda 4.0 Release

Posted Wednesday, April 6, 2016

We are happy to announce that Anaconda 4.0 has been released, which includes the new Anaconda Navigator.

Did you notice we skipped from release 2.5 to 4.0? Sharp eyes! The team decided to move up to 4.0 release number to reduce confusion with common Python versions.

Anaconda Navigator is a desktop graphical user interface included in Anaconda that allows you to launch applications and easily manage conda packages, environments and channels without the need to use command line commands. It is available for Windows, OS X and Linux.  For those familiar with the Anaconda Launcher, Anaconda Navigator has replaced Launcher. If you are already using Anaconda Cloud to host private packages, you can access them easily by signing in with your Anaconda Cloud account.

 

There are four main components in Anaconda Navigator, each one can be selected by clicking the corresponding tab on the left-hand column:

  • Home where you can install, upgrade, and launch applications

  • Environments allows you to manage channels, environments and packages.

  • Learning shows a long list of learning resources in several categories: webinars, documentation, videos and training.

  • Community where you can connect to other users through events, forums and social media.

If you already have Anaconda installed, update to Anaconda 4.0 by using conda:

conda update conda

conda install anaconda=4.0

The full list of changes, fixes and updates for Anaconda v4.0 can be found in the changelog.

We’d very much appreciate your feedback on the latest release, especially the new Anaconda Navigator. Please submit comments or issues through our anaconda-issues GitHub repo.

Go download Anaconda today!

 

 

by swebster at April 06, 2016 03:24 PM

April 05, 2016

Enthought

Just Released: PyXLL v 3.0 (Python in Excel). New Real Time Data Stream Capabilities, Excel Ribbon Integration, and More.

Download a free 30 day trial of PyXLL and try it with your own data. Since PyXLL was first released back in 2010 it has grown hugely in popularity and is used by businesses in many different sectors. The original motivation for PyXLL was to be able to use all the best bits of Excel […]

by admin at April 05, 2016 04:50 PM

Matthieu Brucher

Analog modeling of a diode clipper (2): Discretization

Let’s start with the two equations we got from the last post and see what we can do with usual/academic tools to solve them (I will tackle nodal and ZDF tools later in this series).

Euler and trapezoidal approximation

The usual tools start with a specific form: $$ \dot{y} = f(y) $

I’ll work with the second clipper whose equation is of this form:

\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) = f(V_o)

Forward Euler

The simplest way of computing the derivative term is to use the following rule, with h, the inverse of the sampling frequency:

V_{on+1} = V_{on} + h f(V_{on})

The nice thing about this rule is that it is easy to compute. The main drawback is that the result may not be accurate and stable enough (let’s keep this for the next post, with actual derivations).

Backward Euler

Instead of using the past to compute the new sample, we can use the future, which leads to

V_{on+1} = V_{on} + h f(V_{on+1})

As the result is present on both sides, solving the problem is not simple. In this case, we can even say that the equation has no close form solution (due to the sinh term), and thus no analytical solution. The only way is to use numerical methods like Brent or Newton Raphson to solve the equation.

Trapezoidal approximation

Another solution is to combine both solution to have a better approximation of the derivative term:

V_{on+1} = V_{on} + \frac{h}{2}(f(V_{on}) + f(V_{on+1}))

We still need the numerical methods to solve the clipper equation with this method, but like the Backward Euler method, this one is said to be A-stable, which is a mandatory condition when solving stiff systems (or systems that have a bad condition number). For a one variable system, the condition number is of course 1…

Other approximations

There are different other ways of approximating this derivative term. The most used one is the trapezoidal methods, but there are others like all the linear multistep methods (that actually encompass the first three).

Numerical methods

Let’s try to analyse a few numerical methods. If we used trapezoidal approximation, then the following function needs to be considered:

g(V_{on+1}) = V_{on+1} - V_{on} - \frac{h}{2}(f(V_{on}) + f(V_{on+1}))

The goal is to find a value where the function is zero, called a root of the function.

Bisection method

This method is simple to implement, from two starting points, on either side of the root. Then, we take the middle of the interval, check the sign of it and keep the original point that has a different sign, and keep on until we get close enough of the root.

What is interesting with this method is that it can be vectorized easily by checking several values in the interval instead of just one.

Newton-Raphson

This numerical method requires the derivative function of g(). Then we can start from the original starting point and iterate this series:

x_{n+1} = x_n - \frac{g(x_n)}{g'(x_n)}

For those who are used to optimize cost function and know about the Newton method, it is exactly the same as this one. To optimize a cost function, we need to find a zero of the derivative function. So if g() is this derivative function, then we end up on the Newton method to minimize (or maximize) a cost function.

That being said, if the rate of convergence is quadratic for the Newton-Raphson method, it may not converge. The only way to achieve convergence is to be close enough to the root we are looking for and have some conditions that are usually quite complex to check (see the wikipedia page).

Starting point

The are several ways of starting the methods.

The first one is any enough: just use the result of the last optimization.

The second one is a little bit more complex: approximate all the complex functions (like sinh) by their tangent and solve the resulting polynomial.

The third one is derived from the second one and called pivotal method/mystran method. Instead of using the tangent, we use the linear function that crosses the origin and the last point. The idea here is that it can be more stable that the tangent method (consider doing this for the hyperbolic tangent, the resulting result could be quite far).

Conclusion

Of course, there are other numerical methods that I haven’t spoken about. Any can be tried and used. Please do so and report your results!

Let’s see how the ones I’ve shown behave in the next post.

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at April 05, 2016 07:10 AM

April 04, 2016

Continuum Analytics news

Anaconda Powers TaxBrain to Transform Washington Through Transparency, Access and Collaboration

Posted Monday, April 4, 2016

Continuum Analytics and Open Source Policy Center Leverage the Power of Open Source to Build Vital Policy Forecasting Models with Anaconda 

AUSTIN, TX—April 4, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading open source analytics platform powered by Python, today announced that Anaconda is powering the American Enterprise Institute’s (AEI) Open Source Policy Center (OSPC) TaxBrain initiative. TaxBrain is a web-based application that lets users simulate and study the effect of tax policy reforms using open source economic models. TaxBrain provides transparent analysis for policy makers and the public, ultimately creating a more democratic and scientific platform to analyze economic policy.  

“OSPC’s mission is to empower the public to contribute to government policymaking through open source methods and technology, making policy analysis more transparent, trustworthy and collaborative,” said Matt Jensen, founder and managing director of the Open Source Policy Center at the American Enterprise Institute. “TaxBrain is OSPC’s first product, and with Anaconda it is already improving tax policy by making the policy analysis process more democratic and scientific. By leveraging the power of open source, we are able to provide policy makers, journalists and the general public with the information they need to impact and change policy for the better.” 

TaxBrain is made possible by a community of economists, data scientists, software developers, and policy experts who are motivated by a shared belief that public policy should be guided by open scientific inquiry, rather than proprietary analysis. The community also believes that the analysis of public policy should be freely available to everyone, rather than just to a select group of those in power.  

“The TaxBrain initiative is only the beginning of a much larger movement to use open source approaches in policy and government,” said Travis Oliphant, CEO and co-founder of Continuum Analytics. “TaxBrain is the perfect example of how Anaconda can inspire people to harness the power of data science to enable positive changes. With Anaconda, the OSPC is able to empower a growing community with the superpowers necessary to promote change and democratic policy reform.” 

Anaconda has allowed TaxBrain to tap a vast network of outside contributors in the PyData community to accelerate the total number of open source economic models. Contributions from the PyData community come quickly and—because of Anaconda—get large performance gains from Numba, the Python compiler included in Anaconda, and are easy to integrate into TaxBrain. These contributions can be used in other applications and are hosted on the Anaconda Cloud (anaconda.org/ospc).  

To learn more about AEI’s OSPC TaxBrain initiative, please visit http://www.ospc.org/taxbrain/.  

About AEI

AEI is a nonprofit, nonpartisan public policy research organization that works to expand liberty, increase individual opportunity, and strengthen free enterprise. 

About Continuum Analytics

Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team––data scientists, developers, DevOps, data engineers and business analysts––to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

To learn more about Continuum Analytics, visit w​ww.continuum.io.​

###

by pcudia at April 04, 2016 12:30 PM

April 02, 2016

NeuralEnsemble

EU Human Brain Project Releases Platforms to the Public

"Geneva, 30 March 2016 — The Human Brain Project (HBP) is pleased to announce the release of initial versions of its six Information and Communications Technology (ICT) Platforms to users outside the Project. These Platforms are designed to help the scientific community to accelerate progress in neuroscience, medicine, and computing.

[...]


The six HBP Platforms are:
  • The Neuroinformatics Platform: registration, search, analysis of neuroscience data.
  • The Brain Simulation Platform: reconstruction and simulation of the brain.
  • The High Performance Computing Platform: computing and storage facilities to run complex simulations and analyse large data sets.
  • The Medical Informatics Platform: searching of real patient data to understand similarities and differences among brain diseases.
  • The Neuromorphic Computing Platform: access to computer systems that emulate brain microcircuits and apply principles similar to the way the brain learns.
  • The Neurorobotics Platform: testing of virtual models of the brain by connecting them to simulated robot bodies and environments.
All the Platforms can be accessed via the HBP Collaboratory, a web portal where users can also find guidelines, tutorials and information on training seminars. Please note that users will need to register to access the Platforms and that some of the Platform resources have capacity limits."

   ... More in the official press release here.

 The HBP held an online release event on 30 March:


Prof. Felix Schürmann (EPFL-BBP, Geneva), Dr. Eilif Muller (EPFL-BBP, Geneva), and Prof. Idan Segev (HUJI, Jerusalem) present an overview of the mission, tools, capabilities and science of the EU Human Brain Project (HBP) Brain Simulation Platform:


A publicly accessible forum for the BSP is here:
https://forum.humanbrainproject.eu/c/bsp
and for community models
https://forum.humanbrainproject.eu/c/community-models
and for community models of hippocampus in particular
https://forum.humanbrainproject.eu/c/community-models/hippocampus

by eilif (noreply@blogger.com) at April 02, 2016 12:55 AM

April 01, 2016

Enthought

Virtual Core: CT, Photo, and Well Log Co-visualization

Enthought is pleased to announce Virtual Core 1.8.  Virtual Core automates aspects of core description for geologists, drastically reducing the time and effort required for core description, and its unified visualization interface displays cleansed whole-core CT data alongside core photographs and well logs.  It provides tools for geoscientists to analyze core data and extract features from […]

by admin at April 01, 2016 09:58 PM

Canopy Geoscience: Python-Based Analysis Environment for Geoscience Data

Today we officially release Canopy Geoscience 0.10.0, our Python-based analysis environment for geoscience data. Canopy Geoscience integrates data I/O, visualization, and programming, in an easy-to-use environment. Canopy Geoscience is tightly integrated with Enthought Canopy’s Python distribution, giving you access to hundreds of high-performance scientific libraries to extract information from your data. The Canopy Geoscience environment […]

by admin at April 01, 2016 09:48 PM

March 31, 2016

Continuum Analytics news

Why Every CEO Needs To Understand Data Science

Posted Thursday, March 31, 2016

Tech culture has perpetuated the myth that data science is a sort of magic; something that only those with exceptional math skills, deep technical know-how and industry knowledge can understand or act on. While it’s true that math skills and technical knowledge are required to effectively extract insights from data, it’s far from magic. Given a little time and effort, anyone can become familiar with the basic concepts.

As a CEO, you don’t need to understand every technical detail, but it’s very important to have a good grasp of the entire process behind extracting useful insights from your data. Click on the full article below to read the five big-picture steps you must take to ensure you understand data science (and to ensure your company is gaining actionable insights throughout the process).

Read the full article here.

by swebster at March 31, 2016 03:43 PM

March 30, 2016

Titus Brown

A grant proposal: A workshop on dockerized notebook computing

I'm writing a proposal to the Sloan Foundation for about $20k to support a workshop to hack on mybinder. Comments solicited. Note, it's, umm, due today ;).

(I know the section on "major related work" is weak. I could use some help there.)

If you're interested in participating and don't mind being named in the proposal, drop me an e-mail or leave me a note in a comment.


Summary:

We propose to host a hackfest (cooperative hackathon) workshop to enhance and extend the functionality of the mybinder notebook computing platform. Mybinder provides a fully hosted solution for executing Jupyter notebooks based in GitHub repositories; it isolates execution and provides configurability through the use of Docker containers. We would like to extend mybinder to support a broader range of data science tools - specifically, RStudio and other tools in the R ecosystem - as well as additional cloud hosting infrastructure and version control systems. A key aspect of this proposal is to brainstorm and prototype support for credentials so that private resources can be used to source and execute binders (on e.g. AWS accounts and private repositories). We believe this will broaden interest and use of mybinder and similar resources, and ultimately drive increased adoption of fully specified and executable data narratives.

What is the subject, and why is it important?

Fully specified and perfectly repeatable computational data analyses have long been a goal of open scientists - after all, if we can eliminate the dull parts of reproducibility, we can get one with the more exciting bits of arguing about significance, interpretation and meaning. There is also the hope that fully open and repeatable computational methods can spur reuse and remixing of these methods, and accelerate the process of computational science. We have been asymptotically approaching this in science for decades, helped by configuration and versioning tools used in open source development and devops. The emergence of fully open source virtual machine hosting, cloud computing, and (most recently) container computing means that a detailed host configuration can be shared and instantiated on demand; together with flexible interactive Web notebooks such as RStudio and Jupyter Notebook, and public hosting of source code, we can now provide code, specify the execution environment, execute a workflow, and give scientists (and others) an interactive data analysis environment in which to explore the results.

Fully specified data narratives will are already impacting science and journalism, by providing a common platform for data driven discussion. We and others are using them in education, as well, for teaching and training in data science, statistics, and other fields. Jupyter Notebooks and RMarkdown are increasingly used to write books and communicate data- and compute-based analyses. And, while the technology is still young, the interactive widgets in Jupyter make it possible to communicate these analyses to communities that are not coders. At the moment, we can’t say where this will all lead, but it is heading towards an exciting transformation of how we publish, work with, collaborate around, and explore computing.

mybinder and similar projects provide a low barrier-to-entry way to publish and then execute Jupyter notebooks in a fully customizable environment, where dependencies and software requirements can be specified in a simple, standard way tied directly to the project. For example, we have been able to use mybinder to provision notebooks for a classroom of 30 people in under 30 seconds, with only 5 minutes of setup and preparation. Inadvertent experiments on Twitter have shown us that the current infrastructure can expand to handle hundreds of simultaneous users. In effect, mybinder is a major step closer to helping us realize the promise of ubiquitous and frictionless computational narratives.

What is the major related work in this field?

I am interested in (a) potentially anonymous execution of full workflows, (b) within a customizable compute environment, (c) with a robust, open source software infrastructure supporting the layer between the execution platform and the interactive environment.

There are a variety of hosting platforms for Jupyter Notebooks, but I am only aware of one that offers completely anonymous execution of notebooks - this is the tmpnb service provided by Rackspace, used to support a Nature demo of the notebook. Even more than mybinder, tmpnb enables an ecosystem of services because it lets users frictionlessly “spin up” an execution environment - for a powerful demo of this being used to support static execution, see https://betatim.github.io/posts/really-interactive-posts/. However, tmpnb doesn’t allow flexible configuration of the underlying execution environment, which prevents it from being used for more complex interactions in the same way as mybinder.

More generally, there are many projects that seek to make deployment and execution of Jupyter Notebooks straightforward. This includes JupyterHub, everware, thebe, Wakari, SageMathCloud, and Google Drive. Apart from everware (which has significant overlap with mybinder in design) these other platforms are primarily designed around delivery of notebooks and collaboration within them, and do not provide the author-specified customization of the compute environment provided by mybinder via Docker containers. That having been said, all of these are robust players within the Jupyter ecosystem and are building out tools and approaches that mybinder can take advantage of (and vice versa).

We have already discussed the proposed workshop informally with people from Jupyter, everware, and thebe, and anticipate inviting at least one team member from each project (as well as tmpnb).

Outside of the Jupyter ecosystem, the R world has a collection of software that I’m mostly unfamiliar with. This includes RStudio Server, an interactive execution environment for R that is accessed remotely over a Web interface, and Shiny, which allows users to serve R-based analyses to the Web. These compare well with Jupyter Notebook in features and functionality. One goal of this workshop is to provide a common execution system to support these in the same way that mybinder supports Jupyter now, and to find out which R applications are the most appropriate ones to target. We will invite one or more members of the rOpenSci team to the workshop for exactly this purpose.

Why is the proposer qualified?

My major qualifications for hosting the workshop are as follows:

  • Known (and “good”) actor in the open source world, with decades of dedication to open source and open science principles and community interaction.
  • Official Jupyter evangelist, on good terms with everyone (so far as I know).
  • Neutral player with respect to all of these projects.
  • Teacher and trainer and designer of workshop materials for Jupyter notebooks, Docker, reproducible science, version control, and cloud computing.
  • Affiliated with Software Carpentry and Data Carpentry, to help with delivery of training materials.
  • Faculty at a major US university, with access to admin support and workshop space.
  • Interest and some experience in fostering diverse communities (primarily from connections with Python Software Foundation and Software Carpentry).
  • Technically capable of programming my way out of a paper bag.
  • Located in California, to which people will enthusiastically travel in the winter months.
  • One of the members and discussants of the ever pub #openscienceprize proposal, which explored related topics of executable publications in some detail (but was then rejected).

What is the approach being taken?

The approach is to run a cooperative hackathon/hackfest/workshop targeting a few specific objectives, but with flexibility to expand or change focus as determined by the participants. The objectives are essentially as listed in my mybinder blog post (http://ivory.idyll.org/blog/2016-mybinder.html):

  • hack on mybinder and develop APIs and tools to connect mybinder to other hosting platforms, both commercial (AWS, Azure, etc.) and academic (e.g. XSEDE/TACC);
  • connect mybinder to other versioning sites, including bitbucket and gitlab.
  • brainstorm and hack on ways to connect credentials to mybinder to support private repositories and for-pay compute.
  • identify missing links and technologies that are needed to more fully realize the promise of mybinder and Jupyter notebook;
  • identify overlaps and complementarity with existing projects that we can make use of;
  • more integrated support for docker hub (and private hub) based images;
  • brainstorm around blockers that prevent mybinder from being used for more data-intensive workflows;

About half of the invitations will be to projects that are involved in this area already (listed above, together with at least two people from the Freeman Lab, who develop mybinder). I also anticipate inviting at least one librarian (to bring the archivist and data librarian perspectives in) and one journalist (perhaps Jeffrey Perkel, who has written on these topics several times). The other half will be opened to applications from the open source and open science communities.

All outputs from the workshop will be made available under an Open Source license through github or another hosting platform (which is where the mybinder source is currently hosted). We will also provide a livestream from workshop presentations and discussions so that the larger community can participate.

What will be the output from the project?

In addition to source code, demonstrations, proofs of concept, etc., I anticipate several blog posts from different perspectives. If we can identify an enthusiastic journalist we could also get an article out targeted at a broader audience. I also expect to develop (and deliver) training materials around any new functionality that emerges from this workshop.

What is the justification for the amount of money requested?

We anticipate fully supporting travel, room, and board for 15 people from this grant, although this number may increase or decrease depending on costs. We will also provides snacks and one restaurant dinner. No compute costs or anything else is requested - we can support the remainder of the workshop hosting entirely out of our existing resources.

What are the other sources of support?

I expect to supplement travel and provide compute as needed out of my existing Moore DDD Investigator funding and my startup funds. However, no other support explicitly for this project is being requested.

by C. Titus Brown at March 30, 2016 10:00 PM

March 29, 2016

Travis Oliphant

Anaconda and Hadoop --- a story of the journey and where we are now.

Early Experience with Clusters

My first real experience with cluster computing came in 1999 during my graduate school days at the Mayo Clinic.  These were wonderful times.   My advisor was Dr. James Greenleaf.   He was very patient with allowing me to pester a bunch of IT professionals throughout the hospital to collect their aging Mac Performa machines and build my own home-grown cluster.   He also let me use a bunch of space in his ultrasound lab to host the cluster for about 6 months.

Building my own cluster

The form-factor for those Mac machines really made it easy to stack them.   I ended up with 28 machines in two stacks with 14 machines in each stack (all plugged into a few power strips and a standard lab-quality outlet).  With the recent release of Yellow-Dog Linux, I wiped the standard OS from all the machines  and installed Linux on all those Macs to create a beautiful cluster of UNIX goodness I could really get excited about.   I called my system "The Orchard" and thought it would be difficult to come up with 28 different kinds of apple varieties to name each machine after.  It wasn't difficult. It turns out there are over 7,500 varieties of apples grown throughout the world.

The reason I put this cluster together was to simulate Magnetic Resonance Elastography (MRE) which is a technique to visualize motion using Magnetic Resonance Imaging (MRI).  I wanted to simulate the Bloch equations with a classical model for how MRI images are produced.  The goal was to create a simulation model for the MRE experiment that I could then use to both understand the data and perhaps eventually use this model to determine material properties directly from the measurements using Bayesian inversion (ambitiously bypassing the standard sequential steps of inverse FFT and local-frequency estimation).

Now I just had to get all these machines to talk to each other, and then I would be poised to do anything.  I read up a bit on MPI, PVM, and anything else I could find about getting computers to talk to each other.  My unfamiliarity with the field left me puzzled as I tried to learn these frameworks in addition to figuring out how to solve my immediate problem.  Eventually, I just settled down with a trusted UNIX book by the late W. Richard Stevens.    This book explained how the internet works.   I learned enough about TCP/IP and sockets so that I could write my own C++ classes representing the model.  These classes communicated directly with each other over raw sockets.   While using sockets directly was perhaps not the best approach, it did work and helped me understand the internet so much better.  It also makes me appreciate projects like tornado and zmq that much more.

Lessons Learned

I ended up with a system that worked reasonably well, and I could simulate MRE to some manner of fidelity with about 2-6 hours of computation. This little project didn't end up being critical to my graduation path and so it was abandoned after about 6 months.  I still value what I learned about C++, how abstractions can ruin performance, how to guard against that, and how to get machines to communicate with each other.

Using Numeric, Python, and my recently-linked ODE library (early SciPy), I built a simpler version of the simulator that was actually faster on one machine than my cluster-version was in C++ on 20+ machines.  I certainly could have optimized the C++ code, but I could have also optimized the Python code.   The Python code took me about 4 days to write, the C++ code took me about 4 weeks.  This experience has markedly influenced my thinking for many years about both pre-mature parallelization and pre-mature use of C++ and other compiled languages.

Fast forward over a decade.   My computer efforts until 2012 were spent on sequential array-oriented programming, creating SciPy, writing NumPy, solving inverse problems, and watching a few parallel computing paradigms emerge while I worked on projects to provide for my family.  I didn't personally get to work on parallel computing problems during that time, though I always dreamed of going back and implementing this MRE simulator using a parallel construct with NumPy and SciPy directly.   When I needed to do the occassional parallel computing example during this intermediate period, I would either use IPython parallel or multi-processing.

Parallel Plans at Continuum

In 2012, Peter Wang and I started Continuum, created PyData, and released Anaconda.   We also worked closely with members of the community to establish NumFOCUS as an independent organization.  In order to give NumFOCUS the attention it deserved, we hired the indefatigable Leah Silen and donated her time entirely to the non-profit so she could work with the community to grow PyData and the Open Data Science community and ecosystem.  It has been amazing to watch the community-based, organic, and independent growth of NumFOCUS.    It took effort and resources to jump-start,  but now it is moving along with a diverse community driving it.   It is a great organization to join and contribute effort to.

A huge reason we started Continuum was to bring the NumPy stack to parallel computing --- for both scale-up (many cores) and scale-out (many nodes).   We knew that we could not do this alone and it would require creating a company and rallying a community to pull it off.   We worked hard to establish PyData as a conference and concept and then transitioned the effort to the community through NumFOCUS to rally the community behind the long-term mission of enabling data-, quantitative-, and computational-scientists with open-source software.  To ensure everyone in the community could get the software they needed to do data science with Python quickly and painlessly, we also created Anaconda and made it freely available.

In addition to important community work, we knew that we would need to work alone on specific, hard problems to also move things forward.   As part of our goals in starting Continuum we wanted to significantly improve the status of Python in the JVM-centric Hadoop world.   Conda, Bokeh, Numba, and Blaze were the four technologies we started specifically related to our goals as a company beginning in 2012.   Each had a relationship to parallel computing including Hadoop.

Conda enables easy creation and replication of environments built around deep and complex software dependencies that often exist in the data-scientist workflow.   This is a problem on a single node --- it's an even bigger problem when you want that environment easily updated and replicated across a cluster.

Bokeh  allows visualization-centric applications backed by quantitative-science to be built easily in the browser --- by non web-developers.   With the release of Bokeh 0.11 it is extremely simple to create visualization-centric-web-applications and dashboards with simple Python scripts (or also R-scripts thanks to rBokeh).

With Bokeh, Python data scientists now have the power of both d3 and Shiny, all in one package. One of the driving use-cases of Bokeh was also easy visualization of large data.  Connecting the visualization pipeline with large-scale cluster processing was always a goal of the project.   Now, with datashader, this goal is now also being realized to visualize billions of points in seconds and display them in the browser.

Our scale-up computing efforts centered on the open-source Numba project as well as our Accelerate product.  Numba has made tremendous progress in the past couple of years, and is in production use in multiple places.   Many are taking advantage of numba.vectorize to create array-oriented solutions and program the GPU with ease.   The CUDA Python support in Numba makes it the easiest way to program the GPU that I'm aware of.  The CUDA simulator provided in Numba makes it much simpler to debug in Python the logic of CUDA-based GPU programming.  The addition of parallel-contexts to numba.vectorize mean that any many-core architecture can now be exploited in Python easily.   Early HSA support is also in Numba now meaning that Numba can be used to program novel hardware devices from many vendors.

Summarizing Blaze 

The ambitious Blaze project will require another blog-post to explain its history and progress well. I will only try to summarize the project and where it's heading.  Blaze came out of a combination of deep experience with industry problems in finance, oil&gas, and other quantitative domains that would benefit from a large-scale logical array solution that was easy to use and connected with the Python ecosystem.    We observed that the MapReduce engine of Hadoop was definitely not what was needed.  We were also aware of Spark and RDD's but felt that they too were also not general enough (nor flexible enough) for the demands of distributed array computing we encountered in those fields.

DyND, Datashape, and a vision for the future of Array-computing 

After early work trying to extend the NumPy code itself led to struggles because of both the organic complexity of the code base and the stability needs of a mature project, the Blaze effort started with an effort to re-build the core functionality of NumPy and Pandas to fix some major warts of NumPy that had been on my mind for some time.   With Continuum support, Mark Wiebe decided to continue to develop a C++ library that could then be used by Python and any-other data-science language (DyND).   This necessitated defining a new data-description language (datashape) that generalizes NumPy's dtype to structures of arrays (column-oriented layout) as well as variable-length strings and categorical types.   This work continues today and is making rapid progress which I will leave to others to describe in more detail.  I do want to say, however, that dynd is implementing my "Pluribus" vision for the future of array-oriented computing in Python.   We are factoring the core capability into 3 distinct parts:  the type-system (or data-declaration system), a generalized function mechanism that can interact with any "typed" memory-view or "typed" buffer, and finally the container itself.   We are nearing release of a separated type-library and are working on a separate C-API to the generalized function mechanism.   This is where we are heading and it will allow maximum flexibility and re-use in the dynamic and growing world of Python and data-analysis.   The DyND project is worth checking out right now (if you have desires to contribute) as it has made rapid progress in the past 6 months.

As we worked on the distributed aspects of Blaze it centered on the realization that to scale array computing to many machines you fundamentally have to move code and not data.   To do this well means that how the computer actually sees and makes decisions about the data must be exposed.  This information is usually part of the type system that is hidden either inside the compiler, in the specifics of the data-base schema, or implied as part of the runtime.   To fundamentally solve the problem of moving code to data in a general way, a first-class and wide-spread data-description language must be created and made available.   Python users will recognize that a subset of this kind of information is contained in the struct module (the struct "format" strings), in the Python 3 extended buffer protocol definition (PEP 3118), and in NumPy's dtype system.   Extending these concepts to any language is the purpose of datashape.

In addition, run-times that understand this information and can execute instructions on variables that expose this information must be adapted or created for every system.  This is part of the motivation for DyND and why very soon the datashape system and its C++ type library will be released independently from the rest of DyND and Blaze.   This is fundamentally why DyND and datashape are such important projects to me.  I see in them the long-term path to massive code-reuse, the breaking down of data-silos that currently cause so much analytics algorithm duplication and lack of cooperation.

Simple algorithms from data-munging scripts to complex machine-learning solutions must currently be re-built for every-kind of data-silo unless there is a common way to actually functionally bring code to data.  Datashape and the type-library runtime from DyND (ndt) will allow this future to exist.   I am eager to see the Apache Arrow project succeed as well because it has related goals (though more narrowly defined).

The next step in this direction is an on-disk and in-memory data-fabric that allows data to exist in a distributed file-system or a shared-memory across a cluster with a pointer to the head of that data along with a data-shape description of how to interpret that pointer so that any language that can understand the bytes in that layout can be used to execute analytics on those bytes.  The C++ type run-time stands ready to support any language that wants to parse and understand data-shape-described pointers in this future data-fabric.

From one point of view, this DyND and data-fabric effort are a natural evolution of the efforts I started in 1998 that led to the creation of SciPy and NumPy.  We built a system that allows existing algorithms in C/C++ and Fortran to be applied to any data in Python.   The evolution of that effort will allow algorithms from many other languages to be applied to any data in memory across a cluster.

Blaze Expressions and Server

The key part of Blaze that is also important to mention is the notion of the Blaze server and user-facing Blaze expressions and functions.   This is now what Blaze the project actually entails --- while other aspects of Blaze have been pushed into their respective projects.  Functionally, the Blaze server allows the data-fabric concept on a machine or a cluster of machines to be exposed to the rest of the internet as a data-url (e.g. http://mydomain.com/catalog/datasource/slice).   This data-url can then be consumed as a variable in a Blaze expression --- first across entire organizations and then across the world.

This is the truly exciting part of Blaze that would enable all the data in the world to be as accessible as an already-loaded data-frame or array.  The logical expressions and transformations you can then write on those data to be your "logical computer" will then be translated at compute time to the actual run-time instructions as determined by the Blaze server which is mediating communication with various backends depending on where the data is actually located.   We are realizing this vision on many data-sets and a certain set of expressions already with a growing collection of backends.   It is allowing true "write-once and run anywhere" to be applied to data-transformations and queries and eventually data-analytics.     Currently, the data-scientists finds herself to be in a situation similar to the assembly programmer in the 1960s who had to know what machine the code would run on before writing the code.   Before beginning a data analytics task, you have to determine which data-silo the data is located in before tackling the task.  SQL has provided a database-agnostic layer for years, but it is too limiting for advanced analytics --- and user-defined functions are still database specific.

Continuum's support of blaze development is currently taking place as defined by our consulting customers as well as by the demands of our Anaconda platform and the feature-set of an exciting new product for the Anaconda Platform that will be discussed in the coming weeks and months. This new product will provide a simplified graphical user-experience on top of Blaze expressions, and Bokeh visualizations for rapidly connecting quantitative analysts to their data and allowing explorations that retain provenance and governance.  General availability is currently planned for August.

Blaze also spawned additional efforts around fast compressed storage of data (blz which formed the inspiration and initial basis for bcolz) and experiments with castra as well as a popular and straight-forward tool for quickly copying data from one data-silo kind to another (odo).

Developing dask the library and Dask the project

The most important development to come out of Blaze, however, will have tremendous impact in the short term well before the full Blaze vision is completed.  This project is Dask and I'm excited for what Dask will bring to the community in 2016.   It is helping us finally deliver on scaled-out NumPy / Pandas and making Anaconda a first-class citizen in Hadoop.

In 2014, Matthew Rocklin started working at Continuum on the Blaze team.   Matthew is the well-known author of many functional tools for Python.  He has a great blog you should read regularly.   His first contribution to Blaze was to adapt a multiple-dispatch system he had built which formed the foundation of both odo and Blaze.  He also worked with Andy Terrel and Phillip Cloud to clarify the Blaze library as a front-end to multiple backends like Spark, Impala, Mongo, and NumPy/Pandas.

With these steps taken, it was clear that the Blaze project needed its own first-class backend as well something that the community could rally around to ensure that Python remained a first-class participant in the scale-out conversation --- especially where systems that connected with Hadoop were being promoted.  Python should not ultimately be relegated to being a mere front-end system that scripts Spark or Hadoop --- unable to talk directly to the underlying data.    This is not how Python achieved its place as a de-facto data-science language.  Python should be able to access and execute on the data directly inside Hadoop.

Getting there took time.  The first version of dask was released in early 2015 and while distributed work-flows were envisioned, the first versions were focused on out-of-core work-flows --- allowing problem-sizes that were too big to fit in memory to be explored with simple pandas-like and numpy-like APIs.

When Matthew showed me his first version of dask, I was excited.  I loved three things about it:  1) It was simple and could, therefore, be used as a foundation for parallel PyData.  2) It leveraged already existing code and infrastructure in NumPy and Pandas.  3) It had very clean separation between collections like arrays and data-frames, the directed graph representation, and the schedulers that executed those graphs.   This was the missing piece we needed in the Blaze ecosystem.   I immediately directed people on the Blaze team to work with Matt Rocklin on Dask and asked Matt to work full-time on it.

He and the team made great progress and by summer of 2015 had a very nice out-of-core system working with two functioning parallel-schedulers (multi-processing and multi-threaded).  There was also a "synchronous" scheduler that could be used for debugging the graph and the system showed well enough throughout 2015 to start to be adopted by other projects (scikit-image and xarray).

In the summer of 2015, Matt began working on the distributed scheduler.  By fall of 2015, he had a very nice core system leveraging the hard work of the Python community.   He built the API around the concepts of asynchronous computing already being promoted in Python 3 (futures) and built dask.distributed on top of tornado.   The next several months were spent improving the scheduler by exposing it to as many work-flows as possible from computational-science, quantitative-science and computational-science.   By February of 2016, the system was ready to be used by a variety of people interested in distributed computing with Python.   This process continues today.

Using dask.dataframes and dask.arrays you can quickly build array- and table-based work-flows with a Pandas-like and NumPy-like syntax respectively that works on data sitting across a cluster.

Anaconda and the PyData ecosystem now had another solution for the scale-out problem --- one whose design and implementation was something I felt could be a default run-time backend for Blaze.  As a result, I could get motivated to support, market, and seek additional funding for this effort.  Continuum has received some DARPA funding under the XDATA program.  However, this money was now spread pretty thin among Bokeh, Numba, Blaze, and now Dask.

Connecting to Hadoop

With the distributed scheduler basically working and beginning to improve, two problems remained with respect to Hadoop interoperability: 1) direct access to the data sitting in HDFS and 2) interaction with the resource schedulers running most Hadoop clusters (YARN or mesos).

To see how important the next developments are, it is useful to describe an anecdote from early on in our XDATA experience.  In the summer of 2013, when the DARPA XDATA program first kicked-off, the program organizers had reserved a large Hadoop cluster (which even had GPUs on some of the nodes).  They loaded many data sets onto the cluster and communicated about its existence to all of the teams who had gathered to collaborate on getting insights out of "Big Data."    However, a large number of the people collaborating were using Python, R, or C++.  To them the Hadoop cluster was inaccessible as there was very little they could use to interact with the data stored in HDFS (beyond some high-latency and low-bandwidth streaming approaches) and nothing they could do to interact with the scheduler directly (without writing Scala or Java code). The Hadoop cluster sat idle for most of the summer while teams scrambled to get their own hardware to run their code on and deliver their results.

This same situation we encountered in 2013 exists in many organizations today.  People have large Hadoop infrastructures, but are not connecting that infrastructure effectively to their data-scientists who are more comfortable in Python, R, or some-other high-level (non JVM language).

With dask working reasonably well, tackling this data-connection problem head on became an important part of our Anaconda for Hadoop story and so in December of 2015 we began two initiatives to connect Anaconda directly to Hadoop.   Getting data from HDFS turned out to be much easier than we had initially expected because of the hard-work of many others.    There had been quite a bit of work building a C++ interface to Hadoop at Pivotal that had culminated in a library called libhdfs3.   Continuum wrote a Python interface to that library quickly, and it now exists as the hdfs3 library under the Dask organization on Github.

The second project was a little more involved as we needed to integrate with YARN directly.   Continuum developers worked on this and produced a Python library that communicates directly to the YARN classes (using Scala) in order to allow the Python developer to control computing resources as well as spread files to the Hadoop cluster.   This project is called knit, and we expect to connect it to mesos and other cluster resource managers in the near future (if you would like to sponsor this effort, please get in touch with me).

Early releases of hdfs3 and knit were available by the end of February 2015.  At that time, these projects were joined with dask.distributed and the dask code-base into a new Github organization called Dask.   The graduation of Dask into its own organization signified an important milestone that dask was now ready for rapid improvement and growth alongside Spark as a first-class execution engine in the Hadoop ecosystem.

Our initial goals for Dask are to build enough examples, capability, and awareness so that every PySpark user tries Dask to see if it helps them.    We also want Dask to be a compatible and respected member of the growing Hadoop execution-framework community.   We are also seeking to enable Dask to be used by scientists of all kinds who have both array and table data stored on central file-systems and distributed file-systems outside of the Hadoop ecosystem.

Anaconda as a first-class execution ecosystem for Hadoop

With Dask (including hdfs3 and knit), Anaconda is now able to participate on an equal footing with every other execution framework for Hadoop.  Because of the vast reaches of Anaconda Python and Anaconda R communities, this means that a lot of native code can now be integrated to Hadoop much more easily, and any company that has stored their data in HDFS or other distributed file system (like s3fs or gpfs) can now connect that data easily to the entire Python and/or R computing stack.

This is exciting news!    While we are cautious because these integrative technologies are still young, they are connected to and leveraging the very mature PyData ecosystem.    While benchmarks can be misleading, we have a few benchmarks that I believe accurately reflect the reality of what parallel and distributed Anaconda can do and how it relates to other Hadoop systems.  For array-based and table-based computing workflows, Dask will be 10x to 100x faster than an equivalent PySpark solution.   For applications where you are not using arrays or tables (i.e. word-count using a dask.bag), Dask is a little bit slower than a similar PySpark solution.  However, I would argue that Dask is much more Pythonic and easier to understand for someone who has learned Python.

It will be very interesting to see what the next year brings as more and more people realize what is now available to them in Anaconda.  The PyData crowd will now have instant access to cluster computing at a scale that has previously been accessible only by learning complicated new systems based on the JVM or paying an unfortunate performance penalty.   The Hadoop crowd will now have direct and optimized access to entire classes of algorithms from Python (and R) that they have not previously been used to.

It will take time for this news and these new capabilities to percolate, be tested, and find use-cases that resonate with the particular problems people actually encounter in practice.  I look forward to helping many of you take the leap into using Anaconda at scale in 2016.

We will be showing off aspects of the new technology at Strata in San Jose in the Continuum booth #1336 (look for Anaconda logo and mark).  We have already announced at a high-level some of the capabilities:   Peter and I will both be at Strata along with several of the talented people at Continuum.    If you are attending drop by and say hello.

We first came to Strata on behalf of Continuum in 2012 in Santa Clara.  We announced that we were going to bring you scaled-out NumPy.  We are now beginning to deliver on this promise with Dask.   We brought you scaled-up NumPy with Numba.   Blaze and Bokeh will continue to bring them together along with the rest of the larger data community to provide real insight on data --- where-ever it is stored.   Try out Dask and join the new scaled-out PyData story which is richer than ever before, has a larger community than ever before, and has a brighter future than ever before.

by Travis Oliphant (noreply@blogger.com) at March 29, 2016 01:54 PM

Matthieu Brucher

Announcement: Audio ToolKit moves to its own website

I’ve decided to create a real space for Audio ToolKit. The idea is to make it more visible, with a consistent message to the users.

In addition to this move, this blog has move to a subdomain there (and you may have noticed it) and Audio ToolKit documentation as well.

I’ve updated Audio Toolkit to version 1.2.0:
* Added SecondOrderSVF filters from cytomic with Python wrappers
* Implemented a LowPassReverbFilter with Python wrappers
* Added Python wrappers to AllPassReverbFilter
* Distortion filters optimization
* Bunch of fixes (Linux compil, calls…)

Download link: ATK 1.2.0

I’ve also updated the SD1 emulation to version 2.0.0. The sound is now closer to the actual pedal, thanks to a better modeling of the circuit. It also means that you won’t get exactly the same sound with this new release, so pay attention when you update!

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATK SD1.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at March 29, 2016 07:30 AM

March 27, 2016

Titus Brown

Reproducibility, repeatability, and a workshop in Norway

This March, Andreas Hejnol invited me to give a talk in Bergen, Norway, and as part of the trip I arranged to also give a trial workshop on "computational reproducibility" at the University in Oslo, where my friend & colleague Lex Nederbragt works.

(The workshop materials are here, under CC0.)

The basic idea of this workshop followed from recent conversations with Jonah Duckles (Executive Director, Software Carpentry) and Tracy Teal (Executive Director, Data Carpentry), where we'd discussed the observation that many experienced computational biologists had chosen a fairly restricted set of tools for building highly reproducible scientific workflows. Since Software Carpentry and Data Carpentry taught most of these tools, I thought that we might start putting together a "next steps" lesson showing how to efficiently and effectively combine these tools for fun and profit and better science

The tool categories are as follows: version control (git or hg), a build system to coordinate running a bunch of scripts (make or snakemake, typically), a literate data analysis and graphing system to make figures (R and RMarkdown, or Python and Jupyter), and some form of execution environment (virtualenv, virtual machines, cloud machines, and/or Docker, typically). There are many different choices here - I've only listed the ones I run across regularly - and everyone has their own favorites, but those are the categories I see.

https://raw.githubusercontent.com/ngs-docs/2016-oslo-repeatability/master/images/overview-1.png

These tools generally work together in some fashion like this:

https://raw.githubusercontent.com/ngs-docs/2016-oslo-repeatability/master/images/overview-2.png

In the end, the 1-day workshop description I sent Lex was:

Reproducibility of computational research can be achieved by connecting
together tools that many people already know.  This workshop will walk
attendees through combining RMarkdown/Jupyter, git (version control),
and make (workflow automation) for building highly reproducible analyses
of biological data.  We will also introduce Docker as a method for
specifying your computational environment, and demonstrate mybinder.org
for posting computational analyses.

The workshop will be interactive and hands-on, with many opportunities for
questions and discussion.

We suggest that attendees have some experience with the UNIX shell, either
Python or R, and git - a Software or Data Carpentry workshop will suffice.

People will need an installation of Docker on their laptop (or access to the
Amazon Cloud).

From this, you can probably infer that what I was planning on teaching was how to use a local Docker container to combine git, make, and Jupyter Notebook to do a simple analysis.

As the time approached (well, on the way to Oslo :) I started to think concretely about this, and I found myself blocking on two issues:

First, I had never tried to get a whole classroom of people to run Docker before. I've run several Docker workshops, but always in circumstances where I had more backup (either AWS, or experienced users willing to fail a lot). How, in a single day, could I get a whole classroom of people through Docker, git, make, and Jupyter Notebook? Especially when the Jupyter docker container required a large download and a decent machine and would probably break on many laptops?

Second, I realized I was developing a distaste for the term 'reproducibility', because of the confusion around what it meant. Computational people tend to talk about reproducibility in one sense - 'is the bug reproducible?' - while scientists tend to talk about reproducibility in the larger sense of scientific reproducibility - 'do other labs see the same thing I saw?' You can't really teach scientific reproducibility in a hands-on way, but it is what scientists are interested in; while computational reproducibility is useful for several reasons, but doesn't have an obvious connection to scientific reproducibility.

Luckily, I'd already come across a solution to the first issue the previous week, when I ran a workshop at UC Davis on Jupyter Notebook that relied 100% on the mybinder service - literally, no local install needed! We just ran everything on Google Compute Engine, on the Freeman Lab's dime. It worked pretty well, and I thought it would work here, too. So I resolved to do the first 80% or more of the workshop in the mybinder container, making use of Jupyter built-in Web editor and terminal to build Makefiles and post things to git. Then, given time, I could segue into Docker and show how to build a Docker container that could run the full git repository, including both make and the Jupyter notebook we wrote as part of the analysis pipeline.

The second issue was harder to resolve, because I wanted to bring things down to a really concrete level and then discuss them from there. What I ended up doing was writing a moderately lengthy introduction on my perspective, in which I further confused the issue by using the term 'repeatability' for completely automated analyses that could be run exactly as written by anyone. You can read more about it here. (Tip o' the hat to Victoria Stodden for several of the arguments I wrote up, as well as a pointer to the term 'repeatability'.)

At the actual workshop, we started with a discussion about the goals and techniques of repeatability in computational work. This occupied us for about an hour, and involved about half of the class; there were some very experienced (and very passionate) scientists in the room, which made it a great learning experience for everyone involved, including me! We discussed how different scientific domains thought differently about repeatability, reproducibility, and publishing methods, and tentatively reached the solid conclusion that this was a very complicated area of science ;). However, the discussion served its purpose, I think: no one was under any illusions that I was trying to solve the reproducibility problem with the workshop, and everyone understood that I was simply showing how to combine tools to build a perfectly repeatable workflow.

We then moved on to a walkthrough of Jupyter Notebook, followed by the first two parts of the make tutorial. We took our resulting Makefile, our scripts, and our data, and committed them to git and pushed them to github (see my repo). Then we took a lunch break.

In the afternoon, we built a Jupyter Notebook that did some silly graphing. (Note to self: word clouds are cute but probably not the most interesting thing to show to scientists! If I run this again, I'll probably do something like analyze Zipf's law graphically and then do a log-log fit.) We added that to the git repo, and then pushed that to github, and then I showed how to use mybinder to spin the repo up in its own execution environment.

Finally, for the last ~hour, I sped ahead and demoed how to use docker-machine from my laptop to spin up a docker host on AWS, construct a basic Dockerfile starting from a base of jupyter/notebook, and then run the repo on a new container using that Dockerfile.

Throughout, we had a lot of discussion and (up until the Docker bit) I think everyone followed along pretty well.

In the end, I think the workshop went pretty well - so far, at least, 5/5 survey responders (of about 15 attendees) said it was a valuable use of their time.

After I left at 3pm to fly up to Bergen for my talk, Tracy Teal went through RMarkdown and knitr, starting from the work-in-progress Data Carpentry reproducibility lesson. (I didn't see that so I'll leave Lex or Tracy to talk about it.)

What would I change next time?

  • I'm not sure if the Jupyter Notebook walkthrough was important. It seemed a bit tedious to me, but maybe that was because it was the second time in two weeks I was teaching it?
  • I shortchanged make a bit, but still got the essential bits across (the dependency graph, and the basic Makefile format).
  • I would definitely have liked to get people more hands-on experience with Docker.
  • I would change the Jupyter notebook analysis to be a bit more science-y, with some graphing and fitting. It doesn't really matter if it's a bit more complicated, since we're copy/pasting, but I think it would be more relevant to the scientists.
  • I would try to more organically introduce RMarkdown as a substitute for the Jupyter bit.

Overall, I'm quite happy with the whole thing, and mybinder continues to work astonishingly well for me.

-titus

by C. Titus Brown at March 27, 2016 10:00 PM

Filipe Saraiva

Workshop de Software Livre 2016 – call for papers and tools

wsl2016-banner

The call for papers and call for tools for the WSL – Workshop de Software Livre (Workshop on Free Software), the academic conference held together with FISL – Fórum Internacional de Software Livre (International Free Software Forum) is open!

WSL publishes scientific papers on several topics of interest for free and open source software communities, like social dynamics, management, development processes, motivations of contributors communities, adoption and case studies, legal and economic aspects, social and historical studies, and more.

This edition of WSL has a specific call for tools to publish papers describing software. This specific call ensures the software described was peer-reviewed, is consistent with the principles of FLOSS, and the source code will be preserved and accessible for a long time period.

All accepted and presented papers will be published in WSL open access repository. This year we are working hard to provide ISSN and DOI to the publications.

The deadline is April 10. Papers can be submitted in Portuguese, English or Spanish.

See the official WSL page to get more information.

by Filipe Saraiva at March 27, 2016 08:34 PM

March 24, 2016

Fabian Pedregosa

Lightning v0.1

Announce: first public release of lightning!, a library for large-scale linear classification, regression and ranking in Python. The library was started a couple of years ago by Mathieu Blondel who also contributed the vast majority of source code. I joined recently its development and decided it was about time for a v0.1!.

Prebuild conda packages are available for all operating systems (god thank appveyor). More information on lightning's website.

by Fabian Pedregosa at March 24, 2016 11:00 PM

Continuum Analytics news

DyND Callables: Speed and Flexibility

Posted Thursday, March 24, 2016

Introduction

We've been working hard to improve DyND in a wide variety of ways over the past few months. While there is still a lot of churn in our codebase, now is a good time to show a few basic examples of the great functionality that's already there. The library is available on GitHub at https://github.com/libdynd/libdynd.

Today I want to focus on DyND's callable objects. Much of the code in this post is still experimental and subject to change. Keep that in mind when considering where to use it.

All the examples here will be in C++14 unless otherwise noted. The build configuration should be set up to indicate that C++14, the DyND headers, and the DyND shared libraries should be used. Output from a line that prints something will be shown directly in the source files as comments. DyND also has a Python interface, so several examples in Python will also be included.

Getting Started

DyND's callables are, at the most basic level, functions that operate on arrays. At the very lowest level, a callable can access all of the data, type-defined metadata (e.g. stride information), and metadata for the arrays passed to it as arguments. This makes it possible to use callables for functionality like multiple dispatch, views based on stride manipulation, reductions, and broadcasting. The simplest case is using a callable to wrap a non-broadcasting non-dispatched function call so that it can be used on scalar arrays.

Here's an example of how to do that:

#include <iostream>
// Main header for DyND arrays:
#include <dynd/array.hpp>
// Main callable object header:
#include <dynd/callable.hpp>

using namespace dynd;

// Write a function to turn into a DyND callable.
double f(double a, double b) {
  return a * (a - b);
}

// Make the callable.
nd::callable f_callable = nd::callable(f);

// Main function to show how this works:
int main() {
  // Initialize two arrays containing scalar values.
  nd::array a = 1.;
  nd::array b = 2.;

  // Print the dynamic type signature of the callable f_callable.
  std::cout << f_callable << std::endl;
  // <callable <(float64, float64) -> float64> at 000001879424CF60>

  // Call the callable and print its output.
  std::cout << f_callable(a, b) << std::endl;
  //array(-1,
  //      type="float64")
}


The constructor for dynd::nd::callable does most of the work here. Using some interesting templating mechanisms internally, it is able to infer the argument types and return type for the function, select the corresponding DyND types, and form a DyND type that represents an analogous function call. The result is a callable object that wraps a pointer to the function f and knows all of the type information about the pointer it is wrapping. This callable can only be used with input arrays that have types that match the types for the original function's arguments.

The extra type information contained in this callable is "(float64, float64) -> float64", as can be seen when the callable is printed. The syntax here comes from the datashape data description system—the same type system used by Blaze, Odo, and several other libraries.

One key thing to notice here is that the callable created now does its type checking dynamically rather than at compile time. DyND has its own system of types that is used to represent data and the functions that operate on it at runtime. While this does have some runtime cost, dynamic type checking removes the requirement that a C++ compiler verify the types for every operation. The dynamic nature of the DyND type system makes it possible to write code that operates in a generic way on both builtin and user-defined types in both static and dynamic languages. I'll leave discussion of the finer details of the DyND type system for another day though.

Broadcasting

DyND has other functions that make it possible to add additional semantics to a callable. These are higher-order functions (functions that operate on other functions), and they are used on existing callables rather than function pointers. The types for these functions are patterns that can be matched against a variety of different argument types.

Things like array broadcasting, reductions, and multiple dispatch are all currently available. In the case of broadcasting and reductions, the new callable calls the wrapped function many times and handles the desired iteration structure over the arrays itself. In the case of multiple dispatch, different implementations of a function can be called based on the types of the inputs. DyND's multiple dispatch semantics are currently under revision, so I'll just show broadcasting and reductions here.

DyND provides broadcasting through the function dynd::nd::functional::elwise. It follows broadcasting semantics similar to those followed by NumPy's generalized universal functions—though it is, in many ways, more general. The following example shows how to use elwise to create a callable that follows broadcasting semantics:

// Include <cmath> to get std::exp.
#include <cmath>
#include <iostream>

#include <dynd/array.hpp>
#include <dynd/callable.hpp>
// Header containing dynd::nd::functional::elwise.
#include <dynd/func/elwise.hpp>

using namespace dynd;

double myfunc_core(double a, double b) {
  return a * (a - b);
}

nd::callable myfunc = nd::functional::elwise(nd::callable(myfunc_core));

int main() {
  // Initialize some arrays to demonstrate broadcasting semantics.
  // Use brace initialization from C++11.
  nd::array a{{1., 2.}, {3., 4.}};
  nd::array b{5., 6.};
  // Create an additional array with a ragged second dimension as well.
  nd::array c{{9., 10.}, {11.}};

  // Print the dynamic type signature of the callable.
  std::cout << myfunc << std::endl;
  // <callable <(Dims... * float64, Dims... * float64) -> Dims... * float64>
  //  at 000001C223FC5BE0>

  // Call the callable and print its output.
  // Broadcast along the rows of a.
  std::cout << myfunc(a, b) << std::endl;
  // array([[-4, -8], [-6, -8]],
  //       type="2 * 2 * float64")

  // Broadcast the second row of c across the second row of a.
  std::cout << myfunc(a, c) << std::endl;
  // array([[ -8, -16], [-24, -28]],
  //       type="2 * 2 * float64")

}


A similar function can be constructed in Python using DyND's Python bindings and Python 3's function type annotations. If Numba is installed, it is used to get JIT-compiled code that has performance relatively close to the speed of the code generated by the C++ compiler.

from dynd import nd, ndt

@nd.functional.elwise
def myfunc(a: ndt.float64, b: ndt.float64) -> ndt.float64:
    return a * (a - b)


Reductions

Reductions are formed from functions that take two inputs and produce a single output. Examples of reductions include taking the sum, max, min, and product of the items in an array. Here we'll work with a reduction that takes the maximum of the absolute values of the items in an array. In DyND this can be implemented by using nd::functional::reduction on a callable that takes two floating point inputs and returns the maximum of their absolute values. Here's an example:

// Include <algorithm> to get std::max.
#include <algorithm>
// Include <cmath> to get std::abs.
#include <cmath>
#include <iostream>

#include <dynd/array.hpp>
#include <dynd/callable.hpp>
// Header containing dynd::nd::functional::reduction.
#include <dynd/func/reduction.hpp>

using namespace dynd;

// Wrap the function as a callable.
// Then use dynd::nd::functional::reduction to make a reduction from it.
// This time just wrap a C++ lambda function rather than a pointer
// to a different function.
nd::callable inf_norm = nd::functional::reduction(nd::callable(
        [](double a, double b) { return std::max(std::abs(a), std::abs(b));}));

// Demonstrate the reduction working along both axes simultaneously.
int main() {
  nd::array a{{1., 2.}, {3., 4.}};

  // Take the maximum absolute value over the whole array.
  std::cout << inf_norm(a) << std::endl;
  // array(4,
  //       type="float64")
}


Again, in Python, it is relatively easy to create a similar callable.

from dynd import nd, ndt

@nd.functional.reduction
def inf_norm(a: ndt.float64, b: ndt.float64) -> ndt.float64:
    return max(abs(a), abs(b))


The type for the reduction callable inf_norm is a bit longer. It is (Dims... * float64, axes: ?Fixed * int32, identity: ?float64, keepdims: ?bool) -> Dims... * float64. This signature represents a callable that accepts a single input array and has several optional keyword arguments. In Python, passing keyword arguments to callables works the same as it would for any other function. Currently, in C++, initializer lists mapping strings to values are used since the names of the keyword arguments are not necessarily known at compile time.

Exporting Callables to Python

The fact that DyND callables are C++ objects with a single C++ type makes it easy to wrap them for use in Python. This is done using the wrappers for the callable class already built in to DyND's Python bindings. Using DyND in Cython merits a discussion of its own, so I'll only include a minimal example here.

This Cython code in particular is still using experimental interfaces. The import structure and function names here are very likely to change.

The first thing needed is a header that creates the desired callable. Since this will only be included once in a single Cython based module, additional guards to make sure the header is only applied once are not needed.

// inf_norm_reduction.hpp
#include <algorithm>
#include <cmath>

#include <dynd/array.hpp>
#include <dynd/callable.hpp>
#include <dynd/func/reduction.hpp>

static dynd::nd::callable inf_norm =
    dynd::nd::functional::reduction(dynd::nd::callable(
        [](double a, double b) { return std::max(std::abs(a), std::abs(b));}));


The callable can now be exposed to Python through Cython. Some work still needs to be done in DyND's Python bindings to simplify the system-specific configuration for linking extensions like this to the DyND libraries. For simplicity, I'll just show the commented Cython distutils directives that can be used to build this file on 64 bit Windows with a libdynd built and installed from source in the default location. Similar configurations can be put together for other systems.

# py_inf_norm.pyx
# distutils: include_dirs = "c:/Program Files/libdynd/include"
# distutils: library_dirs = "c:/Program Files/libdynd/lib"
# distutils: libraries = ["libdynd", "libdyndt"]

from dynd import nd, ndt

from dynd.cpp.callable cimport callable as cpp_callable
from dynd.nd.callable cimport dynd_nd_callable_from_cpp

cdef extern from "inf_norm_reduction.hpp" nogil:
    # Have Cython call this "cpp_inf_norm", but use "inf_norm" in
    # the generated C++ source.
    cpp_callable inf_norm

py_inf_norm = dynd_nd_callable_from_cpp(inf_norm)


To build the extension I used the following setup file and ran it with the command python setup.py build_ext --inplace.

# setup.py
# This is a fairly standard setup script for a minimal Cython module.
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules=cythonize("py_inf_norm.pyx", language='c++'))


A Short Benchmark

elwise and reduction are not heavily optimized yet, but, using IPython's timeit magic, it's clear that DyND is already doing well:

In [1]: import numpy as np

In [2]: from py_inf_norm import py_inf_norm

In [3]: a = np.random.rand(10000, 10000)

In [4]: %timeit py_inf_norm(a)
1 loop, best of 3: 231 ms per loop

In [5]: %timeit np.linalg.norm(a.ravel(), ord=np.inf)
1 loop, best of 3: 393 ms per loop


These are just a few examples of the myriad of things you can do with DyND's callables. For more information take a look at our github repo as well as libdynd.org. We'll be adding a lot more functionality, documentation, and examples in the coming months.

by swebster at March 24, 2016 02:28 PM

March 23, 2016

Mark Fenner

SVD Computation Capstone

At long last, I’ve gotten to my original goal: writing a relatively easy to understand version of your plain, vanilla SVD computation. I’m going to spend just a few minutes glossing (literally) over the theory behind the code and then dive into some implementation that builds on the previous posts. Thanks for reading! Words of […]

by Mark Fenner at March 23, 2016 03:03 PM

March 22, 2016

Matthieu Brucher

Analog modeling of a diode clipper (1): Circuits

I’ve published a few years ago an emulation of the SD1 pedal, but haven’t touched analog modeling since. There are lots of different methods to model a circuit, and they all have different advantages and drawbacks. So I’ve decided to start from scratch again, using two different diode clippers, from the continuous equations to different numerical solutions in a series of blog posts here.

First clipper

Let’s start with the first circuit, which I implemented originally in Audio Toolkit.

Diode clipper 1Diode clipper 1

It consists of a resistor, a capacitor and antiparallel diodes. What is interesting with this circuit is that in constant mode, the output is actually null.

V_i - 2 R_1 I_s sinh(\frac{V_o}{nV_t}) - \int \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_o = 0

Second clipper

The second circuit is a variation of the first one:

Diode Clipper 2Diode Clipper 2

More or less, it’s a first order low-pass filter that is clipped with antiparallel diodes. The first result is that in constant mode, there is a non-null output.

\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})

The two equations are quite different. If the first one has an integral, the second one uses a derivative. This should be interesting to discretize and compare.

Conclusion

The equations are simple enough so that we can try different numerical methods on them. They are still too complex to get an analytical solution (no closed-form solution), so we have to use more or less complex numerical algorithms to get an approximation of the result.

And we will start working on this in a future post.

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at March 22, 2016 08:08 AM

March 16, 2016

Matthew Rocklin

Dask EC2 Startup Script

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://youtu.be/KGlhU9kSfVk

Summary

Copy-pasting the following commands gives you a Dask cluster on EC2.

pip install dec2
dec2 up --keyname YOUR-AWS-KEY-NAME
        --keypair ~/.ssh/YOUR-AWS-KEY-FILE.pem
        --count 9   # Provision nine nodes
        --nprocs 8  # Use eight separate worker processes per node

dec2 ssh            # SSH into head node
ipython             # Start IPython console on head node
from distributed import Executor, s3, progress
e = Executor('127.0.0.1:8786')
df = s3.read_csv('dask-data/nyc-taxi/2015', lazy=False)
progress(df)
df.head()

You will have to use your own AWS credentials, but you’ll get fast distributed Pandas access on the NYCTaxi data across a cluster, loaded from S3.

Motivation

Reducing barriers to entry enables curious play.

Curiosity drives us to play with new tools. We love the idea that previously difficult tasks will suddenly become easy, expanding our abilities and opening up a range of newly solvable problems.

However, as our problems grow more complex our tools grow more cumbersome and setup costs increase. This cost stops us from playing around, which is a shame, because playing is good both for the education of the user and for the development of the tool. Tool makers who want feedback are strongly incentivized to decrease setup costs, especially for the play case.

In February we introduced dask.distributed, a lightweight distributed computing framework for Python. We focused on processing data with high level abstractions like dataframes and arrays in the following blogposts:

  1. Analyze GitHub JSON record data in S3
  2. Use Dask DataFrames on CSV data in HDFS
  3. Process NetCDF data with Dask arrays on a traditional cluster

Today we present a simple setup script to launch dask.distributed on EC2, enabling any user with AWS credentials to repeat these experiments easily.

dec2

Devops tooling and EC2 to the rescue

DEC2 does the following:

  1. Provisions nodes on EC2 using your AWS credentials
  2. Installs Anaconda on those nodes
  3. Deploys a dask.distributed Scheduler on the head node and Workers on the rest of the nodes
  4. Helps you to SSH into the head node or connect from your local machine
$ pip install dec2
$ dec2 up --help
Usage: dec2 up [OPTIONS]

Options:
  --keyname TEXT                Keyname on EC2 console  [required]
  --keypair PATH                Path to the keypair that matches the keyname [required]
  --name TEXT                   Tag name on EC2
  --region-name TEXT            AWS region  [default: us-east-1]
  --ami TEXT                    EC2 AMI  [default: ami-d05e75b8]
  --username TEXT               User to SSH to the AMI  [default: ubuntu]
  --type TEXT                   EC2 Instance Type  [default: m3.2xlarge]
  --count INTEGER               Number of nodes  [default: 4]
  --security-group TEXT         Security Group Name  [default: dec2-default]
  --volume-type TEXT            Root volume type  [default: gp2]
  --volume-size INTEGER         Root volume size (GB)  [default: 500]
  --file PATH                   File to save the metadata  [default: cluster.yaml]
  --provision / --no-provision  Provision salt on the nodes  [default: True]
  --dask / --no-dask            Install Dask.Distributed in the cluster [default: True]
  --nprocs INTEGER              Number of processes per worker  [default: 1]
  -h, --help                    Show this message and exit.

Note: dec2 was largely built by Daniel Rodriguez

Run

As an example we use dec2 to create a new cluster of nine nodes. Each worker will run with eight processes, rather than using threads.

dec2 up --keyname my-key-name
        --keypair ~/.ssh/my-key-file.pem
        --count 9       # Provision nine nodes
        --nprocs 8      # Use eight separate worker processes per node

Connect

We ssh into the head node and start playing in an IPython terminal:

localmachine:~$ dec2 ssh                          # SSH into head node
ec2-machine:~$ ipython                            # Start IPython console
In [1]: from distributed import Executor, s3, progress
In [2]: e = Executor('127.0.0.1:8786')
In [3]: e
Out[3]: <Executor: scheduler=127.0.0.1:8786 workers=64 threads=64>

Notebooks

Alternatively we set up a globally visible Jupyter notebook server:

localmachine:~$ dec2 dask-distributed address    # Get location of head node
Scheduler Address: XXX:XXX:XXX:XXX:8786

localmachine:~$ dec2 ssh                          # SSH into head node
ec2-machine:~$ jupyter notebook --ip="*"          # Start Jupyter Server

Then navigate to http://XXX:XXX:XXX:XXX:8888 from your local browser. Note, this method is not secure, see Jupyter docs for a better solution.

Public datasets

We repeat the experiments from our earlier blogpost on NYCTaxi data.

df = s3.read_csv('dask-data/nyc-taxi/2015', lazy=False)
progress(df)
df.head()
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0 0.3 16.30

Acknowledgments

The dec2 startup script is largely the work of Daniel Rodriguez. Daniel usually works on Anaconda for cluster management which is normally part of an Anaconda subscription but is also free for moderate use (4 nodes.) This does things similar to dec2, but much more maturely.

DEC2 was inspired by the excellent spark-ec2 setup script, which is how most Spark users, myself included, were first able to try out the library. The spark-ec2 script empowered many new users to try out a distributed system for the first time.

The S3 work was largely done by Hussain Sultan (Capital One) and Martin Durant (Continuum).

What didn’t work

  • Originally we automatically started an IPython console on the local machine and connected it to the remote scheduler. This felt slick, but was error prone due to mismatches between the user’s environment and the remote cluster’s environment.
  • It’s tricky to replicate functionality that’s part of a proprietary product (Anaconda for cluster management.) Fortunately, Continuum management has been quite supportive.
  • There aren’t many people in the data science community who know Salt, the system that backs dec2. I expect maintenance to be a bit tricky moving forward, especially during periods when Daniel and other developers are occupied.

March 16, 2016 12:00 AM

March 15, 2016

Mark Fenner

Householder Bidiagonalization

We’re going to make use of Householder reflections to turn a generic matrix into a bidiagonal matrix. I’ve laid a lot of foundation for these topics in the posts I just linked. Check them out! Onward. Reducing a Generic Matrix to a Bidiagonal Matrix Using Householder Reflections Why would we want to do this? Well, […]

by Mark Fenner at March 15, 2016 02:03 PM

March 12, 2016

Mark Fenner

Givens Rotations and the Case of the Blemished Bidiagonal Matrix

Last time, we looked at using Givens rotations to perform a QR factorization of a matrix. Today, we’re going to be a little more selective and use Givens rotations to walk values off the edge of a special class of matrices Old Friends Here are few friends that we introduced last time. I updated the […]

by Mark Fenner at March 12, 2016 04:27 PM

March 10, 2016

William Stein

Open source is now ready to compete with Mathematica for use in the classroom



When I think about what makes SageMath different, one of the most fundamental things is that it was created by people who use it every day.  It was created by people doing research math, by people teaching math at universities, and by computer programmers and engineers using it for research.  It was created by people who really understand computational problems because we live them.  We understand the needs of math research, teaching courses, and managing an open source project that users can contribute to and customize to work for their own unique needs.

The tools we were using, like Mathematica, are clunky, very expensive, and just don't do everything we need.  And worst of all, they are closed source software, meaning that you can't even see how they work, and can't modify them to do what you really need.  For teaching math, professors get bogged down scheduling computer labs and arranging for their students to buy and install expensive software.

So I started SageMath as an open source project at Harvard in 2004, to solve the problem that other math software is expensive, closed source, and limited in functionality, and to create a powerful tool for the students in my classes.  It wasn't a project that was intended initially as something to be used by hundred of thousands of people.  But as I got into the project and as more professors and students started contributing to the project, I could clearly see that these weren't just problems that pissed me off, they were problems that made everyone angry.

The scope of SageMath rapidly expanded.  Our mission evolved to create a free open source serious competitor to Mathematica and similar closed software that the mathematics community was collective spending hundreds of millions of dollars on every year. After a decade of work by over 500 contributors, we made huge progress.

But installing SageMath was more difficult than ever.  It was at that point that I decided I needed to do something so that this groundbreaking software that people desperately needed could be shared with the world.

So I created SageMathCloud, which is an extremely powerful web-based collaborative way for people to easily use SageMath and other open source software such as LaTeX, R, and Jupyter notebooks easily in their teaching  and research.   I created SageMathCloud based on nearly two decades of experience using math software in the classroom and online, at Harvard, UC San Diego, and University of Washington.

SageMathCloud is commercial grade, hosted in Google's cloud, and very large classes are using it heavily right now.  It solves the installation problem by avoiding it altogether.  It is entirely open source.

Open source is now ready to directly compete with Mathematica for use in the classroom.  They told us we could never make something good enough for mass adoption, but we have made something even better.  For the first time, we're making it possible for you to easily use Python and R in your teaching instead of Mathematica; these are industry standard mainstream open source programming languages with strong support from Google, Microsoft and other industry leaders.   For the first time, we're making it possible for you to collaborate in real time and manage your course online using the same cutting edge software used by elite mathematicians at the best universities in the world.

A huge community in academia and in industry are all working together to make open source math software better at a breathtaking pace, and the traditional closed development model just can't keep up.

by William Stein (noreply@blogger.com) at March 10, 2016 08:02 AM

March 08, 2016

Matthieu Brucher

Announcement: Audio TK 1.1.0

This is mainly a bug fix release. A nasty bug on increasing processing sizes would corrupt the input data and thus change the results. It is advised to upgrade to this release as soon as possible.

Download link: ATK 1.1.0

Changelog:

1.1.0
* Fix a really nasty bug when changing processing sizes
* Implemented a basic AllPassFilter (algorithmic reverb)

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at March 08, 2016 08:38 AM

March 07, 2016

Continuum Analytics news

Continuum Analytics Launches Anaconda Skills Accelerator Program to Turn Professionals Into Data Scientists ASAP

Posted Monday, March 7, 2016

Professional Development Residency Program Propels Scientists, Physicists, Mathematicians and Engineers into Data Science Careers

AUSTIN, TX—March 7, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading modern open source analytics platform powered by Python, today launched its Anaconda Skills Accelerator Program (ASAP). The new professional development residency program, led by a team of world class experts, is highly selective and brings together scientists, physicists, mathematicians and engineers to tackle challenging real world projects. Participants will use advanced technology and pragmatic techniques across a broad set of technical areas including—data management, analytic modeling, visualization, high performance computing on multi-core CPUs and GPUs, vectorization and modern distributed computing techniques such as Spark and Dask.

“One of the greatest challenges affecting our industry is the disconnect between the data science needs and expectations of enterprises and the current skills of today’s scientists, mathematicians and engineers,” said Travis Oliphant, CEO and co-founder of Continuum Analytics. “Fostering true data science experts requires more than traditional training programs. Continuum Analytics’ ASAP empowers professionals with the necessary technology and hands on experience to tackle big data challenges and solve the data problems that are inside every organization today.” 

Participants emerge from this residency program as Certified Anaconda Professionals (CAP), enabling them to manage challenging data science and data engineering problems in business, science, engineering, data analytics, data integration and machine learning.

Advantages to CAP Designation

  • Consulting and placement opportunities with Continuum Analytics clients

  • Membership to the CAP Alumni Association

  • Individual subscription to premium Anaconda packages usually only available as part of Anaconda Workgroup or Anaconda Enterprise

  • Complimentary one year subscription to priority Anaconda support (only available for companies who put their teams of three or more through the program)

  • A 50 percent discounted pass to the annual Anaconda conference

For additional information about the Continuum Analytics’ ASAP, qualifications and process, please visit: http://continuum.io/asap.

About Continuum Analytics

Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, data engineers, and business analysts – to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

To learn more about Continuum Analytics, visit w​ww.continuum.io.​

###

by pcudia at March 07, 2016 01:38 PM

NeuralEnsemble

Released - BluePyOpt 0.2 : Leveraging OSS and the cloud to optimise models to neuroscience data

BluePyOpt


The BlueBrain Python Optimisation Library (BluePyOpt) is an extensible framework for data-driven model parameter optimisation that wraps and standardises several existing open-source tools. It simplifies the task of creating and sharing these optimisations, and the associated techniques and knowledge. This is achieved by abstracting the optimisation and evaluation tasks into various reusable and flexible discrete elements according to established best-practices. Further, BluePyOpt provides methods for setting up both small- and large-scale optimisations on a variety of platforms, ranging from laptops to Linux clusters and cloud-based compute infrastructures.

The code is available here:
https://github.com/BlueBrain/BluePyOpt
A preprint to the paper is available here:
http://arxiv.org/abs/1603.00500

by eilif (noreply@blogger.com) at March 07, 2016 11:47 AM

March 05, 2016

Fabian Pedregosa

March 04, 2016

Continuum Analytics news

Open Data Science Is the David Beckham of Your Data Team

Posted Friday, March 4, 2016

If data scientists were professional athletes, they’d be ditching their swim caps and donning cleats right about now.

Traditional data science teams function like swim teams. Although all members strive for one goal — winning the meet — each works individually, concentrating on his or her isolated lane or heat. As each player finishes his heat, the team’s score is tallied. If enough team members hit their individual goals, then the team succeeds. But team members’ efforts are isolated and desynchronized, and some members might pull off incredible times, while others lag behind in their heats.

Today, the industry is moving toward open data science (ODS), which enables data scientists to function more like a soccer team: Team members aren’t restricted to their own lanes but are instead free to move about the field. Even goalies can score in soccer, and ODS developers are similarly encouraged to contribute wherever their skills intersect with development challenges. No team members are relegated to “second-string” or niche roles: In a given project, developers might contribute to model building, while domain experts like quants offer insights about code structure or visualization. As with soccer, the pace is fast, and the process is engaging and fun.

Why ODS?

Data is the indisputable king of the modern economy, but too many data scientists still function like swimmers, each working with his own set of tools to manage heaps of data. To work as a true team, data scientists (and their tools) must function together.

ODS is the revolution rising to meet this challenge. Instead of forcing data scientists to settle on a single language and proprietary toolset, open data science champions inclusion. Just as ODS encourages data scientists to function as one, it also recognizes the potential of open source tools — for data, analytics and computation — to form a connected, collaborative ecosystem.

Developers, analytics and domain experts adhering to ODS principles gain these key advantages over traditional and proprietary approaches:

  • Availability: Open source tools offer many packages and approaches within an ecosystem  to solve problems that make it faster to build solutions quickly.
  • Innovation: The lack of vendor lock-in means the developer community is continuously updating tools, which is ideal for collaborative workflows and quick-to-fail projects.
  • Interoperability: Rather than pigeonholing developers into a single language or set of tools, ODS embraces a rich ecosystem of compatible open source tools.
  • Transparency: Open data science tools facilitate transparency between teams, emphasizing innovations like the Jupyter/IPython notebook that allows developers to easily share code and visualizations.

ODS Is Making Analytics More Precise

Over the past few years — largely thanks to the growth of ODS — the pace of innovation has quickened in the analytics field. Now, researchers and academics immediately release their algorithms to the open source community, which has empowered data scientists using these tools to become ever more granular in their analyses.

Because open data science has provided more ways to build, deploy and consume analytics, businesses can know their customers more personally than ever before.

Previously, a sales analysis might have yielded insights like “Women over 50 living in middle-class households are most likely to buy this product.” Now, these analyses are shedding ever-greater light on buyer personas. Today, an analyst might tell marketers, “Stay-at-home moms in suburban areas who tend to shop online between 3 p.m. and 6 p.m. are more likely to buy the premium product than the entry-level one.”

Notice the difference? The more precise the analysis is, the greater its use to business professionals. Information about where, how and when buyers purchase a product are invaluable insights for marketers.

ODS has also broadened data scientists’ options for visualization. Let’s say a telecommunications company needs to illustrate rotational churn in a way that shows which customers are likely to leave over a period of time. Perhaps the company is using Salesforce, but Salesforce doesn’t offer the rotational churn analysis the business is looking for. If the business’ data scientist is using ODS tools, then that data scientist can create a model for the rotational churn and embed it into Salesforce. The analysis can include rich, contextual visualization that clearly illustrates customers’ rotational churn patterns.

Data Science Isn’t One-Size-Fits-All

Just like there’s no one right way to run a business or market a product, there’s no one right way to do data science.

For instance, one data scientist might employ a visual component framework, while another accomplishes the same task with a command line interface or integrated language development environment. Business analysts and data engineers in the data science team may prefer spreadsheets and visual exploration.

Thanks to this flexibility, ODS has become the foundation of modern predictive analytics and one platform has emerged to manage its tools successfully: Anaconda.

Anaconda is the leading open source analytics ecosystem and full-stack ODS platform for Python, R, Scala and many other data science languages. It adeptly manages packages, dependencies and environments for multiple languages.

Data science teams using Anaconda can share models and results through Jupyter/IPython notebooks and can utilize an enormous trove of libraries for analytics and visualization. Furthermore, Anaconda supercharges Python to deliver high performance analytics for data science team that want to scale up and out their prototypes without worrying about incurring delays when moving to production.

With Anaconda, data science teams decrease project times by better managing and illustrating data sets and offering clearer insights to business professionals. Get your data science team out of the pool by making the switch to ODS and Anaconda. Before you know it, your data scientists will be scoring goals and on their way to the World Cup.

If you are interested in learning more about this topic and are attending the Gartner BI & Analytics Summit in Grapevine, TX, join me in our session on Why Open Data Science Matters, Tuesday, March 15th from 10:45-11:30amCT.

by pcudia at March 04, 2016 05:16 PM

March 03, 2016

Mark Fenner

Givens Rotations and QR

Today I want to talk about Givens rotations. Givens rotations are a generalization of the rotation matrix you might remember from high school trig class. Instead of rotating in the plane of a 2D matrix, we can rotated in any plane of a larger dimension matrix. We’ll use these rotations to selectively place zeros in […]

by Mark Fenner at March 03, 2016 01:06 PM

March 02, 2016

Continuum Analytics news

Introducing Constructor 1.0

Posted Wednesday, March 2, 2016

I am excited to announce version 1.0 of the constructor project. Constructor combines a collection of conda packages into a standalone installer similar to the Anaconda installer. Using constructor you can create installers for all the platforms conda supports, namely Linux, Mac OS X and Windows.

  • The Linux and Mac installers are self extracting bash scripts. They consist of a bash header, and a tar archive. Besides providing some help options, and an install dialog, the bash header extracts the tar archive, which in turn consists of conda packages.
  • The Windows installer is an executable (.exe), which when executed opens up a dialog box, which walks the user through the install process and installs the conda packages contained within the executable. This is all achieved by relying on the NSIS (Nullsoft Scriptable Install System) under the cover.

This is the first public release of this project, which was previously proprietary and known as "cas-installer".

As constructor relies on conda, it needs to be installed into the root conda environment:

conda install -n root constructor

Constructor builds an installer for the current platform by default, and can also build an installer for other platforms, although Windows installers must be created on Windows and all Unix installers must be created on some Unix platform.

The constructor command takes an installer specification directory as its argument. This directory needs to contain a file construct.yaml, which specifies information like the name and version of the installer, the conda channels to pull packages from, the conda packages included in the installer, and so on. The complete list of keys in this file can be found in the documentation. The directory may also contain some optional files such as a license file and image files for the Windows installer.

We created a documented example to demonstrate how to build installers for Linux, OS X and Windows that are similar to Anaconda installers, but significantly smaller.

Have fun!

by ryanwh at March 02, 2016 06:39 PM

February 28, 2016

Titus Brown

An e-mail from Seven Bridges Genomics re patents and patent applications

Preface from Titus: this is an e-mail written by Deniz Kural of Seven Bridges Genomics in response to concerns and accusations about their patents and patent applications on genomics workflow engines and graph genome analysis techniques. It was sent to a closed list initially, and I asked Deniz if we could publicize it; he agreed. Note that I removed the messages to which he was replying, given the closed nature of the original list, and Deniz lightly edited the e-mail to remove some personal details and fix discontinuity.

Here are some references to the Twitter conversation: Variant reference graphs patented!?, looks like Seven Bridges is trying to patent Galaxy or Taverna?

Please remember to abide by the commenting policy on this blog if you decide to post a comment here.


Dear All,

Firstly, thank you for the considered discussion and civility on this thread. In this email, I've tried to cover why SBG has a patent practice & next steps with GA4GH; and what the patents cover / prior art concerns.

Purpose of SBGs patents:

Developed economies have a complicated relationship to patents; as one of the few instruments underpinning the wealth these knowledge economies create. However, science requires unfettered freedom to build upon previous ideas, and open communication. So what to do?

Seven Bridges was founded in 2009, with our name a nod to graph theory (Euler). We've avoided patents until 2013, and having been made aware of various industrial and academic institutions pursuing bioinformatics and cloud patents, have decided to pursue our own patents as an active, operating company in a field with some publicly traded entities aggressively prosecuting their IP claims, as a protective measure. The patents listed in there have been submitted before the existence of the GA4GH mailing list. Some of the government funding we receive compels us to obtain IP and measures our success on this criteria. We would welcome patent reform, and would rather not spend our funds on patents. We are significantly larger than a lab, but tiny compared to our established software/biotech competitors, and ultimately can't match their legal funds. We'd also love to see the demise of patent trolls. Thus, it is not our intention to limit scientific discussion / publication.

As an example: We've released the source code of our pipeline editor (unrelated to graph genomes) more than a year ago under GPL, and last month decided to also start using an Apache license to make the patent rights of the user more explicit. We'd like to preserve an open space for common formats and standards - common workflow language, and graph genome exchange formats (more on this below). Open source / Apache is a start, but not a universal panacea.

Many of the suggestions on this thread regarding patents and licenses would be a positive step forward, and we would welcome a clarification of GA4GH policies, including overall governance. Although I had the opportunity to discuss our patents & answer some questions on previous phone calls, I also wish that we've had a wider conversation sooner.

What's patented re: Graph Genomes / SBG patent process + prior art

We have an obligation to submit prior art, and precisely for the reasons outlined in this thread. Likewise, it does us no good to obtain and maintain patents that can't realistically hold when challenged. I've tried to explain the content of our patents and a way to incorporate community-submitted prior art below.

None of our patent applications are on "genome graphs" per se - i.e. representing variants or sequence data as a graph genome -- we truly believe that having a common, open representation or specification of graph genomes is needed. This is also the basis of our interest and participation in this group. Indeed, in my first presentation to this group about two summers ago, I've presented an edge graph representation, and a coordinate system that goes along with it (which counts as a public disclosure) thus taking it off from what can be patented. SBG's tools must accept and write out formats commonly used and adopted by the community to gain traction and adoption & use open and free APIs for input and output.

We've pursued patents for improvements that go into our toolchain (aligners, variant callers, etc.) - a good portion while it may not be obvious at first, are tied to producing efficient / fast implementations, having a practical benefit, and another group of patents relate to application patents around not having restrictions to apply graph genomes to various specific domains. Please see below re: on prior art for a specific application or concern.

It presents an interesting challenge on how external counsel could become a graph genome expert and write patents on this area intelligently. We've realized after a full cycle of this, that external counsel often produces lower quality applications due to lack of context and limited time investment, resulting a lot of resources spent downstream. Thus we've been pursuing measures to improve the quality and substance of our process, including hiring in-house last year.

The way our patent process (essentially a chore) now works as follows: Our attorneys sit on our weekly R&D meetings, and then follow-up if anything novel is presented. Currently we have about 45 PhDs and 100 engineers, with 20+ R&D members working on graph-related issues alone, and the IP activity is thus comparable to institutions of similar size. We try not to spend engineering time on detailed patent work -- time better spent to build tools and products. Our scientists are not trained in patent novelty vs scientific novelty, and thus tend to be conservative with disclosures.

Thus, our researchers write a short disclosure to our attorney, who then writes the claims (as a lawyer would), and goes back-and-forth with external entities until a specific set of claims are finalized / approved. For reasons beyond my understanding, the patent system seems to encourage starting with a broad set of claims and working with the patent examiner to make them more specific to exactly what the engineer built.

Our attorneys are obliged to pay attention to prior art. We circulate papers - including pioneering graph genome work, but also from highly related fields of genome assembly and transcriptomics. It's worth noting that many of our submissions are in "application" stage, and indeed may be thrown out. The claims of a final submission, after the review steps, often look very different as outlined above. We also have the USPTO review and point out previous work. That said, more prior art is better, and welcome.

Thus, if a person trained in reading patent claims (or anyone really), feels like we've overlooked work from 2013 (or any of the submission dates) & before, we'd like to hear from you on this issue. Even better to name which claims and applications it relates to, and which sections of the paper. We'll happily submit the evidence to USPTO and remove or revise those claims. Please keep in mind sometimes we need to wait for the USPTO to throw it out depending on which stage we're at. Please do not hesitate to write to ip@sbgenomics.com which goes directly to our in house counsel.

Likewise, we'd like to have an open channel of communication in general, so if you have any other questions or concerns please email on related issues.

Best, Deniz

(Titus sez: Deniz is on Twitter at @denizkural.)

by Deniz Kural at February 28, 2016 11:00 PM

February 26, 2016

NeuralEnsemble

Stop plotting your data -- HoloViews 1.4 released!

We are pleased to announce the fifth public release of HoloViews, a Python package for exploring and visualizing scientific data:

   http://holoviews.org

HoloViews provides composable, sliceable, declarative data structures for building even complex visualizations easily.  Instead of you having to explicitly and laboriously plot your data, HoloViews lets you simply annotate your data so that any part of it visualizes itself automatically.  You can now work with large datasets as easily as you work with simple datatypes at the Python prompt.

The new version can be installed using conda:

   conda install -c ioam holoviews

Release 1.4 introduces major new features, incorporating over 1700 new commits and closing 142 issues:

- Now supports both Bokeh (bokeh.pydata.org) and matplotlib backends, with Bokeh providing extensive interactive features such as panning and zooming linked axes, and customizable callbacks

- DynamicMap: Allows exploring live streams from ongoing data collection or simulation, or parameter spaces too large to fit into your computer's or your browser's memory, from within a Jupyter notebook

- Columnar data support: Underlying data storage can now be in Pandas dataframes, NumPy arrays, or Python dictionaries, allowing you to define HoloViews objects without copying or reformatting your data

- New Element types: Area (area under or between curves), Spikes (sequence of lines, e.g. spectra, neural spikes, or rug plots), BoxWhisker (summary of a distribution), QuadMesh (nonuniform rasters), Trisurface (Delaunay-triangulated surface plots)

- New Container type: GridMatrix (grid of heterogenous Elements)

- Improved layout handling, with better support for varying aspect  ratios and plot sizes

- Improved help system, including recursively listing and searching the help for all the components of a composite object

- Improved Jupyter/IPython notebook support, including improved export using nbconvert, and standalone HTML output that supports dynamic widgets even without a Python server

- Significant performance improvements for large or highly nested data

And of course we have fixed a number of bugs found by our very dedicated users; please keep filing Github issues if you find any!

For the full list of changes, see:

   https://github.com/ioam/holoviews/releases

HoloViews is now supported by Continuum Analytics, and is being used in a wide range of scientific and industrial projects.  HoloViews remains freely available under a BSD license, is Python 2 and 3 compatible, and has minimal external dependencies, making it easy to integrate into your workflow. Try out the extensive tutorials at holoviews.org today!

Jean-Luc R. Stevens
Philipp Rudiger
James A. Bednar

Continuum Analytics, Inc., Austin, TX, USA
School of Informatics, The University of Edinburgh, UK

by Jim Bednar (noreply@blogger.com) at February 26, 2016 09:03 PM

Matthew Rocklin

Distributed Dask Arrays

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

In this post we analyze weather data across a cluster using NumPy in parallel with dask.array. We focus on the following:

  1. How to set up the distributed scheduler with a job scheduler like Sun GridEngine.
  2. How to load NetCDF data from a network file system (NFS) into distributed RAM
  3. How to manipulate data with dask.arrays
  4. How to interact with distributed data using IPython widgets

This blogpost has an accompanying screencast which might be a bit more fun than this text version.

This is the third in a sequence of blogposts about dask.distributed:

  1. Dask Bags on GitHub Data
  2. Dask DataFrames on HDFS
  3. Dask Arrays on NetCDF data

Setup

We wanted to emulate the typical academic cluster setup using a job scheduler like SunGridEngine (similar to SLURM, Torque, PBS scripts and other technologies), a shared network file system, and typical binary stored arrays in NetCDF files (similar to HDF5).

To this end we used Starcluster, a quick way to set up such a cluster on EC2 with SGE and NFS, and we downloaded data from the European Centre for Meteorology and Weather Forecasting

To deploy dask’s distributed scheduler with SGE we made a scheduler on the master node:

sgeadmin@master:~$ dscheduler
distributed.scheduler - INFO - Start Scheduler at:  172.31.7.88:8786

And then used the qsub command to start four dask workers, pointing to the scheduler address:

sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 1 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 2 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 3 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 4 ("dworker") has been submitted

After a few seconds these workers start on various nodes in the cluster and connect to the scheduler.

Load sample data on a single machine

On the shared NFS drive we’ve downloaded several NetCDF3 files, each holding the global temperature every six hours for a single day:

>>> from glob import glob
>>> filenames = sorted(glob('*.nc3'))
>>> filenames[:5]
['2014-01-01.nc3',
 '2014-01-02.nc3',
 '2014-01-03.nc3',
 '2014-01-04.nc3',
 '2014-01-05.nc3']

We use conda to install the netCDF4 library and make a small function to read the t2m variable for “temperature at two meters elevation” from a single filename:

$ conda install netcdf4
import netCDF4
def load_temperature(fn):
    with netCDF4.Dataset(fn) as f:
        return f.variables['t2m'][:]

This converts a single file into a single numpy array in memory. We could call this on an individual file locally as follows:

>>> load_temperature(filenames[0])
array([[[ 253.96238624,  253.96238624,  253.96238624, ...,  253.96238624,
          253.96238624,  253.96238624],
        [ 252.80590921,  252.81070124,  252.81389593, ...,  252.79792249,
          252.80111718,  252.80271452],
          ...
>>> load_temperature(filenames[0]).shape
(4, 721, 1440)

Our dataset has dimensions of (time, latitude, longitude). Note above that each day has four time entries (measurements every six hours).

The NFS set up by Starcluster is unfortunately quite small. We were only able to fit around five months of data (136 days) in shared disk.

Load data across cluster

We want to call the load_temperature function on our list filenames on each of our four workers. We connect a dask Executor to our scheduler address and then map our function on our filenames:

>>> from distributed import Executor, progress
>>> e = Executor('172.31.7.88:8786')
>>> e
<Executor: scheduler=172.31.7.88:8786 workers=4 threads=32>

>>> futures = e.map(load_temperature, filenames)
>>> progress(futures)

After this completes we have several numpy arrays scattered about the memory of each of our four workers.

Coordinate with dask.array

We coordinate these many numpy arrays into a single logical dask array as follows:

>>> from distributed.collections import futures_to_dask_arrays
>>> xs = futures_to_dask_arrays(futures)  # many small dask arrays

>>> import dask.array as da
>>> x = da.concatenate(xs, axis=0)        # one large dask array, joined by time
>>> x
dask.array<concate..., shape=(544, 721, 1440), dtype=float64, chunksize=(4, 721, 1440)>

This single logical dask array is comprised of 136 numpy arrays spread across our cluster. Operations on the single dask array will trigger many operations on each of our numpy arrays.

Interact with Distributed Data

We can now interact with our dataset using standard NumPy syntax and other PyData libraries. Below we pull out a single time slice and render it to the screen with matplotlib.

from matplotlib import pyplot as plt
plt.imshow(x[100, :, :].compute(), cmap='viridis')
plt.colorbar()

In the screencast version of this post we hook this up to an IPython slider widget and scroll around time, which is fun.

Speed

We benchmark a few representative operations to look at the strengths and weaknesses of the distributed system.

Single element

This single element computation accesses a single number from a single NumPy array of our dataset. It is bound by a network roundtrip from client to scheduler, to worker, and back.

>>> %time x[0, 0, 0].compute()
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 9.72 ms

Single time slice

This time slice computation pulls around 8 MB from a single NumPy array on a single worker. It is likely bound by network bandwidth.

>>> %time x[0].compute()
CPU times: user 24 ms, sys: 24 ms, total: 48 ms
Wall time: 274 ms

Mean computation

This mean computation touches every number in every NumPy array across all of our workers. Computing means is quite fast, so this is likely bound by scheduler overhead.

>>> %time x.mean().compute()
CPU times: user 88 ms, sys: 0 ns, total: 88 ms
Wall time: 422 ms

Interactive Widgets

To make these times feel more visceral we hook up these computations to IPython Widgets.

This first example looks fairly fluid. This only touches a single worker and returns a small result. It is cheap because it indexes in a way that is well aligned with how our NumPy arrays are split up by time.

@interact(time=[0, x.shape[0] - 1])
def f(time):
    return x[time, :, :].mean().compute()

This second example is less fluid because we index across our NumPy chunks. Each computation touches all of our data. It’s still not bad though and quite acceptable by today’s standards of interactive distributed data science.

@interact(lat=[0, x.shape[1] - 1])
def f(lat):
    return x[:, lat, :].mean().compute()

Normalize Data

Until now we’ve only performed simple calculations on our data, usually grabbing out means. The image of the temperature above looks unsurprising. The image is dominated by the facts that land is warmer than oceans and that the equator is warmer than the poles. No surprises there.

To make things more interesting we subtract off the mean and divide by the standard deviation over time. This will tell us how unexpectedly hot or cold a particular point was, relative to all measurements of that point over time. This gives us something like a geo-located Z-Score.

z = (x - x.mean(axis=0)) / x.std(axis=0)
z = e.persist(z)
progress(z)

plt.imshow(z[slice].compute(), cmap='RdBu_r')
plt.colorbar()

We can now see much more fine structure of the currents of the day. In the screencast version we hook this dataset up to a slider as well and inspect various times.

I’ve avoided displaying GIFs of full images changing in this post to keep the size down, however we can easily render a plot of average temperature by latitude changing over time here:

import numpy as np
xrange = 90 - np.arange(z.shape[1]) / 4

@interact(time=[0, z.shape[0] - 1])
def f(time):
    plt.figure(figsize=(10, 4))
    plt.plot(xrange, z[time].mean(axis=1).compute())
    plt.ylabel("Normalized Temperature")
    plt.xlabel("Latitude (degrees)")

Conclusion

We showed how to use distributed dask.arrays on a typical academic cluster. I’ve had several conversations with different groups about this topic; it seems to be a common case. I hope that the instructions at the beginning of this post prove to be helpful to others.

It is really satisfying to me to couple interactive widgets with data on a cluster in an intuitive way. This sort of fluid interaction on larger datasets is a core problem in modern data science.

What didn’t work

As always I’ll include a section like this on what didn’t work well or what I would have done with more time:

  • No high-level read_netcdf function: We had to use the mid-level API of executor.map to construct our dask array. This is a bit of a pain for novice users. We should probably adapt existing high-level functions in dask.array to robustly handle the distributed data case.
  • Need a larger problem: Our dataset could have fit into a Macbook Pro. A larger dataset that could not have been efficiently investigated from a single machine would have really cemented the need for this technology.
  • Easier deployment: The solution above with qsub was straightforward but not always accessible to novice users. Additionally while SGE is common there are several other systems that are just as common. We need to think of nice ways to automate this for the user.
  • XArray integration: Many people use dask.array on single machines through XArray, an excellent library for the analysis of labeled nd-arrays especially common in climate science. It would be good to integrate this new distributed work into the XArray project. I suspect that doing this mostly involves handling the data ingest problem described above.
  • Reduction speed: The computation of normalized temperature, z, took a surprisingly long time. I’d like to look into what is holding up that computation.

February 26, 2016 12:00 AM

February 25, 2016

Stefan van der Walt

Python & Matplotlib on OSX

One day, we will hopefully have a grand unified build and package management system for Python where everything is free & open and Just Works (TM). Until then, you have two options:

brew Python + pip

  1. brew install python3
  2. pyvenv -v ~/envs/py3
  3. source ~/envs/py3/bin/activate
  4. pip install matplotlib

Pros/cons:

  • Pip is the standard Python package management tool, and uses the official Python Package Index (PyPi) repository.
  • Wheels on PyPi are built by authors themselves from open recipes.
  • Binary wheels may not be available for all packages.
  • Pip is not the best of package management tools.

Conda

  1. Download and install miniconda
  2. conda create -n py3 python=3.5 matplotlib
  3. source activate py3

Pros/cons:

  • Conda is a great package management tool.
  • Conda environments are well tested; almost everything works out of the box. This includes fast linear algebra using MKL.
  • Some of the conda build recipes are closed and binary wheels may not be available for all packages1.
  • Conda and pip do not always play well together.
  • Conda packages are supported almost exclusively by a single company.

  1. Some members of the community maintain their own channels, but there are still some issues to be aware of when mixing those channels and the official ones

by Stefan van der Walt at February 25, 2016 11:00 PM

William Stein

"If you were new faculty, would you start something like SageMathCloud sooner?"

I was recently asked by a young academic: "If you were a new faculty member again, would you start something like SageMathCloud sooner or simply leave for industry?" The academic goes on to say "I am increasingly frustrated by continual evidence that it is more valuable to publish a litany of computational papers with no source code than to do the thankless task of developing a niche open source library; deep mathematical software is not appreciated by either mathematicians or the public."

I wanted to answer that "things have gotten better" since back in 2000 when I started as an academic who does computation. Unfortunately, I think they have gotten worse. I do not understand why. In fact, this evening I just received the most recent in a long string of rejections by the NSF.

Regarding a company versus taking a job in industry, for me personally there is no point in starting a company unless you have a goal that can only be accomplished via a company, since building a business from scratch is extremely hard and has little to do with math or research. I do have such a goal: "create a viable open source alternative to Mathematica, etc...". I was very clearly told by Michael Monagan (co-founder of Maplesoft) in 2006 that this goal could not be accomplished in academia, and I spent the last 10 years trying to prove him wrong.

On the other hand, leaving for a job in industry means that your focus will switch from "pure" research to solving concrete problems that make products better for customers. That said, many of the mathematicians who work on open source math software do so because they care so much about making the experience of using math software much better for the math community. What often drives Sage developers is exactly the sort of passionate care for "consumer focus" and products that also makes one successful in industry. I'm sure you know exactly what I mean, since it probably partly motivates your work. It is sad that the math community turns its back on such people. If the community were to systematically embrace them, instead of losing all these $300K+/year engineers to mathematics entirely -- which is exactly what we do constantly -- the experience of doing mathematics could be massively improved into the future. But that is not what the community has chosen to do. We are shooting ourselves in the foot.

Now that I have seen how academia works from the inside over 15 years I'm starting to understand a little why these things change very slowly, if ever. In the mathematics department I'm at, there are a small handful of research areas in pure math, and due to how hiring works (voting system, culture, etc.) we have spent the last 10 years hiring in those areas little by little (to replace people who die/retire/leave). I imagine most mathematics departments are very similar. "Open source software" is not one of those traditional areas. Nobody will win a Fields Medal in it.

Overall, the mathematical community does not value open source mathematical software in proportion to its value, and doesn't understand its importance to mathematical research and education. I would like to say that things have got a lot better over the last decade, but I don't think they have. My personal experience is that much of the "next generation" of mathematicians who would have changed how the math community approaches open source software are now in industry, or soon will be, and hence they have no impact on academic mathematical culture. Every one of my Ph.D. students are now at Google/Facebook/etc.

We as a community overall would be better off if, when considering how we build departments, we put "mathematical software writers" on an equal footing with "algebraic geometers". We should systematically consider quality open source software contributions on a potentially equal footing with publications in journals.

To answer the original question, YES, knowing what I know now, I really wish I had started something like SageMathCloud sooner. In fact, here's the previously private discussion from eight years ago when I almost did.

--

- There is a community generated followup ...

by William Stein (noreply@blogger.com) at February 25, 2016 03:14 PM

February 24, 2016

William Stein

Elliptic curves: Magma versus Sage

Elliptic Curves

Elliptic curves are certain types of nonsingular plane cubic curves, e.g., y^2 = x^3 + ax +b, which are central to both number theory and cryptography (e.g., they are used to compute the hash in bitcoin).


Magma and Sage

If you want to do a wide range of explicit computations with elliptic curves, for research purposes, you will very likely use SageMath or Magma. If you're really serious, you'll use both.

Both Sage and Magma are far ahead of all other software (e.g., Mathematica, Maple and Matlab) for elliptic curves.

A Little History

When I started contributing to Magma in 1999, I remember that Magma was way, way behind Pari. I remember having lunch with John Cannon (founder of Magma), and telling him I would no longer have to use Pari if only Magma would have dramatically faster code for computing point counts on elliptic curves.

A few years later, John wisely hired Mark Watkins to work fulltime on Magma, and Mark has been working there for over a decade. Mark is definitely one of the top people in the world at implementing (and using) computational number theory algorithms, and he's ensured that Magma can do a lot. Some of that "do a lot" means catching up with (and surpassing!) what was in Pari and Sage for a long time (e.g., point counting, p-adic L-functions, etc.)

However, in addition, many people have visited Sydney and added extremely deep functionality for doing higher descents to Magma, which is not available in any open source software. Search for Magma in this paper to see how, even today, there seems to be no open source way to compute the rank of the curve y2 = x3 + 169304x + 25788938.  (The rank is 0.)

Two Codebases

There are several elliptic curves algorithms available only in Magma (e.g., higher descents) ... and some available only in Sage (L-function rank bounds, some overconvergent modular symbols, zeros of L-functions, images of Galois representations). I could be wrong about functionality not being in Magma, since almost anything can get implemented in a year...

The code bases are almost completely separate, which is a very good thing. Any time something gets implemented in one, it gets (or should get) tested via a big run on elliptic curves up to some bound in the other. This sometimes results in bugs being found. I remember refereeing the "integral points" code in Sage by running it against all curves up to some bound and comparing to what Magma output, and getting many discrepancies, which showed that there were bugs in both Sage and Magma.
Thus we would be way better off if Sage could do everything Magma does (and vice versa).



by William Stein (noreply@blogger.com) at February 24, 2016 12:07 PM

February 23, 2016

Mark Fenner

Householder Reflections and QR

My next two or three posts are going to deal with the QR and SVD techniques. QR can be done in several different ways. I’m going to work through two methods of computing QR that will shed some light on our SVD implementation. So, these two topics dovetail nicely. Onward! QR via Householder Matrices For […]

by Mark Fenner at February 23, 2016 02:50 PM

February 22, 2016

Titus Brown

An #openscienceprize entry: Integrating server-side annotation into the hypothes.is ecosystem

Over the last few months, I've been playing with hypothes.is and thinking about how to use it to further my scientific work. This resulted in some brainstorming with Jon Udell and Maryann Martone about, well, lots of things. And now we're putting in an open science prize entry!

tl; dr? Check out a draft of our proposal to the open science prize competition, "Annotation as a service: helping build a global knowledge layer with annotation of the dark literature".

What is Hypothes.is?

If you're not familiar with hypothes.is, it's an awesome, game-changing system that lets you add (and view) annotations to virtually any Web page or PDF document. The key bit is that it's utterly permissionless - the hypothes.is service is completely client-side, so you can add and view annotations on any content to which you have access. For example, if you're reading this blog post in a hypothes.is compatible browser (try Firefox or Google Chrome), you should see this sentence highlighted; click on it and you should see an annotation window pop out with some text in it.

Hypothes.is lets me decorate papers (including closed access papers) manually with whatever I want.

More specifically, hypothes.is lets me connect any two resources with an "anchor" in one, pointing to another, with some text and tags. This can be done manually. It's done without permission by either resource - as long as the resource has an ID (URL or PDF ID), it works.

So what's the #openscienceprize entry about?

We want to expand this concept to the notion of automated annotation, where any source of textual analysis (including links to database, reviews, post-publication-date updates, etc.) can be used to create annotations for any given piece of literature, linking it to relevant resources, metacommentary, etc.

I don't know of anything like this currently and so it's not easy to describe how it's useful. The first use case, Wormbase, is one example - right now, Wormbase can't place links to their gene information on the lit, but with this they could do so quite easily. More generally, I see this as opening the floodgates for rampant creativity to be applied to marking up scientific papers.

We'd love your feedback on the idea, along with any brainstorming or critiques you might have. Please go check out the proposal at https://2016-aesir.readthedocs.org/en/latest/ and let us know what you think!

--titus

by C. Titus Brown at February 22, 2016 11:00 PM

Matthew Rocklin

Pandas on HDFS with Dask Dataframes

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

In this post we use Pandas in parallel across an HDFS cluster to read CSV data. We coordinate these computations with dask.dataframe. A screencast version of this blogpost is available here and the previous post in this series is available here.

To start, we connect to our scheduler, import the hdfs module from the distributed library, and read our CSV data from HDFS.

>>> from distributed import Executor, hdfs, progress
>>> e = Executor('127.0.0.1:8786')
>>> e
<Executor: scheduler=127.0.0.1:8786 workers=64 threads=64>

>>> nyc2014 = hdfs.read_csv('/nyctaxi/2014/*.csv',
...               parse_dates=['pickup_datetime', 'dropoff_datetime'],
...               skipinitialspace=True)

>>> nyc2015 = hdfs.read_csv('/nyctaxi/2015/*.csv',
...               parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

>>> nyc2014, nyc2015 = e.persist([nyc2014, nyc2015])
>>> progress(nyc2014, nyc2015)

Our data comes from the New York City Taxi and Limousine Commission which publishes all yellow cab taxi rides in NYC for various years. This is a nice model dataset for computational tabular data because it’s large enough to be annoying while also deep enough to be broadly appealing. Each year is about 25GB on disk and about 60GB in memory as a Pandas DataFrame.

HDFS breaks up our CSV files into 128MB chunks on various hard drives spread throughout the cluster. The dask.distributed workers each read the chunks of bytes local to them and call the pandas.read_csv function on these bytes, producing 391 separate Pandas DataFrame objects spread throughout the memory of our eight worker nodes. The returned objects, nyc2014 and nyc2015, are dask.dataframe objects which present a subset of the Pandas API to the user, but farm out all of the work to the many Pandas dataframes they control across the network.

Play with Distributed Data

If we wait for the data to load fully into memory then we can perform pandas-style analysis at interactive speeds.

>>> nyc2015.head()
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0 0.3 16.30
>>> len(nyc2014)
165114373

>>> len(nyc2015)
146112989

Interestingly it appears that the NYC cab industry has contracted a bit in the last year. There are fewer cab rides in 2015 than in 2014.

When we ask for something like the length of the full dask.dataframe we actually ask for the length of all of the hundreds of Pandas dataframes and then sum them up. This process of reaching out to all of the workers completes in around 200-300 ms, which is generally fast enough to feel snappy in an interactive session.

The dask.dataframe API looks just like the Pandas API, except that we call .compute() when we want an actual result.

>>> nyc2014.passenger_count.sum().compute()
279997507.0

>>> nyc2015.passenger_count.sum().compute()
245566747

Dask.dataframes build a plan to get your result and the distributed scheduler coordinates that plan on all of the little Pandas dataframes on the workers that make up our dataset.

Pandas for Metadata

Let’s appreciate for a moment all the work we didn’t have to do around CSV handling because Pandas magically handled it for us.

>>> nyc2015.dtypes
VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
pickup_longitude                float64
pickup_latitude                 float64
RateCodeID                        int64
store_and_fwd_flag               object
dropoff_longitude               float64
dropoff_latitude                float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount\r                  float64
dtype: object

We didn’t have to find columns or specify data-types. We didn’t have to parse each value with an int or float function as appropriate. We didn’t have to parse the datetimes, but instead just specified a parse_datetimes= keyword. The CSV parsing happened about as quickly as can be expected for this format, clocking in at a network total of a bit under 1 GB/s.

Pandas is well loved because it removes all of these little hurdles from the life of the analyst. If we tried to reinvent a new “Big-Data-Frame” we would have to reimplement all of the work already well done inside of Pandas. Instead, dask.dataframe just coordinates and reuses the code within the Pandas library. It is successful largely due to work from core Pandas developers, notably Masaaki Horikoshi (@sinhrks), who have done tremendous work to align the API precisely with the Pandas core library.

Analyze Tips and Payment Types

In an effort to demonstrate the abilities of dask.dataframe we ask a simple question of our data, “how do New Yorkers tip?”. The 2015 NYCTaxi data is quite good about breaking down the total cost of each ride into the fare amount, tip amount, and various taxes and fees. In particular this lets us measure the percentage that each rider decided to pay in tip.

>>> nyc2015[['fare_amount', 'tip_amount', 'payment_type']].head()
fare_amount tip_amount payment_type
0 12.0 3.25 1
1 14.5 2.00 1
2 9.5 0.00 2
3 3.5 0.00 2
4 15.0 0.00 2

In the first two lines we see evidence supporting the 15-20% tip standard common in the US. The following three lines interestingly show zero tip. Judging only by these first five lines (a very small sample) we see a strong correlation here with the payment type. We analyze this a bit more by counting occurrences in the payment_type column both for the full dataset, and filtered by zero tip:

>>> %time nyc2015.payment_type.value_counts().compute()
CPU times: user 132 ms, sys: 0 ns, total: 132 ms
Wall time: 558 ms

1    91574644
2    53864648
3      503070
4      170599
5          28
Name: payment_type, dtype: int64

>>> %time nyc2015[nyc2015.tip_amount == 0].payment_type.value_counts().compute()
CPU times: user 212 ms, sys: 4 ms, total: 216 ms
Wall time: 1.69 s

2    53862557
1     3365668
3      502025
4      170234
5          26
Name: payment_type, dtype: int64

We find that almost all zero-tip rides correspond to payment type 2, and that almost all payment type 2 rides don’t tip. My un-scientific hypothesis here is payment type 2 corresponds to cash fares and that we’re observing a tendancy of drivers not to record cash tips. However we would need more domain knowledge about our data to actually make this claim with any degree of authority.

Analyze Tips Fractions

Lets make a new column, tip_fraction, and then look at the average of this column grouped by day of week and grouped by hour of day.

First, we need to filter out bad rows, both rows with this odd payment type, and rows with zero fare (there are a surprising number of free cab rides in NYC.) Second we create a new column equal to the ratio of tip_amount / fare_amount.

>>> df = nyc2015[(nyc2015.fare_amount > 0) & (nyc2015.payment_type != 2)]
>>> df = df.assign(tip_fraction=(df.tip_amount / df.fare_amount))

Next we choose to groupby the pickup datetime column in order to see how the average tip fraction changes by day of week and by hour. The groupby and datetime handling of Pandas makes these operations trivial.

>>> dayofweek = df.groupby(df.tpep_pickup_datetime.dt.dayofweek).tip_fraction.mean()
>>> hour = df.groupby(df.tpep_pickup_datetime.dt.hour).tip_fraction.mean()

>>> dayofweek, hour = e.persist([dayofweek, hour])
>>> progress(dayofweek, hour)

Grouping by day-of-week doesn’t show anything too striking to my eye. However I would like to note at how generous NYC cab riders seem to be. A 23-25% tip can be quite nice:

>>> dayofweek.compute()
tpep_pickup_datetime
0    0.237510
1    0.236494
2    0.236073
3    0.246007
4    0.242081
5    0.232415
6    0.259974
Name: tip_fraction, dtype: float64

But grouping by hour shows that late night and early morning riders are more likely to tip extravagantly:

>>> hour.compute()
tpep_pickup_datetime
0     0.263602
1     0.278828
2     0.293536
3     0.276784
4     0.348649
5     0.248618
6     0.233257
7     0.216003
8     0.221508
9     0.217018
10    0.225618
11    0.231396
12    0.225186
13    0.235662
14    0.237636
15    0.228832
16    0.234086
17    0.240635
18    0.237488
19    0.272792
20    0.235866
21    0.242157
22    0.243244
23    0.244586
Name: tip_fraction, dtype: float64
In [24]:

We plot this with matplotlib and see a nice trough during business hours with a surge in the early morning with an astonishing peak of 34% at 4am:

Performance

Lets dive into a few operations that run at different time scales. This gives a good understanding of the strengths and limits of the scheduler.

>>> %time nyc2015.head()
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 20.9 ms

This head computation is about as fast as a film projector. You could perform this roundtrip computation between every consecutive frame of a movie; to a human eye this appears fluid. In the last post we asked about how low we could bring latency. In that post we were running computations from my laptop in California and so were bound by transcontinental latencies of 200ms. This time, because we’re operating from the cluster, we can get down to 20ms. We’re only able to be this fast because we touch only a single data element, the first partition. Things change when we need to touch the entire dataset.

>>> %time len(nyc2015)
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 271 ms

The length computation takes 200-300 ms. This computation takes longer because we touch every individual partition of the data, of which there are 178. The scheduler incurs about 1ms of overhead per task, add a bit of latency and you get the ~200ms total. This means that the scheduler will likely be the bottleneck whenever computations are very fast, such as is the case for computing len. Really, this is good news; it means that by improving the scheduler we can reduce these durations even further.

If you look at the groupby computations above you can add the numbers in the progress bars to show that we computed around 3000 tasks in around 7s. It looks like this computation is about half scheduler overhead and about half bound by actual computation.

Conclusion

We used dask+distributed on a cluster to read CSV data from HDFS into a dask dataframe. We then used dask.dataframe, which looks identical to the Pandas dataframe, to manipulate our distributed dataset intuitively and efficiently.

We looked a bit at the performance characteristics of simple computations.

What doesn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

  • Dask dataframe implements a commonly used subset of Pandas functionality, not all of it. It’s surprisingly hard to communicate the exact bounds of this subset to users. Notably, in the distributed setting we don’t have a shuffle algorithm, so groupby(...).apply(...) and some joins are not yet possible.

  • If you want to use threads, you’ll need Pandas 0.18.0 which, at the time of this writing, was still in release candidate stage. This Pandas release fixes some important GIL related issues.

  • The 1ms overhead per task limit is significant. While we can still scale out to clusters far larger than what we have here, we probably won’t be able to strongly accelerate very quick operations until we reduce this number.

  • We use the hdfs3 library to read data from HDFS. This library seems to work great but is new and could use more active users to flush out bug reports.

Setup and Data

You can obtain public data from the New York City Taxi and Limousine Commission here. I downloaded this onto the head node and dumped it into HDFS with commands like the following:

$ wget https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-{01..12}.csv
$ hdfs dfs -mkdir /nyctaxi
$ hdfs dfs -mkdir /nyctaxi/2015
$ hdfs dfs -put yellow*.csv /nyctaxi/2015/

The cluster was hosted on EC2 and was comprised of nine m3.2xlarges with 8 cores and 30GB of RAM each. Eight of these nodes were used as workers; they used processes for parallelism, not threads.

February 22, 2016 12:00 AM

February 21, 2016

Fabian Pedregosa

SAGA algorithm in the lightning library

Recently I've implemented, together with Arnaud Rachez, the SAGA[1] algorithm in the lightning machine learning library (which by the way, has been recently moved to the new scikit-learn-contrib project). The lightning library uses the same API as scikit-learn but is particularly adapted to online learning. As for the SAGA algorithm, its performance is similar to other variance-reduced stochastic algorithms such as SAG[3] or SVRG[2] but it has the advantage with respect to SAG[3] that it allows non-smooth penalty terms (such as $\ell_1$ regularization). It is implemented in lightning as SAGAClassifier and SAGARegressor.

We have taken care to make this implementation as efficient as possible. As for most stochastic gradient algorithms, a naive implementation takes 3 lines of code and is straightforward to implement. However, there are many tricks that are time-consuming and error-prone to implement but make a huge difference in efficiency.

A small example, more as a sanity check than to claim anything. The following plot shows the suboptimality as a function of time for three similar methods: SAG, SAGA and SVRG. The dataset used in the RCV1 dataset (test set, obtained from the libsvm webpage), consisting of 677.399 samples and 47.236 features. Interestingly, all methods can solve this rather large-scale problem within a few seconds. Within them, SAG and SAGA have a very similar performance and SVRG seems to be reasonably faster.

A note about the benchmarks: it is difficult to compare fairly stochastic gradient methods because at the end it usually boils down to how you choose the step size. In this plot I set the step size of all methods to 1/(3L) , where L is the Lipschitz constant of the objective function, as I think this is a popular choice. I would have prefered 1/L but SVRG was not converging for this step size. The code for the benchmarks can be found here.


  1. A. Defazio, F. Bach & S. Lacoste-Julien. "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives" (2014). 

  2. Rie Johnson and Tong Zhang. "Accelerating stochastic gradient descent using predictive variance reduction." Advances in Neural Information Processing Systems. 2013. 

  3. Mark Schmidt, Nicolas Le Roux, and Francis Bach. "Minimizing finite sums with the stochastic average gradient." arXiv preprint arXiv:1309.2388 (2013). 

by Fabian Pedregosa at February 21, 2016 11:00 PM

Titus Brown

Introducing undergraduates to code review

A few days, I gave an invited lecture on code review here at UC Davis. The class was the ECS capstone class, consisting of about 100 CS majors who were all working on some form of group project - all but a few were doing something with programming. I was told the class had little to no experience with code review, but did know C++ rather well.

Since I only had two hours AND I really hate lectures, I really wanted to do something that was somewhat hands on, but didn't involve programming or computer use. Asking my Twitter network, I got a few links -- Azalee Boestrom has code review tutorial, and Maxime Boissonneault demoed exercism.io for me -- but neither fit what I had in mind.

The big thing was this: somewhere during my thought process, I realized that I wanted to show the students the basic connection between code review, testing, and refactoring. How do you show all of that to people who might have never done any of that?

I ended up with the plan that follows. It went better than expected!

In preparation for class, I extracted some code from our khmer package -- specifically, the KmerIterator code that takes a string of DNA and produces a succession of hashed values from it -- and defactored it into a horrid function. You can see the result here. I also built a simple Makefile.

In class, I started by asking the class what they thought the purpose of code review was, and polled them on it via a free-text entry Google Form. This led to a discussion about maintainability vs correctness, and my point that you couldn't approach the latter over a long period of time without the former.

Then, I moved on to the code:

First, I ran through the purpose of the (ugly) code with lots of hand-waving.

Then, I asked the question: how do we know if the code works?

To answer that, I live-coded a very simple version of the function that was slow, but so simple that you could guarantee it worked -- see it here. I saved the output from this "presumed-good" version, and compared it to the output of the ugly version and showed that it matched.

For good measure, I also automated that as a smoke test, so that with a simple 'make test' we could check to see if our latest code changes produced presumed-good output.

Now I had code that probably worked.

Second, I polled the class (with a free-text entry approach using Google Forms) for their thoughts on what the code needed. Remember, it's ugly!

I went through the resulting comments, and pulled out a few suggestions.

One was "write down in detail what the biology background is." I explained that you probably didn't want that kind of comment in the code -- generally, anyone working on this code should already be somewhat versed in the biology.

Another suggestion was "Comment the code." So I started putting in comments like "set this variable to 0", which led to a discussion of how too many comments could be silly and what we really wanted was important comments. Here, I brought in my confession that I'd spent 15 minutes trying to understand why i started at k, so we commented that plus a few other things.

Third, I was running a bit low on time, so I took a previous suggestion and refactored the code by making longer, better variable names. Here, I could show that the code not only compiled but passed tests on the first try (hence the commit comment).

Finally, I took the suggestion of one astute observer and split up my now too-long function by building it into a class. The result looks reasonably similar to the actual code in khmer, which I pointed out; even better, it passed my smoke test (hence the commit comment). I got applause!

We closed with some discussion about where to look for more information, and how to think about this stuff. I showed them some live examples, and referred them to David Soergel's paper, Rampant software errors may undermine scientific results as an entry point into some literature on error rates; I also showed them Python coverage output and our code review guidelines for the khmer project.


All in all, I managed to hold their attention (by live coding, by cracking jokes, and by being energetic). (I only got one negative comment about the jokes! :).

I think I demonstrated about 80% of what I wanted to show, and filled up the two hours nicely. I can do a better job next time, of course, and if I get invited back I'll try to provide a better outline of points to make and references; I winged it too much this time and so it's not a re-usable approach yet, I think.

--titus

by C. Titus Brown at February 21, 2016 11:00 PM

Filipe Saraiva

Call to Co-maintainers of Cantor

Cantor, the software to scientific programming in worksheet-style interface, had (and has!) several developers working in different parts of the code along the years. Thanks to the plugin-based architecture of Cantor, a developer can to create a new backend to communicate with some new programming language, an assistant, or some other piece of software, requiring just the knowledge of Cantor API.

Because it Cantor has 10 backends using different sets of technologies to provide communication between the software and some programming language. Citing some examples, Python 2 backend uses the Python/C API, R and Python 3 backends use D-Bus protocol, and Scilab backend uses KProcess. So, there are a mix of different technologies in the project.

You can think the maintenance of Cantor can be a headache if the work is done by just one person. Since I received the maintainer status, I am trying to improve the coordination of the project, including the distribution of responsibilities among developers.

In this way, I requested to KDE sysadmin and that amazing team created in KDE Bugzilla different components to each backend in Cantor. For now I asked to some developers to take the responsibility and manage the bugs reported to backends developed by them, but some components are missing maintainers yet, specially the backends of Maxima, Qualculate, R, and Sage.

So, if you were a developer of some of those backends and you would like to continue developing Cantor, please comment in this post or send an e-mail to me in filipe at kde.org and I will put you as the maintainer of the component.

Of course, all developers can to contribute for Cantor and solve his bugs, but the work as maintainer of some component will help us in the task of manage and improve the project.

by Filipe Saraiva at February 21, 2016 12:30 AM

February 18, 2016

Pierre de Buyl

Code as a paper

Publishing scientific software is a nontrivial problem. I present here, and ask feedback, on the idea to upload the software to arXiv with a presentation paper and use it as an evolving reference.

The situation is the following: I am developing a code that, while still under development, should provide scientific outputs in the coming months. I'd like to cite properly the software in my papers at that point, and not in a hypotethical future where all my wishlist has been satisfied. I would like to cite it as it is used, and would like a "single point of citation" for the lifetime of the code.

Many people have already thought about this issue. See the recent Software Credit Workshop, the related blog post by Martin Fenner, Mick Watson's blog, Gaël Varoquaux's blog, Titus Brown's blog and probably many others. There are several old and new solutions to publish software:

  • Paper: Publishing a software paper in a computing oriented journal. In physics, a well-known venue is Computer Physics Communications for instance.
  • Web page: Setup a web page for your software and link to the source (a tarball, a version control system, etc).
  • Crazy (in a good sense): Launching a new journal
  • Zenodo/GitHub: Obtain a DOI for your GitHub-hosted code via Zenodo, as presented here

Here are my criteria:

  • Open-access (by principle).
  • Low or inexistant author charges (pragmatic condition).
  • Provides a single citable reference point for the software and its updates. Be explicit on the version, though.
  • Persistent.
  • Allows more than "just" the code. That is, I would like to document the first release by an extended note, similar to what would be found in a traditional software paper.

Let us review the propositions above:

  • Paper: Only allows one version. Also, is probably paywalled or expensive through publication fees.
  • Web page: Hard to cite. Likely not persistent enough.
  • Crazy: Experience shows that it is not realistic, at least now.
  • Zenodo/GitHub: One DOI per version. Also, the DOI points to an exact reproduction of the source repository. It is thus rather inflexible.

The arXiv identifier is, I believe, stable and citable. One can attach a tarball of code with the paper and it is possible to update a paper with newer versions, allowing for update to the software in a simple form: v1, v2, etc. As far as journal are concerned, The Journal of Open Research Software has reasonable fees and allows to distribute the code via GitHub. I don't know how they handle software revisions though.

Advice and comments welcome!

by Pierre de Buyl at February 18, 2016 10:00 AM

February 17, 2016

Matthew Rocklin

Introducing Dask distributed

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We analyze JSON data on a cluster using pure Python projects.

Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with a new distributed memory scheduler. This enables dask’s existing parallel algorithms to scale across 10s to 100s of nodes, and extends a subset of PyData to distributed computing. Over the next few weeks I and others will write about this system. Please note that dask+distributed is developing quickly and so the API is likely to shift around a bit.

Today we start simple with the typical cluster computing problem, parsing JSON records, filtering, and counting events using dask.bag and the new distributed scheduler. We’ll dive into more advanced problems in future posts.

A video version of this blogpost is available here.

GitHub Archive Data on S3

GitHub releases data dumps of their public event stream as gzipped compressed, line-delimited, JSON. This data is too large to fit comfortably into memory, even on a sizable workstation. We could stream it from disk but, due to the compression and JSON encoding this takes a while and so slogs down interactive use. For an interactive experience with data like this we need a distributed cluster.

Setup and Data

We provision nine m3.2xlarge nodes on EC2. These have eight cores and 30GB of RAM each. On this cluster we provision one scheduler and nine workers (see setup docs). (More on launching in later posts.) We have five months of data, from 2015-01-01 to 2015-05-31 on the githubarchive-data bucket in S3. This data is publicly avaialble if you want to play with it on EC2. You can download the full dataset at https://www.githubarchive.org/ .

The first record looks like the following:

 {'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/9152315?',
   'gravatar_id': '',
   'id': 9152315,
   'login': 'davidjhulse',
   'url': 'https://api.github.com/users/davidjhulse'},
  'created_at': '2015-01-01T00:00:00Z',
  'id': '2489368070',
  'payload': {'before': '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
   'commits': [{'author': {'email': 'david.hulse@live.com',
      'name': 'davidjhulse'},
     'distinct': True,
     'message': 'Altered BingBot.jar\n\nFixed issue with multiple account support',
     'sha': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
     'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],
   'distinct_size': 1,
   'head': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
   'push_id': 536740396,
   'ref': 'refs/heads/master',
   'size': 1},
  'public': True,
  'repo': {'id': 28635890,
   'name': 'davidjhulse/davesbingrewardsbot',
   'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},
  'type': 'PushEvent'}

So we have a large dataset on S3 and a moderate sized play cluster on EC2, which has access to S3 data at about 100MB/s per node. We’re ready to play.

Play

We start an ipython interpreter on our local laptop and connect to the dask scheduler running on the cluster. For the purposes of timing, the cluster is on the East Coast while the local machine is in California on commercial broadband internet.

>>> from distributed import Executor, s3
>>> e = Executor('54.173.84.107:8786')
>>> e
<Executor: scheduler=54.173.84.107:8786 workers=72 threads=72>

Our seventy-two worker processes come from nine workers with eight processes each. We chose processes rather than threads for this task because computations will be bound by the GIL. We will change this to threads in later examples.

We start by loading a single month of data into distributed memory.

import json
text = s3.read_text('githubarchive-data', '2015-01', compression='gzip')
records = text.map(json.loads)
records = e.persist(records)

The data lives in S3 in hourly files as gzipped encoded, line delimited JSON. The s3.read_text and text.map functions produce dask.bag objects which track our operations in a lazily built task graph. When we ask the executor to persist this collection we ship those tasks off to the scheduler to run on all of the workers in parallel. The persist function gives us back another dask.bag pointing to these remotely running results. This persist function returns immediately, and the computation happens on the cluster in the background asynchronously. We gain control of our interpreter immediately while the cluster hums along.

The cluster takes around 40 seconds to download, decompress, and parse this data. If you watch the video embedded above you’ll see fancy progress-bars.

We ask for a single record. This returns in around 200ms, which is fast enough that it feels instantaneous to a human.

>>> records.take(1)
({'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/9152315?',
   'gravatar_id': '',
   'id': 9152315,
   'login': 'davidjhulse',
   'url': 'https://api.github.com/users/davidjhulse'},
  'created_at': '2015-01-01T00:00:00Z',
  'id': '2489368070',
  'payload': {'before': '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
   'commits': [{'author': {'email': 'david.hulse@live.com',
      'name': 'davidjhulse'},
     'distinct': True,
     'message': 'Altered BingBot.jar\n\nFixed issue with multiple account support',
     'sha': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
     'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],
   'distinct_size': 1,
   'head': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
   'push_id': 536740396,
   'ref': 'refs/heads/master',
   'size': 1},
  'public': True,
  'repo': {'id': 28635890,
   'name': 'davidjhulse/davesbingrewardsbot',
   'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},
  'type': 'PushEvent'},)

This particular event is a 'PushEvent'. Let’s quickly see all the kinds of events. For fun, we’ll also time the interaction:

>>> %time records.pluck('type').frequencies().compute()
CPU times: user 112 ms, sys: 0 ns, total: 112 ms
Wall time: 2.41 s

[('ReleaseEvent', 44312),
 ('MemberEvent', 69757),
 ('IssuesEvent', 693363),
 ('PublicEvent', 14614),
 ('CreateEvent', 1651300),
 ('PullRequestReviewCommentEvent', 214288),
 ('PullRequestEvent', 680879),
 ('ForkEvent', 491256),
 ('DeleteEvent', 256987),
 ('PushEvent', 7028566),
 ('IssueCommentEvent', 1322509),
 ('GollumEvent', 150861),
 ('CommitCommentEvent', 96468),
 ('WatchEvent', 1321546)]

And we compute the total count of all commits for this month.

>>> %time records.count().compute()
CPU times: user 134 ms, sys: 133 µs, total: 134 ms
Wall time: 1.49 s

14036706

We see that it takes a few seconds to walk through the data (and perform all scheduling overhead.) The scheduler adds about a millisecond overhead per task, and there are about 1000 partitions/files here (the GitHub data is split by hour and there are 730 hours in a month) so most of the cost here is overhead.

Investigate Jupyter

We investigate the activities of Project Jupyter. We chose this project because it’s sizable and because we understand the players involved and so can check our accuracy. This will require us to filter our data to a much smaller subset, then find popular repositories and members.

>>> jupyter = (records.filter(lambda d: d['repo']['name'].startswith('jupyter/'))
                      .repartition(10))
>>> jupyter = e.persist(jupyter)

All records, regardless of event type, have a repository which has a name like 'organization/repository' in typical GitHub fashion. We filter all records that start with 'jupyter/'. Additionally, because this dataset is likely much smaller, we push all of these records into just ten partitions. This dramatically reduces scheduling overhead. The persist call hands this computation off to the scheduler and then gives us back our collection that points to that computing result. Filtering this month for Jupyter events takes about 7.5 seconds. Afterwards computations on this subset feel snappy.

>>> %time jupyter.count().compute()
CPU times: user 5.19 ms, sys: 97 µs, total: 5.28 ms
Wall time: 199 ms

747

>>> %time jupyter.take(1)
CPU times: user 7.01 ms, sys: 259 µs, total: 7.27 ms
Wall time: 182 ms

({'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/26679?',
   'gravatar_id': '',
   'id': 26679,
   'login': 'marksteve',
   'url': 'https://api.github.com/users/marksteve'},
  'created_at': '2015-01-01T13:25:44Z',
  'id': '2489612400',
  'org': {'avatar_url': 'https://avatars.githubusercontent.com/u/7388996?',
   'gravatar_id': '',
   'id': 7388996,
   'login': 'jupyter',
   'url': 'https://api.github.com/orgs/jupyter'},
  'payload': {'action': 'started'},
  'public': True,
  'repo': {'id': 5303123,
   'name': 'jupyter/nbviewer',
   'url': 'https://api.github.com/repos/jupyter/nbviewer'},
  'type': 'WatchEvent'},)

So the first event of the year was by 'marksteve' who decided to watch the 'nbviewer' repository on new year’s day.

Notice that these computations take around 200ms. I can’t get below this from my local machine, so we’re likely bound by communicating to such a remote location. A 200ms latency is not great if you’re playing a video game, but it’s decent for interactive computing.

Here are all of the Jupyter repositories touched in the month of January,

>>> %time jupyter.pluck('repo').pluck('name').distinct().compute()
CPU times: user 2.84 ms, sys: 4.03 ms, total: 6.86 ms
Wall time: 204 ms

['jupyter/dockerspawner',
 'jupyter/design',
 'jupyter/docker-demo-images',
 'jupyter/jupyterhub',
 'jupyter/configurable-http-proxy',
 'jupyter/nbshot',
 'jupyter/sudospawner',
 'jupyter/colaboratory',
 'jupyter/strata-sv-2015-tutorial',
 'jupyter/tmpnb-deploy',
 'jupyter/nature-demo',
 'jupyter/nbcache',
 'jupyter/jupyter.github.io',
 'jupyter/try.jupyter.org',
 'jupyter/jupyter-drive',
 'jupyter/tmpnb',
 'jupyter/tmpnb-redirector',
 'jupyter/nbgrader',
 'jupyter/nbindex',
 'jupyter/nbviewer',
 'jupyter/oauthenticator']

And the top ten most active people on GitHub.

>>> %time (jupyter.pluck('actor')
                  .pluck('login')
                  .frequencies()
                  .topk(10, lambda kv: kv[1])
                  .compute())
CPU times: user 8.03 ms, sys: 90 µs, total: 8.12 ms
Wall time: 226 ms

[('rgbkrk', 156),
 ('minrk', 87),
 ('Carreau', 87),
 ('KesterTong', 74),
 ('jhamrick', 70),
 ('bollwyvl', 25),
 ('pkt', 18),
 ('ssanderson', 13),
 ('smashwilson', 13),
 ('ellisonbg', 13)]

Nothing too surprising here if you know these folks.

Full Dataset

The full five months of data is too large to fit in memory, even for this cluster. When we represent semi-structured data like this with dynamic data structures like lists and dictionaries there is quite a bit of memory bloat. Some careful attention to efficient semi-structured storage here could save us from having to switch to such a large cluster, but that will have to be the topic of another post.

Instead, we operate efficiently on this dataset by flowing it through memory, persisting only the records we care about. The distributed dask scheduler descends from the single-machine dask scheduler, which was quite good at flowing through a computation and intelligently removing intermediate results.

From a user API perspective, we call persist only on the jupyter dataset, and not the full records dataset.

>>> full = (s3.read_text('githubarchive-data', '2015', compression='gzip')
              .map(json.loads)

>>> jupyter = (full.filter(lambda d: d['repo']['name'].startswith('jupyter/'))
                   .repartition(10))

>>> jupyter = e.persist(jupyter)

It takes 2m36s to download, decompress, and parse the five months of publicly available GitHub events for all Jupyter events on nine m3.2xlarges.

There were seven thousand such events.

>>> jupyter.count().compute()
7065

We find which repositories saw the most activity during that time:

>>> %time (jupyter.pluck('repo')
                  .pluck('name')
                  .frequencies()
                  .topk(20, lambda kv: kv[1])
                  .compute())
CPU times: user 6.98 ms, sys: 474 µs, total: 7.46 ms
Wall time: 219 ms

[('jupyter/jupyterhub', 1262),
 ('jupyter/nbgrader', 1235),
 ('jupyter/nbviewer', 846),
 ('jupyter/jupyter_notebook', 507),
 ('jupyter/jupyter-drive', 505),
 ('jupyter/notebook', 451),
 ('jupyter/docker-demo-images', 363),
 ('jupyter/tmpnb', 284),
 ('jupyter/jupyter_client', 162),
 ('jupyter/dockerspawner', 149),
 ('jupyter/colaboratory', 134),
 ('jupyter/jupyter_core', 127),
 ('jupyter/strata-sv-2015-tutorial', 108),
 ('jupyter/jupyter_nbconvert', 103),
 ('jupyter/configurable-http-proxy', 89),
 ('jupyter/hubpress.io', 85),
 ('jupyter/jupyter.github.io', 84),
 ('jupyter/tmpnb-deploy', 76),
 ('jupyter/nbconvert', 66),
 ('jupyter/jupyter_qtconsole', 59)]

We see that projects like jupyterhub were quite active during that time while, surprisingly, nbconvert saw relatively little action.

Local Data

The Jupyter data is quite small and easily fits in a single machine. Let’s bring the data to our local machine so that we can compare times:

>>> %time L = jupyter.compute()
CPU times: user 4.74 s, sys: 10.9 s, total: 15.7 s
Wall time: 30.2 s

It takes surprisingly long to download the data, but once its here, we can iterate far more quickly with basic Python.

>>> from toolz.curried import pluck, frequencies, topk, pipe
>>> %time pipe(L, pluck('repo'), pluck('name'), frequencies,
               dict.items, topk(20, key=lambda kv: kv[1]), list)
CPU times: user 11.8 ms, sys: 0 ns, total: 11.8 ms
Wall time: 11.5 ms

[('jupyter/jupyterhub', 1262),
 ('jupyter/nbgrader', 1235),
 ('jupyter/nbviewer', 846),
 ('jupyter/jupyter_notebook', 507),
 ('jupyter/jupyter-drive', 505),
 ('jupyter/notebook', 451),
 ('jupyter/docker-demo-images', 363),
 ('jupyter/tmpnb', 284),
 ('jupyter/jupyter_client', 162),
 ('jupyter/dockerspawner', 149),
 ('jupyter/colaboratory', 134),
 ('jupyter/jupyter_core', 127),
 ('jupyter/strata-sv-2015-tutorial', 108),
 ('jupyter/jupyter_nbconvert', 103),
 ('jupyter/configurable-http-proxy', 89),
 ('jupyter/hubpress.io', 85),
 ('jupyter/jupyter.github.io', 84),
 ('jupyter/tmpnb-deploy', 76),
 ('jupyter/nbconvert', 66),
 ('jupyter/jupyter_qtconsole', 59)]

The difference here is 20x, which is a good reminder that, once you no longer have a large problem you should probably eschew distributed systems and act locally.

Conclusion

Downloading, decompressing, parsing, filtering, and counting JSON records is the new wordcount. It’s the first problem anyone sees. Fortunately it’s both easy to solve and the common case. Woo hoo!

Here we saw that dask+distributed handle the common case decently well and with a Pure Python stack. Typically Python users rely on a JVM technology like Hadoop/Spark/Storm to distribute their computations. Here we have Python distributing Python; there are some usability gains to be had here like nice stack traces, a bit less serialization overhead, and attention to other Pythonic style choices.

Over the next few posts I intend to deviate from this common case. Most “Big Data” technologies were designed to solve typical data munging problems found in web companies or with simple database operations in mind. Python users care about these things too, but they also reach out to a wide variety of fields. In dask+distributed development we care about the common case, but also support less traditional workflows that are commonly found in the life, physical, and algorithmic sciences.

By designing to support these more extreme cases we’ve nailed some common pain points in current distributed systems. Today we’ve seen low latency and remote control; in the future we’ll see others.

What doesn’t work

I’ll have an honest section like this at the end of each upcoming post describing what doesn’t work, what still feels broken, or what I would have done differently with more time.

  • The imports for dask and distributed are still strange. They’re two separate codebases that play very nicely together. Unfortunately the functionality you need is sometimes in one or in the other and it’s not immediately clear to the novice user where to go. For example dask.bag, the collection we’re using for records, jupyter, etc. is in dask but the s3 module is within the distributed library. We’ll have to merge things at some point in the near-to-moderate future. Ditto for the API: there are compute methods both on the dask collections (records.compute()) and on the distributed executor (e.compute(records)) that behave slightly differently.

  • We lack an efficient distributed shuffle algorithm. This is very important if you want to use operations like .groupby (which you should avoid anyway). The user API here doesn’t even cleanly warn users that this is missing in the distributed case which is kind of a mess. (It works fine on a single machine.) Efficient alternatives like foldby are available.

  • I would have liked to run this experiment directly on the cluster to see how low we could have gone below the 200ms barrier we ran into here.

  • dask, the original project
  • distributed, the distributed memory scheduler powering the cluster computing
  • dask.bag, the user API we’ve used in this post.
  • This post largely repeats work by Blake Griffith in a similar post last year with an older iteration of the dask distributed scheduler

February 17, 2016 12:00 AM

February 16, 2016

Titus Brown

Is mybinder 95% of the way to next-gen computational science publishing, or only 90%?

If you haven't seen mybinder.org, you should go check it out. It's a site that runs IPython/Jupyter Notebooks from GitHub for free, and I think it's a solution to publishing reproducible computational work.

For a really basic example, take a look at my demo Software Carpentry lesson. Clicking on that link brings up a Python notebook that will load and plot some data; it's interactive and you can go ahead and execute the cells.

There are a couple of really appealing things about this.

First, it's simple -- the source repository is just a directory that contains a Jupyter notebook and some data. It's pretty much what you'd have if you were actually, you know, doing data analysis with data in a repo.

Second, it's built on top of solid technology. mybinder's foundation is Docker and Jupyter Notebook, both of which are robust and have large, supportive communities. Sure, behind the scenes, mybinder is using some bleeding edge tech (Kubernetes running on Google Compute Engine) to provision and load balance the compute. But I bet it would take me no more than 30 minutes to cobble together something that would let me run any mybinder-compatible repository on my laptop, or on any compute server that can run Docker.

Third, it's super configurable -- because it's based on Docker, you can install pretty much anything you want. I tested this out myself by installing R and an R kernel for the Jupyter notebook -- see this mybinder link and repo. The Dockerfile that installs things is pretty straightforward (given that installing all that stuff isn't straightforward in the first place ;).

Fourth, it's open source. Go ahead, do what you want with it!

I believe mybinder is the hitherto missing link for publishing reproducible computation. Before this, we could give people our data and our data analysis scripts, but not our compute environment - or at least not in a way that was easy to run. We can now provide readers with a link that they can use to instantly go from my paper to a running instantiation of my paper's data analysis.

What's missing and easy?

Authentication and Cost: right now, mybinder is running on someone else's dime -- specifically, the Freeman Lab's dime. This can't continue forever, especially since it's totes open for abuse. I expect that sooner or later someone will gin up a way to run this on many cloud compute services; once that works, it should be pretty easy to build a site that redirects you to your favorite compute service and lets you enter your own credentials, then boom there you go. Insta-compute.

(I think current mybinder limitations are in the several GB of RAM range, with a few dozen MB of disk space.)

Executing anything other than GitHub public repos: mybinder only runs off of public GitHub repositories right now, but that's not going to be too tricky to fix, either. BitBucket and GitLab and private repos can be served almost as easily.

RStudio and RMarkdown: currently, it's all Project Jupyter, but it should be pretty easy to get RStudio Server or some sort of RMarkdown application (Shiny?) working on it.

Pre-built Docker images: you need to derive your Dockerfile from the andrewosh/binder-base image, which means you can't use pre-built images from the Docker Hub (even if they were built from binder-base). I predict that will go away soon enough, since it's not a fundamental requirement of the way they're doing things.

What's missing and not easy?

Big files and big/long compute: Coordinating the delivery of big files, big compute, and big memory (potentially for a long period of time) is something that's out of easy reach, I think. Docker isn't ready for really data-intensive stuff, from what I can tell. More, I'm not sure when (if ever) people will completely automate the heavyweight data analysis - surely it's further away than most people using Jupyter or RStudio for data analysis.

The split that my lab has made here is to use a workflow engine (e.g. make, pydoit, or snakemake) for the compute & data intensive stuff, and then feed those intermediate results (assembly and mapping stats, quantification, etc.) into analysis notebooks. For mybinder purposes, there should be no problem saving those intermediate results into a github repo for us and everyone else to analyze and reanalyze.

What are the next steps?

I bet this will evolve pretty fast, because it's an easy way for scientists to deliver compute to friends, colleagues, and students.

In terms of learning how to make use of mybinder, Software Carpentry already teaches scientists how to do much of the necessary stuff -- scripting, git, and Jupyter Notebooks are all part of the standard curriculum. It's not a big leap to show students some simple Docker files. There's virtually no unnecessary configuration required at the moment, which is a big win; we can talk about why everything is actually important for the end result, which is <ahem> not always the case in computing.

Beyond that, I'm not sure where the gaps are - as usual we're missing incentives for adopting this, so I'd be interested in thinking about how to structure things there. But I look forward to seeing how this evolves!

--titus

p.s. Everware does something similar to mybinder. I'm aware of it, I just haven't had the chance to try it out - I'd love to hear from people who have used it!

p.p.s. For a real, and super cool, example of mybinder, see: Min Ragan-Kelly's LIGO/gravitational waves tutorial.

by C. Titus Brown at February 16, 2016 11:00 PM

Three remote workshops: regex in Python, scipy.optimize, and building Web sites

We're running three half-day workshops for your remote viewing pleasure! All three will be live-streamed via YouTube (well, Hangouts on Air).

Today 2/17, at 9:15am PT, Tiffany Timbers (Simon Fraser U.) is going to be going through regular expressions in Python - see the workshop description.

On Friday, 2/19, same time, same place, Ariel Rokem (U. Washington) will be introducing scipy.optimize.

And, in just under two weeks, on Monday 2/29, Adelaide Rhodes (Oregon State University) will be showing us how to build static Web sites with Sphinx.

All three workshops will be live-streamed as well as available for archival viewing. We'll post the YouTube links on the workshop pages above, but you can always tweet us at @ctitusbrown or @jessicamizzi if you want the hook-up. (We'll also do live-stream support as possible.)

--titus

by C. Titus Brown at February 16, 2016 11:00 PM

February 14, 2016

Titus Brown

The likely challenges of (post) publication peer review

CORRECTION: I mistakenly linked to Geoff Bilder, Jennifer Lin, and Cameron's piece on infrastructure in the first posted version, rather than Cameron's post on culture in data sharing. Both are worth reading but the latter is more relevant to this post, and I also wanted to make sure I correctly attributed the former to Geoff, Jennifer, and Cameron all.


I am a huge fan of the idea of changing up scientific peer review.

Anyone paying attention has seen a nigh-daily drumbeat of blog posts, tweets, and op-eds pointing out the shortcomings of the current approach, which is dominated by anonymous pre-publication peer review.

For those who don't know how the current system works, it goes something like this:

  1. Author submits paper to journal
  2. Journal finds editor for paper.
  3. Editor decides whether paper is important enough to review, finds reviewers for paper.
  4. Reviewers review paper, send reviews to editor.
  5. Editor collates reviews, anonymizes them, and sends them to authors along with a decision (reject, major revisions, minor revisions, accept).
  6. Go to #1 (it it's a reject), or go to #4 (if revisions required), or go to...
  7. Publish!
  8. Garner academic fame, reap grant money, write another paper, go to #1.

This is how much of scientific communication works, and peer-reviewed pubs are how reputation is garnered and credit is assigned.

There are many concerns with this approach in the modern day world, with all the possibilities enabled by the Intertubes - papers are often not available prior to publication, which delays scientific progress; journals are often unnecessarily expensive, with high profit margins; papers have to find big effects to be considered significant enough to even review; reviewers are often inexpert, usually uncompensated, rarely timely, and sometimes quite obnoxious; the reviews are rarely published, which means the reviewer perspective is rarely heard; reviews are not re-used, so when a paper transitions from journal to journal, the reviews must be redone; editors make incomprehensible decisions; the process is slow; the process if opaque; there's little accountability; and probably many other things that I am missing.

A precis might be that, so far, publishing and peer review have really failed to take deep advantage of the opportunities provided by quick, easy, instantaneous worldwide communication.

In recognition of this idea that there may be better ways to do communicate science, I've indulged in lots of experiments in alt-publishing since becoming an Assistant Professor - I blog and tweet a lot, I sign my reviews, I (used to) post many of my reviews on my blog, I've been trying out F1000Research's open review system, most of our papers are written openly & blogged & preprinted, I've indulged in authorship experiments, and we've tried pre- and post-publication peer review in journal clubs and on my blog.

As a result of all of these experiments, I've learned a lot about how the system works and reacts to attempts to change. It's mostly been positive (and I certainly have no complaints about my career trajectory). But the main conclusion I've reached is a tentative and disappointing one - the whole system is complicated, and is deeply rooted in the surprisingly conservative culture of science. And, because of this, I don't think there's a simple way to cut the Gordian Knot and move quickly to a new publication system.

In recent weeks, Michael Eisen has been very vocal about changing to a system of post-publication peer review (PPPR). (Also see this more detailed blog post.) This morning, I objected to some snark directed at Michael and other PPPR advocates, and this led to a decent amount of back and forth on Twitter (see this tweet and descendants). But, of course, Twitter isn't great for conveying complicated positions, so I thought I'd take to the blogs.

Below, I tried to distill my opinions down to three main points. Let's see how it goes.

1. Practice in this area is still pretty shallow in science.

I don't think we have enough experience to make decisions yet. We need more experiments.

There's (IMO) legitimate confusion over questions like anonymity and platform. The question of metrics (qualitative or quantitative) rears its ugly head - when is something adequately peer reviewed, and when does something "count" in whatever way it needs to count? And how are we supposed to evaluate a deeply technical paper that's outside of our field? What if it has a mixture of good and bad reviews? How do we focus our attention on papers outside of our immediate sphere? Anyone who thinks hard about these things and reaches a simple conclusion is kidding themselves.

(Of course, the fun bit is that if you think hard about these things you quickly reach the conclusion that our current practice is horrible, too. But that doesn't mean new practices are automatically better.)

2. The Internet is not a friendly place, and it takes a lot of work to create well-behaving communities.

As my corner of science has struggled to embrace "online", I've watched scientists recap the same mistakes that many open source and social media communities made over the last two decades.

The following is a good generalization, in my experience:

Any worthwhile community these days has active (if light) moderation, a code of conduct laying out the rules, and a number of active participants who set the tone and invite new people into the community. These communities take time to grow and cannot be created by fiat.

I think any proposal for PPPR needs to explicitly address the questions of methods of moderation, selection of moderators, expected conduct, and what to do with bad actors. It also needs to address community growth and sustainability. The latter questions are where most PPPR commenters focus, but the former questions are critical as well.

3. Scholarship in this area is still pretty young.

There are a lot of knee jerk reactions (PPPR good! PPPR bad! I like blue!), but there isn't a lot of scholarship and discussion in this area, and most of it happens on blogs (which are largely invisible to 99% of scientists). There are only a few people thinking hard about the whole picture; I really like some of Cameron Neylon's posts, for example, but the fact that his work stands out to me so much is (to me) a bad sign. (I think that's a compliment to you and a diss on the field as a whole, Cameron ;).

Worse, I've been entirely spoiled by reading Gabriella Coleman's book on Anonymous, "Hacker, Hoaxer, Whistleblower, Spy". This kind of deep, immersive research and reporting on how technological communities operate is hard to find, and most of what I have found is ignorant of the new uses and opportunities of technology. I've found nothing equivalent on science. (Pointers VERY welcome!)

Concluding thoughts

At the end of the day, it's not clear to me that we will ever have an answer to Goodhart's Law -- "when a measure becomes a target, it ceases to be a good measure." There are tremendous pressures (livelihoods, reputation, funding) that will be placed on any system of peer review and publishing. I worry that any system we come up with will be even more easily perverted than the current system has been, and science (and scientists, and scientific progress) will suffer as a result.

Me? I'm going to continue experimenting, and talking with people, and seeing if I can identify and promulgate good practice from the bottom up. 'cause that's how I roll.

--titus

p.s. The reason I'm not posting reviews on my blog anymore has to do with time and energy - I've been overwhelmed for the last year or two. I think I need a better workflow for posting them that takes less of my focus.

by C. Titus Brown at February 14, 2016 11:00 PM