## May 29, 2016

### Matthieu Brucher

#### On modeling posts

I’m currently considering whether I should do more posts on preamps modeling or just keep implementing filters/plugins. Of course, it’s not one or the other, there are different options in this poll:

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

So the idea is to ask my readers what they actually want. I can explain how the new triodes filters are implemented, how they behave, but I can also add new filters in Audio Toolkit (based on different preamp and amp stages, dedicated to guitars, bass, other instruments), try to optimize them, and finally I can include them in new plugins that could be used by users. Or I can do something completely different.

So if you have any ideas, feel free to say so!

## May 27, 2016

### Continuum Analytics news

#### Taking the Wheel: How Open Source is Driving Data Science

Posted Friday, May 27, 2016

The world is a big, exciting place—and thanks to cutting-edge technology, we now have amazing ways to explore its many facets. Today, self-driving cars, bullet trains and even private rocket ships allow humans to travel anywhere faster, more safely and more efficiently than ever before.

But technology's impact on our exploratory abilities isn't just limited to transportation: it's also revolutionizing how we navigate the Data Science landscape. More companies are moving toward Open Data Science and the open source technology that underlies it. As a result, we now have an amazing new fleet of vehicles for our data-related excursions.

We're no longer constrained to the single railroad track or state highway of a proprietary analytics product. We can use hundreds of freely available open source libraries for any need: web scraping, ingesting and cleaning data, visualization, predictive analytics, report generation, online integration and more. With these tools, any corner of the Data Science map—astrophysics, financial services, public policy, you name it—can be reached nimbly and efficiently.

But even in this climate of innovation, nobody can afford to completely abandon previous solutions and traditional approaches still remain viable. Fortunately, graceful interoperability is one of the hallmarks of Open Data Science. In appropriate scenarios, it accommodates the blending of legacy code or proprietary products with open source solutions. After all, sometimes taking the train is necessary and even preferable.

Regardless of which technology teams use, the open nature of Open Data Science allows you to travel across the data terrain in a way that is transparent and accessible for all participants.

Data Science in Overdrive

Let's take a look at six specific ways Open Data Science is propelling analytics for small and large teams.

1. Community. Open Data Science prioritizes inclusivity; community involvement is a big reason that open source software has boomed in recent years. Communities can test out new software faster and more thoroughly than any one vendor, accelerating innovation and remediation of any bugs.

Today, the open source software repository, GitHub, is home to more than 5 million open source projects and thousands of distinct communities. One such community is conda-forge, a community of developers that build infrastructure and packages for the conda package manager, a general, cross-platform and cross-language package manager with a large and growing number of data science packages available. Considering that Python is the most popular language in computer science classrooms at U.S. universities, open source communities will only continue to grow.

2. Innovation. The Open Data Science movement recognizes that no one software vendor has all the answers. Instead, it embraces the large—and growing—community of bright minds that are constantly working to build new solutions to age-old challenges.

Because of its adherence to free or low-cost technologies, non-restrictive licensing and shareable code, Open Data Science offers developers unparalleled flexibility to experiment and create innovative software.

One example of the innovation that is possible with Open Data Science is taxcalc, an Open Source Policy Modeling Center project publically available via TaxBrain. Using open source software, the project brought developers from around the globe together to create a new kind of tax policy analysis. This software has the computational power to process the equivalent of more than 120 million tax returns, yet is easy-to-use and accessible to private citizens, policy professionals and journalists alike.

3. Inclusiveness. The Open Data Science movement unites dozens of different technologies and languages under a single umbrella. Data science is a team sport and the Open Data Science movement recognizes that complex projects require a multitude of tools and approaches.

This is why Open Data Science brings together leading open source data science tools under a single roof. It welcomes languages ranging from Python and R to FORTRAN and it provides a common base for data scientists, business analysts and domain experts like economists or biologists.

What's more, it can integrate legacy code or enterprise projects with newly developed code, allowing teams to take the most expedient path to solve their challenges. For example, with the conda package management system, developers can create conda packages from legacy code, allowing integration into a custom analytics platform with newer open source code. In fact, libraries like SciPy already leverage highly optimized legacy FORTRAN code.

4. Visualizations. Visualization has come a long way in the last decade, but many visualization technologies have been focused on reporting and static dashboards. Open Data Science; however, has unveiled intelligent web apps that offer rich, browser-based interactive visualizations, such as those produced with Bokeh. Visualizations empower data scientists and business executives to explore their data, revealing subtle nuances and hidden patterns.

One visualization solution, Anaconda's Datashader library, is a Big Data visualizer that plays to the strengths of the human visual system. The Datashader library—alongside the Bokeh visualization library—offers a clever solution to the problem of plotting an enormous number of points in a relatively limited number of pixels.

Another choice for data scientists is the D3 Javascript library, which exploded the number of visual tools for data. With wrappers for Python and other languages, D3 has prompted a real renaissance in data visualization.

5. Deep Learning. One of the hottest branches of data science is deep learning, a sub-segment of machine learning based on algorithms that work to model data abstractions using a multitude of processing layers. Open source technology, such as that embraced by Open Data Science, is critical to its expansion and improvement.

Some of the new entrants to the field—all of which are now open source—are Google's TensorFlow project, the Caffe deep learning framework, Microsoft's Computational Network Toolkit (CNTK), Amazon's Deep Scalable Sparse Tensor Network Engine (DSSTNE), Facebook's Torch framework and Nervana's Neon. These products enter a field with many participants like Theano whose Lasagne extension allows easy construction of deep learning models and Berkley's Caffe, which is an open deep learning framework.

These are only some of the most interesting frameworks. There are many others, which is a testament to the lively and burgeoning Open Data Science community and its commitment to innovation and idea sharing allowing for even more future innovation.

6. Interoperability. Traditional, proprietary data science tools typically integrate well only with their own suite. They’re either closed to outside tools or provide inferior, slow methods of integration. Open Data Science, by contrast, rejects these restrictions, instead allowing diverse tools to cooperate and interact in every more closely connected ways.

For example, Anaconda, includes open source distributions of the Python and R languages, which interoperate very well together enabling data scientists to use the technologies that make sense for them. For example, a business analyst might start with Excel, then work with predictive models in R and later fire up Tableau for data visualizations. Interoperable tools speed analysis, eliminate the need for switching between multiple toolsets and improve collaboration.

It's clear that open source tools will lead the charge towards innovation in Data Science and many of the top technology companies are moving in this direction. IBM, Microsoft, Google, Facebook, Amazon and others are all joining the Open Data Science revolution, making their technology available with APIs and open source code. This benefits technology companies and individual developers, as it empowers a motivated user base to improve code, create new software and use existing technologies in new contexts.

That's the power of open source software and inclusive Open Data Science platforms like Anaconda. Thankfully, today's user-friendly languages—like Python—make joining this new future easier than ever.

If you're considering open source for your next data project, now’s the time to grab the wheel. Join the Open Data Science movement and shift your analyses into overdrive.

## May 26, 2016

### NeuralEnsemble

#### Updated Docker images for biological neuronal network simulations with Python

The NeuralEnsemble Docker images for biological neuronal network simulations with Python have been updated to contain NEST 2.10, NEURON 7.4, Brian 2.0rc1 and PyNN 0.8.1.

In addition, the default images (which are based on NeuroDebian Jessie) now use Python 3.4. Images with Python 2.7 and Brian 1.4 are also available (using the "py2" tag). There is also an image with older versions (NEST 2.2 and PyNN 0.7.5).

The images are intended as a quick way to get simulation projects up-and-running on Linux, OS X and Windows. They can be used for teaching or as the basis for reproducible research projects that can easily be shared with others.

The images are available on Docker Hub.

To quickly get started, once you have Docker installed, run

docker pull neuralensemble/simulation
docker run -i -t neuralensemble/simulation /bin/bash

For Python 2.7:

docker pull neuralensemble/simulation:py2

For older versions:

docker pull neuralensemble/pynn07

For ssh/X11 support, use the "simulationx" image instead of "simulation". Full instructions are available here.

If anyone would like to help out, or suggest other tools that should be installed, please contact me, or open a ticket on Github.

#### PyNN 0.8.1 released

Having forgotten to blog about the release of PyNN 0.8.0, here is an announcement of PyNN 0.8.1!

For all the API changes between PyNN 0.7 and 0.8 see the release notes for 0.8.0. The main change with PyNN 0.8.1 is support for NEST 2.10.

PyNN 0.8.1 can be installed with pip from PyPI.

### What is PyNN?

PyNN (pronounced 'pine' ) is a simulator-independent language for building neuronal network models.

In other words, you can write the code for a model once, using the PyNN API and the Python programming language, and then run it without modification on any simulator that PyNN supports (currently NEURON, NEST and Brian as well as the SpiNNaker and BrainScaleS neuromorphic hardware systems).

Even if you don't wish to run simulations on multiple simulators, you may benefit from writing your simulation code using PyNN's powerful, high-level interface. In this case, you can use any neuron or synapse model supported by your simulator, and are not restricted to the standard models.

The code is released under the CeCILL licence (GPL-compatible).

### Paul Ivanov

#### in transit

Standing impatient, platform teeming, almost noon
Robo voices read off final destinations
But one commuter's already at his
He reached for life's third rail

There is no why in the abyss
There's only closing credit hiss
The soundtrack's gone, he didn't miss
Reaching for life's third rail

We ride on, now, relieved and moving forward
Each our own lives roll forth, for now
But now is gone, for one among us
Who reached for life's third rail

We rock, to-fro, and reach each station
Weight shifting onto forward foot
Flesh, bone ground up in violent elation
And bloody rags, hours ago a well worn suit

I ride the escalator up and pensive
About what did and not occur today
Commuter glut, flow restricted
A crooked kink in public transport hose resolved.


## Early Experience with Clusters

My first real experience with cluster computing came in 1999 during my graduate school days at the Mayo Clinic.  These were wonderful times.   My advisor was Dr. James Greenleaf.   He was very patient with allowing me to pester a bunch of IT professionals throughout the hospital to collect their aging Mac Performa machines and build my own home-grown cluster.   He also let me use a bunch of space in his ultrasound lab to host the cluster for about 6 months.

#### Building my own cluster

The form-factor for those Mac machines really made it easy to stack them.   I ended up with 28 machines in two stacks with 14 machines in each stack (all plugged into a few power strips and a standard lab-quality outlet).  With the recent release of Yellow-Dog Linux, I wiped the standard OS from all the machines  and installed Linux on all those Macs to create a beautiful cluster of UNIX goodness I could really get excited about.   I called my system "The Orchard" and thought it would be difficult to come up with 28 different kinds of apple varieties to name each machine after.  It wasn't difficult. It turns out there are over 7,500 varieties of apples grown throughout the world.

 Me smiling alongside by smoothly humming "Orchard" of interconnected Macs

The reason I put this cluster together was to simulate Magnetic Resonance Elastography (MRE) which is a technique to visualize motion using Magnetic Resonance Imaging (MRI).  I wanted to simulate the Bloch equations with a classical model for how MRI images are produced.  The goal was to create a simulation model for the MRE experiment that I could then use to both understand the data and perhaps eventually use this model to determine material properties directly from the measurements using Bayesian inversion (ambitiously bypassing the standard sequential steps of inverse FFT and local-frequency estimation).

Now I just had to get all these machines to talk to each other, and then I would be poised to do anything.  I read up a bit on MPI, PVM, and anything else I could find about getting computers to talk to each other.  My unfamiliarity with the field left me puzzled as I tried to learn these frameworks in addition to figuring out how to solve my immediate problem.  Eventually, I just settled down with a trusted UNIX book by the late W. Richard Stevens.    This book explained how the internet works.   I learned enough about TCP/IP and sockets so that I could write my own C++ classes representing the model.  These classes communicated directly with each other over raw sockets.   While using sockets directly was perhaps not the best approach, it did work and helped me understand the internet so much better.  It also makes me appreciate projects like tornado and zmq that much more.

#### Lessons Learned

I ended up with a system that worked reasonably well, and I could simulate MRE to some manner of fidelity with about 2-6 hours of computation. This little project didn't end up being critical to my graduation path and so it was abandoned after about 6 months.  I still value what I learned about C++, how abstractions can ruin performance, how to guard against that, and how to get machines to communicate with each other.

Using Numeric, Python, and my recently-linked ODE library (early SciPy), I built a simpler version of the simulator that was actually faster on one machine than my cluster-version was in C++ on 20+ machines.  I certainly could have optimized the C++ code, but I could have also optimized the Python code.   The Python code took me about 4 days to write, the C++ code took me about 4 weeks.  This experience has markedly influenced my thinking for many years about both pre-mature parallelization and pre-mature use of C++ and other compiled languages.

Fast forward over a decade.   My computer efforts until 2012 were spent on sequential array-oriented programming, creating SciPy, writing NumPy, solving inverse problems, and watching a few parallel computing paradigms emerge while I worked on projects to provide for my family.  I didn't personally get to work on parallel computing problems during that time, though I always dreamed of going back and implementing this MRE simulator using a parallel construct with NumPy and SciPy directly.   When I needed to do the occassional parallel computing example during this intermediate period, I would either use IPython parallel or multi-processing.

## Parallel Plans at Continuum

In 2012, Peter Wang and I started Continuum, created PyData, and released Anaconda.   We also worked closely with members of the community to establish NumFOCUS as an independent organization.  In order to give NumFOCUS the attention it deserved, we hired the indefatigable Leah Silen and donated her time entirely to the non-profit so she could work with the community to grow PyData and the Open Data Science community and ecosystem.  It has been amazing to watch the community-based, organic, and independent growth of NumFOCUS.    It took effort and resources to jump-start,  but now it is moving along with a diverse community driving it.   It is a great organization to join and contribute effort to.

A huge reason we started Continuum was to bring the NumPy stack to parallel computing --- for both scale-up (many cores) and scale-out (many nodes).   We knew that we could not do this alone and it would require creating a company and rallying a community to pull it off.   We worked hard to establish PyData as a conference and concept and then transitioned the effort to the community through NumFOCUS to rally the community behind the long-term mission of enabling data-, quantitative-, and computational-scientists with open-source software.  To ensure everyone in the community could get the software they needed to do data science with Python quickly and painlessly, we also created Anaconda and made it freely available.

In addition to important community work, we knew that we would need to work alone on specific, hard problems to also move things forward.   As part of our goals in starting Continuum we wanted to significantly improve the status of Python in the JVM-centric Hadoop world.   Conda, Bokeh, Numba, and Blaze were the four technologies we started specifically related to our goals as a company beginning in 2012.   Each had a relationship to parallel computing including Hadoop.

Conda enables easy creation and replication of environments built around deep and complex software dependencies that often exist in the data-scientist workflow.   This is a problem on a single node --- it's an even bigger problem when you want that environment easily updated and replicated across a cluster.

Bokeh  allows visualization-centric applications backed by quantitative-science to be built easily in the browser --- by non web-developers.   With the release of Bokeh 0.11 it is extremely simple to create visualization-centric-web-applications and dashboards with simple Python scripts (or also R-scripts thanks to rBokeh).

With Bokeh, Python data scientists now have the power of both d3 and Shiny, all in one package. One of the driving use-cases of Bokeh was also easy visualization of large data.  Connecting the visualization pipeline with large-scale cluster processing was always a goal of the project.   Now, with datashader, this goal is now also being realized to visualize billions of points in seconds and display them in the browser.

Our scale-up computing efforts centered on the open-source Numba project as well as our Accelerate product.  Numba has made tremendous progress in the past couple of years, and is in production use in multiple places.   Many are taking advantage of numba.vectorize to create array-oriented solutions and program the GPU with ease.   The CUDA Python support in Numba makes it the easiest way to program the GPU that I'm aware of.  The CUDA simulator provided in Numba makes it much simpler to debug in Python the logic of CUDA-based GPU programming.  The addition of parallel-contexts to numba.vectorize mean that any many-core architecture can now be exploited in Python easily.   Early HSA support is also in Numba now meaning that Numba can be used to program novel hardware devices from many vendors.

### Summarizing Blaze

The ambitious Blaze project will require another blog-post to explain its history and progress well. I will only try to summarize the project and where it's heading.  Blaze came out of a combination of deep experience with industry problems in finance, oil&gas, and other quantitative domains that would benefit from a large-scale logical array solution that was easy to use and connected with the Python ecosystem.    We observed that the MapReduce engine of Hadoop was definitely not what was needed.  We were also aware of Spark and RDD's but felt that they too were also not general enough (nor flexible enough) for the demands of distributed array computing we encountered in those fields.

#### DyND, Datashape, and a vision for the future of Array-computing

After early work trying to extend the NumPy code itself led to struggles because of both the organic complexity of the code base and the stability needs of a mature project, the Blaze effort started with an effort to re-build the core functionality of NumPy and Pandas to fix some major warts of NumPy that had been on my mind for some time.   With Continuum support, Mark Wiebe decided to continue to develop a C++ library that could then be used by Python and any-other data-science language (DyND).   This necessitated defining a new data-description language (datashape) that generalizes NumPy's dtype to structures of arrays (column-oriented layout) as well as variable-length strings and categorical types.   This work continues today and is making rapid progress which I will leave to others to describe in more detail.  I do want to say, however, that dynd is implementing my "Pluribus" vision for the future of array-oriented computing in Python.   We are factoring the core capability into 3 distinct parts:  the type-system (or data-declaration system), a generalized function mechanism that can interact with any "typed" memory-view or "typed" buffer, and finally the container itself.   We are nearing release of a separated type-library and are working on a separate C-API to the generalized function mechanism.   This is where we are heading and it will allow maximum flexibility and re-use in the dynamic and growing world of Python and data-analysis.   The DyND project is worth checking out right now (if you have desires to contribute) as it has made rapid progress in the past 6 months.

As we worked on the distributed aspects of Blaze it centered on the realization that to scale array computing to many machines you fundamentally have to move code and not data.   To do this well means that how the computer actually sees and makes decisions about the data must be exposed.  This information is usually part of the type system that is hidden either inside the compiler, in the specifics of the data-base schema, or implied as part of the runtime.   To fundamentally solve the problem of moving code to data in a general way, a first-class and wide-spread data-description language must be created and made available.   Python users will recognize that a subset of this kind of information is contained in the struct module (the struct "format" strings), in the Python 3 extended buffer protocol definition (PEP 3118), and in NumPy's dtype system.   Extending these concepts to any language is the purpose of datashape.

In addition, run-times that understand this information and can execute instructions on variables that expose this information must be adapted or created for every system.  This is part of the motivation for DyND and why very soon the datashape system and its C++ type library will be released independently from the rest of DyND and Blaze.   This is fundamentally why DyND and datashape are such important projects to me.  I see in them the long-term path to massive code-reuse, the breaking down of data-silos that currently cause so much analytics algorithm duplication and lack of cooperation.

Simple algorithms from data-munging scripts to complex machine-learning solutions must currently be re-built for every-kind of data-silo unless there is a common way to actually functionally bring code to data.  Datashape and the type-library runtime from DyND (ndt) will allow this future to exist.   I am eager to see the Apache Arrow project succeed as well because it has related goals (though more narrowly defined).

The next step in this direction is an on-disk and in-memory data-fabric that allows data to exist in a distributed file-system or a shared-memory across a cluster with a pointer to the head of that data along with a data-shape description of how to interpret that pointer so that any language that can understand the bytes in that layout can be used to execute analytics on those bytes.  The C++ type run-time stands ready to support any language that wants to parse and understand data-shape-described pointers in this future data-fabric.

From one point of view, this DyND and data-fabric effort are a natural evolution of the efforts I started in 1998 that led to the creation of SciPy and NumPy.  We built a system that allows existing algorithms in C/C++ and Fortran to be applied to any data in Python.   The evolution of that effort will allow algorithms from many other languages to be applied to any data in memory across a cluster.

#### Blaze Expressions and Server

The key part of Blaze that is also important to mention is the notion of the Blaze server and user-facing Blaze expressions and functions.   This is now what Blaze the project actually entails --- while other aspects of Blaze have been pushed into their respective projects.  Functionally, the Blaze server allows the data-fabric concept on a machine or a cluster of machines to be exposed to the rest of the internet as a data-url (e.g. http://mydomain.com/catalog/datasource/slice).   This data-url can then be consumed as a variable in a Blaze expression --- first across entire organizations and then across the world.

This is the truly exciting part of Blaze that would enable all the data in the world to be as accessible as an already-loaded data-frame or array.  The logical expressions and transformations you can then write on those data to be your "logical computer" will then be translated at compute time to the actual run-time instructions as determined by the Blaze server which is mediating communication with various backends depending on where the data is actually located.   We are realizing this vision on many data-sets and a certain set of expressions already with a growing collection of backends.   It is allowing true "write-once and run anywhere" to be applied to data-transformations and queries and eventually data-analytics.     Currently, the data-scientists finds herself to be in a situation similar to the assembly programmer in the 1960s who had to know what machine the code would run on before writing the code.   Before beginning a data analytics task, you have to determine which data-silo the data is located in before tackling the task.  SQL has provided a database-agnostic layer for years, but it is too limiting for advanced analytics --- and user-defined functions are still database specific.

Continuum's support of blaze development is currently taking place as defined by our consulting customers as well as by the demands of our Anaconda platform and the feature-set of an exciting new product for the Anaconda Platform that will be discussed in the coming weeks and months. This new product will provide a simplified graphical user-experience on top of Blaze expressions, and Bokeh visualizations for rapidly connecting quantitative analysts to their data and allowing explorations that retain provenance and governance.  General availability is currently planned for August.

Blaze also spawned additional efforts around fast compressed storage of data (blz which formed the inspiration and initial basis for bcolz) and experiments with castra as well as a popular and straight-forward tool for quickly copying data from one data-silo kind to another (odo).

The most important development to come out of Blaze, however, will have tremendous impact in the short term well before the full Blaze vision is completed.  This project is Dask and I'm excited for what Dask will bring to the community in 2016.   It is helping us finally deliver on scaled-out NumPy / Pandas and making Anaconda a first-class citizen in Hadoop.

In 2014, Matthew Rocklin started working at Continuum on the Blaze team.   Matthew is the well-known author of many functional tools for Python.  He has a great blog you should read regularly.   His first contribution to Blaze was to adapt a multiple-dispatch system he had built which formed the foundation of both odo and Blaze.  He also worked with Andy Terrel and Phillip Cloud to clarify the Blaze library as a front-end to multiple backends like Spark, Impala, Mongo, and NumPy/Pandas.

With these steps taken, it was clear that the Blaze project needed its own first-class backend as well something that the community could rally around to ensure that Python remained a first-class participant in the scale-out conversation --- especially where systems that connected with Hadoop were being promoted.  Python should not ultimately be relegated to being a mere front-end system that scripts Spark or Hadoop --- unable to talk directly to the underlying data.    This is not how Python achieved its place as a de-facto data-science language.  Python should be able to access and execute on the data directly inside Hadoop.

Getting there took time.  The first version of dask was released in early 2015 and while distributed work-flows were envisioned, the first versions were focused on out-of-core work-flows --- allowing problem-sizes that were too big to fit in memory to be explored with simple pandas-like and numpy-like APIs.

When Matthew showed me his first version of dask, I was excited.  I loved three things about it:  1) It was simple and could, therefore, be used as a foundation for parallel PyData.  2) It leveraged already existing code and infrastructure in NumPy and Pandas.  3) It had very clean separation between collections like arrays and data-frames, the directed graph representation, and the schedulers that executed those graphs.   This was the missing piece we needed in the Blaze ecosystem.   I immediately directed people on the Blaze team to work with Matt Rocklin on Dask and asked Matt to work full-time on it.

He and the team made great progress and by summer of 2015 had a very nice out-of-core system working with two functioning parallel-schedulers (multi-processing and multi-threaded).  There was also a "synchronous" scheduler that could be used for debugging the graph and the system showed well enough throughout 2015 to start to be adopted by other projects (scikit-image and xarray).

In the summer of 2015, Matt began working on the distributed scheduler.  By fall of 2015, he had a very nice core system leveraging the hard work of the Python community.   He built the API around the concepts of asynchronous computing already being promoted in Python 3 (futures) and built dask.distributed on top of tornado.   The next several months were spent improving the scheduler by exposing it to as many work-flows as possible from computational-science, quantitative-science and computational-science.   By February of 2016, the system was ready to be used by a variety of people interested in distributed computing with Python.   This process continues today.

Using dask.dataframes and dask.arrays you can quickly build array- and table-based work-flows with a Pandas-like and NumPy-like syntax respectively that works on data sitting across a cluster.

Anaconda and the PyData ecosystem now had another solution for the scale-out problem --- one whose design and implementation was something I felt could be a default run-time backend for Blaze.  As a result, I could get motivated to support, market, and seek additional funding for this effort.  Continuum has received some DARPA funding under the XDATA program.  However, this money was now spread pretty thin among Bokeh, Numba, Blaze, and now Dask.

With the distributed scheduler basically working and beginning to improve, two problems remained with respect to Hadoop interoperability: 1) direct access to the data sitting in HDFS and 2) interaction with the resource schedulers running most Hadoop clusters (YARN or mesos).

To see how important the next developments are, it is useful to describe an anecdote from early on in our XDATA experience.  In the summer of 2013, when the DARPA XDATA program first kicked-off, the program organizers had reserved a large Hadoop cluster (which even had GPUs on some of the nodes).  They loaded many data sets onto the cluster and communicated about its existence to all of the teams who had gathered to collaborate on getting insights out of "Big Data."    However, a large number of the people collaborating were using Python, R, or C++.  To them the Hadoop cluster was inaccessible as there was very little they could use to interact with the data stored in HDFS (beyond some high-latency and low-bandwidth streaming approaches) and nothing they could do to interact with the scheduler directly (without writing Scala or Java code). The Hadoop cluster sat idle for most of the summer while teams scrambled to get their own hardware to run their code on and deliver their results.

This same situation we encountered in 2013 exists in many organizations today.  People have large Hadoop infrastructures, but are not connecting that infrastructure effectively to their data-scientists who are more comfortable in Python, R, or some-other high-level (non JVM language).

With dask working reasonably well, tackling this data-connection problem head on became an important part of our Anaconda for Hadoop story and so in December of 2015 we began two initiatives to connect Anaconda directly to Hadoop.   Getting data from HDFS turned out to be much easier than we had initially expected because of the hard-work of many others.    There had been quite a bit of work building a C++ interface to Hadoop at Pivotal that had culminated in a library called libhdfs3.   Continuum wrote a Python interface to that library quickly, and it now exists as the hdfs3 library under the Dask organization on Github.

The second project was a little more involved as we needed to integrate with YARN directly.   Continuum developers worked on this and produced a Python library that communicates directly to the YARN classes (using Scala) in order to allow the Python developer to control computing resources as well as spread files to the Hadoop cluster.   This project is called knit, and we expect to connect it to mesos and other cluster resource managers in the near future (if you would like to sponsor this effort, please get in touch with me).

Early releases of hdfs3 and knit were available by the end of February 2015.  At that time, these projects were joined with dask.distributed and the dask code-base into a new Github organization called Dask.   The graduation of Dask into its own organization signified an important milestone that dask was now ready for rapid improvement and growth alongside Spark as a first-class execution engine in the Hadoop ecosystem.

Our initial goals for Dask are to build enough examples, capability, and awareness so that every PySpark user tries Dask to see if it helps them.    We also want Dask to be a compatible and respected member of the growing Hadoop execution-framework community.   We are also seeking to enable Dask to be used by scientists of all kinds who have both array and table data stored on central file-systems and distributed file-systems outside of the Hadoop ecosystem.

### Anaconda as a first-class execution ecosystem for Hadoop

With Dask (including hdfs3 and knit), Anaconda is now able to participate on an equal footing with every other execution framework for Hadoop.  Because of the vast reaches of Anaconda Python and Anaconda R communities, this means that a lot of native code can now be integrated to Hadoop much more easily, and any company that has stored their data in HDFS or other distributed file system (like s3fs or gpfs) can now connect that data easily to the entire Python and/or R computing stack.

This is exciting news!    While we are cautious because these integrative technologies are still young, they are connected to and leveraging the very mature PyData ecosystem.    While benchmarks can be misleading, we have a few benchmarks that I believe accurately reflect the reality of what parallel and distributed Anaconda can do and how it relates to other Hadoop systems.  For array-based and table-based computing workflows, Dask will be 10x to 100x faster than an equivalent PySpark solution.   For applications where you are not using arrays or tables (i.e. word-count using a dask.bag), Dask is a little bit slower than a similar PySpark solution.  However, I would argue that Dask is much more Pythonic and easier to understand for someone who has learned Python.

It will be very interesting to see what the next year brings as more and more people realize what is now available to them in Anaconda.  The PyData crowd will now have instant access to cluster computing at a scale that has previously been accessible only by learning complicated new systems based on the JVM or paying an unfortunate performance penalty.   The Hadoop crowd will now have direct and optimized access to entire classes of algorithms from Python (and R) that they have not previously been used to.

It will take time for this news and these new capabilities to percolate, be tested, and find use-cases that resonate with the particular problems people actually encounter in practice.  I look forward to helping many of you take the leap into using Anaconda at scale in 2016.

We will be showing off aspects of the new technology at Strata in San Jose in the Continuum booth #1336 (look for Anaconda logo and mark).  We have already announced at a high-level some of the capabilities:   Peter and I will both be at Strata along with several of the talented people at Continuum.    If you are attending drop by and say hello.

We first came to Strata on behalf of Continuum in 2012 in Santa Clara.  We announced that we were going to bring you scaled-out NumPy.  We are now beginning to deliver on this promise with Dask.   We brought you scaled-up NumPy with Numba.   Blaze and Bokeh will continue to bring them together along with the rest of the larger data community to provide real insight on data --- where-ever it is stored.   Try out Dask and join the new scaled-out PyData story which is richer than ever before, has a larger community than ever before, and has a brighter future than ever before.

## May 24, 2016

### Filipe Saraiva

#### if (LaKademy 2016) goto Rio de Janeiro

Rio de Janeiro, the “Cidade Maravilhosa”, land of the eternal Summer. The sunlight here is always clear and hot, the sea is refreshing, the sand is comfortable. The people is happy, Rio de Janeiro has good music, food, the craziest parties of the world, and beautiful bodies having fun with beach games (do you know futevolei?).

But while Rio de Janeiro is boiling, some Gearheads based in Latin America will be working together in a cold and dark room in the city, attending to our “multi-area” sprint named Latin America Akademy – LaKademy 2016.

In my plans I have a lot of work to do in Cantor, including a strong triage in bugs and several tests with some IPC technologies. I would like to choose one to be the “official” technology to implement backends for Cantor. Cantor needs a IPC technology with good multiplatform support for the main desktop operating systems. I am think about DBus… do you have other suggestions or tips?

Other contributors also want to work in Cantor. Wagner wants to build and test the application in Windows and begin an implementation of a backend for a new programming language. Fernando, my SoK 2015 student, wants to fix the R backend. I will be very happy seeing these developers dirtying their hands in Cantor source code, so I will help them in those tasks.

During LaKademy I intent to present for the attendees some ideas and prototypes of two new software I am working. I expect to get some feedback and I will think about the next steps for them. Maybe I can submit them for new KDE projects… Well, let’s see.

Wait for more news from the cold and dark room of our LaKademy event in Rio de Janeiro.

### Titus Brown

#### Increasing postdoc pay

I just gave all of my postdocs a $10,000-a-year raise. My two current postdocs all got a$10k raise over their current salary, and the four postdocs coming on board over the next 6 months will start at $10k over the NIH base salary we pay them already. (This means that my starting postdocs will get something like$52k/year, plus benefits.)

I already pay most of my grad students more than I'm required to by UC Davis regulations. While I'm a pretty strong believer that graduate school is school, and that it's pretty good training (see Claus Wilke's extended discussion here), there's something to be said for enabling them to think more about their work and less about whether or not they can afford a slightly more comfortable life. (I pay all my grad students the same salary, independent of graduate program; see below.)

Why did I increase the postdoc salaries now? I've been thinking about it for a while, but the main reason that motivated me to do the paperwork was the change in US labor regulations. There's the cold-blooded calculation that, hey, I don't want to pay overtime; but if it were just that, I could have given smaller raises to my existing postdocs. A bigger factor is that I really don't want the postdocs to have to think about tracking their time. I also hope it will decrease postdoc turnover, which can be a real problem: it takes a lot of time to recruit a new person to the lab, and if it takes a year to recruit a postdoc and they leave sooner because the salary sucks, well, that's really a net loss to me.

More broadly, I view my job as flying cover for my employees. If they worry a little bit less because of a (let's face it) measly $10k, well, so much the better. A while ago I decided to pay all my postdocs on the same scale; there are some people who are good at negotiating and asking, and others who aren't, and it's baldly unfair to listen more to the former. (I've had people who pushed for a raise every 6 months; I've had other people who offered to pay me back by personal check when they were out sick for a week.) I'm also really uncomfortable trying to judge a person's personal need - sure, one postdoc may have a family, and another postdoc may look free as a bird and capable of living out of the office (which has also happened due to low pay...), but everyone's lives are more complicated than they appear, and it's not my place to get that involved. So paying everyone the same salary and explaining that up front reduces any friction that might arise there, I think. There's also the fact that I can afford it at the moment, between my startup and my Moore Foundation grant. The$10k/person increase means that I'm paying somewhere around $80k extra per year, once you include the increase in benefits -- basically, an entire additional postdoc's salary. But being the world's worst manager, I'm not sure how I will deal with a lab with 9 people in it; a 10th would probably not have helped there. So maybe it's not such a bad thing to avoid hiring one more person :). And in the future I will simply budget it into grants. (I do have one grant out in review at the moment where I underbudgeted; if I get it, I'll have to supplement that with my startup.) The interesting thing is that I didn't realize how large a salary many of my future postdocs were turning down. In order to justify the raise to the admins, I asked for information on other offers the postdocs had received - I'd heard that some of them had turned down larger salaries, but hadn't asked for details before. Two of my future postdocs had offers in the$80k range; another was leaving a postdoc that paid north of $60k (not uncommon in some fields) to come to my lab. I'm somewhat surprised (and frankly humbled) that they were planning to come to my lab before this raise; even with this raise, I'm not approaching what they'd already been offered! There are some downsides for the postdocs here (although I think they're pretty mild, all things considered). First, I won't have as much unbudgeted cash lying around, so supply and travel expenditures will be a bit more constrained. Second, I can't afford to keep them all on for quite as long now, so some postdoc jobs may end sooner than they otherwise would have. Third, if they want to transition to a new postdoc at UC Davis, they will probably have to find someone willing to pay them the extra money - it's very hard to lower someone's salary within an institution. (I don't expect this to be a problem, but it's an issue to consider.) There are also some downsides for me that I don't think my employees always appreciate, too. I worry a lot - like, an awful lot - about money. I'm deathly afraid of overpromising employment and having to lay off a postdoc before they have a next step, or, worse, move a grad student to a lot of TAing. So this salary increase puts me a little bit more on edge, and makes me think more about writing grants, and less about research and other, more pleasant things. I can't help but resent that a teensy bit. On the flip side, that is my job and all things considered I'm at a pretty awesome place in a pretty awesome gig so shrug. There may be other downsides I hadn't considered - there usually are ;) -- and upsides as well. I'll follow up if anything interesting happens. --titus ### Matthieu Brucher #### Announcement: Audio TK 1.3.0 ATK is updated to 1.3.0 with new features and optimizations. Download link: ATK 1.3.0 Changelog: 1.3.0 * Added a family of triode preamplification filters with Python wrappers (requires Eigen) * Added a class A NPN preamplification filter with Python wrappers (requires Eigen) * Added a buffer filter with Python wrappers * Added a new Diode clipper with trapezoidal rule with Python wrappers * Added a new version of the SD1 distortion with ZDF mode and Python wrappers 1.2.0 * Added SecondOrderSVF filters from cytomic with Python wrappers * Implemented a LowPassReverbFilter with Python wrappers * Added Python wrappers to AllPassReverbFilter * Distortion filters optimization * Bunch of fixes (Linux compil, calls…) ## May 20, 2016 ### Continuum Analytics news #### New Pip 8.1.2 Release Leaves Anaconda Cloud Broken - Fix in Progress Posted Friday, May 20, 2016 This is an important update for Anaconda Cloud users who are upgrading to the latest version of Pip. Due to changes in a recent release of Pip v8.1.2, Anaconda Cloud users that are installing packages from the PyPI channel where the package name contains a "." or "-" (period or hypen) will be unable to install those packages. The short-term fix is to downgrade Pip to v8.1.1. (The Pip 8.1.2 conda package has been removed from repo.continuum.io so it's not conda-installable currently because of this issue but will be restored to the repo as soon as this issue is resolved in Anaconda Cloud) We anticipate having an updated version of Anaconda Cloud released in the next 1-2 weeks to address this issue and allow users to upgrade to 8.1.2. An update to this post will be shared when it's resolved. To read more about the underlying nature of the issue, please refer to this issue: pypa/pip#3666 ## May 19, 2016 ### Gaël Varoquaux #### Better Python compressed persistence in joblib ## Problem setting: persistence for big data Joblib is a powerful Python package for management of computation: parallel computing, caching, and primitives for out-of-core computing. It is handy when working on so called big data, that can consume more than the available RAM (several GB nowadays). In such situations, objects in the working space must be persisted to disk, for out-of-core computing, distribution of jobs, or caching. An efficient strategy to write code dealing with big data is to rely on numpy arrays to hold large chunks of structured data. The code then handles objects or arbitrary containers (list, dict) with numpy arrays. For data management, joblib provides transparent disk persistence that is very efficient with such objects. The internal mechanism relies on specializing pickle to handle better numpy arrays. Recent improvements reduce vastly the memory overhead of data persistence. ### Limitations of the old implementation ❶ Dumping/loading persisted data with compression was a memory hog, because of internal copies of data, limiting the maximum size of usable data with compressed persistence: We see the increased memory usage during the calls to dump and load functions, profiled using the memory_profiler package with this gist ❷ Another drawback was that large numpy arrays (>10MB) contained in an arbitrary Python object were dumped in separate .npy file, increasing the load on the file system [1]: >>> import numpy as np >>> import joblib # joblib version: 0.9.4 >>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))] # 3 files are generated: >>> joblib.dump(obj, '/tmp/test.pkl', compress=True) ['/tmp/test.pkl', '/tmp/test.pkl_01.npy.z', '/tmp/test.pkl_02.npy.z'] >>> joblib.load('/tmp/test.pkl') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  ## What’s new: compression, low memory… Memory usage is now stable: All numpy arrays are persisted in a single file: >>> import numpy as np >>> import joblib # joblib version: 0.10.0 (dev) >>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))] # only 1 file is generated: >>> joblib.dump(obj, '/tmp/test.pkl', compress=True) ['/tmp/test.pkl'] >>> joblib.load('/tmp/test.pkl') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  Persistence in a file handle (ongoing work in a pull request) More compression formats are available Backward compatibility Existing joblib users can be reassured: the new version is still compatible with pickles generated by older versions (>= 0.8.4). You are encouraged to update (rebuild?) your cache if you want to take advantage of this new version. ## Benchmarks: speed and memory consumption Joblib strives to have minimum dependencies (only numpy) and to be agnostic to the input data. Hence the goals are to deal with any kind of data while trying to be as efficient as possible with numpy arrays. To illustrate the benefits and cost of the new persistence implementation, let’s now compare a real life use case (LFW dataset from scikit-learn) with different libraries: • Joblib, with 2 different versions, 0.9.4 and master (dev), • Pickle • Numpy The four first lines use non compressed persistence strategies, the last four use persistence with zlib/gzip [2] strategies. Code to reproduce the benchmarks is available on this gist. Speed: the results between joblib 0.9.4 and 0.10.0 (dev) are similar whereas numpy and pickle are clearly slower than joblib in both compressed and non compressed cases. Memory consumption: Without compression, old and new joblib versions are the same; with compression, the new joblib version is much better than the old one. Joblib clearly outperforms pickle and numpy in terms of memory consumption. This can be explained by the fact that numpy relies on pickle if the object is not a pure numpy array (a list or a dict with arrays for example), so in this case it inherits the memory drawbacks from pickle. When persisting pure numpy arrays (not tested here), numpy uses its internal save/load functions which are efficient in terms of speed and memory consumption. Disk used: results are as expected: non compressed files have the same size as the in-memory data; compressed files are smaller. Caveat Emptor: performance is data-dependent Different data compress more or less easily. Speed and disk used will vary depending on the data. Key considerations are: • Fraction of data in arrays: joblib is efficient if much of the data is contained in numpy arrays. The worst case scenario is something like a large dictionary of random numbers as keys and values. • Entropy of the data: an array fully of zeros will compress well and fast. A fully random array will compress slowly, and use a lot of disk. Real data is often somewhere in the middle. ## Extra improvements in compressed persistence ### New compression formats Joblib can use new compression formats based on Python standard library modules: zlib, gzip, bz2, lzma and xz (the last 2 are available for Python greater than 3.3). The compressor is selected automatically when the file name has an explicit extension: >>> joblib.dump(obj, '/tmp/test.pkl.z') # zlib ['/tmp/test.pkl.z'] >>> joblib.dump(obj, '/tmp/test.pkl.gz') # gzip ['/tmp/test.pkl.gz'] >>> joblib.dump(obj, '/tmp/test.pkl.bz2') # bz2 ['/tmp/test.pkl.bz2'] >>> joblib.dump(obj, '/tmp/test.pkl.lzma') # lzma ['/tmp/test.pkl.lzma'] >>> joblib.dump(obj, '/tmp/test.pkl.xz') # xz ['/tmp/test.pkl.xz']  One can tune the compression level, setting the compressor explicitly: >>> joblib.dump(obj, '/tmp/test.pkl.compressed', compress=('zlib', 6)) ['/tmp/test.pkl.compressed'] >>> joblib.dump(obj, '/tmp/test.compressed', compress=('lzma', 6)) ['/tmp/test.pkl.compressed']  On loading, joblib uses the magic number of the file to determine the right decompression method. This makes loading compressed pickle transparent: >>> joblib.load('/tmp/test.compressed') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  Importantly, the generated compressed files use a standard compression file format: for instance, regular command line tools (zip/unzip, gzip/gunzip, bzip2, lzma, xz) can be used to compress/uncompress a pickled file generated with joblib. Joblib will be able to load cache compressed with those tools. Toward more and faster compression Specific compression strategies have been developped for fast compression, sometimes even faster than disk reads such as snappy , blosc, LZO or LZ4. With a file-like interface, they should be readily usable with joblib. In the benchmarks above, loading and dumping with compression is slower than without (though only by a factor of 3 for loading). These were done on a computer with an SSD, hence with very fast I/O. In a situation with slower I/O, as on a network drive, compression could save time. With faster compressors, compression will save time on most hardware. ### Compressed persistence into a file handle Now that everything is stored in a single file using standard compression formats, joblib can persist in an open file handle: >>> with open('/tmp/test.pkl', 'wb') as f: >>> joblib.dump(obj, f) ['/tmp/test.pkl'] >>> with open('/tmp/test.pkl', 'rb') as f: >>> print(joblib.load(f)) [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  This also works with compression file object available in the standard library, like gzip.GzipFile, bz2.Bz2File or lzma.LzmaFile: >>> import gzip >>> with gzip.GzipFile('/tmp/test.pkl.gz', 'wb') as f: >>> joblib.dump(data, f) ['/tmp/test.pkl.gz'] >>> with gzip.GzipFile('/tmp/test.pkl.gz', 'rb') as f: >>> print(joblib.load(f)) [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  Be sure that you use a decompressor matching the internal compression when loading with the above method. If unsure, simply use open, joblib will select the right decompressor: >>> with open('/tmp/test.pkl.gz', 'rb') as f: >>> print(joblib.load(f)) [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  Towards dumping to elaborate stores Working with file handles opens the door to storing cache data in database blob or cloud storage such as Amazon S3, Amazon Glacier and Google Cloud Storage (for instance via the Python package boto). ## Implementation A Pickle Subclass: joblib relies on subclassing the Python Pickler/Unpickler [3]. These are state machines that walk the graph of nested objects (a dict may contain a list, that may contain…), creating a string representation of each object encountered. The new implementation proceeds as follows: • Pickling an arbitrary object: when an np.ndarray object is reached, instead of using the default pickling functions (__reduce__()), the joblib Pickler replaces in pickle stream the ndarray with a wrapper object containing all important array metadata (shape, dtype, flags). Then it writes the array content in the pickle file. Note that this step breaks the pickle compatibility. One benefit is that it enables using fast code for copyless handling of the numpy array. For compression, we pass chunks of the data to a compressor object (using the buffer protocol to avoid copies). • Unpickling from a file: when pickle reaches the array wrapper, as the object is in the pickle stream, the file handle is at the beginning of the array content. So at this point the Unpickler simply constructs an array based on the metadata contained in the wrapper and then fills the array buffer directly from the file. The object returned is the reconstructed array, the array wrapper being dropped. A benefit is that if the data is stored not compressed, the array can be directly memory mapped from the storage (the mmap_mode option of joblib.load). This technique allows joblib to pickle all objects in a single file but also to have memory-efficient dump and load. A fast compression stream: as the pickling refactoring opens the door to file objects usage, joblib is now able to persist data in any kind of file object: open, gzip.GzipFile, bz2.Bz2file and lzma.LzmaFile. For performance reason and usability, the new joblib version uses its own file object BinaryZlibFile for zlib compression. Compared to GzipFile, it disables crc computation, which bring a performance gain of 15%. Speed penalties of on-the-fly writes There’s also a small speed difference with dict/list objects between new/old joblib when using compression. The old version pickles the data inside a io.BytesIO buffer and then compress it in a row whereas the new version write “on the fly” compressed chunk of pickled data to the file. Because of this internal buffer the old implementation is not memory safe as it indeed copy the data in memory before compressing. The small speed difference was judged acceptable compared to this memory duplication. ## Conclusion and future work Memory copies were a limitation when caching on disk very large numpy arrays, e.g arrays with a size close to the available RAM on the computer. The problem was solved via intensive buffering and a lot of hacking on top of pickle and numpy. Unfortunately, our strategy has poor performance with big dictionaries or list compared to a cPickle, hence try to use numpy arrays in your internal data structures (note that something like scipy sparse matrices works well, as it builds on arrays). For the future, maybe numpy’s pickle methods could be improved and make a better use of 64-bit opcodes for large objects that were introduced in Python recently. Pickling using file handles is a first step toward pickling in sockets, enabling broadcasting of data between computing units on a network. This will be priceless with joblib’s new distributed backends. Other improvements will come from better compressor, making everything faster. Note The pull request was implemented by @aabadie. He thanks @lesteve, @ogrisel and @GaelVaroquaux for the valuable help, reviews and support.  [1] The load created by multiple files on the filesystem is particularly detrimental for network filesystems, as it triggers multiple requests and isn’t cache friendly.  [2] gzip is based on zlib with additional crc checks and a default compression level of 3.  [3] A drawback of subclassing the Python Pickler/Unpickler is that it is done for the pure-Python version, and not the “cPickle” version. The latter is much faster when dealing with a large number of Python objects. Once again, joblib is efficient when most of the data is represented as numpy arrays or subclasses. ## May 18, 2016 ### Martin Fitzpatrick #### Can I use setup.py to pack an app that requires PyQt5? ## Can I require PyQt5 via setup.py? In a word yes, as long as you restrict your support to PyQt5 and Python3. The requirements specified in setup.py are typically provided by requesting packages from the Python Package Index (PyPi). Until recently these packages were source only, meaning that an installation depending on PyQt5 would only work on a system where it was possible to build it from source. Building on Windows in particular requires quite a lot of set up, and this would therefore put your application out of reach for anyone unable or unwilling to do this. Note: As far as I am aware, it was never actually possible to build from source via PyPi. The standard approach was to download the source/binaries from Riverbank Software and build/install from there. This problem was solved by the introduction of Python Wheels which provide a means to install C extension packages without the need for compilation on the target system. This is achieved by platform-specific .whl files. Wheels for PyQt5 on Python3 are available on PyPi for multiple platforms, including MacOS X, Linux (any), Win32 and Win64 which should cover most uses. For example, this is the output when pip-installing PyQt5 on Python3 on a Mac: mfitzp@MacBook-Air ~$ pip3 install pyqt5
Collecting pyqt5
100% |████████████████████████████████| 73.2MB 2.5kB/s
Collecting sip (from pyqt5)
100% |████████████████████████████████| 49kB 1.8MB/s
Installing collected packages: sip, pyqt5
Successfully installed pyqt5-5.6 sip-4.18


To set PyQt5 as a dependency of your own package simply specify it as normal in your setup.py e.g. install_requires=['PyQt5']

## What’s the proper way of distributing a Python GUI application?

Here you have a few options. The above means that anyone with Python3 installed can now install your application using pip. However, this assumes that the end-user has Python and knows what pip is. If you are looking to distribute your application with a Windows installer, MacOSX ‘app’ bundle, or Linux package, you will need to use one of the tools dedicated to that purpose.

### Windows

• cx_Freeze is a cross-platform packager that can package Python applications for Windows, Mac and Linux. It works by analysing your project and freezing the required packages and subpackages. Success depends largely on which packages you depend on and their complexity/correctness.
• PyInstaller is another cross-platform packager that can package Python applications for Windows, Mac and Linux. This works in a similar way to cx_Freeze and will likely perform both better/worse depending on the packages.
• PyNSISt builds NSIS installer packages for Windows. This has the advantage of being very straightforward: it simply packages all the files together as-is, without ‘freezing’. The downside is that packages can end up very large and slower to install (but see the file-filter options). It now supports bundling of .whl files which will solve this in many cases. By far the easiest if you’re targeting Windows-only.

### MacOSX

• cx_Freeze see above.
• PyInstaller see above.
• Py2app creates .app bundles from the definition in your setup.py. Big advantage is the custom handlers that allow you to adjust packaging of troublesome packages. If you’re only targetting MacOSX this is probably your best option.

### Linux

• cx_Freeze see above.
• PyInstaller see above.
• stdeb build Debian-style packages from your setup.py definition.

Note: It is possible to write a very complex setup.py that allows you to build using one or more tools on different platforms, but I have usually ended up storing separate configs (e.g. setup-py2app.py) for clarity.

## May 17, 2016

### Matthieu Brucher

#### Audio Toolkit: Parameter smoothing

Audio Toolkit shines when the pipeline is fixed (filter-wise and parameter-wise). But in DAWs, automated parameters are often used, and to avoid glitches, it’s interesting to additionally smooth parameters of the pipeline. So let’s see how this can be efficiently achieved.

Although automation in a DAW would already smooth parameters and although some filters can have a heavy state (EQ, but also the dynamics filters, even with their threaded updates), it’s interesting to implement this pattern in some cases. So here it is:

// You need to setup memory, how the parameter is updated
// you need to setup max_interval_process, the number of samples before the next update
class ProcessingClass
{
double parameter_target;
double parameter_current;

int64_t interval_process;
public:
ProcessingClass()
:parameter_target(0), parameter_current(0), interval_process(0)
{}

void update()
{
parameter_current = parameter_current * (1 - memory) + parameter_target * memory;
interval_process = 0;
}

void process(double** in, double** out, int64_t size)
{
// Setup the input/outputs of the pipeline as usual

int64_t processed_size = 0;
do
{
// We can only process max_interval_process elements at a time, but if we already have some elements in the buffer,
// we need to take them into account.
int64_t size_to_process = std::min(max_interval_process - interval_process, size - processed_size);

pipeline_exit.process(size_to_process);

interval_process += size_to_process;
processed_size += size_to_process;
if(interval_process == max_interval_process)
{
update();
interval_process = 0;
}
}while(processed_size != size);
}
};

I’m considering that ProcessingClass has an Audio Toolkit pipeline and that it is embedded in a VST or AU plugin. A call to the parameter update function would update parameter_target and make a call to update(). During the call to process() where the plugin would do some processing, the snippet will cut the input and output arrays in chunks of max_interval_processing elements and call the pipeline for each chunk and then update the underlying parameters if required.

In this snippet, I’m calling update after the pipeline call, but I could also do it before the pipeline call and remove the call to the update function from the parameter change function. It’s a matter of taste.

## May 10, 2016

### Matthieu Brucher

#### Announcement: ATKAutoSwell 1.0.0

I’m happy to announce the release of a mono autoswell based on the Audio Toolkit. They are available on Windows and OS X (min. 10.8) in different formats.

This plugin applies a ratio to the global gain of a signal once it is higher than a given threshold. This means that contrary to a compressor where the power of the signal will never go lower than the threshold, for AutoSwell, it can.

ATKAutoSwell

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

## May 09, 2016

### Enthought

#### Webinar: Fast Forward Through the “Dirty Work” of Data Analysis: New Python Data Import and Manipulation Tool Makes Short Work of Data Munging Drudgery

No matter whether you are a data scientist, quantitative analyst, or an engineer, whether you are evaluating consumer purchase behavior, stock portfolios, or design simulation results, your data analysis workflow probably looks a lot like this: Acquire > Wrangle > Analyze and Model > Share and Refine > Publish The problem is that often 50 to 80 […]

### Continuum Analytics news

#### Community Powered conda Packaging: conda-forge

Posted Monday, May 9, 2016

conda-forge is a community led collection of recipes, build infrastructure and packages for the conda package manager.

### The Problem

Historically, the scientific Python community has always wanted a cross-platform package manager that does not require elevated privileges, handles all types of packages, including compiled Python packages and non-Python packages, and generally lets Python be the awesome scientific toolbox of choice.

The conda package manager solved that problem, but in doing so has created new ones:

• How to get niche tools that are not packaged by Continuum in the “default” channel for Anaconda and Miniconda?
• Where should built packages be hosted?
• How should binaries be built to ensure they are compatible with other systems and with the packages from the default channel?

Continuum Analytics does its best to produce reliable conda packages on the default channel, but it can be difficult to keep pace with the many highly specialized communities and their often complex build requirements. The default channel is therefore occasionally out of date, built without particular features or, in some situations, even broken. In response, Continuum provided Anaconda Cloud as a platform for hosting conda packages. Many communities have created their own channels on Anaconda Cloud to provide a collection of reliable packages that they know will work for their users. This has improved the situation within these communities but has also led to duplication of effort, recipe fragmentation and some unstable environments when combining packages from different channels.

### The Solution

conda-forge is an effort towards unification of these fragmented users and communities. The conda-forge organization was created to be a transparent, open and community-led organization to centralize and standardize package building and recipe hosting, while improving distribution of the maintenance burden.

### What Exactly is conda-forge?

In a nutshell, conda-forge is a GitHub organization containing repositories of conda recipes and a framework of automation scripts for facilitating CI setup for these recipes. In its current implementation, free services from AppVeyor, CircleCI and Travis CI power the continuous build service on Windows, Linux and OS X, respectively. Each recipe is treated as its own repository. referred to as a feedstock, and is automatically built in a clean and repeatable way on each platform.

The built distributions are uploaded to the central repository at anaconda.org/conda-forge and can be installed with conda. For example, to install a conda-forge package into an existing conda environment:

$conda install --channel conda-forge <package-name> Or, to add conda-forge to your channels so that it is always searched: $ conda config --add channels conda-forge

### How Can I be Part of the conda-forge Community?

Users can contribute in a number of ways. These include reporting issues (as can be seen by this sample issue), updating any of our long list of existing recipes (as can be seen by this sample PR) or by adding new recipes to the collection.

Adding new recipes starts with a a PR to staged-recipes. The recipe will be built on Windows, Linux and OS X to ensure the package's builds and the recipe’s tests pass.The PR will also be reviewed by the community to assert the recipes are written in a clear and maintainable way. Once the recipe is ready it will be merged and a new feedstock repository will automatically be created for the recipe by the staged-recipes automation scripts. The feedstock repository has a team with commit rights automatically created using the GitHub handles listed in the recipe extra/recipe-maintainers field. The build and upload processes takes place in the feedstock and, once completed, the package will be available on the conda-forge channel.

A full example of this process can be seen with the “colorlog” package. A PR was raised at staged-recipes proposing the new recipe. It was then built and tested automatically, and, after some iteration, it was merged. Subsequently, the colorlog-feedstock repository was automatically generated with full write access for everybody listed in the recipe-maintainers section.

### Feedstock Model vs Single Repository Model

Many communities are familiar with the “single repository” model - repositories like github.com/conda/conda-recipes that contain many folders of conda recipes.  This model is not ideal for community maintenance, as it lacks granularity of permissions and struggles to scale beyond tens of recipes. With the feedstock model, in which there is one repo per recipe, each recipe has its own team of maintainers and its own CI. The conda-forge/feedstocks repository puts the recipes back into the more familiar single repository model for those workflows which require it.

### Technical Build Details

The build centralization of conda-forge has provided an opportunity to standardize the build tools used in the ecosystem. By itself, Anaconda Cloud imposes no constraints on build tools. This results in some packages working with only a subset of user systems due to platform incompatibilities. For example, packages built on newer Linux systems will often not run on older Linux systems due to glibc compatibility issues. By unifying and solving these problems, together we are improving the likelihood that any package from the conda-forge channel will be compatible with most user systems. Additionally, pooling knowledge has led to better centralized build tools and documentation than any single community had before. Some of this documentation is at https://github.com/conda-forge/staged-recipes/wiki/Helpful-conda-links

### What's Next?

Conda forge is growing rapidly (~60 contributors, ~400 packages, and >118000 downloads). With more community involvement, everyone benefits: package compatibility is improved, packages stay current and we have a larger pool of knowledge to tackle more difficult issues. We can all go get work done, instead of fighting packaging!

conda-forge is open, transparent, and growing quickly. We would love to see more communities joining the effort to solve improve software packaging for the scientific Python community.

## May 03, 2016

### Matthieu Brucher

#### Analog modeling of a diode clipper (4): DK-method

DK method is explained at large by David Ye in his thesis. It’s based around nodal analysis and was also extensively used by cytomic in his papers.

When analyzing a circuit form scratch, we need to replace all capacitors by an equivalent circuit and solve the equation with this modified circuit. Then, the equivalent currents need to be updated with the proper formula.

# What does the formula mean?

So this is the update formula:

$i_{eq_{n+1}} = \frac{C}{\Delta t}V_{n+1} - i_{eq_{n}}$

Let’s write it differently:

$i_{eq_{n+1}} + i_{eq_{n}} = C\frac{V_{n+1}}{\Delta t}$

If we consider $V_{n+1}$ as being a difference, then this is a derivative with a trapezoidal approximation. In conjunction to the original equation, this means that we have a system of several equations that are staggered. Actually $V_n$ has not the same time constraints than $i_{eq_n}$, it lags it by half a sample.

On the one hand, there are several reasons why this is good. Staggered systems are easier to write, and also if the conditions are respected, they are more accurate. For instance, for wave equations, using central difference instead of the staggered system leads to HF instabilities.

The issues on the other hand are that we do a linear update. If this is fine for the SD1 circuit, it is not the same for the two clippers here, as the amount of current in the condensator is a function of the diode function (not the case of the SD1 circuit, as only the input voltage impacts it). But, still, it’s a good approximation.

# Usage on the clippers

OK, let’s see how to apply this on the first clipper:
$V_{in} = V_{on} + I_s sinh(\frac{V_{on}}{nV_t})(\frac{h}{C_1} + 2 R_1)) - \frac{hI_{eq_n}}{2C_1}$

The time dependency is kept inside $I_{eq_n}$, and we don’t need the rest like for the trapezoidal rule:

$V_{in+1} - V_{in} - I_s sinh(\frac{V_{on+1}}{nV_t}) (\frac{h}{C_1} + 2 R_1) - I_s sinh(\frac{V_{on}}{nV_t}) (\frac{h}{C_1} - 2 R_1) - V_{on+1} + V_{on} = 0$

Quite obvious it is simpler! But actually the update rule is a little bit more complicated:

$I_{eq_{n+1}} = \frac{2 C_1}{h} (V_{in} - V_{on} - R_1 I_s sinh(\frac{V_{on}}{nV_t})) - I_{eq_n}$

Actually,a s we computed all the intermediate values, this comes at a cost of a few additions and multiplications, so it’s good.

Let’s try the second clipper:

$V_{in} = V_{on} (1 + \frac{2 R_1 C_1}{h}) + R_1 I_s sinh(\frac{V_{on}}{nV_t}) + I_{eq_n} R_1$

Compared to:

$V_{on+1} - V_{on} = h(\frac{V_{in+1}}{R_1 C_1} - \frac{V_{on+1} + V_{on}}{2 R_1 C_1} + \frac{I_s}{C_1}(sinh(\frac{V_{on+1}}{nV_t}) + sinh(\frac{V_{on}}{nV_t})))$

And in this case, the update formula is simple, as the tension on the condensator is the output voltage:

$I_{eq_{n+1}} = \frac{2 C_1}{h} V_{on} - I_{eq_n}$

Once again, the dependency is hidden inside $I_{eq_n}$ which means simpler and also faster optimization.

# Conclusion

Using the equivalent currents transformation is actually really easy to implement and it allows to simplify the function to optimize. It doesn’t change the function itself compared to the trapezoidal rule, because they are actually (in my opinion, I have done the actual math) two sides of the same coin.

I’ve applied this to the SD1 filter. The simplification in the equation also leads to an improvement in the computation time, but for low sampling rates the filter does not converge. But the higher the sampling rate, the better the improvement over the traditional trapezoidal rule.

## April 29, 2016

### Continuum Analytics news

#### Open Data Science: Bringing “Magic” to Modern Analytics

Posted Friday, April 29, 2016

Science fiction author Arthur C. Clarke once wrote, “any sufficiently advanced technology is indistinguishable from magic.”

We’re nearer than ever to that incomprehensible, magical future. Our gadgets understand our speech, driverless cars have made their debut and we’ll soon be viewing virtual worlds at home.

These “magical” technologies spring from a 21st-century spirit of innovation—but not only from big companies. Thanks to the Internet—and to the open source movement—companies of all sizes are able to spur advancements in science and technology.

In the past, our analytics tools were proprietary, product-oriented solutions. These were necessarily limited in flexibility and they locked customers into the slow innovation cycles and whims of vendors. These closed-source solutions forced a “one size fits all” approach to analytics with monolithic tools that did not offer easy customization for different needs.

Open Data Science has changed that. It offers innovative software—free of proprietary restrictions and tailorable for all varieties of data science teams—created in the transparent collaboration that is driving today’s tech boom.

The Magic 8-Ball of Automated Modeling

One of Open Data Science's most visible points of innovation is in the sphere of data science modeling.

Initially, models were created exclusively by statisticians and analysts for business professionals, but demand from the business sector for software that could do this job gave rise to automatic model fitting—often called “black box” analytics—in which analysts let software algorithmically generate models that fit data and create predictive models.

Such a system creates models, but much like a magic 8-ball, it offers its users answers without business explanations. Mysteries are fun for toys, but no business will bet on them. Quite understandably, no marketing manager or product manager wants to approach the CEO with predictions, only to be stumped when he asks how the manager arrived at them. As Clarke knew, it’s not really magic creating the models, it’s advanced technology and it too operates under assumptions that might or might not make sense for the business.

App Starters Means More Transparent Modeling

Today’s business professionals want faster time-to-value and are dazzled by advanced technologies like automated model fitting, but they also want to understand exactly how and why the work.

That’s why Continuum Analytics is hard at work on Open Data Science solutions including Anaconda App Starters, expected to debut later this year. App Starters are solution “templates” aimed to be a 60-80 percent data science solution that make it easy for businesses to have a starting point. App Starters serve the same purpose as the “black box”—faster time-to-value— but are not a “black box” in that it allows analysts to see exactly how the model was created and to tweak models as desired.

Because the App Starters are are based on Open Data Science, they don’t include proprietary restrictions that keep business professionals or data scientists in the dark regarding the analytics pipeline including the algorithms. It still provides the value of “automagically” creating models, but the details of how it does so are transparent and accessible to the team. With App Starters, business professionals will finally have confidence in the models they’re using to formulate business strategies, while getting faster time-to-value from their growing data.

Over time App Starters will get more sophisticated and will include recommendations—just like how Netflix offers up movie and tv show recommendations for your watching pleasure—that will learn and suggest algorithms and visualizations that best fit the data. Unlike “black boxes” the entire narrative as to why recommendations are offered will be available for the business analyst to learn and gain confidence in the recommendations. However, the business analyst can choose to use the recommendation, tweak the recommendation, use the template without recommendations or they could try tuning the suggested models to find a perfect fit. This type of innovation will further the advancement of sophisticated data science solutions that realize more business value, while instilling confidence in the solution.

Casting Spells with Anaconda

Although App Starters are about to shake up automated modeling, businesses require melding new ideas with tried-and-true solutions. In business analytics, for instance, tools like Microsoft Excel are a staple of the field and being able to integrate them with newer “magic” is highly desirable.

Fortunately, interoperability is one of the keystones of the Open Data Science philosophy and Anaconda provides a way to bridge the reliable old world with the magical new one. With Anaconda, analysts who are comfortable using Excel have an entry point into the world of predictive analytics from the comfort of their spreadsheets. By using the same familiar interface, analysts can access powerful Python libraries to apply cutting-edge analytics to their data. Anaconda recognizes that business analysts want to improve—not disrupt—a proven workflow.

Because Anaconda leverages the Python ecosystem, analysts using Anaconda will achieve powerful results. They might apply a formula to an Excel sheet with a million data rows to predict repeat customers or they may create beautiful, informative visualizations to show how sales have shifted to a new demographic after the company’s newest marketing campaign kicked off. With Anaconda, business analysts can continue using Excel as their main interface, while harnessing the newest “magic” available in the open source community.

Open Data Science for Wizards…and Apprentices

Open Data Science is an inclusive movement. Although open source languages like Python and R dominate data science and allow for the most advanced—and therefore “magical”—analytics technology available, the community is open to all levels of expertise.

Anaconda is a great way for business analysts, for example, to embark on the road toward advanced analytics. But solutions, like App Starters, give advanced wizards the algorithmic visibility to alter and improve models as they see fit.

Open Data Science gives us the “sufficiently advanced technology” that Arthur C. Clarke mentioned—but it puts the power of that magic in our hands.

## April 26, 2016

### Matthieu Brucher

#### Analog modeling of a diode clipper (3b): Simulation

Let’s dive directly inside the second diode clipper and follow exactly the same pattern.

# Second diode clipper

So first let’s remember the equation:

$\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})$

## Forward Euler

The forward Euler approximation is then:

$V_{on+1} = V_{on} + h(\frac{V_{in+1} - V_{on}}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_{on}}{nV_t}))$

## Backward Euler

Backward Euler approximation is now:

$V_{on+1} - V_{on} = h(\frac{V_{in+1} - V_{on+1}}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_{on+1}}{nV_t}))$

(The equations are definitely easier to derive…)

## Trapezoidal rule

And finally trapezoidal rule gives:

$V_{on+1} - V_{on} = h(\frac{V_{in+1}}{R_1 C_1} - \frac{V_{on+1} + V_{on}}{2 R_1 C_1} + \frac{I_s}{C_1}(sinh(\frac{V_{on+1}}{nV_t}) + sinh(\frac{V_{on}}{nV_t})))$

## Starting estimates

For the estimates, we use exactly the same methods as the previous clipper, so I won’t recall them.

## Graphs

Numerical optimization comparison

The first obvious change is that the forward Euler can give pretty good results. This makes me think I may have made a mistake in the previous circuit, but as I had to derive the equation before doing the approximation, this may be the reason.

For the original estimates, just like last time, the results are identical:

Original estimates comparison

OK, let’s compare the result of the first iteration with different original estimates:
One step comparison

All estimates give a similar result, but the affine estimates give a better estimate than linear which gives a far better result than the default/copying estimate.

# Conclusion

Just for fun, let’s display the difference between the two clippers:

Diode clippers comparison

Obviously, the second clipper is more symmetric than the first one and thus will create less harmonics (which is confirmed by a spectrogram), and this is also easier to optimize (the second clipper uses at least one less iteration than the first one).

All things considered, the Newton Raphson algorithm is always efficient, with around 3 or less iterations for these circuits. Trying bisection or something else may not be that interesting, except if you are heavily using SIMD instructions. In this case, the optimization may be faster because you have a similar number of iterations.

Original estimates done with the last optimized value always works great although affine estimates are usually faster. The tricky part is deriving the equation. And more often than not, you make mistakes when implementing them!

Next step: DK method…

## April 25, 2016

### Continuum Analytics news

#### Accelerate 2.2 Released!

Posted Monday, April 25, 2016

We're happy to announce the latest update to Accelerate with the release of version 2.2. This version of Accelerate adds compatibility with the recently released Numba 0.25, and also expands the Anaconda Platform in two new directions:

• Data profiling
• MKL-accelerated ufuncs

I'll discuss each of these in detail below.

### Data Profiling

We've built up quite a bit of experience over the years optimizing numerical Python code for our customers, and these projects follow some common patterns. First, the most important step in the optimization process is profiling a realistic test case. You can't improve what you can't measure, and profiling is critical to identify the true bottlenecks in an application. Even experienced developers are often surprised by profiling results when they see which functions are consuming the most time. Ensuring the test case is realistic (but not necessarily long) is also very important, as unit and functional tests for applications tend to use smaller, or differently shaped, input data sets. The scaling behavior of many algorithms is non-linear, so profiling with a very small input can give misleading results.

The second step in optimization is to consider alternative implementations for the critical functions identified in the first step, possibly adopting a different algorithm, parallelizing the calculation to make use of multiple cores or a GPU, or moving up a level to eliminate or batch unnecessary calls to the function. In this step of the process, we often found ourselves lacking a critical piece of information: what data types and sizes were being passed to this function? The best approach often depends on this information. Are these NumPy arrays or custom classes? Are the arrays large or small? 32-bit or 64-bit float? What dimensionality? Large arrays might benefit from GPU acceleration, but small arrays often require moving up the call stack in order to see if calculations can be batched.

Rather than having to manually modify the code to collect this data type information in an ad-hoc way, we've added a new profiling tool to Accelerate that can record this type information as a part of normal profiling. For lack of a better term, we're calling this "data profiling."

We collect this extra information using a modified version of the built-in Python profiling mechanism, and can display it using the standard pstats-style table:

ncalls  tottime percall cumtime percall filename:lineno(function)
300/100 0.01313 0.0001313 0.03036 0.0003036  linalg.py:532(cholesky(a:ndarray(dtype=float64, shape=(3, 3))))
200/100 0.004237 4.237e-05 0.007189 7.189e-05 linalg.py:139(_commonType())
200/100 0.003431 3.431e-05 0.005312 5.312e-05 linalg.py:106(_makearray(a:ndarray(dtype=float64, shape=(3, 3))))
400/200 0.002663 1.332e-05 0.002663 1.332e-05 linalg.py:111(isComplexType(t:type))
300/100 0.002185 2.185e-05 0.002185 2.185e-05 linalg.py:209(_assertNdSquareness())
200/100 0.001592 1.592e-05 0.001592 1.592e-05 linalg.py:124(_realType(t:type, default:NoneType))
200/100 0.00107 1.07e-05 0.00107 1.07e-05 linalg.py:198(_assertRankAtLeast2())
100 0.000162 1.62e-06 0.000162 1.62e-06 linalg.py:101(get_linalg_error_extobj(callback:function))

The recorded function signatures now include data types, and NumPy arrays also have dtype and shape information. In the above example, we've selected only the linear algebra calls from the execution of a PyMC model. Here we can clearly see the Cholesky decomposition is being done on 3x3 matrices, which would dictate our optimization strategy if cholesky was the bottleneck in the code (in this case, it is not).

We've also integrated the SnakeViz profile visualization tool into the Accelerate profiler, so you can easily collect and view profile information right inside your Jupyter notebooks:

## profiling.png

All it takes to profile a function and view it in a notebook is a few lines:

from accelerate import profiler

p = profiler.Profile()

p.run('my_function_to_profile()')

profiler.plot(p)

### MKL-Accelerated Ufuncs

MKL is perhaps best known for high performance, multi-threaded linear algebra functionality, but MKL also provides highly optimized math functions, like sin() and cos() for arrays. Anaconda already ships with the numexpr library, which is linked against MKL to provide fast array math support. However, we have future plans for Accelerate that go beyond what numexpr can provide, so in the latest release of Accelerate, we've exposed the MKL array math functions as NumPy ufuncs you can call directly.

For code that makes extensive use of special math functions on arrays with many thousands of elements, the performance speedup is quite amazing:

import numpy as np

from accelerate.mkl import ufuncs as mkl_ufuncs

def spherical_to_cartesian_numpy(r, theta, phi):

    cos_theta = np.cos(theta)

    sin_theta = np.sin(theta)

    cos_phi = np.cos(phi)

    sin_phi = np.sin(phi)

    x = r * sin_theta * cos_phi

    y = r * sin_theta * sin_phi

    z = r * cos_theta

def spherical_to_cartesian_mkl(r, theta, phi):

    cos_theta = mkl_ufuncs.cos(theta)

    sin_theta = mkl_ufuncs.sin(theta)

    cos_phi = mkl_ufuncs.cos(phi)

    sin_phi = mkl_ufuncs.sin(phi)

        x = r * sin_theta * cos_phi

    y = r * sin_theta * sin_phi

    z = r, cos_theta

        return x, y, z

 n = 100000

r, theta, phi = np.random.uniform(1, 10, n), np.random.uniform(0, np.pi, n), np.random.uniform(-np.pi, np.pi, n)

%timeit spherical_to_cartesian_numpy(r, theta, phi)

%timeit spherical_to_cartesian_mkl(r, theta, phi)

    100 loops, best of 3: 7.01 ms per loop

    1000 loops, best of 3: 978 µs per loop

A speedup of 7x is not bad for a 2.3 GHz quad core laptop CPU from 2012. In future releases, we are looking to expand and integrate this functionality further into the Anaconda Platform, so stay tuned!

### Summary

You can install Accelerate with conda and use it free for 30 days:

conda install accelerate

Try it out, and let us know what you think. Academic users can get a free subscription to Anaconda (including several useful tools, like Accelerate) by following these instructions. Contact sales@continuum.io to find out how to get a subscription to Anaconda at your organization.

## April 20, 2016

### Matthew Rocklin

#### Ad Hoc Distributed Random Forests

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://www.youtube.com/watch?v=FkPlEqB8AnE

## TL;DR.

Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data.

Our machine learning model does not perform well, but we do learn how to execute ad-hoc computations easily.

## Motivation

In the past few posts we analyzed data on a cluster with Dask collections:

Often our computations don’t fit neatly into the bag, dataframe, or array abstractions. In these cases we want the flexibility of normal code with for loops, but still with the computational power of a cluster. With the dask.distributed task interface, we achieve something close to this.

## Application: Naive Distributed Random Forest Algorithm

As a motivating application we build a random forest algorithm from the ground up using the single-machine Scikit Learn library, and dask.distributed’s ability to quickly submit individual tasks to run on the cluster. Our algorithm will look like the following:

1. Pull data from some external source (S3) into several dataframes on the cluster
2. For each dataframe, create and train one RandomForestClassifier
3. Scatter single testing dataframe to all machines
4. For each RandomForestClassifier predict output on test dataframe
5. Aggregate independent predictions from each classifier together by a majority vote. To avoid bringing too much data to any one machine, perform this majority vote as a tree reduction.

## Data: NYC Taxi 2015

As in our blogpost on distributed dataframes we use the data on all NYC Taxi rides in 2015. This is around 20GB on disk and 60GB in RAM.

We predict the number of passengers in each cab given the other numeric columns like pickup and destination location, fare breakdown, distance, etc..

We do this first on a small bit of data on a single machine and then on the entire dataset on the cluster. Our cluster is composed of twelve m4.xlarges (4 cores, 15GB RAM each).

Disclaimer and Spoiler Alert: I am not an expert in machine learning. Our algorithm will perform very poorly. If you’re excited about machine learning you can stop reading here. However, if you’re interested in how to build distributed algorithms with Dask then you may want to read on, especially if you happen to know enough machine learning to improve upon my naive solution.

## API: submit, map, gather

We use a small number of dask.distributed functions to build our computation:

futures = executor.scatter(data)                     # scatter data
future = executor.submit(function, *args, **kwargs)  # submit single task
futures = executor.map(function, sequence)           # submit many tasks
results = executor.gather(futures)                   # gather results
executor.replicate(futures, n=number_of_replications)


In particular, functions like executor.submit(function, *args) let us send individual functions out to our cluster thousands of times a second. Because these functions consume their own results we can create complex workflows that stay entirely on the cluster and trust the distributed scheduler to move data around intelligently.

First we load data from Amazon S3. We use the s3.read_csv(..., collection=False) function to load 178 Pandas DataFrames on our cluster from CSV data on S3. We get back a list of Future objects that refer to these remote dataframes. The use of collection=False gives us this list of futures rather than a single cohesive Dask.dataframe object.

from distributed import Executor, s3
e = Executor('52.91.1.177:8786')

parse_dates=['tpep_pickup_datetime',
'tpep_dropoff_datetime'],
collection=False)
dfs = e.compute(dfs)


Each of these is a lightweight Future pointing to a pandas.DataFrame on the cluster.

>>> dfs[:5]
[<Future: status: finished, type: DataFrame, key: finalize-a06c3dd25769f434978fa27d5a4cf24b>,
<Future: status: finished, type: DataFrame, key: finalize-7dcb27364a8701f45cb02d2fe034728a>,
<Future: status: finished, type: DataFrame, key: finalize-b0dfe075000bd59c3a90bfdf89a990da>,
<Future: status: finished, type: DataFrame, key: finalize-1c9bb25cefa1b892fac9b48c0aef7e04>,
<Future: status: finished, type: DataFrame, key: finalize-c8254256b09ae287badca3cf6d9e3142>]


If we’re willing to wait a bit then we can pull data from any future back to our local process using the .result() method. We don’t want to do this too much though, data transfer can be expensive and we can’t hold the entire dataset in the memory of a single machine. Here we just bring back one of the dataframes:

>>> df = dfs[0].result()

VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0 0.3 16.30

## Train on a single machine

To start lets go through the standard Scikit Learn fit/predict/score cycle with this small bit of data on a single machine.

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

df_train, df_test = train_test_split(df)

columns = ['trip_distance', 'pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_latitude', 'payment_type',
'fare_amount', 'mta_tax', 'tip_amount', 'tolls_amount']

est = RandomForestClassifier(n_estimators=4)
est.fit(df_train[columns], df_train.passenger_count)


This builds a RandomForestClassifer with four decision trees and then trains it against the numeric columns in the data, trying to predict the passenger_count column. It takes around 10 seconds to train on a single core. We now see how well we do on the holdout testing data:

>>> est.score(df_test[columns], df_test.passenger_count)
0.65808188654721012


This 65% accuracy is actually pretty poor. About 70% of the rides in NYC have a single passenger, so the model of “always guess one” would out-perform our fancy random forest.

>>> from sklearn.metrics import accuracy_score
>>> import numpy as np
>>> accuracy_score(df_test.passenger_count,
...                np.ones_like(df_test.passenger_count))
0.70669390028780987


This is where my ignorance in machine learning really kills us. There is likely a simple way to improve this. However, because I’m more interested in showing how to build distributed computations with Dask than in actually doing machine learning I’m going to go ahead with this naive approach. Spoiler alert: we’re going to do a lot of computation and still not beat the “always guess one” strategy.

## Fit across the cluster with executor.map

First we build a function that does just what we did before, builds a random forest and then trains it on a dataframe.

def fit(df):
est = RandomForestClassifier(n_estimators=4)
est.fit(df[columns], df.passenger_count)
return est


Second we call this function on all of our training dataframes on the cluster using the standard e.map(function, sequence) function. This sends out many small tasks for the cluster to run. We use all but the last dataframe for training data and hold out the last dataframe for testing. There are more principled ways to do this, but again we’re going to charge ahead here.

train = dfs[:-1]
test = dfs[-1]

estimators = e.map(fit, train)


This takes around two minutes to train on all of the 177 dataframes and now we have 177 independent estimators, each capable of guessing how many passengers a particular ride had. There is relatively little overhead in this computation.

## Predict on testing data

Recall that we kept separate a future, test, that points to a Pandas dataframe on the cluster that was not used to train any of our 177 estimators. We’re going to replicate this dataframe across all workers on the cluster and then ask each estimator to predict the number of passengers for each ride in this dataset.

e.replicate([test], n=48)

def predict(est, X):
return est.predict(X[columns])

predictions = [e.submit(predict, est, test) for est in estimators]


Here we used the executor.submit(function, *args, **kwrags) function in a list comprehension to individually launch many tasks. The scheduler determines when and where to run these tasks for optimal computation time and minimal data transfer. As with all functions, this returns futures that we can use to collect data if we want in the future.

Developers note: we explicitly replicate here in order to take advantage of efficient tree-broadcasting algorithms. This is purely a performance consideration, everything would have worked fine without this, but the explicit broadcast turns a 30s communication+computation into a 2s communication+computation.

## Aggregate predictions by majority vote

For each estimator we now have an independent prediction of the passenger counts for all of the rides in our test data. In other words for each ride we have 177 different opinions on how many passengers were in the cab. By averaging these opinions together we hope to achieve a more accurate consensus opinion.

For example, consider the first four prediction arrays:

>>> a_few_predictions = e.gather(predictions[:4])  # remote futures -> local arrays
>>> a_few_predictions
[array([1, 2, 1, ..., 2, 2, 1]),
array([1, 1, 1, ..., 1, 1, 1]),
array([2, 1, 1, ..., 1, 1, 1]),
array([1, 1, 1, ..., 1, 1, 1])]


For the first ride/column we see that three of the four predictions are for a single passenger while one prediction disagrees and is for two passengers. We create a consensus opinion by taking the mode of the stacked arrays:

from scipy.stats import mode
import numpy as np

def mymode(*arrays):
array = np.stack(arrays, axis=0)
return mode(array)[0][0]

>>> mymode(*a_few_predictions)
array([1, 1, 1, ..., 1, 1, 1])


And so when we average these four prediction arrays together we see that the majority opinion of one passenger dominates for all of the six rides visible here.

## Tree Reduction

We could call our mymode function on all of our predictions like this:

>>> mode_prediction = e.submit(mymode, *predictions)  # this doesn't scale well


Unfortunately this would move all of our results to a single machine to compute the mode there. This might swamp that single machine.

Instead we batch our predictions into groups of size 10, average each group, and then repeat the process with the smaller set of predictions until we have only one left. This sort of multi-step reduction is called a tree reduction. We can write it up with a couple nested loops and executor.submit. This is only an approximation of the mode, but it’s a much more scalable computation. This finishes in about 1.5 seconds.

from toolz import partition_all

while len(predictions) > 1:
predictions = [e.submit(mymode, *chunk)
for chunk in partition_all(10, predictions)]

result = e.gather(predictions)[0]

>>> result
array([1, 1, 1, ..., 1, 1, 1])


## Final Score

Finally, after completing all of our work on our cluster we can see how well our distributed random forest algorithm does.

>>> accuracy_score(result, test.result().passenger_count)
0.67061974451423045


Still worse than the naive “always guess one” strategy. This just goes to show that, no matter how sophisticated your Big Data solution is, there is no substitute for common sense and a little bit of domain expertise.

## What didn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

• Clearly this would have benefited from more machine learning knowledge. What would have been a good approach for this problem?
• I’ve been thinking a bit about memory management of replicated data on the cluster. In this exercise we specifically replicated out the test data. Everything would have worked fine without this step but it would have been much slower as every worker gathered data from the single worker that originally had the test dataframe. Replicating data is great until you start filling up distributed RAM. It will be interesting to think of policies about when to start cleaning up redundant data and when to keep it around.
• Several people from both open source users and Continuum customers have asked about a general Dask library for machine learning, something akin to Spark’s MLlib. Ideally a future Dask.learn module would leverage Scikit-Learn in the same way that Dask.dataframe leverages Pandas. It’s not clear how to cleanly break up and parallelize Scikit-Learn algorithms.

## Conclusion

This blogpost gives a concrete example using basic task submission with executor.map and executor.submit to build a non-trivial computation. This approach is straightforward and not restrictive. Personally this interface excites me more than collections like Dask.dataframe; there is a lot of freedom in arbitrary task submission.

## April 19, 2016

### Matthieu Brucher

#### Book review: The Culture Map: Decoding How People Think and Get Things Done in a Global World

I work in an international company, and there are lots of people from different cultures around me, and with whom I need to interact. Out of the blue, it feels like it’s easy to work with all of them, I mean, how difficult could it be to work with them?

Actually, it’s easy, but sometimes interactions are intriguing and people do not react the way you expect them to react. And why is that? Lots of reasons, of course, but one of them is that they have a different culture and do not expect you to explicitly tell them what they did wrong (which is something I do. A lot).

#### Content and opinions

Enters Erin Meyer. She had to navigate between these cultures, as she’s American, married to a French guy, in France. In her book, she presents 8 scales, and each culture is placed differently on each scale.

I won’t enter in all the details of the different scales, but they are about all the different ways of people interacting with other people. Whether it’s about scheduling, feedback to decision-making, all cultures are different. And sometimes, even if the cultures are close geographically, they can be quite different on some scales. After all, they are all influenced by their short or long history, their philosophers…

All the scales are imaged with stories of Meyer’s experience in teaching them, stories from her students, and they are always spot on.

#### Conclusion

Of course, the book only tells you the differences, what to look for. It doesn’t educate you to do the right think. This takes practice, and it requires work.

Also it doesn’t solve all interaction problems. Everyone is different in one’s own culture (not even talking about people having several cultures…), on the left or on the right compared to the average of one’s culture on each scale. So you can’t sum up someone to a culture. But if you want to learn more about interacting with people, you already know that.

## April 18, 2016

### Continuum Analytics news

#### Conda + Spark

Posted Tuesday, April 19, 2016

In my previous post, I described different scenarios for bootstrapping Python on a multi-node cluster. I offered a general solution using Anaconda for cluster management and solution using a custom conda env deployed with Knit.

In a follow-up to that post, I was asked if the machinery in Knit would also work for Spark. Sure--of course! In fact, much of Knit's design comes from Spark's deploy codebase. Here, I am going to demonstrate how we can ship a Python environment, complete with desired dependencies, as part of a Spark job without installing Python on every node.

## Spark YARN Deploy

First, I want to briefly describe key points in Spark's YARN deploy methodologies. After negotiating which resources to provision with YARN's Resource Manager, Spark asks for a directory to be constructed on HDFS: /user/ubuntu/.sparkStaging/application_1460665326796_0065/ The directory will always be in the user's home, and the application ID issued by YARN is appended to the directory name (thinking about this now, perhaps this is obvious and straightforward to JAVA/JVM folks where bundling Uber JARs has long been the practice in traditional Map-Reduce jobs). In any case, Spark then uploads itself to the stagingDirectory, and when YARN provisions a container, the contents of the directory are pulled down and the spark-assembly jar is executed. If you are using PySpark or sparkR, a corresponding pyspark.zip and sparkr.zip will be found in the staging directory as well.

Occasionally, users see FileNotFoundException errors -- this can be caused by a few things: incorrect Spark Contexts, incorrect SPARK_HOME, and I have faint recollection that there was a packaging problem once where pyspark.zip or sparkr.zip was missing, or could not be created do to permissions? Anyway -- below is the output you will see when Spark works cleanly.

16/04/15 13:01:03 INFO Client: Uploading resource file:/opt/anaconda/share/spark-1.6.0/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://ip-172-31-50-60:9000/user/ubuntu/.sparkStaging/application_1460665326796_0065/spark-assembly-1.6.0-hadoop2.6.0.jar

16/04/15 13:01:07 INFO Client: Uploading resource file:/opt/anaconda/share/spark-1.6.0/python/lib/pyspark.zip -> hdfs://ip-172-31-50-60:9000/user/ubuntu/.sparkStaging/application_1460665326796_0065/pyspark.zip

Not terribly exciting, but positive confirmation that Spark is uploading local files to HDFS.

## Bootstrap-Fu Redux

Most of what I described above is what the YARN framework allows developers to do -- it's more that Spark implements a YARN application than Spark doing magical things (and Knit as well!). If I were using Scala/Java, I would package up everything in a jar and use spark-submit -- Done!

Unfortunately, there's a little more work to be done for an Uber Python jar equivalent.

One of the killer features of conda is environment management. When conda creates a new environment, it uses hard-links when possible. Generally, this greatly reduces disk usage. But, if we move the directory to another machine, we're probably just moving a handful of hard-links and not the files themselves. Fortunately, we can tell conda: "No! Copy the files!"

For example:

conda create -p /home/ubuntu/dev --copy -y -q python=3 pandas scikit-learn

By using the --copy, we "install all packages using copies instead of hard or soft-linking." The headers in various files in the bin/ directory may have lines like #!/home/ubuntu/dev/bin/python. But, we don't need to be concerned about that -- we're not going to be using 2to3, idle, pip, etc. If we zipped up the environment, we could move this onto another machine of a similar OS type, execute Python, and we'd be able to load any library in the lib/python3.45/site-packages directory.

We're very close to our Uber Python jar -- now with a zipped conda directory in mind, let's proceed.

zip -r dev.zip dev

## Death by ENV Vars

We are going to need a handful of specific command line options and environment variables: Spark Yarn Configuration and Spark Environment Variables. We'll be using:

• PYSPARK_PYTHON: The Python binary Spark should use
• spark.yarn.appMasterEnv.PYSPARK_PYTHON (though this one could be wrong/unnecessary/only used for --master yarn-cluster)
• --archives: include local tgz/jar/zip in .sparkStaging directory and pull down into temporary YARN container

We'll also need a test script. The following is a reasonable test to prove which Python Spark is using -- we're writing a no-op function which returns Python's various paths it is using to find libraries

# test_spark.py

import os

import sys

from pyspark import SparkContext

from pyspark import SparkConf

conf = SparkConf()

conf.setAppName("get-hosts")

sc = SparkContext(conf=conf)

def noop(x):

    import socket

    import sys

    return socket.gethostname() + ' '.join(sys.path) + ' '.join(os.environ)

rdd = sc.parallelize(range(1000), 100)

hosts = rdd.map(noop).distinct().collect()

print(hosts)

And executing everything together:

 PYSPARK_PYTHON=./ANACONDA/dev/bin/python spark-submit \

 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ANACONDA/dev/bin/python \

 --master yarn-cluster \

 --archives /home/ubuntu/dev.zip#ANACONDA \

 /home/ubuntu/test_spark.py

We'll get the following output in the yarn logs:

It's a little hard to parse -- what should be noted are file paths like:

.../container_1460665326796_0070_01_000002/ANACONDA/dev/lib/python3.5/site- packages

This is demonstrating that Spark is using the unzipped directory in the YARN container. Ta-da!

## Thoughts

Okay, perhaps that's not super exciting, so let's zoom out again:

1. We create a zipped conda environment with dependencies: pandas, python=3,...
2. We successfully launched a Python Spark job without any Python binaries or libraries previously installed on the nodes.

There is an open JIRA ticket discussing the option of having Spark ingest a requirements.txt and building the Python environment as a preamble to a Spark job. This is also a fairly novel approach to the same end -- using Spark to bootstrap a runtime environment. It's even a bit more general, since the method described above relies on YARN. I first saw this strategy in use with streamparse. Similarly to the implementation in JIRA ticket, streamparse can ship a Python requirements.txt and construct a Python environment as part of a Streamparse Storm job!

## Rrrrrrrrrrrrrr

Oh, and R conda environments work as well...but it's more involved.

## Create/Munge R Env

First, it's pretty cool that conda can install and manage R environments. Again, we create a conda environment with R binaries and libraries

conda create -p /home/ubuntu/r_env --copy -y -q r-essentials -c r

R is not exactly relocatable so we need to munge a bit:

sed -i "s/home\/ubuntu/.r_env.zip/g" /home/ubuntu/r_env/bin/R

zip -r r_env.zip r_env

My R skills are at a below-novice level, so the following test script could probably be improved

# /home/ubuntu/test_spark.R

library(SparkR)

sc <- sparkR.init(appName="get-hosts-R")

noop <- function(x) {

  path <- toString(.libPaths())

  host <- toString(Sys.info()['nodename'])

  host_path <- toString(cbind(host,path))

  host_path

}

rdd <- SparkR:::parallelize(sc, 1:1000, 100)

hosts <- SparkR:::map(rdd, noop)

d_hosts <- SparkR:::distinct(hosts)

out <- SparkR:::collect(d_hosts)

print(out)

Execute (and the real death by options):

SPARKR_DRIVER_R=./r_env.zip/r_env/lib/R spark-submit --master yarn-cluster \

--conf spark.yarn.appMasterEnv.R_HOME=./r_env.zip/r_env/lib64/R \

--conf spark.yarn.appMasterEnv.RHOME=./r_env.zip/r_env \

--conf spark.yarn.appMasterEnv.R_SHARE_DIR=./r_env.zip/r_env/lib/R/share \

--conf spark.yarn.appMasterEnv.R_INCLUDE_DIR=./r_env.zip/r_env/lib/R/include \

--conf spark.executorEnv.R_HOME=./r_env.zip/r_env/lib64/R \

--conf spark.executorEnv.RHOME=./r_env.zip/r_env \

--conf spark.executorEnv.R_SHARE_DIR=./r_env.zip/r_env/lib/R/share \

--conf spark.executorEnv.R_INCLUDE_DIR=./r_env.zip/r_env/lib/R/include \

--conf  spark.r.command=./r_env.zip/r_env/bin/Rscript \

--archives r_env.zip \

/home/ubuntu/test_spark.R

Example output:

[1] "ip-172-31-50-59, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/230/sparkr.zip, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/229/r_env.zip/r_env/lib64/R/library"

[1] "ip-172-31-50-61, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/183/sparkr.zip, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/182/r_env.zip/r_env/lib64/R/library"

This post is also published on Ben's website here.

Posted Monday, April 18, 2016

## Overview

Dask is a flexible open source parallel computation framework that lets you comfortably scale up and scale out your analytics. If you’re running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.

Dask enables distributed computing in pure Python and complements the existing numerical and scientific computing capability within Anaconda. Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data, and it scales up resiliently and elastically on clusters with hundreds of nodes.

Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP.

In this post, we’ll show you how you can use Anaconda with Dask for distributed computations and workflows, including distributed dataframes, arrays, text processing, and custom parallel workflows that can help you make the most of Anaconda and Dask on your cluster. We’ll work with Anaconda and Dask interactively from the Jupyter Notebook while the heavy computations are running on the cluster.

There are many different ways to get started with Anaconda and Dask on your Hadoop or HPC cluster, including manual setup via SSH; by integrating with resource managers such as YARN, SGE, or Slurm; launching instances on Amazon EC2; or by using the enterprise-ready Anaconda for cluster management.

Anaconda for cluster management makes it easy to install familiar packages from Anaconda (including NumPy, SciPy, Pandas, NLTK, scikit-learn, scikit-image, and access to 720+ more packages in Anaconda) and the Dask parallel processing framework on all of your bare-metal or cloud-based cluster nodes. You can provision centrally managed installations of Anaconda, Dask and the Jupyter notebook using two simple commands with Anaconda for cluster management:

$acluster create dask-cluster -p dask-cluster $ acluster install dask notebook

Additional features of Anaconda for cluster management include:

• Easily install Python and R packages across multiple cluster nodes
• Manage multiple conda environments across a cluster
• Push local conda environments to all cluster nodes
• Works on cloud-based and bare-metal clusters with existing Hadoop installations

Once you’ve installed Anaconda and Dask on your cluster, you can perform many types of distributed computations, including text processing (similar to Spark), distributed dataframes, distributed arrays, and custom parallel workflows. We’ll show some examples in the following sections.

## Distributed Text and Language Processing (Dask Bag)

Dask works well with standard computations such as text processing and natural language processing and with data in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask Bag collection is similar to other parallel frameworks and supports operations like  filter, count, fold, frequencies, pluck, and take, which are useful for working with a collection of Python objects such as text.

For example, we can use the natural language processing toolkit (NLTK) in Anaconda to perform distributed language processing on a Hadoop cluster, all while working interactively in a Jupyter notebook.

In this example, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.

First, we import libraries from Dask and connect to the Dask distributed scheduler:

>>> import dask

>>> from distributed import Executor, hdfs, progress

>>> e = Executor('54.164.41.213:8786')

Next, we load 242 GB of JSON data from HDFS using pure Python:

>>> import json

>>> lines = hdfs.read_text('/user/ubuntu/RC_2015-*.json')

>>> js = lines.map(json.loads)

We can filter and load the data into distributed memory across the cluster:

>>> movies = js.filter(lambda d: 'movies' in d['subreddit'])

>>> movies = e.persist(movies)

Once we’ve loaded the data into distributed memory, we can import the NLTK library from Anaconda and construct stacked expressions to tokenize words, tag parts of speech, and filter out non-words from the dataset.

>>> import nltk

>>> pos = e.persist(movies.pluck('body')

...                       .map(nltk.word_tokenize)

...                       .map(nltk.pos_tag)

...                       .concat()

...                       .filter(lambda (word, pos): word.isalpha()))

In this example, we’ll generate a list of the top 10 proper nouns from the movies subreddit.

>>> f = e.compute(pos.filter(lambda (word, type): type == 'NNP')

...                  .pluck(0)

...                  .frequencies()

...                  .topk(10, lambda (word, count): count))

>>> f.result()

[(u'Marvel', 35452),

 (u'Star', 34849),

 (u'Batman', 31749),

 (u'Wars', 28875),

 (u'Man', 26423),

 (u'John', 25304),

 (u'Superman', 22476),

 (u'Hollywood', 19840),

 (u'Max', 19558),

 (u'CGI', 19304)]

Finally, we can use Bokeh to generate an interactive plot of the resulting data:

## Analysis with Distributed Dataframes (Dask DataFrame)

Dask allows you to work with familiar Pandas dataframe syntax on a single machine or on many nodes on a Hadoop or HPC cluster. You can work with data stored in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask DataFrame collection mimics the Pandas API, uses Pandas under the hood, and supports operations like head, groupby, value_counts, merge, and set_index.

For example, we can use the Dask to perform computations with dataframes on a Hadoop cluster with data stored in HDFS, all while working interactively in a Jupyter notebook.

First, we import libraries from Dask and connect to the Dask distributed scheduler:

>>> import dask

>>> from distributed import Executor, hdfs, progress, wait, s3

>>> e = Executor('54.164.41.213:8786')

Next, we’ll load the NYC taxi data in CSV format from HDFS using pure Python and persist the data in memory:

>>> df = hdfs.read_csv('/user/ubuntu/nyc/yellow_tripdata_2015-*.csv',

                       parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'],

                       header='infer')

>>> df = e.persist(df)

We can perform familiar operations such as computing value counts on columns and statistical correlations:

>>> df.payment_type.value_counts().compute()

1    91574644

2    53864648

3      503070

4      170599

5          28

Name: payment_type, dtype: int64

>>> df2 = df.assign(payment_2=(df.payment_type == 2),

...                 no_tip=(df.tip_amount == 0))

>>> df2.astype(int).corr().compute()

           no_tip    payment_2

no_tip    1.000000    0.943123

payment_2    0.943123    1.000000

Dask runs entirely asynchronously, leaving us free to explore other cells in the notebook while computations happen in the background. Dask also handles all of the messy CSV schema handling for us automatically.

Finally, we can use Bokeh to generate an interactive plot of the resulting data:

## Numerical, Statistical and Scientific Computations with Distributed Arrays (Dask Array)

Dask works well with numerical and scientific computations on n-dimensional array data. The Dask Array collection mimics a subset of the NumPy API, uses NumPy under the hood, and supports operations like dot, flatten, max, mean, and std.

For example, we can use the Dask to perform computations with arrays on a cluster with global temperature/weather data stored in NetCDF format (like HDF5), all while working interactively in a Jupyter notebook. The data files contain measurements that were taken every six hours at every quarter degree latitude and longitude.

First, we import the netCDF4 library and point to the data files stored on disk:

>>> import netCDF4

>>> from glob import glob

>>> filenames = sorted(glob('2014-*.nc3'))

>>> t2m = [netCDF4.Dataset(fn).variables['t2m'] for fn in filenames]

>>> t2m[0]

<class 'netCDF4._netCDF4.Variable'>

int16 t2m(time, latitude, longitude)

    scale_factor: 0.00159734395579

    add_offset: 268.172358066

    _FillValue: -32767

    missing_value: -32767

    units: K

    long_name: 2 metre temperature

unlimited dimensions:

current shape = (4, 721, 1440)

filling off

We then import Dask and read in the data from the NumPy arrays:

>>> import dask.array as da

>>> xs = [da.from_array(t, chunks=t.shape) for t in t2m]

>>> x = da.concatenate(xs, axis=0)

We can then perform distributed computations on the cluster, such as computing the mean temperature, variance of the temperature over time, and normalized temperature. We can view the progress of the computations as they run on the cluster nodes and continue to work in other cells in the notebook:

>>> avg, std = da.compute(x.mean(axis=0), x.std(axis=0))

>>> z = (x - avg) / std

>>> progress(z)

We can plot the resulting normalized temperature using matplotlib:

We can also create interactive widgets in the notebook to interact with and visualize the data in real-time while the computations are running across the cluster:

## Creating Custom Parallel Workflows

When one of the standard Dask collections isn’t a good fit for your workflow, Dask gives you the flexibility to work with different file formats and custom parallel workflows. The Dask Imperative collection lets you wrap functions in existing Python code and run the computations on a single machine or across a cluster.

In this example, we have multiple files stored hierarchically in a custom file format (Feather for reading and writing Python and R dataframes on disk). We can build a custom workflow by wrapping the code with Dask Imperative and making use of the Feather library:

>>> import feather

>>> from dask import delayed

>>> from glob import glob

>>> import os

>>> lazy_dataframes = []

>>> for directory in glob('2016-*'):

...     for symbol in os.listdir(directory):

...         filename = os.path.join(directory, symbol)

...         df = delayed(feather.read_dataframe)(filename)

...         df = delayed(pd.DataFrame.assign)(df,

                    date=pd.Timestamp(directory),

                    symbol=symbol)

...         lazy_dataframes.append(df)

You can get started with Anaconda and Dask using Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$conda install anaconda-client -n root $ anaconda login

$conda install anaconda-cluster -c anaconda-cluster In addition to Anaconda subscriptions, there are many different ways that Continuum can help you get started with Anaconda and Dask to construct parallel workflows, parallelize your existing code, or integrate with your existing Hadoop or HPC cluster, including: • Architecture consulting and review • Manage Python packages and environments on a cluster • Develop custom package management solutions on existing clusters • Migrate and parallelize existing code with Python and Dask • Architect parallel workflows and data pipelines with Dask • Build proof of concepts and interactive applications with Dask • Custom product/OSS core development • Training on parallel development with Dask For more information about the above solutions, or if you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io. ## April 17, 2016 ### Titus Brown #### MinHash signatures as ways to find samples, and collaborators? As I wrote last week my latest enthusiasm is MinHash sketches, applied (for the moment) to RNAseq data sets. Briefly, these are small "signatures" of data sets that can be used to compare data sets quickly. In the previous blog post, I talked a bit about their effectiveness and showed that (at least in my hands, and on a small data set of ~200 samples) I could use them to cluster RNAseq data sets by species. What I didn't highlight in that blog post is that they could potentially be used to find samples of interest as well as (maybe) collaborators. ## Finding samples of interest The "samples of interest" idea is pretty clear - supposed we had a collection of signatures from all the RNAseq in the the Sequence Read Archive? Then we could search the entire SRA for data sets that were "close" to ours, and then just use those to do transcriptome studies. It's not yet clear how well this might work for finding RNAseq data sets with similar expression patterns, but if you're working with non-model species, then it might be a good way to pick out all the data sets that you should use to generate a de novo assembly. More generally, as we get more and more data, finding relevant samples may get harder and harder. This kind of approach lets you search on sequence content, not annotations or metadata, which may be incomplete or inaccurate for all sorts of reasons. In support of this general idea, I have defined a provisional file format (in YAML) that can be used to transport around these signatures. It's rather minimal and fairly human readable - we would need to augment it with additional metadata fields for any serious use in databases(but see below for more discussion on that). Each record (and there can currently only be one record per signature file) can contain multiple different sketches, corresponding to different k-mer sizes used in generating the sketch. (For different sized sketches with the same k-mers, you just store the biggest one, because we're using bottom sketches so the bigger sketches properly include the smaller sketches.) If you want to play with some signatures, you can -- here's an executable binder with some examples of generating distance matrices between signatures, and plotting them. Note that by far the most time is spent in loading the signatures - the comparisons are super quick, and in any case could be sped up a lot by moving them from pure Python over to C. I've got a pile of all echinoderm SRA signatures already built, for those who are interested in looking at a collection -- look here. ## Finding collaborators Searching public databases is all well and good, and is a pretty cool application to enable with a few dozen lines of code. But I'm also interested in enabling the search of pre-publication data and doing matchmaking between potential collaborators. How could this work? Well, the interesting thing about these signatures is that they are irreversible signatures with a one-sided error (a match means something; no match means very little). This means that you can't learn much of anything about the original sample from the signature unless you have a matching sample, and even then all you know is the species and maybe something about the tissue/stage being sequenced. In turn, this means that it might be possible to convince people to publicly post signatures of pre-publication mRNAseq data sets. Why would they do this?? An underappreciated challenge in the non-model organism world is that building reference transcriptomes requires a lot of samples. Sure, you can go sequence just the tissues you're interested in, but you have to sequence deeply and broadly in order to generate good enough data to produce a good reference transcriptome so that you can interpret your own mRNAseq. In part because of this (as well as many other reasons), people are slow to publish on their mRNAseq - and, generally, data isn't made available pre-publication. What if you could go fishing for collaborators on building a reference transcriptome? Very few people are excited about just publishing a transcriptome (with some reason, when you see papers that publish 300), but those are really valuable building blocks for the field as a whole. So, suppose you had some RNAseq, and you wanted to find other people with RNAseq from the same organism, and there was this service where you could post your RNAseq signature and get notified when similar signatures were posted? You wouldn't need to do anything more than supply an e-mail address along with your signature, and if you're worried about leaking information about who you are, it's easy enough to make new e-mail addresses. I dunno. Seems interesting. Could work. Right? One fun point is that this could be a distributed service. The signatures are small enough (~1-2 kb) that you can post them on places like github, and then have aggregators that collect them. The only "centralized" service involved would be in searching all of them, and that's pretty lightweight in practice. Another fun point is that we already have a good way to communicate RNAseq for the limited purpose of transcrpiptome assembly -- diginorm. Abundance-normalized RNAseq is useless for doing expression analysis, and if you normalize a bunch of samples together you can't even figure out what the original tissue was. So, if you're worried about other people having access to your expression levels, you can simply normalize the data all together before handing it over. ## Further thoughts As I said in the first post, this was all nucleated by reading the mash and MetaPalette papers. In my review for MetaPalette, I suggested that they look at mash to see if MinHash signatures could be used to dramatically reduce their database size, and now that I actually understand MinHash a bit more, I think the answer is clearly yes. Which leads to another question - the Mash folk are clearly planning to use MinHash & mash to search assembled genomes, with a side helping of unassembled short and long reads. If we can all agree on an interchange format or three, why couldn't we just start generating public signatures of all the things, mRNAseq and genomic and metagenomic all? I see many, many uses, all somewhat dimly... (Lest anyone think I believe this to be a novel observation, clearly the Mash folk are well ahead of me here -- they undersold it in their paper, so I didn't notice until I re-read it with this in mind, but it's there :). Anyway, it seems like a great idea and we should totally do it. Who's in? What are the use cases? What do we need to do? Where is it going to break? --titus p.s. Thanks to Luiz Irber for some helpful discussion about YAML formats! ## April 14, 2016 ### Matthew Rocklin #### Fast Message Serialization This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project Very high performance isn’t about doing one thing well, it’s about doing nothing poorly. This week I optimized the inter-node communication protocol used by dask.distributed. It was a fun exercise in optimization that involved several different and unexpected components. I separately had to deal with Pickle, NumPy, Tornado, MsgPack, and compression libraries. This blogpost is not advertising any particular functionality, rather it’s a story of the problems I ran into when designing and optimizing a protocol to quickly send both very small and very large numeric data between machines on the Python stack. We care very strongly about both the many small messages case (thousands of 100 byte messages per second) and the very large messages case (100-1000 MB). This spans an interesting range of performance space. We end up with a protocol that costs around 5 microseconds in the small case and operates at 1-1.5 GB/s in the large case. ## Identify a Problem This came about as I was preparing a demo using dask.array on a distributed cluster for a Continuum webinar. I noticed that my computations were taking much longer than expected. The Web UI quickly pointed me to the fact that my machines were spending 10-20 seconds moving 30 MB chunks of numpy array data between them. This is very strange because I was on 100MB/s network, and so I expected these transfers to happen in more like 0.3s than 15s. The Web UI made this glaringly apparent, so my first lesson was how valuable visual profiling tools can be when they make performance issues glaringly obvious. Thanks here goes to the Bokeh developers who helped the development of the Dask real-time Web UI. ## Problem 1: Tornado’s sentinels Dask’s networking is built off of Tornado’s TCP IOStreams. There are two common ways to delineate messages on a socket, sentinel values that signal the end of a message, and prefixing a length before every message. Early on we tried both in Dask but found that prefixing a length before every message was slow. It turns out that this was because TCP sockets try to batch small messages to increase bandwidth. Turning this optimization off ended up being an effective and easy solution, see the TCP_NODELAY parameter. However, before we figured that out we used sentinels for a long time. Unfortunately Tornado does not handle sentinels well for large messages. At the receipt of every new message it reads through all buffered data to see if it can find the sentinel. This makes lots and lots of copies and reads through lots and lots of bytes. This isn’t a problem if your messages are a few kilobytes, as is common in web development, but it’s terrible if your messages are millions or billions of bytes long. Switching back to prefixing messages with lengths and turning off the no-delay optimization moved our bandwidth up from 3MB/s to 20MB/s per node. Thanks goes to Ben Darnell (main Tornado developer) for helping us to track this down. ## Problem 2: Memory Copies A nice machine can copy memory at 5 GB/s. If your network is only 100 MB/s then you can easily suffer several memory copies in your system without caring. This leads to code that looks like the following: socket.send(header + payload)  This code concatenates two bytestrings, header and payload before sending the result down a socket. If we cared deeply about avoiding memory copies then we might instead send these two separately: socket.send(header) socket.send(payload)  But who cares, right? At 5 GB/s copying memory is cheap! Unfortunately this breaks down under either of the following conditions 1. You are sloppy enough to do this multiple times 2. You find yourself on a machine with surprisingly low memory bandwidth, like 10 times slower, as is the case on some EC2 machines. Both of these were true for me but fortunately it’s usually straightforward to reduce the number of copies down to a small number (we got down to three), with moderate effort. ## Problem 3: Unwanted Compression Dask compresses all large messages with LZ4 or Snappy if they’re available. Unfortunately, if your data isn’t very compressible then this is mostly lost time. Doubly unforutnate is that you also have to decompress the data on the recipient side. Decompressing not-very-compressible data was surprisingly slow. Now we compress with the following policy: 1. If the message is less than 10kB, don’t bother 2. Pick out five 10kB samples of the data and compress those. If the result isn’t well compressed then don’t bother compressing the full payload. 3. Compress the full payload, if it doesn’t compress well then just send along the original to spare the receiver’s side from compressing. In this case we use cheap checks to guard against unwanted compression. We also avoid any cost at all for small messages, which we care about deeply. ## Problem 4: Cloudpickle is not as fast as Pickle This was surprising, because cloudpickle mostly defers to Pickle for the easy stuff, like NumPy arrays. In [1]: import numpy as np In [2]: data = np.random.randint(0, 255, dtype='u1', size=10000000) In [3]: import pickle, cloudpickle In [4]: %time len(pickle.dumps(data, protocol=-1)) CPU times: user 8.65 ms, sys: 8.42 ms, total: 17.1 ms Wall time: 16.9 ms Out[4]: 10000161 In [5]: %time len(cloudpickle.dumps(data, protocol=-1)) CPU times: user 20.6 ms, sys: 24.5 ms, total: 45.1 ms Wall time: 44.4 ms Out[5]: 10000161  But it turns out that cloudpickle is using the Python implementation, while pickle itself (or cPickle in Python 2) is using the compiled C implemenation. Fortunately this is easy to correct, and a quick typecheck on common large dataformats in Python (NumPy and Pandas) gets us this speed boost. ## Problem 5: Pickle is still slower than you’d expect Pickle runs at about half the speed of memcopy, which is what you’d expect from a protocol that is mostly just “serialize the dtype, strides, then tack on the data bytes”. There must be an extraneous memory copy in there. See issue 7544 ## Problem 6: MsgPack is bad at large bytestrings Dask serializes most messages with MsgPack, which is ordinarily very fast. Unfortunately the MsgPack spec doesn’t support bytestrings greater than 4GB (which do come up for us) and the Python implementations don’t pass through large bytestrings very efficiently. So we had to handle large bytestrings separately. Any message that contains bytestrings over 1MB in size will have them stripped out and sent along in a separate frame. This both avoids the MsgPack overhead and avoids a memory copy (we can send the bytes directly to the socket). ## Problem 7: Tornado makes a copy Sockets on Windows don’t accept payloads greater than 128kB in size. As a result Tornado chops up large messages into many small ones. On linux this memory copy is extraneous. It can be removed with a bit of logic within Tornado. I might do this in the moderate future. ## Results We serialize small messages in about 5 microseconds (thanks msgpack!) and move large bytes around in the cost of three memory copies (about 1-1.5 GB/s) which is generally faster than most networks in use. Here is a profile of sending and receiving a gigabyte-sized NumPy array of random values through to the same process over localhost (500 MB/s on my machine.)  381360 function calls (381323 primitive calls) in 1.451 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.366 0.366 0.366 0.366 {built-in method dumps} 8 0.289 0.036 0.291 0.036 iostream.py:360(write) 15353 0.228 0.000 0.228 0.000 {method 'join' of 'bytes' objects} 15355 0.166 0.000 0.166 0.000 {method 'recv' of '_socket.socket' objects} 15362 0.156 0.000 0.398 0.000 iostream.py:1510(_merge_prefix) 7759 0.101 0.000 0.101 0.000 {method 'send' of '_socket.socket' objects} 17/14 0.026 0.002 0.686 0.049 gen.py:990(run) 15355 0.021 0.000 0.198 0.000 iostream.py:721(_read_to_buffer) 8 0.018 0.002 0.203 0.025 iostream.py:876(_consume) 91 0.017 0.000 0.335 0.004 iostream.py:827(_handle_write) 89 0.015 0.000 0.217 0.002 iostream.py:585(_read_to_buffer_loop) 122567 0.009 0.000 0.009 0.000 {built-in method len} 15355 0.008 0.000 0.173 0.000 iostream.py:1010(read_from_fd) 38369 0.004 0.000 0.004 0.000 {method 'append' of 'list' objects} 7759 0.004 0.000 0.104 0.000 iostream.py:1023(write_to_fd) 1 0.003 0.003 1.451 1.451 ioloop.py:746(start)  Dominant unwanted costs include the following: 1. 400ms: Pickling the NumPy array 2. 400ms: Bytestring handling within Tornado After this we’re just bound by pushing bytes down a wire. ## Conclusion Writing fast code isn’t about writing any one thing particularly well, it’s about mitigating everything that can get in your way. As you approch peak performance, previously minor flaws suddenly become your dominant bottleneck. Success here depends on frequent profiling and keeping your mind open to unexpected and surprising costs. ## April 13, 2016 ### Titus Brown #### Applying MinHash to cluster RNAseq samples (I gave a talk on this on Monday, April 11th - you can see the slides slides here, on figshare. This is a Reproducible Blog Post. You can regenerate all the figures and play with this software yourself on binder.) So, my latest enthusiasm is MinHash sketches. A few weeks back, I had the luck to be asked to review both the mash paper (preprint here) and the MetaPalette paper (preprint here). The mash paper made me learn about MinHash sketches, while the MetaPalette paper made some very nice points about shared k-mers and species identification. After reading, I got to thinking. I wondered to myself, hey, could I use MinHash signatures to cluster unassembled Illumina RNAseq samples? While the mash folk showed that MinHash could be applied to raw reads nicely, I guessed that the greater dynamic range of gene expression would cause problems - mainly because high-abundance transcripts would yield many, many erroneous k-mers. Conveniently, however, my lab has some not-so-secret sauce for dealing with this problem - would it work, here? I thought it might. Combined with all of this, my former grad student, Dr. Qingpeng Zhang (first author on the not-so-secret sauce, above) has some other still-unpublished work showing that the the first ~1m reads of metagenome samples can be used to cluster samples together. So, I reasoned, perhaps it would work well to stream the first million or so reads from the beginning of RNAseq samples through our error trimming approach, compute a MinHash signature, and then use that signature to identify the species from which the RNAseq was isolated (and perhaps even closely related samples). tl; dr? It seems to work, with some modifications. For everything below, I used a k-mer hash size of 32 and only chose read data sets with reads of length 72 or higher. (Here's a nice presentation on MinHash, via Luiz Irber.) ## MinHash is super easy to implement I implemented MinHash in only a few lines of Python; see the repository at https://github.com/dib-lab/sourmash/. The most relevant code is sourmash_lib.py. Here, I'm using a bottom sketch, and at the moment I'm building some of it on top of khmer, although I will probably remove that requirement soon. After lots of trial and error (some of it reported below), I settled on using a k-mer size of k=32, and a sketch size of 500. (You can go down to a sketch size of 100, but you lose resolution. Lower k-mer sizes have the expected effect of slowly decreasing resolution; odd k-mer sizes effectively halve the sketch size.) ## How fast is it, and how much memory does it use, and how big are the sketches? I haven't bothered benchmarking it, but • everything but the hash function itself is on Python; • on my 3 yro laptop it takes about 5 minutes to add 1m reads; • the memory usage of sourmash itself is negligible - error trimming the reads requires about 1 GB of RAM; • the sketches are tiny - less than a few kb - and the program is dominated by the Python overhead. So it's super fast and super lightweight. ## Do you need to error trim the reads? The figure below shows a dendrogram next to a distance matrix of 8 samples - four mouse samples, untrimmed, and the same four mouse samples, trimmed at low-abundance k-mers. (You can see the trimming command here, using khmer's trim-low-abund command.) The two house mouse samples are replicates, and they always cluster together. However, they are much further apart without trimming. The effect of trimming on the disease mouse samples (which are independent biological samples, I believe) is much less; it rearranges the tree a bit but it's not as convincing as with the trimming. So you seem to get better resolution when you error trim the reads, which is expected. The signal isn't as strong as I thought it'd be, though. Have to think about that; I'm surprised MinHash is that robust to errors! ## Species group together pretty robustly with only 1m reads How many reads do you need to use? If you're looking for species groupings, not that many -- 1m reads is enough to cluster mouse vs yeast separately. (Which is good, right? If that didn't work...) Approximately 1m reads turns out to work equally well for 200 echinoderm (sea urchin and sea star) samples, too. Here, I downloaded all 204 echinoderm HiSeq mRNAseq data sets from SRA, trimmed them as above, and computed the MinHash signatures, and then compared them all to each other. The blocks of similarity are all specific species, and all the species groups cluster properly, and none of them (with one exception) cluster with other species. This is also an impressive demonstration of the speed of MinHash - you can do all 204 samples against each other in about 10 seconds. Most of that time is spent loading my YAML format into memory; the actual comparison takes < 1s! (The whole notebook for making all of these figures takes less than 30 seconds to run, since the signatures are already there; check it out!) ## Species that do group together may actually belong together In the urchin clustering above, there's only one "confused" species grouping where one cluster contains more than one species - that's Patiria miniata and Patiria pectinifera, which are both bat stars. I posted this figure on Facebook and noted the grouping, and Dan Rokhsar pointed out that on Wikipedia, Patiria has been identified as a complex of three closely related species in the Pacific. So that's good - it seems like the only group that has cross-species clustering is, indeed, truly multi-species. ## You can sample any 1m reads and get pretty similar results In theory, FASTQ files from shotgun sequencing are perfectly random, so you should be able to pick any 1m reads you want - including the first 1m. In practice, of course, this is not true. How similar are different subsamples? Answer: quite similar. All seven 1m read subsamples (5 random, one from the middle, one from the end) are above 70% in similarity. ## (Very) widely divergent species don't cross-compare at all If you look at (say) yeast and mouse, there's simply no similarity there at all. 32-mer signatures are apparently very specific. (The graph below is kind of stupid. It's just looking at similarity between mouse and yeast data sets as you walk through the two data streams. It's 0.2% all the way.) ## Species samples get more similar (or stay the same) as you go through the stream What happens when you look at more than 1m reads? Do the streams get more or less similar? If you walk through two streams and update the MinHash signature regularly, you see either constant similarity or a general increase in similarity; in the mouse replicates, it's constant and high, and between disease mouse and house mouse, it grows as you step through the stream. (The inflection points are probably due to how we rearrange the reads during the abundance trimming. More investigation needed.) Yeast replicates also maintain high similarity through the data stream. ## What we're actually doing is mostly picking k-mers from the transcriptome (This is pretty much what we expected, but as my dad always said, "trust but verify.") The next question is, what are we actually seeing signatures of? For example, in the above mouse example, we see growing similarity between two mouse data sets as we step through the data stream. Is this because we're counting more sequencing artifacts as we look at more data, or is this because we're seeing true signal? To investigate, I calculated the MinHash signature of the mouse RNA RefSeq file, and then asked if the streams were getting closer to that as we walked through them. They are: So, it seems like part of what's happening here is that we are looking at the True Signature of the mouse transcriptome. Good to know. And that's it for today, folks. ## What can this be used for? So, it all seems to work pretty well - the mash folk are dead-on right, and this is a pretty awesome and simple way to look at sequences. Right now, my approach above seems like it's most useful for identifying what species some RNAseq is from. If we can do that, then we can start thinking about other uses. If we can't do that pretty robustly, then that's a problem ;). So that's where I started. It might be fun to run against portions of the SRA to identify mislabeled samples. Once we have the SRA digested, we can make that available to people who are looking for more samples from their species of interest; whether this is useful will depend. I'm guessing that it's not immediately useful, since the SRA species identification seem pretty decent. One simple idea is to simply run this on each new sample you get back from a sequencing facility. "Hey, this looks like Drosophila. ...did you intend to sequence Drosophila?" It won't work for identifying low-lying contamination that well, but it could identify mis-labeled samples pretty quickly. Tracy Teal suggested that this could be used in-house in large labs to find out if others in the lab have samples of interest to you. Hmm. More on that idea later. ## Some big remaining questions • Do samples actually cluster by expression similarity? Maybe - more work needed. • Can this be used to compare different metagenomes using raw reads? No, probably not very well. At least, the metagenomes I care about are too diverse; you will probably need a different strategy. I'm thinking about it. ## One last shoutout I pretty much reimplemented parts of mash; there's nothing particularly novel here, other than exploring it in my own code on public data :). So, thanks, mash authors! --titus ## April 12, 2016 ### Continuum Analytics news #### Using Anaconda with PySpark for Distributed Language Processing on a Hadoop Cluster Posted Tuesday, April 12, 2016 ### Overview Working with your favorite Python packages along with distributed PySpark jobs across a Hadoop cluster can be difficult due to tedious manual setup and configuration issues, which is a problem that becomes more painful as the number of nodes in your cluster increases. Anaconda makes it easy to manage packages (including Python, R and Scala) and their dependencies on an existing Hadoop cluster with PySpark, including data processing, machine learning, image processing and natural language processing. ## 1-pyspark-reddit-language.png In a previous post, we’ve demonstrated how you can use libraries in Anaconda to query and visualize 1.7 billion comments on a Hadoop cluster. In this post, we’ll use Anaconda to perform distributed natural language processing with PySpark using a subset of the same data set. We’ll configure different enterprise Hadoop distributions, including Cloudera CDH and Hortonworks HDP, to work interactively on your Hadoop cluster with PySpark, Anaconda and a Jupyter Notebook. ## 2-pyspark-reddit-language.png In the remainder of this post, we'll: 1. Install Anaconda and the Jupyter Notebook on an existing Hadoop cluster. 2. Load the text/language data into HDFS on the cluster. 3. Configure PySpark to work with Anaconda and the Jupyter Notebook with different enterprise Hadoop distributions. 4. Perform distributed natural language processing on the data with the NLTK library from Anaconda. 5. Work locally with a subset of the data using Pandas and Bokeh for data analysis and interactive visualization. ### Provisioning Anaconda on a cluster Because we’re installing Anaconda on an existing Hadoop cluster, we can follow the bare-metal cluster setup instructions in Anaconda for cluster management from a Windows, Mac, or Linux machine. We can install and configure conda on each node of the existing Hadoop cluster with a single command: $ acluster create cluster-hadoop --profile cluster-hadoop

After a few minutes, we’ll have a centrally managed installation of conda across our Hadoop cluster in the default location of /opt/anaconda.

### Installing Anaconda packages on the cluster

Once we’ve provisioned conda on the cluster, we can install the packages from Anaconda that we’ll need for this example to perform language processing, data analysis and visualization:

$acluster conda install nltk pandas bokeh We’ll need to download the NLTK data on each node of the cluster. For convenience, we can do this using the distributed shell functionality in Anaconda for cluster management: $ acluster cmd 'sudo /opt/anaconda/bin/python -m nltk.downloader -d /usr/share/nltk_data all'

In this post, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.

Note that we could convert the data into different formats or load it into various query engines; however, since the focus of this blog post is using libraries with Anaconda, we will be working with the raw JSON data in PySpark.

We’ll load the reddit comment data into HDFS from the head node. You can SSH into the head node by running the following command from the client machine:

$acluster ssh The remaining commands in this section will be executed on the head node. If it doesn’t already exist, we’ll need to create a user directory in HDFS and assign the appropriate permissions: $ sudo -u hdfs hadoop fs -mkdir /user/ubuntu

$sudo -u hdfs hadoop fs -chown ubuntu /user/ubuntu We can then move the data by running the following command with valid AWS credentials, which will transfer the reddit comment data from the year 2015 (242 GB of JSON data) from a public Amazon S3 bucket into HDFS on the cluster: $ hadoop distcp s3n://AWS_KEY:AWS_SECRET@blaze-data/reddit/json/2015/*.json /user/ubuntu/

Replace AWS_KEY and AWS_SECRET in the above command with valid Amazon AWS credentials.

To use Python from Anaconda along with PySpark, you can set the PYSPARK_PYTHON environment variable on a per-job basis along with the spark-submit command. If you’re using the Anaconda parcel for CDH, you can run a PySpark script (e.g., spark-job.py) using the following command:

$PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py If you’re using Anaconda for cluster management with Cloudera CDH or Hortonworks HDP, you can run the PySpark script using the following command (note the different path to Python): $ PYSPARK_PYTHON=/opt/anaconda/bin/python spark-submit spark-job.py

Using the spark-submit command is a quick and easy way to verify that our PySpark script works in batch mode. However, it can be tedious to work with our analysis in a non-interactive manner as Java and Python logs scroll by.

## 4-pyspark-reddit-language.png

Instead, we can use the Jupyter Notebook on our Hadoop cluster to work interactively with our data via Anaconda and PySpark.

## 3-pyspark-reddit-language.png

Using Anaconda for cluster management, we can install Jupyter Notebook on the head node of the cluster with a single command, then open the notebook interface in our local web browser:

$acluster install notebook $ acluster open notebook

Once we’ve opened a new notebook, we’ll need to configure some environment variables for PySpark to work with Anaconda. The following sections include details on how to configure the environment variables for Anaconda to work with PySpark on Cloudera CDH and Hortonworks HDP.

#### Using the Anaconda Parcel with Cloudera CDH

If you’re using the Anaconda parcel with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and the Anaconda 4.0 parcel.

>>> import os

>>> import sys

>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"

>>> os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

#### Using Anaconda for cluster management with Cloudera CDH

If you’re using Anaconda for cluster management with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and Anaconda for cluster management 1.4.0.

>>> import os

>>> import sys

>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"

>>> os.environ["SPARK_HOME"] = "/opt/anaconda/parcels/CDH/lib/spark"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

#### Using Anaconda for cluster management with Hortonworks HDP

If you’re using Anaconda for cluster management with Hortonworks HDP, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Hortonworks HDP running Spark 1.6.0 and Anaconda for cluster management 1.4.0.

>>> import os

>>> import sys

>>> os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

### Initializing the SparkContext

After we’ve configured Anaconda to work with PySpark on our Hadoop cluster, we can initialize a SparkContext that we’ll use for distributed computations. In this example, we’ll be using the YARN resource manager in client mode:

>>> from pyspark import SparkConf

>>> from pyspark import SparkContext

>>> conf = SparkConf()

>>> conf.setMaster('yarn-client')

>>> conf.setAppName('anaconda-pyspark-language')

>>> sc = SparkContext(conf=conf)

Now that we’ve created a SparkContext, we can load the JSON reddit comment data into a Resilient Distributed Dataset (RDD) from PySpark:

>>> lines = sc.textFile("/user/ubuntu/*.json")

Next, we decode the JSON data and decide that we want to filter comments from the movies subreddit:

>>> import json

>>> data = lines.map(json.loads)

>>> movies = data.filter(lambda x: x['subreddit'] == 'movies')

We can then persist the RDD in distributed memory across the cluster so that future computations and queries will be computed quickly from memory. Note that this operation only marks the RDD to be persisted; the data will be persisted in memory after the first computation is triggered:

>>> movies.persist()

We can count the total number of comments in the movies subreddit (about 2.9 million comments):

>>> movies.count()

2905085

We can inspect the first comment in the dataset, which shows fields for the author, comment body, creation time, subreddit, etc.:

>>> movies.take(1)

CPU times: user 8 ms, sys: 0 ns, total: 8 msWall time: 113 ms

[{u'archived': False,

 u'author': u'kylionsfan',

 u'author_flair_css_class': None,

 u'author_flair_text': None,

 u'body': u'Goonies',

 u'controversiality': 0,

 u'created_utc': u'1420070402',

 u'distinguished': None,

 u'downs': 0,

 u'edited': False,

 u'gilded': 0,

 u'id': u'cnas90u',

 u'link_id': u't3_2qyjda',

 u'name': u't1_cnas90u',

 u'parent_id': u't3_2qyjda',

 u'retrieved_on': 1425124282,

 u'score': 1,

 u'score_hidden': False,

 u'subreddit': u'movies',

 u'subreddit_id': u't5_2qh3s',

 u'ups': 1}]

### Distributed Natural Language Processing

Now that we’ve filtered a subset of the data and loaded it into memory across the cluster, we can perform distributed natural language computations using Anaconda with PySpark.

First, we define a parse() function that imports the natural language toolkit (NLTK) from Anaconda and tags words in each comment with their corresponding part of speech. Then, we can map the parse() function to the movies RDD:

>>> def parse(record):

...    import nltk

...    tokens = nltk.word_tokenize(record["body"])

...    record["n_words"] = len(tokens)

...    record["pos"] = nltk.pos_tag(tokens)

...    return record

>>> movies2 = movies.map(parse)

Let’s take a look at the body of one of the comments:

>>> movies2.take(10)[6]['body']

u'Dawn of the Apes was such an incredible movie, it should be up there in my opinion.'

And the same comment with tagged parts of speech (e.g., nouns, verbs, prepositions):

>>> movies2.take(10)[6]['pos']

[(u'Dawn', 'NN'),

(u'of', 'IN'),

(u'the', 'DT'),

(u'Apes', 'NNP'),

(u'was', 'VBD'),

(u'such', 'JJ'),

(u'an', 'DT'),

(u'incredible', 'JJ'),

(u'movie', 'NN'),

(u',', ','),

(u'it', 'PRP'),

(u'should', 'MD'),

(u'be', 'VB'),

(u'up', 'RP'),

(u'there', 'RB'),

(u'in', 'IN'),

(u'my', 'PRP$'), (u'opinion', 'NN'), (u'.', '.')]  We can define a get_NN() function that extracts nouns from the records, filters stopwords, and removes non-words from the data set: >>> def get_NN(record): ... import re ... from nltk.corpus import stopwords ... all_pos = record["pos"] ... ret = [] ... for pos in all_pos: ... if pos[1] == "NN" \ ... and pos[0] not in stopwords.words('english') \ ... and re.search("^[0-9a-zA-Z]+$", pos[0]) is not None:

...            ret.append(pos[0])

...    return ret

>>> nouns = movies2.flatMap(get_NN)

We can then generate word counts for the nouns that we extracted from the dataset:

>>> counts = nouns.map(lambda word: (word, 1))

After we’ve done the heavy lifting, processing, filtering and cleaning on the text data using Anaconda and PySpark, we can collect the reduced word count results onto the head node.

>>> top_nouns = counts.countByKey()

>>> top_nouns = dict(top_nouns)

In the next section, we’ll continue our analysis on the head node of the cluster while working with familiar libraries in Anaconda, all in the same interactive Jupyter notebook.

### Local analysis with Pandas and Bokeh

Now that we’ve done the heavy lifting using Anaconda and PySpark across the cluster, we can work with the results as a dataframe in Pandas, where we can query and inspect the data as usual:

>>> import pandas as pd

>>> df = pd.DataFrame(top_nouns.items(), columns=['Noun', 'Count'])

Let’s sort the resulting word counts, and view the top 10 nouns by frequency:

>>> df = df.sort_values('Count', ascending=False)

>>> df_top_10 = df.head(10)

>>> df_top_10

 Noun Count movie 539698 film 220366 time 157595 way 112752 gt 105313 http 92619 something 87835 lot 85573 scene 82229 thing 82101

Let’s generate a bar chart of the top 10 nouns using Pandas:

>>> %matplotlib inline

>>> df_top_10.plot(kind='bar', x=df_top_10['Noun'])

 5-pyspark-reddit-language.png 
 

Finally, we can use Bokeh to generate an interactive plot of the data:

>>> from bokeh.charts import Bar, show

>>> from bokeh.io import output_notebook

>>> from bokeh.charts.attributes import cat

>>> output_notebook()

>>> p = Bar(df_top_10,

...         label=cat(columns='Noun', sort=False),

...         values='Count',

...         title='Top N nouns in r/movies subreddit')

>>> show(p)

 6-pyspark-reddit-language.png 
 

### Conclusion

In this post, we used Anaconda with PySpark to perform distributed natural language processing and computations on data stored in HDFS. We configured Anaconda and the Jupyter Notebook to work with PySpark on various enterprise Hadoop distributions (including Cloudera CDH and Hortonworks HDP), which allowed us to work interactively with Anaconda and the Hadoop cluster. This made it convenient to work with Anaconda for the distributed processing with PySpark, while reducing the data to a size that we could work with on a single machine, all in the same interactive notebook environment. The complete notebook for this example with Anaconda, PySpark, and NLTK can be viewed on Anaconda Cloud.

You can get started with Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$conda install anaconda-client $ anaconda login

$conda install anaconda-cluster -c anaconda-cluster If you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io. The enterprise features of Anaconda, including the cluster management functionality and on-premises repository, are certified for use with Cloudera CDH 5. If you’re running into memory errors, performance issues (related to JVM overhead or Python/Java serialization), problems translating your existing Python code to PySpark, or other limitations with PySpark, stay tuned for a future post about a parallel processing framework in pure Python that works with libraries in Anaconda and your existing Hadoop cluster, including HDFS and YARN. ### Matthieu Brucher #### Analog modeling of a diode clipper (3a): Simulation Now that we have a few methods, let’s try to simulate them. For both circuits, I’ll use the forward Euler, then backward Euler and trapezoidal approximations, then I will show the results of changing the start estimate and then finish by the Newton Raphson optimization. I haven’t checked (yet?) algorithms that don’t use the derivative like the bisection or Brent algorithm. All graphs are done with a x4 oversampling (although I also tried x8, x16 and x32). # First diode clipper Let’s start with the original equation: $V_i - 2 R_1 I_s sinh(V_o/nV_t) - \int \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_o = 0$ ## Forward Euler Let’s now figure out what to do with the integral by deriving the equation: $\frac{dV_o}{dt} = \frac{\frac{dV_i}{dt} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})}{1 + \frac{2 I_s R_1}{nV_t} cosh(\frac{V_o}{nV_t})}$ So now we have the standard form that can used the usual way. For the derivative of the input, I’ll always use the trapezoidal approximation, and then for the output one, I’ll use the forward Euler which leads to the “simple” equation: $V_{on+1} = V_{on} + \frac{V_{in+1} - V_{in} - \frac{4 h I_s}{C_1} sinh(\frac{V_{on}}{nV_t})}{1 + \frac{2 I_s R_1}{nV_t} cosh(\frac{V_{on}}{nV_t})}$ ## Backward Euler For the backward Euler, I’ll start from the integral equation again and remove the time dependency: $V_{in+1} - V_{in} - 2 R_1 I_s (sinh(\frac{V_{on+1}}{nV_t}) - sinh(\frac{V_{on}}{nV_t})) - \int^{t_{n+1}}_{t_n} \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_{on+1} + V_{on} = 0$ Now the discretization becomes: $V_{in+1} - V_{in} - 2 R_1 I_s (sinh(\frac{V_{on+1}}{nV_t}) - sinh(\frac{V_{on}}{nV_t})) - \frac{2 h I_s}{C_1} sinh(\frac{V_{on+1}}{nV_t}) - V_{on+1} + V_{on} = 0$ I didn’t use this equation for the Backward Euler because I would have had a dependency in the sinh term, so I would still have required the numerical methods to solve the equation. ## Trapezoidal rule Here, we just need to change the discretization for a trapezoidal one: $V_{in+1} - V_{in} - I_s sinh(\frac{V_{on+1}}{nV_t}) (\frac{h}{C_1} + 2 R_1) - I_s sinh(\frac{V_{on}}{nV_t}) (\frac{h}{C_1} - 2 R_1) - V_{on+1} + V_{on} = 0$ ## Starting estimates Starting from the different rules, we need to replace sinh(x): • for the pivotal by $\frac{x}{x_0} sinh(x_0)$ • for the tangent rule by $\frac{x}{nV_t} cosh(\frac{x_0}{nV_t}) + y_0 - \frac{x_0}{nV_t} cosh(\frac{x_0}{nV_t})$ ## Graphs Let’s see now how all these optimizers compare (with an estimate of the next element being the last optimized value): Numerical optimization comparison Obviously, the Forward Euler method is definitely not good. Although is on average 4 times lower, the accuracy is definitely not good enough. On the other end, the other two methods give similar results (probably because I try to achieve a convergence quite strong, with less that 10e-8 difference between two iterations). Now, how does the original estimate impact the results? I tried the Backward Euler to start, and the results are identical: Original estimates comparison To have a better picture, let’s turn down the number of iterations to 1 for all the estimates: One step comparison So all the estimates give a similar result. By comparing the number of iterations with the three estimates, the pivotal method gives the worst results, whereas the affine estimates lowers the number of iterations by one. Of course, there is a price to pay in the computation. So the obvious choice is to use trapezoidal approximation with affine starting point estimate, which is not my default choice in SimpleOverdriveFilter. # To be continued The post is getting longer than I thought, so let’s keep it there for now and the next post on the subject will tackle the other diode clipper circuit. ## April 11, 2016 ### Continuum Analytics news #### Data Science with Python at ODSC East Posted Tuesday, April 12, 2016 By, Sheamus McGovern, Open Data Science Conference Chair At ODSC East, the most influential minds and institutions in data science will convene at the Boston Convention & Exhibition Center from May 20th to the 22nd to discuss and teach the newest and most exciting developments in data science. As you know, the Python ecosystem is now one of the most important data science development environments available today. This is due, in large part, to the existence of a rich suite of user-facing data analysis libraries. Powerful Python machine learning libraries like Scikit-learn, XGBoost and others bring sophisticated predictive analytics to the masses. The NLTK and Gensim libraries enable deep analysis of textual information in Python and the Topik library provides a high-level interface to these and other, natural language libraries, adding a new layer of usability. The Pandas library has brought data analysis in Python to a new level by providing expressive data structures for quick and intuitive data manipulation and analysis. The notebook ecosystem in Python has also flourished with the development of the Jupyter, Rodeo and Beaker notebooks. The notebook interface is an increasingly popular way for data scientists to perform complex analyses that serve the purpose of conveying and sharing analyses and their results to colleagues and to stakeholders. Python is also host to a number of rich web-development frameworks that are used not only for building data science dash boards, but also for full-scale data science powered web-apps. Flask and Django lead the way in terms of the Python web-app development landscape, but Bottle and Pyramid are also quite popular. With Cython, code can approach speeds akin to that of C or C++ and new developments, like the Dask package, to make computing on larger-than-memory datasets very easy. Visualization libraries, like Plot.ly and Bokeh, have brought rich, interactive and impactful data visualization tools to the fingertips of data analysts everywhere. Anaconda has streamlined the use of many of these wildly popular open source data science packages by providing an easy way to install, manage and use Python libraries. With Anaconda, users no longer need to worry about tedious incompatibilities and library management across their development environments. Several of the most influential Python developers and data scientists will be talking and teaching at ODSC East. Indeed, Peter Wang will be speaking ODSC East. Peter is the co-founder and CTO at Continuum Analytics, as well as the mastermind behind the popular Bokeh visualization library, the Blaze ecosystem, which simplifies the the analysis of Big Data with Python and Anaconda. At ODSC East, there will be over 100 speakers, 20 workshops and 10 training sessions spanning seven conferences that focused on Open Data Science, Disruptive Data Science, Big Data science, Data Visualization, Data Science for Good, Open Data and a Careers and Training conference. See below for very small sampling of some of the powerful Python workshops and speakers we will have at ODSC East. ●Bayesian Statistics Made Simple - Allen Downey, Think Python ●Intro to Scikit learn for Machine Learning - Andreas Mueller, NYU Center for Data Science ●Parallelizing Data Science in Python with Dask - Matthew Rocklin, Continuum Analytics ●Interactive Viz of a Billion Points with Bokeh Datashader – Peter Wang, Continuum Analytics ## April 07, 2016 ### Titus Brown #### Bashing on monstrous sequencing collections So, there's this fairly large collection of about 700 RNAseq samples, from 300 species in 40 or so phyla. It's called the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), and was funded by the Moore Foundation as a truly field-wide collaboration to improve our reference collection for genes (and more). Back When, it was sequenced and assembled by the National Center for Genome Resources, and published in PLOS Biology (Keeling et al., 2014). Partly because we think assembly has improved in the last few years, partly as an educational exercise, partly as an infrastructure exercise, partly as a demo, and partly just because we can, Lisa Cohen in my lab is starting to reassemble all of the data - starting with about 10%. She has some of the basic evaluations (mostly via transrate) posted, and before we pull the trigger on the rest of the assemblies, we're pausing to reflect and to think about what metrics to use, and what kinds of resources we plan to produce. (We are not lacking in ideas, but we might be lacking in good ideas, if you know what I mean.) In particular, this exercise raises some interesting questions that we hope to dig into: • what does a good transcriptome look like, and how could having 700 assemblies help us figure that out? (hint: distributions) • what is a good canonical set of analyses for characterizing transcriptome assemblies? • what products should we be making available for each assembly? • what kind of data formatting makes it easiest for other bioinformaticians to build off of the compute we're doing? • how should we distribute the workflow components? (Lisa really likes shell scripts but I've been lobbying for something more structured. 'make' doesn't really fit the bill here, though.) • how do we "alert" the community if and when we come up with better assemblies? How do we merge assemblies between programs and efforts, and properly credit everyone involved? Anyway, feedback welcome, here or on Lisa's post! We are happy to share methods, data, analyses, results, etc. etc. --titus p.s. Yes, that's right. I ask new grad students to start by assemblying 700 transcriptomes. So? :) ### Martin Fitzpatrick #### Why are numpy calculations not affected by the global interpreter lock? Many numpy calculations are unaffected by the GIL, but not all. While in code that does not require the Python interpreter (e.g. C libraries) it is possible to specifically release the GIL - allowing other code that depends on the interpreter to continue running. In the Numpy C codebase the macros NPY_BEGIN_THREADS and NPY_END_THREADS are used to delimit blocks of code that permit GIL release. You can see these in this search of the numpy source. The NumPy C API documentation has more information on threading support. Note the additional macros NPY_BEGIN_THREADS_DESCR, NPY_END_THREADS_DESCR and NPY_BEGIN_THREADS_THRESHOLDED which handle conditional GIL release, dependent on array dtypes and the size of loops. Most core functions release the GIL - for example Universal Functions (ufunc) do so as described: as long as no object arrays are involved, the Python Global Interpreter Lock (GIL) is released prior to calling the loops. It is re-acquired if necessary to handle error conditions. With regard to your own code, the source code for NumPy is available. Check the functions you use (and the functions they call) for the above macros. Note also that the performance benefit is heavily dependent on how long the GIL is released - if your code is constantly dropping in/out of Python you won’t see much of an improvement. The other option is to just test it. However, bear in mind that functions using the conditional GIL macros may exhibit different behaviour with small and large arrays. A test with a small dataset may therefore not be an accurate representation of performance for a larger task. There is some additional information on parallel processing with numpy available on the official wiki and a useful post about the Python GIL in general over on Programmers.SE. ## April 06, 2016 ### Continuum Analytics news #### Anaconda 4.0 Release Posted Wednesday, April 6, 2016 We are happy to announce that Anaconda 4.0 has been released, which includes the new Anaconda Navigator. Did you notice we skipped from release 2.5 to 4.0? Sharp eyes! The team decided to move up to 4.0 release number to reduce confusion with common Python versions. Anaconda Navigator is a desktop graphical user interface included in Anaconda that allows you to launch applications and easily manage conda packages, environments and channels without the need to use command line commands. It is available for Windows, OS X and Linux. For those familiar with the Anaconda Launcher, Anaconda Navigator has replaced Launcher. If you are already using Anaconda Cloud to host private packages, you can access them easily by signing in with your Anaconda Cloud account. ## navigator-home.png There are four main components in Anaconda Navigator, each one can be selected by clicking the corresponding tab on the left-hand column: • Home where you can install, upgrade, and launch applications • Environments allows you to manage channels, environments and packages. • Learning shows a long list of learning resources in several categories: webinars, documentation, videos and training. • Community where you can connect to other users through events, forums and social media. If you already have Anaconda installed, update to Anaconda 4.0 by using conda: conda update conda conda install anaconda=4.0 The full list of changes, fixes and updates for Anaconda v4.0 can be found in the changelog. We’d very much appreciate your feedback on the latest release, especially the new Anaconda Navigator. Please submit comments or issues through our anaconda-issues GitHub repo. ## April 05, 2016 ### Enthought #### Just Released: PyXLL v 3.0 (Python in Excel). New Real Time Data Stream Capabilities, Excel Ribbon Integration, and More. Download a free 30 day trial of PyXLL and try it with your own data. Since PyXLL was first released back in 2010 it has grown hugely in popularity and is used by businesses in many different sectors. The original motivation for PyXLL was to be able to use all the best bits of Excel […] ### Matthieu Brucher #### Analog modeling of a diode clipper (2): Discretization Let’s start with the two equations we got from the last post and see what we can do with usual/academic tools to solve them (I will tackle nodal and ZDF tools later in this series). # Euler and trapezoidal approximation The usual tools start with a specific form:$$\dot{y} = f(y)$

I’ll work with the second clipper whose equation is of this form:

$\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) = f(V_o)$

## Forward Euler

The simplest way of computing the derivative term is to use the following rule, with h, the inverse of the sampling frequency:

$V_{on+1} = V_{on} + h f(V_{on})$

The nice thing about this rule is that it is easy to compute. The main drawback is that the result may not be accurate and stable enough (let’s keep this for the next post, with actual derivations).

## Backward Euler

Instead of using the past to compute the new sample, we can use the future, which leads to

$V_{on+1} = V_{on} + h f(V_{on+1})$

As the result is present on both sides, solving the problem is not simple. In this case, we can even say that the equation has no close form solution (due to the sinh term), and thus no analytical solution. The only way is to use numerical methods like Brent or Newton Raphson to solve the equation.

## Trapezoidal approximation

Another solution is to combine both solution to have a better approximation of the derivative term:

$V_{on+1} = V_{on} + \frac{h}{2}(f(V_{on}) + f(V_{on+1}))$

We still need the numerical methods to solve the clipper equation with this method, but like the Backward Euler method, this one is said to be A-stable, which is a mandatory condition when solving stiff systems (or systems that have a bad condition number). For a one variable system, the condition number is of course 1…

## Other approximations

There are different other ways of approximating this derivative term. The most used one is the trapezoidal methods, but there are others like all the linear multistep methods (that actually encompass the first three).

# Numerical methods

Let’s try to analyse a few numerical methods. If we used trapezoidal approximation, then the following function needs to be considered:

$g(V_{on+1}) = V_{on+1} - V_{on} - \frac{h}{2}(f(V_{on}) + f(V_{on+1}))$

The goal is to find a value where the function is zero, called a root of the function.

## Bisection method

This method is simple to implement, from two starting points, on either side of the root. Then, we take the middle of the interval, check the sign of it and keep the original point that has a different sign, and keep on until we get close enough of the root.

What is interesting with this method is that it can be vectorized easily by checking several values in the interval instead of just one.

## Newton-Raphson

This numerical method requires the derivative function of g(). Then we can start from the original starting point and iterate this series:

$x_{n+1} = x_n - \frac{g(x_n)}{g'(x_n)}$

For those who are used to optimize cost function and know about the Newton method, it is exactly the same as this one. To optimize a cost function, we need to find a zero of the derivative function. So if g() is this derivative function, then we end up on the Newton method to minimize (or maximize) a cost function.

That being said, if the rate of convergence is quadratic for the Newton-Raphson method, it may not converge. The only way to achieve convergence is to be close enough to the root we are looking for and have some conditions that are usually quite complex to check (see the wikipedia page).

## Starting point

The are several ways of starting the methods.

The first one is any enough: just use the result of the last optimization.

The second one is a little bit more complex: approximate all the complex functions (like sinh) by their tangent and solve the resulting polynomial.

The third one is derived from the second one and called pivotal method/mystran method. Instead of using the tangent, we use the linear function that crosses the origin and the last point. The idea here is that it can be more stable that the tangent method (consider doing this for the hyperbolic tangent, the resulting result could be quite far).

# Conclusion

Of course, there are other numerical methods that I haven’t spoken about. Any can be tried and used. Please do so and report your results!

Let’s see how the ones I’ve shown behave in the next post.

## April 04, 2016

### Continuum Analytics news

#### Anaconda Powers TaxBrain to Transform Washington Through Transparency, Access and Collaboration

Posted Monday, April 4, 2016

Continuum Analytics and Open Source Policy Center Leverage the Power of Open Source to Build Vital Policy Forecasting Models with Anaconda

AUSTIN, TX—April 4, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading open source analytics platform powered by Python, today announced that Anaconda is powering the American Enterprise Institute’s (AEI) Open Source Policy Center (OSPC) TaxBrain initiative. TaxBrain is a web-based application that lets users simulate and study the effect of tax policy reforms using open source economic models. TaxBrain provides transparent analysis for policy makers and the public, ultimately creating a more democratic and scientific platform to analyze economic policy.

“OSPC’s mission is to empower the public to contribute to government policymaking through open source methods and technology, making policy analysis more transparent, trustworthy and collaborative,” said Matt Jensen, founder and managing director of the Open Source Policy Center at the American Enterprise Institute. “TaxBrain is OSPC’s first product, and with Anaconda it is already improving tax policy by making the policy analysis process more democratic and scientific. By leveraging the power of open source, we are able to provide policy makers, journalists and the general public with the information they need to impact and change policy for the better.”

TaxBrain is made possible by a community of economists, data scientists, software developers, and policy experts who are motivated by a shared belief that public policy should be guided by open scientific inquiry, rather than proprietary analysis. The community also believes that the analysis of public policy should be freely available to everyone, rather than just to a select group of those in power.

“The TaxBrain initiative is only the beginning of a much larger movement to use open source approaches in policy and government,” said Travis Oliphant, CEO and co-founder of Continuum Analytics. “TaxBrain is the perfect example of how Anaconda can inspire people to harness the power of data science to enable positive changes. With Anaconda, the OSPC is able to empower a growing community with the superpowers necessary to promote change and democratic policy reform.”

Anaconda has allowed TaxBrain to tap a vast network of outside contributors in the PyData community to accelerate the total number of open source economic models. Contributions from the PyData community come quickly and—because of Anaconda—get large performance gains from Numba, the Python compiler included in Anaconda, and are easy to integrate into TaxBrain. These contributions can be used in other applications and are hosted on the Anaconda Cloud (anaconda.org/ospc).

AEI is a nonprofit, nonpartisan public policy research organization that works to expand liberty, increase individual opportunity, and strengthen free enterprise.

Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team––data scientists, developers, DevOps, data engineers and business analysts––to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

## April 02, 2016

### NeuralEnsemble

#### EU Human Brain Project Releases Platforms to the Public

"Geneva, 30 March 2016 — The Human Brain Project (HBP) is pleased to announce the release of initial versions of its six Information and Communications Technology (ICT) Platforms to users outside the Project. These Platforms are designed to help the scientific community to accelerate progress in neuroscience, medicine, and computing.

[...]

The six HBP Platforms are:
• The Neuroinformatics Platform: registration, search, analysis of neuroscience data.
• The Brain Simulation Platform: reconstruction and simulation of the brain.
• The High Performance Computing Platform: computing and storage facilities to run complex simulations and analyse large data sets.
• The Medical Informatics Platform: searching of real patient data to understand similarities and differences among brain diseases.
• The Neuromorphic Computing Platform: access to computer systems that emulate brain microcircuits and apply principles similar to the way the brain learns.
• The Neurorobotics Platform: testing of virtual models of the brain by connecting them to simulated robot bodies and environments.
All the Platforms can be accessed via the HBP Collaboratory, a web portal where users can also find guidelines, tutorials and information on training seminars. Please note that users will need to register to access the Platforms and that some of the Platform resources have capacity limits."

... More in the official press release here.

The HBP held an online release event on 30 March:

Prof. Felix Schürmann (EPFL-BBP, Geneva), Dr. Eilif Muller (EPFL-BBP, Geneva), and Prof. Idan Segev (HUJI, Jerusalem) present an overview of the mission, tools, capabilities and science of the EU Human Brain Project (HBP) Brain Simulation Platform:

A publicly accessible forum for the BSP is here:
https://forum.humanbrainproject.eu/c/bsp
and for community models
https://forum.humanbrainproject.eu/c/community-models
and for community models of hippocampus in particular
https://forum.humanbrainproject.eu/c/community-models/hippocampus

## April 01, 2016

### Enthought

#### Virtual Core: CT, Photo, and Well Log Co-visualization

Enthought is pleased to announce Virtual Core 1.8.  Virtual Core automates aspects of core description for geologists, drastically reducing the time and effort required for core description, and its unified visualization interface displays cleansed whole-core CT data alongside core photographs and well logs.  It provides tools for geoscientists to analyze core data and extract features from […]

#### Canopy Geoscience: Python-Based Analysis Environment for Geoscience Data

Today we officially release Canopy Geoscience 0.10.0, our Python-based analysis environment for geoscience data. Canopy Geoscience integrates data I/O, visualization, and programming, in an easy-to-use environment. Canopy Geoscience is tightly integrated with Enthought Canopy’s Python distribution, giving you access to hundreds of high-performance scientific libraries to extract information from your data. The Canopy Geoscience environment […]

## March 31, 2016

### Continuum Analytics news

#### Why Every CEO Needs To Understand Data Science

Posted Thursday, March 31, 2016

Tech culture has perpetuated the myth that data science is a sort of magic; something that only those with exceptional math skills, deep technical know-how and industry knowledge can understand or act on. While it’s true that math skills and technical knowledge are required to effectively extract insights from data, it’s far from magic. Given a little time and effort, anyone can become familiar with the basic concepts.

As a CEO, you don’t need to understand every technical detail, but it’s very important to have a good grasp of the entire process behind extracting useful insights from your data. Click on the full article below to read the five big-picture steps you must take to ensure you understand data science (and to ensure your company is gaining actionable insights throughout the process).

## March 30, 2016

### Titus Brown

#### A grant proposal: A workshop on dockerized notebook computing

I'm writing a proposal to the Sloan Foundation for about \$20k to support a workshop to hack on mybinder. Comments solicited. Note, it's, umm, due today ;).

(I know the section on "major related work" is weak. I could use some help there.)

If you're interested in participating and don't mind being named in the proposal, drop me an e-mail or leave me a note in a comment.

Summary:

We propose to host a hackfest (cooperative hackathon) workshop to enhance and extend the functionality of the mybinder notebook computing platform. Mybinder provides a fully hosted solution for executing Jupyter notebooks based in GitHub repositories; it isolates execution and provides configurability through the use of Docker containers. We would like to extend mybinder to support a broader range of data science tools - specifically, RStudio and other tools in the R ecosystem - as well as additional cloud hosting infrastructure and version control systems. A key aspect of this proposal is to brainstorm and prototype support for credentials so that private resources can be used to source and execute binders (on e.g. AWS accounts and private repositories). We believe this will broaden interest and use of mybinder and similar resources, and ultimately drive increased adoption of fully specified and executable data narratives.

What is the subject, and why is it important?

Fully specified and perfectly repeatable computational data analyses have long been a goal of open scientists - after all, if we can eliminate the dull parts of reproducibility, we can get one with the more exciting bits of arguing about significance, interpretation and meaning. There is also the hope that fully open and repeatable computational methods can spur reuse and remixing of these methods, and accelerate the process of computational science. We have been asymptotically approaching this in science for decades, helped by configuration and versioning tools used in open source development and devops. The emergence of fully open source virtual machine hosting, cloud computing, and (most recently) container computing means that a detailed host configuration can be shared and instantiated on demand; together with flexible interactive Web notebooks such as RStudio and Jupyter Notebook, and public hosting of source code, we can now provide code, specify the execution environment, execute a workflow, and give scientists (and others) an interactive data analysis environment in which to explore the results.

Fully specified data narratives will are already impacting science and journalism, by providing a common platform for data driven discussion. We and others are using them in education, as well, for teaching and training in data science, statistics, and other fields. Jupyter Notebooks and RMarkdown are increasingly used to write books and communicate data- and compute-based analyses. And, while the technology is still young, the interactive widgets in Jupyter make it possible to communicate these analyses to communities that are not coders. At the moment, we can’t say where this will all lead, but it is heading towards an exciting transformation of how we publish, work with, collaborate around, and explore computing.

mybinder and similar projects provide a low barrier-to-entry way to publish and then execute Jupyter notebooks in a fully customizable environment, where dependencies and software requirements can be specified in a simple, standard way tied directly to the project. For example, we have been able to use mybinder to provision notebooks for a classroom of 30 people in under 30 seconds, with only 5 minutes of setup and preparation. Inadvertent experiments on Twitter have shown us that the current infrastructure can expand to handle hundreds of simultaneous users. In effect, mybinder is a major step closer to helping us realize the promise of ubiquitous and frictionless computational narratives.

What is the major related work in this field?

I am interested in (a) potentially anonymous execution of full workflows, (b) within a customizable compute environment, (c) with a robust, open source software infrastructure supporting the layer between the execution platform and the interactive environment.

There are a variety of hosting platforms for Jupyter Notebooks, but I am only aware of one that offers completely anonymous execution of notebooks - this is the tmpnb service provided by Rackspace, used to support a Nature demo of the notebook. Even more than mybinder, tmpnb enables an ecosystem of services because it lets users frictionlessly “spin up” an execution environment - for a powerful demo of this being used to support static execution, see https://betatim.github.io/posts/really-interactive-posts/. However, tmpnb doesn’t allow flexible configuration of the underlying execution environment, which prevents it from being used for more complex interactions in the same way as mybinder.

More generally, there are many projects that seek to make deployment and execution of Jupyter Notebooks straightforward. This includes JupyterHub, everware, thebe, Wakari, SageMathCloud, and Google Drive. Apart from everware (which has significant overlap with mybinder in design) these other platforms are primarily designed around delivery of notebooks and collaboration within them, and do not provide the author-specified customization of the compute environment provided by mybinder via Docker containers. That having been said, all of these are robust players within the Jupyter ecosystem and are building out tools and approaches that mybinder can take advantage of (and vice versa).

We have already discussed the proposed workshop informally with people from Jupyter, everware, and thebe, and anticipate inviting at least one team member from each project (as well as tmpnb).

Outside of the Jupyter ecosystem, the R world has a collection of software that I’m mostly unfamiliar with. This includes RStudio Server, an interactive execution environment for R that is accessed remotely over a Web interface, and Shiny, which allows users to serve R-based analyses to the Web. These compare well with Jupyter Notebook in features and functionality. One goal of this workshop is to provide a common execution system to support these in the same way that mybinder supports Jupyter now, and to find out which R applications are the most appropriate ones to target. We will invite one or more members of the rOpenSci team to the workshop for exactly this purpose.

Why is the proposer qualified?

My major qualifications for hosting the workshop are as follows:

• Known (and “good”) actor in the open source world, with decades of dedication to open source and open science principles and community interaction.
• Official Jupyter evangelist, on good terms with everyone (so far as I know).
• Neutral player with respect to all of these projects.
• Teacher and trainer and designer of workshop materials for Jupyter notebooks, Docker, reproducible science, version control, and cloud computing.
• Affiliated with Software Carpentry and Data Carpentry, to help with delivery of training materials.
• Interest and some experience in fostering diverse communities (primarily from connections with Python Software Foundation and Software Carpentry).
• Technically capable of programming my way out of a paper bag.
• Located in California, to which people will enthusiastically travel in the winter months.
• One of the members and discussants of the ever pub #openscienceprize proposal, which explored related topics of executable publications in some detail (but was then rejected).

What is the approach being taken?

The approach is to run a cooperative hackathon/hackfest/workshop targeting a few specific objectives, but with flexibility to expand or change focus as determined by the participants. The objectives are essentially as listed in my mybinder blog post (http://ivory.idyll.org/blog/2016-mybinder.html):

• hack on mybinder and develop APIs and tools to connect mybinder to other hosting platforms, both commercial (AWS, Azure, etc.) and academic (e.g. XSEDE/TACC);
• connect mybinder to other versioning sites, including bitbucket and gitlab.
• brainstorm and hack on ways to connect credentials to mybinder to support private repositories and for-pay compute.
• identify missing links and technologies that are needed to more fully realize the promise of mybinder and Jupyter notebook;
• identify overlaps and complementarity with existing projects that we can make use of;
• more integrated support for docker hub (and private hub) based images;
• brainstorm around blockers that prevent mybinder from being used for more data-intensive workflows;

About half of the invitations will be to projects that are involved in this area already (listed above, together with at least two people from the Freeman Lab, who develop mybinder). I also anticipate inviting at least one librarian (to bring the archivist and data librarian perspectives in) and one journalist (perhaps Jeffrey Perkel, who has written on these topics several times). The other half will be opened to applications from the open source and open science communities.

All outputs from the workshop will be made available under an Open Source license through github or another hosting platform (which is where the mybinder source is currently hosted). We will also provide a livestream from workshop presentations and discussions so that the larger community can participate.

What will be the output from the project?

In addition to source code, demonstrations, proofs of concept, etc., I anticipate several blog posts from different perspectives. If we can identify an enthusiastic journalist we could also get an article out targeted at a broader audience. I also expect to develop (and deliver) training materials around any new functionality that emerges from this workshop.

What is the justification for the amount of money requested?

We anticipate fully supporting travel, room, and board for 15 people from this grant, although this number may increase or decrease depending on costs. We will also provides snacks and one restaurant dinner. No compute costs or anything else is requested - we can support the remainder of the workshop hosting entirely out of our existing resources.

What are the other sources of support?

I expect to supplement travel and provide compute as needed out of my existing Moore DDD Investigator funding and my startup funds. However, no other support explicitly for this project is being requested.

## March 29, 2016

### Matthieu Brucher

#### Announcement: Audio ToolKit moves to its own website

I’ve decided to create a real space for Audio ToolKit. The idea is to make it more visible, with a consistent message to the users.

In addition to this move, this blog has move to a subdomain there (and you may have noticed it) and Audio ToolKit documentation as well.

I’ve updated Audio Toolkit to version 1.2.0:
* Added SecondOrderSVF filters from cytomic with Python wrappers
* Implemented a LowPassReverbFilter with Python wrappers
* Added Python wrappers to AllPassReverbFilter
* Distortion filters optimization
* Bunch of fixes (Linux compil, calls…)

I’ve also updated the SD1 emulation to version 2.0.0. The sound is now closer to the actual pedal, thanks to a better modeling of the circuit. It also means that you won’t get exactly the same sound with this new release, so pay attention when you update!

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

## March 27, 2016

### Titus Brown

#### Reproducibility, repeatability, and a workshop in Norway

This March, Andreas Hejnol invited me to give a talk in Bergen, Norway, and as part of the trip I arranged to also give a trial workshop on "computational reproducibility" at the University in Oslo, where my friend & colleague Lex Nederbragt works.

The basic idea of this workshop followed from recent conversations with Jonah Duckles (Executive Director, Software Carpentry) and Tracy Teal (Executive Director, Data Carpentry), where we'd discussed the observation that many experienced computational biologists had chosen a fairly restricted set of tools for building highly reproducible scientific workflows. Since Software Carpentry and Data Carpentry taught most of these tools, I thought that we might start putting together a "next steps" lesson showing how to efficiently and effectively combine these tools for fun and profit and better science

The tool categories are as follows: version control (git or hg), a build system to coordinate running a bunch of scripts (make or snakemake, typically), a literate data analysis and graphing system to make figures (R and RMarkdown, or Python and Jupyter), and some form of execution environment (virtualenv, virtual machines, cloud machines, and/or Docker, typically). There are many different choices here - I've only listed the ones I run across regularly - and everyone has their own favorites, but those are the categories I see.

These tools generally work together in some fashion like this:

In the end, the 1-day workshop description I sent Lex was:

Reproducibility of computational research can be achieved by connecting
together tools that many people already know.  This workshop will walk
attendees through combining RMarkdown/Jupyter, git (version control),
and make (workflow automation) for building highly reproducible analyses
of biological data.  We will also introduce Docker as a method for
specifying your computational environment, and demonstrate mybinder.org
for posting computational analyses.

The workshop will be interactive and hands-on, with many opportunities for
questions and discussion.

We suggest that attendees have some experience with the UNIX shell, either
Python or R, and git - a Software or Data Carpentry workshop will suffice.

People will need an installation of Docker on their laptop (or access to the
Amazon Cloud).


From this, you can probably infer that what I was planning on teaching was how to use a local Docker container to combine git, make, and Jupyter Notebook to do a simple analysis.

As the time approached (well, on the way to Oslo :) I started to think concretely about this, and I found myself blocking on two issues:

First, I had never tried to get a whole classroom of people to run Docker before. I've run several Docker workshops, but always in circumstances where I had more backup (either AWS, or experienced users willing to fail a lot). How, in a single day, could I get a whole classroom of people through Docker, git, make, and Jupyter Notebook? Especially when the Jupyter docker container required a large download and a decent machine and would probably break on many laptops?

Second, I realized I was developing a distaste for the term 'reproducibility', because of the confusion around what it meant. Computational people tend to talk about reproducibility in one sense - 'is the bug reproducible?' - while scientists tend to talk about reproducibility in the larger sense of scientific reproducibility - 'do other labs see the same thing I saw?' You can't really teach scientific reproducibility in a hands-on way, but it is what scientists are interested in; while computational reproducibility is useful for several reasons, but doesn't have an obvious connection to scientific reproducibility.

Luckily, I'd already come across a solution to the first issue the previous week, when I ran a workshop at UC Davis on Jupyter Notebook that relied 100% on the mybinder service - literally, no local install needed! We just ran everything on Google Compute Engine, on the Freeman Lab's dime. It worked pretty well, and I thought it would work here, too. So I resolved to do the first 80% or more of the workshop in the mybinder container, making use of Jupyter built-in Web editor and terminal to build Makefiles and post things to git. Then, given time, I could segue into Docker and show how to build a Docker container that could run the full git repository, including both make and the Jupyter notebook we wrote as part of the analysis pipeline.

The second issue was harder to resolve, because I wanted to bring things down to a really concrete level and then discuss them from there. What I ended up doing was writing a moderately lengthy introduction on my perspective, in which I further confused the issue by using the term 'repeatability' for completely automated analyses that could be run exactly as written by anyone. You can read more about it here. (Tip o' the hat to Victoria Stodden for several of the arguments I wrote up, as well as a pointer to the term 'repeatability'.)

At the actual workshop, we started with a discussion about the goals and techniques of repeatability in computational work. This occupied us for about an hour, and involved about half of the class; there were some very experienced (and very passionate) scientists in the room, which made it a great learning experience for everyone involved, including me! We discussed how different scientific domains thought differently about repeatability, reproducibility, and publishing methods, and tentatively reached the solid conclusion that this was a very complicated area of science ;). However, the discussion served its purpose, I think: no one was under any illusions that I was trying to solve the reproducibility problem with the workshop, and everyone understood that I was simply showing how to combine tools to build a perfectly repeatable workflow.

We then moved on to a walkthrough of Jupyter Notebook, followed by the first two parts of the make tutorial. We took our resulting Makefile, our scripts, and our data, and committed them to git and pushed them to github (see my repo). Then we took a lunch break.

In the afternoon, we built a Jupyter Notebook that did some silly graphing. (Note to self: word clouds are cute but probably not the most interesting thing to show to scientists! If I run this again, I'll probably do something like analyze Zipf's law graphically and then do a log-log fit.) We added that to the git repo, and then pushed that to github, and then I showed how to use mybinder to spin the repo up in its own execution environment.

Finally, for the last ~hour, I sped ahead and demoed how to use docker-machine from my laptop to spin up a docker host on AWS, construct a basic Dockerfile starting from a base of jupyter/notebook, and then run the repo on a new container using that Dockerfile.

Throughout, we had a lot of discussion and (up until the Docker bit) I think everyone followed along pretty well.

In the end, I think the workshop went pretty well - so far, at least, 5/5 survey responders (of about 15 attendees) said it was a valuable use of their time.

After I left at 3pm to fly up to Bergen for my talk, Tracy Teal went through RMarkdown and knitr, starting from the work-in-progress Data Carpentry reproducibility lesson. (I didn't see that so I'll leave Lex or Tracy to talk about it.)

What would I change next time?

• I'm not sure if the Jupyter Notebook walkthrough was important. It seemed a bit tedious to me, but maybe that was because it was the second time in two weeks I was teaching it?
• I shortchanged make a bit, but still got the essential bits across (the dependency graph, and the basic Makefile format).
• I would definitely have liked to get people more hands-on experience with Docker.
• I would change the Jupyter notebook analysis to be a bit more science-y, with some graphing and fitting. It doesn't really matter if it's a bit more complicated, since we're copy/pasting, but I think it would be more relevant to the scientists.
• I would try to more organically introduce RMarkdown as a substitute for the Jupyter bit.

Overall, I'm quite happy with the whole thing, and mybinder continues to work astonishingly well for me.

-titus

### Filipe Saraiva

#### Workshop de Software Livre 2016 – call for papers and tools

The call for papers and call for tools for the WSL – Workshop de Software Livre (Workshop on Free Software), the academic conference held together with FISL – Fórum Internacional de Software Livre (International Free Software Forum) is open!

WSL publishes scientific papers on several topics of interest for free and open source software communities, like social dynamics, management, development processes, motivations of contributors communities, adoption and case studies, legal and economic aspects, social and historical studies, and more.

This edition of WSL has a specific call for tools to publish papers describing software. This specific call ensures the software described was peer-reviewed, is consistent with the principles of FLOSS, and the source code will be preserved and accessible for a long time period.

All accepted and presented papers will be published in WSL open access repository. This year we are working hard to provide ISSN and DOI to the publications.

The deadline is April 10. Papers can be submitted in Portuguese, English or Spanish.

## March 24, 2016

### Fabian Pedregosa

#### Lightning v0.1

Announce: first public release of lightning!, a library for large-scale linear classification, regression and ranking in Python. The library was started a couple of years ago by Mathieu Blondel who also contributed the vast majority of source code. I joined recently its development and decided it was about time for a v0.1!.

Prebuild conda packages are available for all operating systems (god thank appveyor). More information on lightning's website.

### Continuum Analytics news

#### DyND Callables: Speed and Flexibility

Posted Thursday, March 24, 2016

## Introduction

We've been working hard to improve DyND in a wide variety of ways over the past few months. While there is still a lot of churn in our codebase, now is a good time to show a few basic examples of the great functionality that's already there. The library is available on GitHub at https://github.com/libdynd/libdynd.

Today I want to focus on DyND's callable objects. Much of the code in this post is still experimental and subject to change. Keep that in mind when considering where to use it.

All the examples here will be in C++14 unless otherwise noted. The build configuration should be set up to indicate that C++14, the DyND headers, and the DyND shared libraries should be used. Output from a line that prints something will be shown directly in the source files as comments. DyND also has a Python interface, so several examples in Python will also be included.

## Getting Started

DyND's callables are, at the most basic level, functions that operate on arrays. At the very lowest level, a callable can access all of the data, type-defined metadata (e.g. stride information), and metadata for the arrays passed to it as arguments. This makes it possible to use callables for functionality like multiple dispatch, views based on stride manipulation, reductions, and broadcasting. The simplest case is using a callable to wrap a non-broadcasting non-dispatched function call so that it can be used on scalar arrays.

Here's an example of how to do that:

#include <iostream>
// Main header for DyND arrays:
#include <dynd/array.hpp>
#include <dynd/callable.hpp>

using namespace dynd;

// Write a function to turn into a DyND callable.
double f(double a, double b) {
return a * (a - b);
}

// Make the callable.
nd::callable f_callable = nd::callable(f);

// Main function to show how this works:
int main() {
// Initialize two arrays containing scalar values.
nd::array a = 1.;
nd::array b = 2.;

// Print the dynamic type signature of the callable f_callable.
std::cout << f_callable << std::endl;
// <callable <(float64, float64) -> float64> at 000001879424CF60>

// Call the callable and print its output.
std::cout << f_callable(a, b) << std::endl;
//array(-1,
//      type="float64")
}


The constructor for dynd::nd::callable does most of the work here. Using some interesting templating mechanisms internally, it is able to infer the argument types and return type for the function, select the corresponding DyND types, and form a DyND type that represents an analogous function call. The result is a callable object that wraps a pointer to the function f and knows all of the type information about the pointer it is wrapping. This callable can only be used with input arrays that have types that match the types for the original function's arguments.

The extra type information contained in this callable is "(float64, float64) -> float64", as can be seen when the callable is printed. The syntax here comes from the datashape data description system—the same type system used by Blaze, Odo, and several other libraries.

One key thing to notice here is that the callable created now does its type checking dynamically rather than at compile time. DyND has its own system of types that is used to represent data and the functions that operate on it at runtime. While this does have some runtime cost, dynamic type checking removes the requirement that a C++ compiler verify the types for every operation. The dynamic nature of the DyND type system makes it possible to write code that operates in a generic way on both builtin and user-defined types in both static and dynamic languages. I'll leave discussion of the finer details of the DyND type system for another day though.

DyND has other functions that make it possible to add additional semantics to a callable. These are higher-order functions (functions that operate on other functions), and they are used on existing callables rather than function pointers. The types for these functions are patterns that can be matched against a variety of different argument types.

Things like array broadcasting, reductions, and multiple dispatch are all currently available. In the case of broadcasting and reductions, the new callable calls the wrapped function many times and handles the desired iteration structure over the arrays itself. In the case of multiple dispatch, different implementations of a function can be called based on the types of the inputs. DyND's multiple dispatch semantics are currently under revision, so I'll just show broadcasting and reductions here.

DyND provides broadcasting through the function dynd::nd::functional::elwise. It follows broadcasting semantics similar to those followed by NumPy's generalized universal functions—though it is, in many ways, more general. The following example shows how to use elwise to create a callable that follows broadcasting semantics:

// Include <cmath> to get std::exp.
#include <cmath>
#include <iostream>

#include <dynd/array.hpp>
#include <dynd/callable.hpp>
#include <dynd/func/elwise.hpp>

using namespace dynd;

double myfunc_core(double a, double b) {
return a * (a - b);
}

nd::callable myfunc = nd::functional::elwise(nd::callable(myfunc_core));

int main() {
// Initialize some arrays to demonstrate broadcasting semantics.
// Use brace initialization from C++11.
nd::array a{{1., 2.}, {3., 4.}};
nd::array b{5., 6.};
// Create an additional array with a ragged second dimension as well.
nd::array c{{9., 10.}, {11.}};

// Print the dynamic type signature of the callable.
std::cout << myfunc << std::endl;
// <callable <(Dims... * float64, Dims... * float64) -> Dims... * float64>
//  at 000001C223FC5BE0>

// Call the callable and print its output.
// Broadcast along the rows of a.
std::cout << myfunc(a, b) << std::endl;
// array([[-4, -8], [-6, -8]],
//       type="2 * 2 * float64")

// Broadcast the second row of c across the second row of a.
std::cout << myfunc(a, c) << std::endl;
// array([[ -8, -16], [-24, -28]],
//       type="2 * 2 * float64")

}


A similar function can be constructed in Python using DyND's Python bindings and Python 3's function type annotations. If Numba is installed, it is used to get JIT-compiled code that has performance relatively close to the speed of the code generated by the C++ compiler.

from dynd import nd, ndt

@nd.functional.elwise
def myfunc(a: ndt.float64, b: ndt.float64) -> ndt.float64:
return a * (a - b)


## Reductions

Reductions are formed from functions that take two inputs and produce a single output. Examples of reductions include taking the sum, max, min, and product of the items in an array. Here we'll work with a reduction that takes the maximum of the absolute values of the items in an array. In DyND this can be implemented by using nd::functional::reduction on a callable that takes two floating point inputs and returns the maximum of their absolute values. Here's an example:

// Include <algorithm> to get std::max.
#include <algorithm>
// Include <cmath> to get std::abs.
#include <cmath>
#include <iostream>

#include <dynd/array.hpp>
#include <dynd/callable.hpp>
#include <dynd/func/reduction.hpp>

using namespace dynd;

// Wrap the function as a callable.
// Then use dynd::nd::functional::reduction to make a reduction from it.
// This time just wrap a C++ lambda function rather than a pointer
// to a different function.
nd::callable inf_norm = nd::functional::reduction(nd::callable(
[](double a, double b) { return std::max(std::abs(a), std::abs(b));}));

// Demonstrate the reduction working along both axes simultaneously.
int main() {
nd::array a{{1., 2.}, {3., 4.}};

// Take the maximum absolute value over the whole array.
std::cout << inf_norm(a) << std::endl;
// array(4,
//       type="float64")
}


Again, in Python, it is relatively easy to create a similar callable.

from dynd import nd, ndt

@nd.functional.reduction
def inf_norm(a: ndt.float64, b: ndt.float64) -> ndt.float64:
return max(abs(a), abs(b))


The type for the reduction callable inf_norm is a bit longer. It is (Dims... * float64, axes: ?Fixed * int32, identity: ?float64, keepdims: ?bool) -> Dims... * float64. This signature represents a callable that accepts a single input array and has several optional keyword arguments. In Python, passing keyword arguments to callables works the same as it would for any other function. Currently, in C++, initializer lists mapping strings to values are used since the names of the keyword arguments are not necessarily known at compile time.

## Exporting Callables to Python

The fact that DyND callables are C++ objects with a single C++ type makes it easy to wrap them for use in Python. This is done using the wrappers for the callable class already built in to DyND's Python bindings. Using DyND in Cython merits a discussion of its own, so I'll only include a minimal example here.

This Cython code in particular is still using experimental interfaces. The import structure and function names here are very likely to change.

The first thing needed is a header that creates the desired callable. Since this will only be included once in a single Cython based module, additional guards to make sure the header is only applied once are not needed.

// inf_norm_reduction.hpp
#include <algorithm>
#include <cmath>

#include <dynd/array.hpp>
#include <dynd/callable.hpp>
#include <dynd/func/reduction.hpp>

static dynd::nd::callable inf_norm =
dynd::nd::functional::reduction(dynd::nd::callable(
[](double a, double b) { return std::max(std::abs(a), std::abs(b));}));


The callable can now be exposed to Python through Cython. Some work still needs to be done in DyND's Python bindings to simplify the system-specific configuration for linking extensions like this to the DyND libraries. For simplicity, I'll just show the commented Cython distutils directives that can be used to build this file on 64 bit Windows with a libdynd built and installed from source in the default location. Similar configurations can be put together for other systems.

# py_inf_norm.pyx
# distutils: include_dirs = "c:/Program Files/libdynd/include"
# distutils: library_dirs = "c:/Program Files/libdynd/lib"
# distutils: libraries = ["libdynd", "libdyndt"]

from dynd import nd, ndt

from dynd.cpp.callable cimport callable as cpp_callable
from dynd.nd.callable cimport dynd_nd_callable_from_cpp

cdef extern from "inf_norm_reduction.hpp" nogil:
# Have Cython call this "cpp_inf_norm", but use "inf_norm" in
# the generated C++ source.
cpp_callable inf_norm

py_inf_norm = dynd_nd_callable_from_cpp(inf_norm)


To build the extension I used the following setup file and ran it with the command python setup.py build_ext --inplace.

# setup.py
# This is a fairly standard setup script for a minimal Cython module.
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules=cythonize("py_inf_norm.pyx", language='c++'))


## A Short Benchmark

elwise and reduction are not heavily optimized yet, but, using IPython's timeit magic, it's clear that DyND is already doing well:

In [1]: import numpy as np

In [2]: from py_inf_norm import py_inf_norm

In [3]: a = np.random.rand(10000, 10000)

In [4]: %timeit py_inf_norm(a)
1 loop, best of 3: 231 ms per loop

In [5]: %timeit np.linalg.norm(a.ravel(), ord=np.inf)
1 loop, best of 3: 393 ms per loop


These are just a few examples of the myriad of things you can do with DyND's callables. For more information take a look at our github repo as well as libdynd.org. We'll be adding a lot more functionality, documentation, and examples in the coming months.

## March 23, 2016

### Mark Fenner

#### SVD Computation Capstone

At long last, I’ve gotten to my original goal: writing a relatively easy to understand version of your plain, vanilla SVD computation. I’m going to spend just a few minutes glossing (literally) over the theory behind the code and then dive into some implementation that builds on the previous posts. Thanks for reading! Words of […]

## March 22, 2016

### Matthieu Brucher

#### Analog modeling of a diode clipper (1): Circuits

I’ve published a few years ago an emulation of the SD1 pedal, but haven’t touched analog modeling since. There are lots of different methods to model a circuit, and they all have different advantages and drawbacks. So I’ve decided to start from scratch again, using two different diode clippers, from the continuous equations to different numerical solutions in a series of blog posts here.

# First clipper

Let’s start with the first circuit, which I implemented originally in Audio Toolkit.

Diode clipper 1

It consists of a resistor, a capacitor and antiparallel diodes. What is interesting with this circuit is that in constant mode, the output is actually null.

$V_i - 2 R_1 I_s sinh(\frac{V_o}{nV_t}) - \int \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_o = 0$

# Second clipper

The second circuit is a variation of the first one:

Diode Clipper 2

More or less, it’s a first order low-pass filter that is clipped with antiparallel diodes. The first result is that in constant mode, there is a non-null output.

$\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})$

The two equations are quite different. If the first one has an integral, the second one uses a derivative. This should be interesting to discretize and compare.

# Conclusion

The equations are simple enough so that we can try different numerical methods on them. They are still too complex to get an analytical solution (no closed-form solution), so we have to use more or less complex numerical algorithms to get an approximation of the result.

And we will start working on this in a future post.

## March 15, 2016

### Mark Fenner

#### Householder Bidiagonalization

We’re going to make use of Householder reflections to turn a generic matrix into a bidiagonal matrix. I’ve laid a lot of foundation for these topics in the posts I just linked. Check them out! Onward. Reducing a Generic Matrix to a Bidiagonal Matrix Using Householder Reflections Why would we want to do this? Well, […]

## March 12, 2016

### Mark Fenner

#### Givens Rotations and the Case of the Blemished Bidiagonal Matrix

Last time, we looked at using Givens rotations to perform a QR factorization of a matrix. Today, we’re going to be a little more selective and use Givens rotations to walk values off the edge of a special class of matrices Old Friends Here are few friends that we introduced last time. I updated the […]

## March 10, 2016

### William Stein

#### Open source is now ready to compete with Mathematica for use in the classroom

When I think about what makes SageMath different, one of the most fundamental things is that it was created by people who use it every day.  It was created by people doing research math, by people teaching math at universities, and by computer programmers and engineers using it for research.  It was created by people who really understand computational problems because we live them.  We understand the needs of math research, teaching courses, and managing an open source project that users can contribute to and customize to work for their own unique needs.

The tools we were using, like Mathematica, are clunky, very expensive, and just don't do everything we need.  And worst of all, they are closed source software, meaning that you can't even see how they work, and can't modify them to do what you really need.  For teaching math, professors get bogged down scheduling computer labs and arranging for their students to buy and install expensive software.

So I started SageMath as an open source project at Harvard in 2004, to solve the problem that other math software is expensive, closed source, and limited in functionality, and to create a powerful tool for the students in my classes.  It wasn't a project that was intended initially as something to be used by hundred of thousands of people.  But as I got into the project and as more professors and students started contributing to the project, I could clearly see that these weren't just problems that pissed me off, they were problems that made everyone angry.

The scope of SageMath rapidly expanded.  Our mission evolved to create a free open source serious competitor to Mathematica and similar closed software that the mathematics community was collective spending hundreds of millions of dollars on every year. After a decade of work by over 500 contributors, we made huge progress.

But installing SageMath was more difficult than ever.  It was at that point that I decided I needed to do something so that this groundbreaking software that people desperately needed could be shared with the world.

So I created SageMathCloud, which is an extremely powerful web-based collaborative way for people to easily use SageMath and other open source software such as LaTeX, R, and Jupyter notebooks easily in their teaching  and research.   I created SageMathCloud based on nearly two decades of experience using math software in the classroom and online, at Harvard, UC San Diego, and University of Washington.

SageMathCloud is commercial grade, hosted in Google's cloud, and very large classes are using it heavily right now.  It solves the installation problem by avoiding it altogether.  It is entirely open source.

Open source is now ready to directly compete with Mathematica for use in the classroom.  They told us we could never make something good enough for mass adoption, but we have made something even better.  For the first time, we're making it possible for you to easily use Python and R in your teaching instead of Mathematica; these are industry standard mainstream open source programming languages with strong support from Google, Microsoft and other industry leaders.   For the first time, we're making it possible for you to collaborate in real time and manage your course online using the same cutting edge software used by elite mathematicians at the best universities in the world.

A huge community in academia and in industry are all working together to make open source math software better at a breathtaking pace, and the traditional closed development model just can't keep up.

## March 08, 2016

### Matthieu Brucher

#### Announcement: Audio TK 1.1.0

This is mainly a bug fix release. A nasty bug on increasing processing sizes would corrupt the input data and thus change the results. It is advised to upgrade to this release as soon as possible.

Changelog:

1.1.0
* Fix a really nasty bug when changing processing sizes
* Implemented a basic AllPassFilter (algorithmic reverb)

## March 07, 2016

### Continuum Analytics news

#### Continuum Analytics Launches Anaconda Skills Accelerator Program to Turn Professionals Into Data Scientists ASAP

Posted Monday, March 7, 2016

Professional Development Residency Program Propels Scientists, Physicists, Mathematicians and Engineers into Data Science Careers

AUSTIN, TX—March 7, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading modern open source analytics platform powered by Python, today launched its Anaconda Skills Accelerator Program (ASAP). The new professional development residency program, led by a team of world class experts, is highly selective and brings together scientists, physicists, mathematicians and engineers to tackle challenging real world projects. Participants will use advanced technology and pragmatic techniques across a broad set of technical areas including—data management, analytic modeling, visualization, high performance computing on multi-core CPUs and GPUs, vectorization and modern distributed computing techniques such as Spark and Dask.

“One of the greatest challenges affecting our industry is the disconnect between the data science needs and expectations of enterprises and the current skills of today’s scientists, mathematicians and engineers,” said Travis Oliphant, CEO and co-founder of Continuum Analytics. “Fostering true data science experts requires more than traditional training programs. Continuum Analytics’ ASAP empowers professionals with the necessary technology and hands on experience to tackle big data challenges and solve the data problems that are inside every organization today.”

Participants emerge from this residency program as Certified Anaconda Professionals (CAP), enabling them to manage challenging data science and data engineering problems in business, science, engineering, data analytics, data integration and machine learning.

• Consulting and placement opportunities with Continuum Analytics clients

• Membership to the CAP Alumni Association

• Individual subscription to premium Anaconda packages usually only available as part of Anaconda Workgroup or Anaconda Enterprise

• Complimentary one year subscription to priority Anaconda support (only available for companies who put their teams of three or more through the program)

• A 50 percent discounted pass to the annual Anaconda conference

Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, data engineers, and business analysts – to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

## BluePyOpt

The BlueBrain Python Optimisation Library (BluePyOpt) is an extensible framework for data-driven model parameter optimisation that wraps and standardises several existing open-source tools. It simplifies the task of creating and sharing these optimisations, and the associated techniques and knowledge. This is achieved by abstracting the optimisation and evaluation tasks into various reusable and flexible discrete elements according to established best-practices. Further, BluePyOpt provides methods for setting up both small- and large-scale optimisations on a variety of platforms, ranging from laptops to Linux clusters and cloud-based compute infrastructures.

The code is available here:
https://github.com/BlueBrain/BluePyOpt
A preprint to the paper is available here:
http://arxiv.org/abs/1603.00500

## March 04, 2016

### Continuum Analytics news

#### Open Data Science Is the David Beckham of Your Data Team

Posted Friday, March 4, 2016

## Screen Shot 2016-03-04 at 11.28.19 AM.png

If data scientists were professional athletes, they’d be ditching their swim caps and donning cleats right about now.

Traditional data science teams function like swim teams. Although all members strive for one goal — winning the meet — each works individually, concentrating on his or her isolated lane or heat. As each player finishes his heat, the team’s score is tallied. If enough team members hit their individual goals, then the team succeeds. But team members’ efforts are isolated and desynchronized, and some members might pull off incredible times, while others lag behind in their heats.

Today, the industry is moving toward open data science (ODS), which enables data scientists to function more like a soccer team: Team members aren’t restricted to their own lanes but are instead free to move about the field. Even goalies can score in soccer, and ODS developers are similarly encouraged to contribute wherever their skills intersect with development challenges. No team members are relegated to “second-string” or niche roles: In a given project, developers might contribute to model building, while domain experts like quants offer insights about code structure or visualization. As with soccer, the pace is fast, and the process is engaging and fun.

### Why ODS?

Data is the indisputable king of the modern economy, but too many data scientists still function like swimmers, each working with his own set of tools to manage heaps of data. To work as a true team, data scientists (and their tools) must function together.

ODS is the revolution rising to meet this challenge. Instead of forcing data scientists to settle on a single language and proprietary toolset, open data science champions inclusion. Just as ODS encourages data scientists to function as one, it also recognizes the potential of open source tools — for data, analytics and computation — to form a connected, collaborative ecosystem.

Developers, analytics and domain experts adhering to ODS principles gain these key advantages over traditional and proprietary approaches:

• Availability: Open source tools offer many packages and approaches within an ecosystem  to solve problems that make it faster to build solutions quickly.
• Innovation: The lack of vendor lock-in means the developer community is continuously updating tools, which is ideal for collaborative workflows and quick-to-fail projects.
• Interoperability: Rather than pigeonholing developers into a single language or set of tools, ODS embraces a rich ecosystem of compatible open source tools.
• Transparency: Open data science tools facilitate transparency between teams, emphasizing innovations like the Jupyter/IPython notebook that allows developers to easily share code and visualizations.

### ODS Is Making Analytics More Precise

Over the past few years — largely thanks to the growth of ODS — the pace of innovation has quickened in the analytics field. Now, researchers and academics immediately release their algorithms to the open source community, which has empowered data scientists using these tools to become ever more granular in their analyses.

Because open data science has provided more ways to build, deploy and consume analytics, businesses can know their customers more personally than ever before.

Previously, a sales analysis might have yielded insights like “Women over 50 living in middle-class households are most likely to buy this product.” Now, these analyses are shedding ever-greater light on buyer personas. Today, an analyst might tell marketers, “Stay-at-home moms in suburban areas who tend to shop online between 3 p.m. and 6 p.m. are more likely to buy the premium product than the entry-level one.”

Notice the difference? The more precise the analysis is, the greater its use to business professionals. Information about where, how and when buyers purchase a product are invaluable insights for marketers.

ODS has also broadened data scientists’ options for visualization. Let’s say a telecommunications company needs to illustrate rotational churn in a way that shows which customers are likely to leave over a period of time. Perhaps the company is using Salesforce, but Salesforce doesn’t offer the rotational churn analysis the business is looking for. If the business’ data scientist is using ODS tools, then that data scientist can create a model for the rotational churn and embed it into Salesforce. The analysis can include rich, contextual visualization that clearly illustrates customers’ rotational churn patterns.

### Data Science Isn’t One-Size-Fits-All

Just like there’s no one right way to run a business or market a product, there’s no one right way to do data science.

For instance, one data scientist might employ a visual component framework, while another accomplishes the same task with a command line interface or integrated language development environment. Business analysts and data engineers in the data science team may prefer spreadsheets and visual exploration.

Thanks to this flexibility, ODS has become the foundation of modern predictive analytics and one platform has emerged to manage its tools successfully: Anaconda.

Anaconda is the leading open source analytics ecosystem and full-stack ODS platform for Python, R, Scala and many other data science languages. It adeptly manages packages, dependencies and environments for multiple languages.

Data science teams using Anaconda can share models and results through Jupyter/IPython notebooks and can utilize an enormous trove of libraries for analytics and visualization. Furthermore, Anaconda supercharges Python to deliver high performance analytics for data science team that want to scale up and out their prototypes without worrying about incurring delays when moving to production.

With Anaconda, data science teams decrease project times by better managing and illustrating data sets and offering clearer insights to business professionals. Get your data science team out of the pool by making the switch to ODS and Anaconda. Before you know it, your data scientists will be scoring goals and on their way to the World Cup.

If you are interested in learning more about this topic and are attending the Gartner BI & Analytics Summit in Grapevine, TX, join me in our session on Why Open Data Science Matters, Tuesday, March 15th from 10:45-11:30amCT.

## March 03, 2016

### Mark Fenner

#### Givens Rotations and QR

Today I want to talk about Givens rotations. Givens rotations are a generalization of the rotation matrix you might remember from high school trig class. Instead of rotating in the plane of a 2D matrix, we can rotated in any plane of a larger dimension matrix. We’ll use these rotations to selectively place zeros in […]

## March 02, 2016

### Continuum Analytics news

#### Introducing Constructor 1.0

Posted Wednesday, March 2, 2016

I am excited to announce version 1.0 of the constructor project. Constructor combines a collection of conda packages into a standalone installer similar to the Anaconda installer. Using constructor you can create installers for all the platforms conda supports, namely Linux, Mac OS X and Windows.

• The Linux and Mac installers are self extracting bash scripts. They consist of a bash header, and a tar archive. Besides providing some help options, and an install dialog, the bash header extracts the tar archive, which in turn consists of conda packages.
• The Windows installer is an executable (.exe), which when executed opens up a dialog box, which walks the user through the install process and installs the conda packages contained within the executable. This is all achieved by relying on the NSIS (Nullsoft Scriptable Install System) under the cover.

This is the first public release of this project, which was previously proprietary and known as "cas-installer".

As constructor relies on conda, it needs to be installed into the root conda environment:

conda install -n root constructor


Constructor builds an installer for the current platform by default, and can also build an installer for other platforms, although Windows installers must be created on Windows and all Unix installers must be created on some Unix platform.

The constructor command takes an installer specification directory as its argument. This directory needs to contain a file construct.yaml, which specifies information like the name and version of the installer, the conda channels to pull packages from, the conda packages included in the installer, and so on. The complete list of keys in this file can be found in the documentation. The directory may also contain some optional files such as a license file and image files for the Windows installer.

We created a documented example to demonstrate how to build installers for Linux, OS X and Windows that are similar to Anaconda installers, but significantly smaller.

Have fun!

## February 28, 2016

### Titus Brown

#### An e-mail from Seven Bridges Genomics re patents and patent applications

Preface from Titus: this is an e-mail written by Deniz Kural of Seven Bridges Genomics in response to concerns and accusations about their patents and patent applications on genomics workflow engines and graph genome analysis techniques. It was sent to a closed list initially, and I asked Deniz if we could publicize it; he agreed. Note that I removed the messages to which he was replying, given the closed nature of the original list, and Deniz lightly edited the e-mail to remove some personal details and fix discontinuity.

Here are some references to the Twitter conversation: Variant reference graphs patented!?, looks like Seven Bridges is trying to patent Galaxy or Taverna?

Dear All,

Firstly, thank you for the considered discussion and civility on this thread. In this email, I've tried to cover why SBG has a patent practice & next steps with GA4GH; and what the patents cover / prior art concerns.

Purpose of SBGs patents:

Developed economies have a complicated relationship to patents; as one of the few instruments underpinning the wealth these knowledge economies create. However, science requires unfettered freedom to build upon previous ideas, and open communication. So what to do?

Seven Bridges was founded in 2009, with our name a nod to graph theory (Euler). We've avoided patents until 2013, and having been made aware of various industrial and academic institutions pursuing bioinformatics and cloud patents, have decided to pursue our own patents as an active, operating company in a field with some publicly traded entities aggressively prosecuting their IP claims, as a protective measure. The patents listed in there have been submitted before the existence of the GA4GH mailing list. Some of the government funding we receive compels us to obtain IP and measures our success on this criteria. We would welcome patent reform, and would rather not spend our funds on patents. We are significantly larger than a lab, but tiny compared to our established software/biotech competitors, and ultimately can't match their legal funds. We'd also love to see the demise of patent trolls. Thus, it is not our intention to limit scientific discussion / publication.

As an example: We've released the source code of our pipeline editor (unrelated to graph genomes) more than a year ago under GPL, and last month decided to also start using an Apache license to make the patent rights of the user more explicit. We'd like to preserve an open space for common formats and standards - common workflow language, and graph genome exchange formats (more on this below). Open source / Apache is a start, but not a universal panacea.

Many of the suggestions on this thread regarding patents and licenses would be a positive step forward, and we would welcome a clarification of GA4GH policies, including overall governance. Although I had the opportunity to discuss our patents & answer some questions on previous phone calls, I also wish that we've had a wider conversation sooner.

What's patented re: Graph Genomes / SBG patent process + prior art

We have an obligation to submit prior art, and precisely for the reasons outlined in this thread. Likewise, it does us no good to obtain and maintain patents that can't realistically hold when challenged. I've tried to explain the content of our patents and a way to incorporate community-submitted prior art below.

None of our patent applications are on "genome graphs" per se - i.e. representing variants or sequence data as a graph genome -- we truly believe that having a common, open representation or specification of graph genomes is needed. This is also the basis of our interest and participation in this group. Indeed, in my first presentation to this group about two summers ago, I've presented an edge graph representation, and a coordinate system that goes along with it (which counts as a public disclosure) thus taking it off from what can be patented. SBG's tools must accept and write out formats commonly used and adopted by the community to gain traction and adoption & use open and free APIs for input and output.

We've pursued patents for improvements that go into our toolchain (aligners, variant callers, etc.) - a good portion while it may not be obvious at first, are tied to producing efficient / fast implementations, having a practical benefit, and another group of patents relate to application patents around not having restrictions to apply graph genomes to various specific domains. Please see below re: on prior art for a specific application or concern.

It presents an interesting challenge on how external counsel could become a graph genome expert and write patents on this area intelligently. We've realized after a full cycle of this, that external counsel often produces lower quality applications due to lack of context and limited time investment, resulting a lot of resources spent downstream. Thus we've been pursuing measures to improve the quality and substance of our process, including hiring in-house last year.

The way our patent process (essentially a chore) now works as follows: Our attorneys sit on our weekly R&D meetings, and then follow-up if anything novel is presented. Currently we have about 45 PhDs and 100 engineers, with 20+ R&D members working on graph-related issues alone, and the IP activity is thus comparable to institutions of similar size. We try not to spend engineering time on detailed patent work -- time better spent to build tools and products. Our scientists are not trained in patent novelty vs scientific novelty, and thus tend to be conservative with disclosures.

Thus, our researchers write a short disclosure to our attorney, who then writes the claims (as a lawyer would), and goes back-and-forth with external entities until a specific set of claims are finalized / approved. For reasons beyond my understanding, the patent system seems to encourage starting with a broad set of claims and working with the patent examiner to make them more specific to exactly what the engineer built.

Our attorneys are obliged to pay attention to prior art. We circulate papers - including pioneering graph genome work, but also from highly related fields of genome assembly and transcriptomics. It's worth noting that many of our submissions are in "application" stage, and indeed may be thrown out. The claims of a final submission, after the review steps, often look very different as outlined above. We also have the USPTO review and point out previous work. That said, more prior art is better, and welcome.

Thus, if a person trained in reading patent claims (or anyone really), feels like we've overlooked work from 2013 (or any of the submission dates) & before, we'd like to hear from you on this issue. Even better to name which claims and applications it relates to, and which sections of the paper. We'll happily submit the evidence to USPTO and remove or revise those claims. Please keep in mind sometimes we need to wait for the USPTO to throw it out depending on which stage we're at. Please do not hesitate to write to ip@sbgenomics.com which goes directly to our in house counsel.

Likewise, we'd like to have an open channel of communication in general, so if you have any other questions or concerns please email on related issues.

Best, Deniz

(Titus sez: Deniz is on Twitter at @denizkural.)