## June 28, 2016

### Matthieu Brucher

#### Analog modeling: Triode circuit

When I started reviewing the diode clippers, the goal was to end up modeling a triode simple preamp. Thanks to Ivan Cohen from musical entropy, I’ve finally managed to drive the proper equation system to model this specific type of preamp.

#### Schematics

Let’s have a look at the circuit:

Triode simple model

There are several things to notice:

• We need to have equations of the triode based on the voltage on its bounds
• There is a non-null steady state, meaning there is current in the circuit when there is no input

For the equations, I’ve used once again Ivan Cohen’s work available in his papers (the modified Koren’s equations), available in Audio Toolkit.

So, for the second point, we need to compute the steady state of the circuit. This can be achieved by putting the input to the ground voltage and remove the capacitors. Once this is done, we can have the final equations of the system in $y$ (for voltage of the plate, the grid, and finally the cathode):

$F = \begin{matrix} y(0) - V_{Bias} + I_p * R_p \\ I_g * R_g + y(1) \\ y(2) - (I_g + I_p) * R_k \end{matrix}$

The Jacobian:

$J= \begin{matrix} 1 + R_p * \frac{dI_p}{V_{pk}} && R_p * \frac{dI_p}{V_{gk}} && -R_p * (\frac{dI_p}{V_{pk}} + \frac{dI_p}{V_{gk}}) \\ \frac{dI_g}{V_{pk}} * R_g && 1 + R_g * \frac{dI_g}{V_{gk}} && -R_g * (\frac{dI_g}{V_{gk}} + \frac{dI_g}{V_{pk}}) \\ -(\frac{dI_p}{V_{pk}} + Ib_Vce) * R_k && -(\frac{dI_c}{V_{gk}} + \frac{dI_g}{V_{gk}}) * R_k && 1 + (\frac{dI_p}{V_{gk}} + \frac{dI_p}{V_{pk}} + \frac{dI_g}{V_{gk}} + \frac{dI_g}{V_{pk}}) * R_k \end{matrix}$

With this system, we can run a Newton Raphson optimizer to find the proper stable state of the system. It may require lots of iterations, but this is not a problem: it’s done once at the beginning, and then we will use the next system for computing the new state when we input a signal.

#### Transient state

As in the previous analog modeling posts, I’m using the SVF/DK-method to simplify the ODE (to remove the derivative dependency, turning the ODE in a non linear system). So there are two systems to solve. The first one is the ODE with traditional Newton Raphson optimizer (from $x$, we want to compute $y=\begin{matrix} V_k \\ V_(out) - V_p \\ V_p \\ V_g \end{matrix}$):

$F = \begin{matrix} I_b + I_c + i{ckeq} - y(0) * (1/R_k + 2*C_k/dt) \\ i_{coeq} + (y(1) + y(2)) / R_o + y(1) * 2*C_o/dt \\ (y(2) - V_{Bias}) / R_p + (I_p + (y(1) + y(2)) / R_o) \\ (y(3) - x(i)) / R_g + I_g \end{matrix}$

Which makes the Jacobian:

$J= \begin{matrix} -(\frac{dI_g}{V_{gk}} + \frac{dI_p}{V_{gk}} + \frac{dI_g}{V_{pk}} + \frac{dI_p}{V_{pk}}) - 1/R_k + 2*C_k/dt && 0 && (\frac{dI_g}{V_{pk}} + \frac{dI_p}{V_{pk}}) && (\frac{dI_g}{V_{gk}} + \frac{dI_p}{V_{gk}}) \\ 0 && 1/R_o + 2*C_o/dt && 1/R_o && 0 \\ -(\frac{dI_p}{V_{gk}} + \frac{dI_p}{V_{pk}}) && 1/R_o && 1/R_p + 1/R_o + \frac{dI_p}{V_{pk}} && \frac{dI_p}{V_{gk}} \\ -(\frac{dI_g}{V_{gk}} + \frac{dI_g}{V_{pk}}) && 0 && \frac{dI_g}{V_{pk}} && \frac{dI_g}{V_{gk}} + 1/R_g \end{matrix}$

Once more, this system can be optimized with a classic NR optimizer, this time in a few iterations (3 to 5, depending on the oversampling, the input signal…)

The updates for ickeq and icoeq are:

$\begin{matrix} i{ckeq} \\ i_{coeq} \end{matrix} = \begin{matrix} 4 * C_k / dt * y(1) - i{ckeq} \\ -4 * C_o / dt * y(2) - i_{coeq} \end{matrix}$

Of course, we need a start value for the steady state. It is quite simple as the previous state is the same as the new one which makes the equations:

$\begin{matrix} i{ckeq} \\ i_{coeq} \end{matrix} = \begin{matrix} 2 * C_k / dt * y(1) \\ -2 * C_o / dt * y(2) \end{matrix}$

#### Result

Tube behavior for a 100Hz signal (10V)

You can spot that in the beginning of the sinusoid, the output signal is not stable, the average moves down with time. This is to be expected as the tube obviously compresses the negative side of the sinusoid, while it almost chops the positive side after 30 to 40V. The non symmetric behavior is what gives the warmth to the tube. The even harmonics are also clear with a sine sweep of the system:

Sine sweep on the triode circuit

Strangely enough, even if the signal seems quite distorted, really harmonics are not strong (compared to SD1 or TS9). This makes me believe that it is a reason for the love of the user for tube preamps.

#### Conclusion

There are a few new developments here compared to my previous posts (SD1 and TS9). The first is the fact that we have a more complex system, with several capacitors that need to be individually simulated. This leads to a vectorial implementation of the NR algorithm.

The second one is the requirement to compute a steady state that is not zero everywhere. Without this, the first iterations of the system would be more than chaotic, not realistic and hard on the CPU. Definitely not what you would want!

Based on these two developments, it is now possible to easily develop more complex circuit modeling. Even if the cost here is high (due to the complex triode equations), the conjunction of the NR algorithm with the DK method makes it doable to have real-time simulation.

## June 27, 2016

### Continuum Analytics news

#### Continuum Analytics Unveils Anaconda Mosaic to Make Enterprise Data Transformations Portable for Heterogeneous Data

Posted Tuesday, June 28, 2016

Empowers data scientists and analysts to explore, visualize, and catalog data and transformations for disparate data sources enabling data portability and faster time-to-insight

AUSTIN, TX—June 28, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced the availability of Anaconda Mosaic. With the ability to easily create and catalog transformations against heterogeneous data stores, Anaconda Mosaic empowers data scientists, quants and business analysts to interactively explore, visualize, and transform larger-than-memory datasets to more quickly discover new insights.

Enterprise data architecture is becoming increasingly complex. Data stores have a relatively short half life and data is being shifted to new data stores - NoSQL, SQL, flat files - at a higher frequency. In order for organizations to find insights from the data they must first find existing transformations and rewrite the transformations for the new data store. This creates delays in getting the insights from the data. Continuum Analytics’ Anaconda Mosaic enables organizations to quickly explore, visualize, and redeploy transformations based on pandas and SQL without rewriting the transformations while maintaining governance by tracking data lineage and provenance.

“Through the course of daily operations, businesses accumulate huge amounts of data that gets locked away in legacy databases and flat file repositories. The transformations that made the data usable for analysis gets lost, buried or simply forgotten,” said Michele Chambers, Executive Vice President Anaconda Business Unit & CMO at Continuum Analytics. “Our mission is for Anaconda Mosaic to unlock the mystery of this dark data, making it accessible for businesses to quickly redeploy to new data stores without any refactoring so enterprises can reap the analytic insight and value almost instantly. By eliminating refactoring of transformations, enterprises dramatically speed up their time-to-value, without having to to spend lengthy cycles on the refactoring process.”

Some of the key features of Anaconda Mosaic include:

• Visually explore your data. Mosaic provides built-in visualizations for large heterogeneous datasets that makes it easy for data scientists and business analysts to accurately understand data including anomalies.
• Instantly get portable transformations. Create transformations with the expression builder to catalog data sources and transformations. Execute the transformation against heterogeneous data stores while tracking data lineage and provenance. When data stores changes, simply deploy the transformations and quickly get the data transformed and ready for analysis.
• Write once, compute anywhere. For maximum efficiency, Mosaic translates transformations and orchestrates computation execution on the data backend, minimizing the costly movement of data across the network and taking full advantage of the built-in highly optimized code featured in the data backend. Users can access data in multiple data stores with the same code without rewriting queries or analytic pipelines.
• Harvest large flat file repositories in place: Mosaic combines flat files, adds derived data and filters for performance easily. This allows users to describe the structure of their data in large flat file repositories and uses that description in data discovery, visualization, and transformations, saving the user from writing tedious ETL code.  Mosaic ensures that the data loaded is only what is necessary to compute the transformation, which can lead to significant memory and performance gains.

Continuum Analytics is hosting a webinar which will take attendees through how to use Mosaic to simplify transformations and get to faster insights on June 30. Please register here

Mosaic is available to current Anaconda Enterprise subscribers; to find out more about Anaconda Mosaic, get in touch

Continuum Analytics is the creator and driving force behind Anaconda, the leading, Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, data engineers and business analysts – to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

#### Anaconda Fusion: A Portal to Open Data Science for Excel

Posted Monday, June 27, 2016

## AnacondaFusion_stacked_RGB.png

Excel has been business analysts’ go-to program for years. It works well, and its familiarity makes it the currency of the realm for many applications.

But, in a bold new world of predictive analytics and Big Data, Excel feels cut off from the latest technologies and limited in the scope of what it can actually take on.

Fortunately for analysts across the business world, a new tool has arrived to change the game — Anaconda Fusion.

## A New Dimension of Analytics

The interdimensional portal has been a staple of classic science fiction for decades. Characters step into a hole in space and emerge instantly in an entirely different setting — one with exciting new opportunities and challenges.

Now, Data Science has a portal of its own. The latest version of Anaconda Fusion, an Open Data Science (ODS) integration for Microsoft Excel, links the familiar world of spreadsheets (and the business analysts that thrive there) to the “alternate dimension” of Open Data Science that is reinventing analytics.

With Anaconda Fusion and other tools from Anaconda, business analysts and data scientists can share work — like charts, tables, formulas and insights — across Excel and ODS languages such as Python easily, erasing the partition that once divided them.

Jupyter (formerly IPython) is a popular approach to sharing across the scientific computing community, with notebooks combining  code, visualizations and comments all in one document. With Anaconda Enterprise Notebooks, this is now available under a governed environment, providing the collaborative locking, version control, notebook differencing and searching needed to operate in the enterprise. Since Anaconda Fusion, like the entire Anaconda ecosystem, integrates seamlessly with Anaconda Enterprise Notebooks, businesses can finally empower Excel gurus to collaborate effectively with the entire Data Science team.

Now, business analysts can exploit the ease and brilliance of Python libraries without having to write any code. Packages such as scikit-learn and pandas drive machine learning initiatives, enabling predictive analytics and data transformations, while plotting libraries, like Bokeh, provide rich interactive visualizations.

With Anaconda Fusion, these tools are available within the familiar Excel environment—without the need to know Python. Contextually-relevant visualizations generated from Python functions are easily embedded into spreadsheets, giving business analysts the ability to make sense of, manipulate and easily interpret data scientists’ work.

## A Meeting of Two Cultures

Anaconda Fusion is connecting two cultures from across the business spectrum, and the end result creates enormous benefits for everyone.

Business analysts can leverage the power, flexibility and transparency of Python for data science using the Excel they are already comfortable with. This enables functionality far beyond Excel, but also can teach business analysts to use Python in the most natural way: gradually, on the job, as needed and in a manner that is relevant to their context. Given that the world is moving more and more toward using Python as a lingua franca for analytics, this benefit is key.

On the other side of the spectrum, Python-using data scientists can now expose data models or interactive graphics in a well-managed way, sharing them effectively with Excel users. Previously, sharing meant sending static images or files, but with Anaconda Fusion, Excel workbooks can now include a user interface to models and interactive graphics, eliminating the clunky overhead of creating and sending files.

It’s hard to overstate how powerful this unification can be. When two cultures learn to communicate more effectively, it results in a cross-pollination of ideas. New insights are generated, and synergistic effects occur.

## The Right Tools

The days of overloaded workarounds are over. With Anaconda Fusion, complex and opaque Excel macros can now be replaced with the transparent and powerful functions that Python users already know and love.

The Python programming community places a high premium on readability and clarity. Maybe that’s part of why it has emerged as the fourth most popular programming language used today. Those traits are now available within the familiar framework of Excel.

Because Python plays so well with web technologies, it’s also simple to transform pools of data into shareable interactive graphics — in fact, it's almost trivially easy. Simply email a web link to anyone, and they will have a beautiful graphics interface powered by live data. This is true even for the most computationally intense cases — Big Data, image recognition, automatic translation and other domains. This is transformative for the enterprise.

## Jump Into the Portal

The glowing interdimensional portal of Anaconda Fusion has arrived, and enterprises can jump in right way. It’s a great time to unite the experience and astuteness of business analysts with the power and flexibility of Python-powered analytics.

## June 23, 2016

### Enthought

#### 5 Simple Steps to Create a Real-Time Twitter Feed in Excel using Python and PyXLL

PyXLL 3.0 introduced a new, simpler, way of streaming real time data to Excel from Python. Excel has had support for real time data (RTD) for a long time, but it requires a certain knowledge of COM to get it to work. With the new RTD features in PyXLL 3.0 it is now a lot […]

## June 21, 2016

### Enthought

#### AAPG 2016 Conference Technical Presentation: Unlocking Whole Core CT Data for Advanced Description and Analysis

Microscale Imaging for Unconventional Plays Track Technical Presentation: Unlocking Whole Core CT Data for Advanced Description and Analysis American Association of Petroleum Geophysicists (AAPG) 2016 Annual Convention and Exposition Technical Presentation Tuesday June 21st at 4:15 PM, Hall B, Room 2, BMO Centre, Calgary Presented by: Brendon Hall, Geoscience Applications Engineer, Enthought, and Andrew Govert, Geologist, […]

### Matthieu Brucher

#### Audio Toolkit: Transient splitter

After my transient shaper, some people told me it would be nice to have a splitter: split the signal in two tracks, one with the transient, another with the sustain. For instance, it would be interesting to apply a different distortion on both signals.

So for instance this is what could happen for a simple signal. The sustain signal is not completely shut off, and there can be a smooth transition between he two signals (thanks to the smoothness parameter). Of course, the final signals have to sum back to the original signal.

How a transient splitter would work

I may end up doing a stereo version (with M/S capabilities) for the splitter, but maybe also another one with some distortion algorithms before everything is summed up again.

Let me know what you think about these idea.

## June 19, 2016

### Filipe Saraiva

In the end of May, ~20 gearheads from different countries of Latin America were together in Rio de Janeiro working in several fronts of the KDE. This is our ‘multiple projects sprint’ named LaKademy!

Like all previous editions of LaKademy, this year I worked hard in Cantor; unlike all previous editions, this year I did some work in new projects to be released in some point in the future. So, let’s see my report of LaKademy 2016.

## Cantor

LaKademy is very important to Cantor development because during the sprint I can to focus and work hard to implement great features to the software. In past editions I started the Python 2 backend development, ported Cantor to Qt5/KF5, drop kdelibs4support, and more.

This year is the first LaKademy after I got the maintainer status of Cantor and, more amazing, it is the first edition where I was not the only developer working in Cantor: we had a team working in different parts of the project.

My main work was to perform a heavy bug triage in Cantor, closing old bugs and confirming some of them. In addition I could to fix several bugs like the LaTeX rendering and the crash after close the window for Sage backend, or the fix for plot commands for Octave backend.

My second work was to help the others developers working in Cantor, I was very happy to work with different LaKademy attendees in the software. I helped Fernando Telles, my SoK 2015 student, to fix the support for Sage backend for Sage version > 7.2. Wagner Reck was working in a new backend for Root, the scientific programming framework developed by CERN. Rafael Gomes created a Docker image to Cantor in order to make easy the environment configuration, build, and code contribution for new developers. He wants to use it in other KDE software and I am really excited to see Cantor as the first software in this experiment.

Other relevant work was some discussions with other developers about the selection of an “official” technology to create backends for Cantor. Currently Cantor has backends developed in several ways: some of them use C/C++ APIs, others use Q/KProcess, others use DBus… you can think about how to maintain all these backends is a work for crazy humans.

I did not select the official technology yet. Both DBus and Q/KProcess has advantages and disadvantages (DBus is a more ‘elegant’ solution but bring Cantor to other OS can be more easy if we use Q/KProcess)… well, I will wait for the new DBus-based Julia backend, in development by our GSoC 2016 student, to make decision about which solution to use.

From left to right: Ronny, Fernando, Ícaro, and me ;)

## New projects: Sprat and Leibniz (non-official names)

This year I could to work in some new projects to be released in the future. Their provisional names are Sprat and Leibniz.

Sprat is a text editor to write drafts of scientific papers. The scientific text follows some patterns of sentences and communication figures. Think about “A approach based in genetic algorithm was applied to the travel salesman problem”: it is easy to identify the pattern in that text. Linguistics has worked in this theme and it is possible to classify sentences based in the communication objective to be reached for a sentence. Sprat will allow to the user to navigate in a set of sentences and select them to create drafts of scientific papers. I intent to release Sprat this year, so please wait for more news soon.

Leibniz is Cantor without worksheets. Sometimes you want just to run your mathematical method, your scientific script, and some related computer programs, without to put explanations, figures, or videos in the terminal. In KDE world we have amazing technologies to allow us to develop a “Matlab-like” interface (KonsolePart, KTextEditor, QWidgets, and plugins) to all kind of scientific programming languages like Octave, Python, Scilab, R… just running these programs in KonsolePart we have access to syntax highlighting, tab completion… I would like to have a software like this so I started the development. I decided to develop a new software and not a new view to Cantor because I think the source code of Leibniz will be small and more easy to maintain.

So, if you are excited with some of them, let me know in comments below and wait a few months for more news!

During LaKademy we had our promo meeting, an entire morning to discuss KDE promo actions in Latin America. KDE will have a day of activities at FISL and we are excited to make amazing KDE 20th birthday parties in the main free software events in Brazil. We also evaluated and discussed the continuation of some interesting activities like Engrenagem (our videocast series) and new projects like demo videos for KDE applications.

In that meeting we also decided the city to host LaKademy 2017: Belo Horizonte! We expect to have a incredible year with KDE activities in Latin America to be evaluated in our next promo meeting.

## Conclusion: “O KDE na América Latina continua lindo“

This edition of LaKademy had strong and dedicated work by all attendees in several fronts of KDE, but we had some moments to stay together and consolidate our community and friendship. Unfortunately we did not have time to explore Rio de Janeiro (it was my first time in the city) but I had good impressions of the city and their people. I intent to go back to there, maybe this year yet.

The best part of to be a member of a community like KDE is to make friends for the life, people with you like to share beers and food while chat about anything. This is amazing for me and I found it in KDE. <3

Thank you KDE and see you soon in next LaKademy!

## June 17, 2016

### Continuum Analytics news

#### Anaconda and Docker - Better Together for Reproducible Data Science

Posted Monday, June 20, 2016

Anaconda integrates with many different providers and platforms to give you access to the data science libraries you love on the services you use, including Amazon Web Services, Microsoft Azure, and Cloudera CDH. Today we’re excited to announce our new partnership with Docker.

As part of the announcements at DockerCon this week, Anaconda images will be featured in the new Docker Store, including Anaconda and Miniconda images based on Python 2 and Python 3. These freely available Anaconda images for Docker are now verified, will be featured in the Docker Store when it launches, are being regularly scanned for security vulnerabilities and are available from the ContinuumIO organization on Docker Hub.

## anaconda_docker_1.png

The Anaconda images for Docker make it easy to get started with Anaconda on any platform, and provide a flexible starting point for developing or deploying data science workflows with more than 100 of the most popular Open Data Science packages for Python and R, including data analysis, visualization, optimization, machine learning, text processing and more.

Whether you’re a developer, data scientist, or devops engineer, Anaconda and Docker can provide your entire data science team with a scalable, deployable and reproducible Open Data Science platform.

## Use Cases with Anaconda and Docker

Anaconda and Docker are a great combination to empower your development, testing and deployment workflows with Open Data Science tools, including Python and R. Our users often ask whether they should be using Anaconda or Docker for data science development and deployment workflows. We suggest using both - they’re better together!

Anaconda’s sandboxed environments and Docker’s containerization complement each other to give you portable Open Data Science functionality when you need it - whether you’re working on a single machine, across a data science team or on a cluster.

## anaconda_docker_2.png

Here are a few different ways that Anaconda and Docker make a great combination for data science development and deployment scenarios:

1) Quick and easy deployments with Anaconda

Anaconda and Docker can be used to quickly reproduce data science environments across different platforms. With a single command, you can quickly spin up a Docker container with Anaconda (and optionally with a Jupyter Notebook) and have access to 720+ of the most popular packages for Open Data Science, including Python and R.

2) Reproducible build and test environments with Anaconda

At Continuum, we’re using Docker to build packages and libraries for Anaconda. The build images are available from the ContinuumIO organization on Docker Hub (e.g., conda-builder-linux and centos5_gcc5_base). We also use Docker with continuous integration services, such as Travis CI, for automated testing of projects across different platforms and configurations (e.g., Dask.distributed and hdfs3).

Within the open-source Anaconda and conda community, Docker is also used for reproducible test and build environments. Conda-forge is a community-driven infrastructure for conda recipes that uses Docker with Travis CI and CircleCI to build, test and upload conda packages that include Python, R, C++ and Fortran libraries. The Docker images used in conda-forge are available from the conda-forge organization on Docker Hub.

3) Collaborative data science workflows with Anaconda

You can use Anaconda with Docker to build, containerize and share your data science applications with your team. Collaborative data science workflows with Anaconda and Docker make the transition from development to deployment as easy as sharing a Dockerfile and conda environment.

## anaconda_docker_3.png

Once you’ve containerized your data science applications, you can use container clustering systems, such as Kubernetes or Docker Swarm, when you’re ready to productionize, deploy and scale out your data science applications for many users.

## anaconda_docker_4.png

4) Endless combinations with Anaconda and Docker

The combined portability of Anaconda and flexibility of Docker enable a wide range of data science and analytics use cases.

A search for “Anaconda“ on Docker Hub shows many different ways that users are leveraging libraries from Anaconda with Docker, including turnkey deployments of Anaconda with Jupyter Notebooks; reproducible scientific research environments; and machine learning and deep learning applications with Anaconda, TensorFlow, Caffe and GPUs.

## Using Anaconda Images with Docker

There are many ways to get started using the Anaconda images with Docker. First, choose one of the Anaconda images for Docker based on your project requirements. The Anaconda images include the default packages listed here, and the Miniconda images include a minimal installation of Python and conda.

continuumio/anaconda (based on Python 2.7)
continuumio/anaconda3 (based on Python 3.5)
continuumio/miniconda (based on Python 2.7)
continuumio/miniconda3 (based on Python 3.5)

For example, we can use the continuumio/anaconda3 image, which can be pulled from the Docker repository:

$docker pull continuumio/anaconda3 Next, we can run the Anaconda image with Docker and start an interactive shell: $ docker run -i -t continuumio/anaconda3 /bin/bash

Once the Docker container is running, we can start an interactive Python shell, install additional conda packages or run Python applications.

Alternatively, we can start a Jupyter Notebook server with Anaconda from a Docker image:

$docker run -i -t -p 8888:8888 continuumio/anaconda3 /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser" You can then view the Jupyter Notebook by opening http://localhost:8888 in your browser, or http://<DOCKER-MACHINE-IP>:8888 if you are using a Docker Machine VM. Once you are inside of the running notebook, you can import libraries from Anaconda, perform interactive computations and visualize your data. ## anaconda_docker_5.png ## Additional Resources for Anaconda and Docker Anaconda and Docker complement each other and make working with Open Data Science development and deployments easy and scalable. For collaborative workflows, Anaconda and Docker provide everyone on your data science team with access to scalable, deployable and reproducible Open Data Science. Get started with Anaconda with Docker by visiting ContinuumIO organization on Docker Hub. The Anaconda images will also be featured in the Docker Store when it launches. Interested in using Anaconda and Docker in your organization for Open Data Science development, reproducibility and deployments? Get in touch with us if you’d like to learn more about how Anaconda can empower your enterprise with Open Data Science, including an on-premise package repository, collaborative notebooks, cluster deployments and custom consulting/training solutions. ## June 14, 2016 ### Continuum Analytics news #### Orange Part II: Monte Carlo Simulation Posted Wednesday, June 22, 2016 For the blog post Orange Part I: Building Predictive Models, please click here. In this blogpost, we will explore the versatility of Orange through a Monte Carlo simulation of Apple’s stock price. For an explanation on Monte Carlo simulation for stocks, visit Investopedia Let’s take a look at our schema: ## 8 copy.png We start off by grabbing AAPL stock data off of Yahoo! Finance and loading it into our canvas. This gives us all of AAPL’s data starting from 1980, but we only want to look at relatively recent data. Fortunately, Orange comes with a variety of data management and preprocessing techniques. Here, we can use the “Purge Domain” widget to simply remove the excess data. After doing so, we can see what AAPL’s closing stock price is post-2008 through a scatter plot. ## 9.png In order to run our simulation, we need certain inputs, including daily returns. Our data does not come with AAPL’s daily returns, but, fortunately, daily returns can easily be calculated via pandas. We can save our current data, add on our daily returns and then load up the modified dataset back into our canvas. After saving the data to AAPL.tab, we run the following script on our data: https://anaconda.org/rahuljain/monte-carlo-with-orange/notebook. After doing so, we simply load up the new data. Here is what our daily returns look like: ## 10.png Now, we need to use our daily returns to run the Monte Carlo simulation. We can again use a Python script for this task; this time let’s use the built-in Python script widget. Note that we could have used the built-in widget for the last script as well, but we wanted to see how we could save/load our data within the canvas. For our Monte Carlo simulation, we will need four parameters: starting stock price, number of days we want to simulate and the standard deviation and mean of AAPL’s daily returns. We can find these inputs with the following script: ## 11.png We go ahead and run our simulation 1000 times with the starting stock price of$125.04. The script takes in our stock data and outputs a dataset containing 1000 price points 365 days later.
We can visualize these prices via a box plot and histograms:

## 14.png

With this simulated data, we can make various calculations; a common calculation is Value at Risk (VaR). Here, we can say with 99% confidence that our stock’s price will be above $116.41 in 365 days, so we are putting$8.63 (starting price - 116.41) at risk 99% of the time.

We have successfully built a monte carlo simulation via Orange; this task demonstrated how we can use Orange outside of its machine learning tools.

## Summary

These three demos in this Orange blogpost series showed how Orange users can quickly and intuitively work with data sets.  Because of component-based design and integration with Python, Orange should appeal to machine learning researchers for the speed of execution and ease of prototyping of new methods. Graphical user’s interface is provided through visual programming and a large toolbox of widgets that support interactive data exploration. Component-based design, both on the level of procedural and visual programming, ﬂexibility in combining components to design new machine learning methods and data mining applications and user-friendly environment are also the most signiﬁcant attributes of Orange and where Orange can make its biggest contribution to the community.

#### Orange Part I: Building Predictive Models

Posted Wednesday, June 15, 2016

In this blog series we will showcase Orange, an open source data visualization and data analysis tool, through two simple predictive models and a Monte Carlo Simulation.

## Introduction to Orange

Orange is a comprehensive, component-based framework for machine learning and data mining. It is intended for both experienced users and researchers in machine learning, who want to prototype new algorithms while reusing as much of the code as possible, and for those just entering the ﬁeld who can either write short Python scripts for data analysis or enjoy the powerful, easy-to-use visual programming environment. Orange includes a range of techniques, such as data management and preprocessing, supervised and unsupervised learning, performance analysis and a range of data and model visualization techniques.

Orange has a visual programming front-end for explorative data analysis and visualization called Orange Canvas. Orange Canvas is a visual, component-based programming approach that allows us to quickly explore and analyze data sets. Orange’s GUI is composed of widgets that communicate through channels; a set of connected widgets is called a schema. The creation of schemas is quick and flexible, because widgets are added on through a drag-and-drop method.

Orange can also be used as a Python library. Using the Orange library, it is easy to prototype state-of-the-art machine learning algorithms.

## Building a Simple Predictive Model in Orange

We start with two simple predictive models in the Orange canvas and their corresponding Jupyter notebooks.

First let’s take a look at our Simple Predictive Model- Part 1 notebook. Now, let’s recreate the model in the Orange Canvas. Here is the schema for predicting the results of the Iris data set via a classification tree in Orange:

## 1.png

Notice the toolbar on the left of the canvas- this is where the 100+ widgets can be found and dragged onto the canvas. Now, let’s take a look at how this simple schema works. The schema reads from left to right, with information flowing from widget to widget through the pipelines. After the Iris data set is loaded in, it can be viewed through a variety of widgets. Here, we chose to see the data in a simple data table and a scatter plot. When we click on those two widgets, we see the following:

## 3.png

With just three widgets, we already get a sense of the data we are working with. The scatter plot has an option to “Rank Projections,” determining the best way to view our data. In this case, having the scatter plot as “Petal Width vs Petal Length” allows us to immediately see a potential pattern in the width of a flower’s petal and the type of iris the flower is. Beyond scatter plots, there are a variety of different widgets to help us visualize our data in Orange.

Now, let’s look at how we built our predictive model. We simply connected the data to a Classification Tree widget and can view the tree through a Classification Tree Viewer widget.

## 4.png

We can see exactly how our predictive model works. Now, we connect our model and our data to the “Test and Score” and “Predictions” widgets. The Test and Score widget is one way of seeing how well our Classification Tree performs:

## 5.png

The Predictions widget predicts the type of iris flower given the input data. Instead of looking at a long list of these predictions, we can use a confusion matrix to see our predictions and their accuracy.

## 6.png

Thus, we see our model misclassified 3/150 data instances.

We have seen how quickly we can build and visualize a working predictive model in the Orange canvas. Now, let’s take a look at how the exact same model can once again be built via scripting in Orange, a Python 3 data mining library

## Building a Predictive Model with a Hold Out Test Set in Orange

In our second example of a predictive model, we make the model slightly more complicated by holding out a test set. By doing so, we can use separate datasets to train and test our model, thus helping to avoid overfitting. Here is the original notebook.

Now, let’s build the same predictive model in the Orange Canvas. The Orange Canvas will allow us to better visualize what we are building.

Orange Schema:

## 7.png

As you can tell, the difference between Part 1 and Part 2 is the Data Sampler widget. This widget randomly separates 30% of the data into the testing data set. Thus, we can build the same model, but more accurately test it using data the model has never seen before.

This example shows how easy it is to modify existing schemas. We simply introduced one new widget to vastly improve our model.

Now let’s look at the same model built via the Orange Python 3 library.

## Summary

In this blogpost, we have introduced Orange, an open source data visualization and data analysis tool, and presented two simple predictive models. In our next blogpost, we will instruct how to build a Monte Carlo Simulation done with Orange.

### Matthieu Brucher

#### Analog modeling: SD1 vs TS9

There are so many different distortion/overdrive/fuzz guitar pedals, and some have a better reputation than other. Two of them have a reputation of being closed (one copied on the other), and I already explained how one of these could be modeled (and I have a plugin with it!). So let’s work on comparing the SD1 and the TS9.

#### Global comparison

I won’t focus on the input and output stage, although they can play a role in the sound (especially since the output stage is the only difference between the TS9 and the TS808…).

Let’s have a look at the schematics:

SD1 annotated schematic

TS9 schematic

The global circuits seem similar, with similar functionalities. The overdrives are heavily closed, and actually without the asymmetry of the SD1, they would be identical (there are versions of SD1 with the 51p capacitor or a close enough value). The tone circuits have more differences, with the TS9 having an additional 10k resistor and a “missing” capacitor around the AOP. The values are also quite different, but based on a similar design.

Now, the input stages are different. SD1 has a higher input capacitor, but removes around 10% of the input signal compared to 1% for TS9 (not accounting for the guitar output impedance). Also there are two high pass filters on SD1 with the same cut frequency at 50Hz, whereas the TS9 has “only” one at 100Hz. They more or less end up being similar. For the output, the SD1 ditches 33% of the final signal before the output stage that also has a high pass filter at 20Hz and finally another one at 10Hz. The TS9 has also a 20Hz high pass but it is followed by another 1Hz high pass. All things considered, except for the overdrive and the tone circuits, there should be not audible difference on a guitar, but I wouldn’t advice either pedal for a bass guitar, the input stages are chopping off too much.

#### Overdrive circuit

The overdrive circuits are almost a match. The only difference is that the potentiometer has double resistance on the SD1 and there is 2 diodes in one path (the capacitor has no impact according to the LTSpice simulation I ran). And leads to exactly what I expected for similar drive value:

SD1 and TS9 behavior on a 100Hz signal

This is the behavior for all frequencies. The only difference is the slightly small voltage on the down part of the curve. This shows up more clearly on the spectrum:

SD1 sine sweep with an oversampling x4

TS9 sine sweep with an oversampling x4

To limite the noise in this case, I ran the sine sweep again, with an oversampling of x8. The difference with the additional even frequencies in SD1 is obvious.

SD1 sine sweep with an oversampling x8

TS9 sine sweep with an oversampling x8

#### Tone

The tone circuit is a nightmare to compute by hand. The issues are with the simplification of the potentiometer in the equations. I did it for the SD1 tone circuit, and as the TS9 is a little bit different, I had to start over (several years after solving SD1 :/).

I won’t display the equations here, the coefficients can be found in the pedal tone stack filters in Audio Toolkit. Suffice to say that TS9 can be a high pass filter whereas SD1 is definitely an EQ. The different behavior is obvious in the following pictures:

SD1 tone transfer function

TS9 tone transfer function

The transfer functions are different even if their analog circuit is quite similar. This is definitely the difference that people hear between SD1 and TS9.

#### Conclusion

The two pedals are quite similar when checking the circuit, and even if SD1 is labelled as an asymmetric overdrive, the actual sound different between the two pedals may be more related to the tone circuit than the overdrive.

Now that these filters are available in Audio Toolkit, it is easy to try different combinations!

## June 13, 2016

### Fernando Perez

#### In Memoriam, John D. Hunter III: 1968-2012

I just returned from the SciPy 2013 conference, whose organizers kindly invited me to deliver a keynote. For me this was a particularly difficult, yet meaningful edition of SciPy, my favorite conference. It was only a year ago that John Hunter, creator of matplotlib, had delivered his keynote shortly before being diagnosed with terminal colon cancer, from which he passed away on August 28, 2012 (if you haven't seen his talk, I strongly recommend it for its insights into scientific open source work).

On October 1st 2012, a memorial service was held at the University of Chicago's Rockefeller Chapel, the location of his PhD graduation. On that occasion I read a brief eulogy, but for obvious reasons only a few members from the SciPy community were able to attend. At this year's SciPy conference, Michael Droetboom (the new project leader for matplotlib) organized the first edition of the John Hunter Excellence in Plotting Contest, and before the awards ceremony I read a slightly edited version of the text I had delivered in Chicago (you can see the video here). I only made a few changes for brevity and to better suit the audience of the SciPy conference. I am reproducing it below.

I also went through my photo albums and found images I had of John. A memorial fund has been established in his honor to help with the education of his three daughers Clara, Ava and Rahel (Update: the fund was closed in late 2012 and its proceeds given to the family; moving forward, NumFOCUS sponsors the John Hunter Technology Fellowship, that anyone can make contributions to).

Dear friends and colleagues,

I used to tease John by telling him that he was the man I aspired to be when I grew up. I am not sure he knew how much I actually meant that. I first met him over email in 2002, when IPython was in its infancy and had rudimentary plotting support via Gnuplot. He sent me a patch to support a plotting syntax more akin to that of matlab, but I was buried in my effort to finish my PhD and couldn’t deal with his contribution for at least a few months. In the first example of what I later came to know as one of his signatures, he kindly replied and then simply routed around this blockage by single-handedly creating matplotlib. For him, building an entire new visualization library from scratch was the sensible solution: he was never one to be stopped by what many would consider an insurmountable obstacle.

Our first personal encounter was at SciPy 2004 at Caltech. I was immediately taken by his unique combination of generous spirit, sharp wit and technical prowess, and over the years I would grow to love him as a brother. John was a true scholar, equally at ease in a conversation about monetary policy, digital typography or the intricacies of C++ extensions in Python. But never once would you feel from him a hint of arrogance or condescension, something depressingly common in academia. John was driven only by the desire to work on interesting questions and to always engage others in a meaningful way, whether solving their problems, lifting their spirits or simply sharing a glass of wine. Beneath a surface of technical genius, there lied a kind, playful and fearless spirit, who was quietly comfortable in his own skin and let the power of his deeds speak for him.

Beyond the professional context, John had a rich world populated by the wonders of his family, his wife Miriam and his daughters Clara, Ava and Rahel. His love for his daughters knew no bounds, and yet I never once saw him clip their wings out of apprehension. They would be up on trees, dangling from monkeybars or riding their bikes, and he would always be watchful but encouraging of all their adventures. In doing so, he taught them to live like he did: without fear that anything could be too difficult or challenging to accomplish, and guided by the knowledge that small slips and failures were the natural price of being bold and never settling for the easy path.

A year ago in this same venue, John drew lessons from a decade’s worth of his own contributions to our community, from the vantage point of matplotlib. Ten years earlier at U. Chicago, his research on pediatric epilepsy required either expensive and proprietary tools or immature free ones. Along with a few similarly-minded folks, many of whom are in this room today, John believed in a future where science and education would be based on openly available software developed in a collaborative fashion. This could be seen as a fool’s errand, given that the competition consisted of products from companies with enormous budgets and well-entrenched positions in the marketplace. Yet a decade later, this vision is gradually becoming a reality. Today, the Scientific Python ecosystem powers everything from history-making astronomical discoveries to large financial modeling companies. Since all of this is freely available for anyone to use, it was possible for us to end up a few years ago in India, teaching students from distant rural colleges how to work with the same tools that NASA uses to analyze images from the Hubble Space Telescope. In recognition of the breadth and impact of his contributions, the Python Software Foundation awarded him posthumously the first installment of its highest distinction, the PSF Distinguished Service Award.

John’s legacy will be far-reaching. His work in scientific computing happened in a context of turmoil in how science and education are conducted, financed and made available to the public. I am absolutely convinced that in a few decades, historians of science will describe the period we are in right now as one of deep and significant transformations to the very structure of science. And in that process, the rise of free openly available tools plays a central role. John was on the front lines of this effort for a decade, and with his accomplishments he shone brighter than most.

John’s life was cut far, far too short. We will mourn him for time to come, and we will never stop missing him. But he set the bar high, and the best way in which we can honor his incredible legacy is by living up to his standards: uncompromising integrity, never-ending intellectual curiosity, and most importantly, unbounded generosity towards all who crossed his path. I know I will never grow up to be John Hunter, but I know I must never stop trying.

Fernando Pérez

June 27th 2013, SciPy Conference, Austin, Tx.

## June 09, 2016

#### ActivePapers: hello, world

ActivePapers is a technology developed by Konrad Hinsen to store code, data and documentation with several benefits: storage in a single HDF5 file, internal provenance tracking (what code created what data/figure, with a Make-like conditional execution) and a containerized execution environment.

Implementations for the JVM and for Python are provided by the author. In this article, I go over the first steps of creating an ActivePaper. Being a regular user of Python, I cover only this language.

### An overview of ActivePapers

First, a "statement of fact": An ActivePaper is a HDF5 file. That is, it is a binary, self-describing, structured and portable file whose content can be explored with generic tools provided by the HDF Group.

The ActivePapers project is developed by Konrad Hinsen as a vehicle for the publication of computational work. This description is a bit short and does not convey the depth that has gone into the design of ActivePapers, the ActivePapers paper will provide more information.

ActivePapers come, by design, with restrictions on the code that is executed. For instance, only Python code (in the Python implementation) can be used, with the scientific computing module NumPy. All data is accessed via the h5py module. The goals behind these design choices are related to security and to a good definition of the execution environment of the code.

### Creating an ActivePaper

The tutorial on the ActivePapers website start by looking at an existing ActivePaper. I'll go the other way around, as I found it more intuitive. Interactions with an ActivePaper are channeled by the aptool program (see the installation notes).

Currently, ActivePapers lack a "hello, world" program, so here is mine. ActivePapers work best when you dedicate a directory to a single ActivePaper. You may enter the following in a terminal:

mkdir hello_world_ap                 # create a new directory
cd hello_world_ap                    # visit it
aptool -p hello_world.ap create      # This lines create a new file "hello_world.ap"
mkdir code                           # create the "code" directory where you can
# write program that will be stored in the AP
echo "print 'hello, world'" > code/hello.py # create a program
aptool checkin -t calclet code/hello.py     # store the program in the AP


That's is, you have created an ActivePaper!

You can observe its content by issuing

aptool ls                            # inspect the AP


And execute it

aptool run hello                     # run the program in "code/hello.py"


This command looks into the ActivePapers file and not into the directories visible in the filesystem. The filesystem acts more like a staging area.

### A basic computation in ActivePapers

The "hello, world" program above did not perform a computation of any kind. An introductory example for science is the computation of the number $\pi$ by the Monte Carlo method.

I will now create a new ActivePaper (AP) but comment on the specific ways to define parameters, store data and create plots. The dependency on the plotting library matplotlib has to be given when creating the ActivePaper:

mkdir pi_ap
cd pi_ap
aptool -p pi.ap create -d matplotlib


To generate a repeatable result, I store the seed for the random number generator

aptool set seed 1780812262
aptool set N 10000


The line above store a data element in the AP that is of type integer. The value of seed can be accessed in the Python code of the AP.

I will create several programs to mimic the workflow of more complex problems: one to generate the data, one to analyze the data and one for generating a figure.

The first program is generate_random_numbers.py

import numpy as np
from activepapers.contents import data

seed = data['seed'][()]
N = data['N'][()]
np.random.seed(seed)
data['random_numbers'] = np.random.random(size=(N, 2))


Apart from importing the NumPy module, I have also imported the ActivePapers data

from activepapers.contents import data


data is a dict-like interface to the content of the ActivePaper and so only work in code that is checked in the ActivePaper and executed with aptool. data can be used to read values, such a the seed and number of samples, and to store data, such as the samples here.

The [()] returns the value of scalar datasets in HDF5. To have more information on this, see the dataset documentation of h5py.

The second program is compute_pi.py

import numpy as np
from activepapers.contents import data

xy = data['random_numbers'][...]
data['estimator'] = np.cumsum(radius_square < 1) * 4 / np.linspace(1, N, N)


And the third is plot_pi.py

import numpy as np
import matplotlib
matplotlib.use('PDF')
import matplotlib.pyplot as plt
from activepapers.contents import data, open_documentation

estimator = data['estimator']
N = len(estimator)
plt.plot(estimator)
plt.xlabel('Number of samples')
plt.ylabel(r'Estimation of $\pi$')
plt.savefig(open_documentation('pi_figure.pdf', 'w'))


Notice:

1. The setting of the PDF driver for matplotlib before importing matplotlib.pyplot.
2. The use of open_documentation. This function provides file descriptors that can read and write binary blurbs.

Now, you can checkin and run the code

aptool checkin -t calclet code/*.py
aptool run generate_random_numbers
aptool run compute_pi
aptool run plot_pi


### Concluding words

That's it, we have created an ActivePaper and ran code with it.

For fun: issue the command

aptool set seed 1780812263


(or any number of your choosing that is different from the previous one) and then

aptool update


ActivePapers handle dependencies! That's is, everything that depends on the seed will be updated. That include the random numbers, the estimator for pi and the figure. To see the update, check the creation times in the ActivePaper

aptool ls -l


It is good to know that ActivePapers have been used as companions to research articles! See Protein secondary-structure description with a coarse-grained model: code and datasets in ActivePapers format for instance.

You can have a look at the resulting files that I uploaded to Zenodo: doi:10.5281/zenodo.55268

### References

ActivePapers paper K. Hinsen, ActivePapers: a platform for publishing and archiving computer-aided research, F1000Research (2015), 3 289.

ActivePapers website The website for ActivePapers

## June 07, 2016

### Matthieu Brucher

#### Announcement: ATKTransientShaper 1.0.0

I’m happy to announce the release of a mono transient shaper based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATK Transient Shaper

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

## June 03, 2016

### Continuum Analytics news

#### Anaconda Cloud Release v 2.18.0

Posted Friday, June 3, 2016

This is a quick note to let everyone know that we released a new version of Anaconda Cloud today - version 2.18.0 (and the underlying Anaconda Repository server software). It's a minor release, but has some useful new updates:

1. With the release of Pip 8.1.2, package downloads weren't working for some packages. This issue is now resolved. Additional details on this issue here.
2. We've moved our docs from docs.anaconda.org to docs.continuum.io with a new IA and new look & feel.
3. The platform's API now has documentation - available here - more work to do to refine this feature, but the basics are present for an often-requested addition.
4. Of course, the laundry list of bug fixes...

If you run into issues, let us know. Here's the best starting point to help direct issues.

-Team Anaconda

## June 02, 2016

### Continuum Analytics news

#### NAG and Continuum Analytics Partner to Provide Readily Accessible Numerical Algorithms

Posted Thursday, June 2, 2016

Improved Accessibility for NAG’s Mathematical and Statistical Routines for Python Data Scientists

Numerical Algorithms Group (NAG) and Continuum have partnered together to provide conda packages for the NAG Library for Python (nag4py), the Python bindings for the NAG C Library. Users wishing to use the NAG Library with Anaconda can now install the bindings with a simple command (conda install -c nag nag4py) or the Anaconda Navigator GUI.

For those of us who use Anaconda, the leading Open Data Science platform, for package management and virtual environments, this enhancement provides immediate access to the 1,500+ numerical algorithms in the NAG Library. It also means that you can automatically download any future NAG Library updates as they are published on the NAG channel in Anaconda Cloud.

To illustrate how to use the NAG Library for Python, I have created an IPython Notebook1 that demonstrates the use of NAG’s implementation of the PELT algorithm to identify the changepoints of a stock whose price history has been stored in a MongoDB database. Using the example of Volkswagen (VOW), you can clearly see that a changepoint occurred when the news about the recent emissions scandal broke. This is an unsurprising result in this case, but in general, it will not always be as clear when and where a changepoint occurs.

So far, conda packages for the NAG Library for Python have been made available for 64-bit Linux, Mac and Windows platforms. On Linux and Mac, a conda package for the NAG C Library will automatically be installed alongside the Python bindings, so no further configuration is necessary. A Windows conda package for the NAG C Library is coming soon. Until then, a separate installation of the NAG C Library is required. In all cases, the Python bindings require NumPy, so that will also be installed by conda if necessary.

Use of the NAG C Library requires a valid licence key, which is available here: www.nag.com. The NAG Library is also available for a 30-day trial.

1The IPython notebook requires Mark 25, which is currently available on Windows and Linux. The Mac version will be released over the summer.

#### Continuum Analytics Announces Inaugural AnacondaCON Conference

Posted Thursday, June 2, 2016

The brightest minds in Open Data Science will come together in Austin for two days of engaging sessions and panels from industry leaders and networking in February 2017

AUSTIN, Texas. – June 2, 2016 – Continuum Analytics, the creator and driving force behind Anaconda, the leading open source analytics platform powered by Python, today announced the inaugural Anaconda user conference, taking place from February 7-9, 2017 in Austin. AnacondaCON is a two-day event at the JW Marriot that brings together innovative enterprises that are on the journey to Open Data Science to capitalize on their growing treasure trove of data assets to create compelling business value for their enterprise.

From predictive analytics to deep learning, AnacondaCON will help attendees learn how to build data science applications to meet their needs. Attendees will be at varying stages from learning how to start their Open Data Science journey and accelerating it to sharing their experiences. The event will offer Open Data Science advocates an opportunity to engage in breakout sessions, hear from industry experts during keynote sessions, learn about case studies from subject matter experts and choose from specialized and focused sessions based on topic areas of interest.

“We connect regularly with Anaconda fans at many industry and community events worldwide. Now, we’re launching our first ever customer and user conference, AnacondaCON, for our growing and thriving enterprise community to have an informative gathering place to discover, share and engage with similar enterprises,” said Michele Chambers, VP of Products & CMO at Continuum Analytics. “The common thread that links these enterprises together is that they are all passionate about solving business and world problems and see Open Data Science as the answer. At AnacondaCON, they will connect and engage with the innovators and thought leaders behind the Open Data Science movement and learn more about industry trends, best practices and how to harness the power of Open Data Science to meet data-driven goals.”

Attend AnacondaCON

Continuum Analytics’ Anaconda is the leading open data science platform powered by Python. We put superpowers into the hands of people who are changing the world. Anaconda is trusted by leading businesses worldwide and across industries – financial services, government, health and life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda helps data science teams discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage open data science environments and harness the power of the latest open source analytic and technology innovations. Visit http://www.continuum.io.

## Neural Networks in PyMC3 estimated with Variational Inference

(c) 2016 by Thomas Wiecki

There are currently three big trends in machine learning: Probabilistic Programming, Deep Learning and "Big Data". Inside of PP, a lot of innovation is in making things scale using Variational Inference. In this blog post, I will show how to use Variational Inference in PyMC3 to fit a simple Bayesian Neural Network. I will also discuss how bridging Probabilistic Programming and Deep Learning can open up very interesting avenues to explore in future research.

### Probabilistic Programming at scale

Probabilistic Programming allows very flexible creation of custom probabilistic models and is mainly concerned with insight and learning from your data. The approach is inherently Bayesian so we can specify priors to inform and constrain our models and get uncertainty estimation in form of a posterior distribution. Using MCMC sampling algorithms we can draw samples from this posterior to very flexibly estimate these models. PyMC3 and Stan are the current state-of-the-art tools to consruct and estimate these models. One major drawback of sampling, however, is that it's often very slow, especially for high-dimensional models. That's why more recently, variational inference algorithms have been developed that are almost as flexible as MCMC but much faster. Instead of drawing samples from the posterior, these algorithms instead fit a distribution (e.g. normal) to the posterior turning a sampling problem into and optimization problem. ADVI -- Automatic Differentation Variational Inference -- is implemented in PyMC3 and Stan, as well as a new package called Edward which is mainly concerned with Variational Inference.

Unfortunately, when it comes traditional ML problems like classification or (non-linear) regression, Probabilistic Programming often plays second fiddle (in terms of accuracy and scalability) to more algorithmic approaches like ensemble learning (e.g. random forests or gradient boosted regression trees).

### Deep Learning

Now in its third renaissance, deep learning has been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games, and beating the world-champion Lee Sedol at Go. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders and in all sorts of other interesting ways (e.g. Recurrent Networks, or MDNs to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.

A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars:

• Speed: facilitating the GPU allowed for much faster processing.
• Software: frameworks like Theano and TensorFlow allow flexible creation of abstract models that can then be optimized and compiled to CPU or GPU.
• Learning algorithms: training on sub-sets of the data -- stochastic gradient descent -- allows us to train these models on massive amounts of data. Techniques like drop-out avoid overfitting.
• Architectural: A lot of innovation comes from changing the input layers, like for convolutional neural nets, or the output layers, like for MDNs.

### Bridging Deep Learning and Probabilistic Programming

On one hand we Probabilistic Programming which allows us to build rather small and focused models in a very principled and well-understood way to gain insight into our data; on the other hand we have deep learning which uses many heuristics to train huge and highly complex models that are amazing at prediction. Recent innovations in variational inference allow probabilistic programming to scale model complexity as well as data size. We are thus at the cusp of being able to combine these two approaches to hopefully unlock new innovations in Machine Learning. For more motivation, see also Dustin Tran's recent blog post.

While this would allow Probabilistic Programming to be applied to a much wider set of interesting problems, I believe this bridging also holds great promise for innovations in Deep Learning. Some ideas are:

• Uncertainty in predictions: As we will see below, the Bayesian Neural Network informs us about the uncertainty in its predictions. I think uncertainty is an underappreciated concept in Machine Learning as it's clearly important for real-world applications. But it could also be useful in training. For example, we could train the model specifically on samples it is most uncertain about.
• Uncertainty in representations: We also get uncertainty estimates of our weights which could inform us about the stability of the learned representations of the network.
• Regularization with priors: Weights are often L2-regularized to avoid overfitting, this very naturally becomes a Gaussian prior for the weight coefficients. We could, however, imagine all kinds of other priors, like spike-and-slab to enforce sparsity (this would be more like using the L1-norm).
• Transfer learning with informed priors: If we wanted to train a network on a new object recognition data set, we could bootstrap the learning by placing informed priors centered around weights retrieved from other pre-trained networks, like GoogLeNet.
• Hierarchical Neural Networks: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see my tutorial on Hierarchical Linear Regression in PyMC3). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy -- e.g. early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data.
• Other hybrid architectures: We can more freely build all kinds of neural networks. For example, Bayesian non-parametrics could be used to flexibly adjust the size and shape of the hidden layers to optimally scale the network architecture to the problem at hand during training. Currently, this requires costly hyper-parameter optimization and a lot of tribal knowledge.

## Bayesian Neural Networks in PyMC3¶

### Generating data

First, lets generate some toy data -- a simple binary classification problem that's not linearly separable.

In [1]:
%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale
from sklearn.cross_validation import train_test_split
from sklearn.datasets import make_moons

In [2]:
X, Y = make_moons(noise=0.2, random_state=0, n_samples=1000)
X = scale(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.5)

In [3]:
fig, ax = plt.subplots()
ax.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0')
ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r', label='Class 1')
sns.despine(); ax.legend()
ax.set(xlabel='X', ylabel='Y', title='Toy binary classification data set');


### Model specification

A neural network is quite simple. The basic unit is a perceptron which is nothing more than logistic regression. We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem.

In [17]:
# Trick: Turn inputs and outputs into shared variables.
# It's still the same thing, but we can later change the values of the shared variable
# (to switch in the test-data later) and pymc3 will just use the new data.
# Kind-of like a pointer we can redirect.
ann_input = theano.shared(X_train)
ann_output = theano.shared(Y_train)

n_hidden = 5

# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden)
init_2 = np.random.randn(n_hidden, n_hidden)
init_out = np.random.randn(n_hidden)

with pm.Model() as neural_network:
# Weights from input to hidden layer
weights_in_1 = pm.Normal('w_in_1', 0, sd=1,
shape=(X.shape[1], n_hidden),
testval=init_1)

# Weights from 1st to 2nd layer
weights_1_2 = pm.Normal('w_1_2', 0, sd=1,
shape=(n_hidden, n_hidden),
testval=init_2)

# Weights from hidden layer to output
weights_2_out = pm.Normal('w_2_out', 0, sd=1,
shape=(n_hidden,),
testval=init_out)

# Build neural-network using tanh activation function
act_1 = T.tanh(T.dot(ann_input,
weights_in_1))
act_2 = T.tanh(T.dot(act_1,
weights_1_2))
act_out = T.nnet.sigmoid(T.dot(act_2,
weights_2_out))

# Binary classification -> Bernoulli likelihood
out = pm.Bernoulli('out',
act_out,
observed=ann_output)


That's not so bad. The Normal priors help regularize the weights. Usually we would add a constant b to the inputs but I omitted it here to keep the code cleaner.

### Variational Inference: Scaling model complexity

We could now just run a MCMC sampler like NUTS which works pretty well in this case but as I already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.

Instead, we will use the brand-new ADVI variational inference algorithm which was recently added to PyMC3. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior.

In [34]:
%%time
with neural_network:
# Run ADVI which returns posterior means, standard deviations, and the evidence lower bound (ELBO)

Iteration 0 [0%]: ELBO = -368.86
Iteration 5000 [10%]: ELBO = -185.65
Iteration 10000 [20%]: ELBO = -197.23
Iteration 15000 [30%]: ELBO = -203.2
Iteration 20000 [40%]: ELBO = -192.46
Iteration 25000 [50%]: ELBO = -198.8
Iteration 30000 [60%]: ELBO = -183.39
Iteration 35000 [70%]: ELBO = -185.04
Iteration 40000 [80%]: ELBO = -187.56
Iteration 45000 [90%]: ELBO = -192.32
Finished [100%]: ELBO = -225.56
CPU times: user 36.3 s, sys: 60 ms, total: 36.4 s
Wall time: 37.2 s



< 40 seconds on my older laptop. That's pretty good considering that NUTS is having a really hard time. Further below we make this even faster. To make it really fly, we probably want to run the Neural Network on the GPU.

As samples are more convenient to work with, we can very quickly draw samples from the variational posterior using sample_vp() (this is just sampling from Normal distributions, so not at all the same like MCMC):

In [35]:
with neural_network:
trace = pm.variational.sample_vp(v_params, draws=5000)


Plotting the objective function (ELBO) we can see that the optimization slowly improves the fit over time.

In [36]:
plt.plot(v_params.elbo_vals)
plt.ylabel('ELBO')
plt.xlabel('iteration')

Out[36]:
<matplotlib.text.Text at 0x7fa5dae039b0>


Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). We use sample_ppc() to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).

In [7]:
# Replace shared variables with testing set
ann_input.set_value(X_test)
ann_output.set_value(Y_test)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

# Use probability of > 0.5 to assume prediction of class 1
pred = ppc['out'].mean(axis=0) > 0.5

In [8]:
fig, ax = plt.subplots()
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
sns.despine()
ax.set(title='Predicted labels in testing set', xlabel='X', ylabel='Y');

In [9]:
print('Accuracy = {}%'.format((Y_test == pred).mean() * 100))

Accuracy = 94.19999999999999%



Hey, our neural network did all right!

## Lets look at what the classifier has learned

For this, we evaluate the class probability predictions on a grid over the whole input space.

In [10]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
dummy_out = np.ones(grid.shape[1], dtype=np.int8)

In [11]:
ann_input.set_value(grid_2d)
ann_output.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)


### Probability surface¶

In [26]:
cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].mean(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Posterior predictive mean probability of class label = 0');


### Uncertainty in predicted value

So far, everything I showed we could have done with a non-Bayesian Neural Network. The mean of the posterior predictive for each class-label should be identical to maximum likelihood predicted values. However, we can also look at the standard deviation of the posterior predictive to get a sense for the uncertainty in our predictions. Here is what that looks like:

In [27]:
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].std(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Uncertainty (posterior predictive standard deviation)');


We can see that very close to the decision boundary, our uncertainty as to which label to predict is highest. You can imagine that associating predictions with uncertainty is a critical property for many applications like health care. To further maximize accuracy, we might want to train the model primarily on samples from that high-uncertainty region.

## Mini-batch ADVI: Scaling data size

So far, we have trained our model on all data at once. Obviously this won't scale to something like ImageNet. Moreover, training on mini-batches of data (stochastic gradient descent) avoids local minima and can lead to faster convergence.

Fortunately, ADVI can be run on mini-batches as well. It just requires some setting up:

In [43]:
# Set back to original data to retrain
ann_input.set_value(X_train)
ann_output.set_value(Y_train)

# Tensors and RV that will be using mini-batches
minibatch_tensors = [ann_input, ann_output]
minibatch_RVs = [out]

# Generator that returns mini-batches in each iteration
def create_minibatch(data):
rng = np.random.RandomState(0)

while True:
# Return random data samples of set size 100 each iteration
ixs = rng.randint(len(data), size=50)
yield data[ixs]

minibatches = [
create_minibatch(X_train),
create_minibatch(Y_train),
]

total_size = len(Y_train)


While the above might look a bit daunting, I really like the design. Especially the fact that you define a generator allows for great flexibility. In principle, we could just pool from a database there and not have to keep all the data in RAM.

Lets pass those to advi_minibatch():

In [48]:
%%time
with neural_network:
n=50000, minibatch_tensors=minibatch_tensors,
minibatch_RVs=minibatch_RVs, minibatches=minibatches,
total_size=total_size, learning_rate=1e-2, epsilon=1.0
)

Iteration 0 [0%]: ELBO = -311.63
Iteration 5000 [10%]: ELBO = -162.34
Iteration 10000 [20%]: ELBO = -70.49
Iteration 15000 [30%]: ELBO = -153.64
Iteration 20000 [40%]: ELBO = -164.07
Iteration 25000 [50%]: ELBO = -135.05
Iteration 30000 [60%]: ELBO = -240.99
Iteration 35000 [70%]: ELBO = -111.71
Iteration 40000 [80%]: ELBO = -87.55
Iteration 45000 [90%]: ELBO = -97.5
Finished [100%]: ELBO = -75.31
CPU times: user 17.4 s, sys: 56 ms, total: 17.5 s
Wall time: 17.5 s


In [49]:
with neural_network:
trace = pm.variational.sample_vp(v_params, draws=5000)

In [50]:
plt.plot(v_params.elbo_vals)
plt.ylabel('ELBO')
plt.xlabel('iteration')
sns.despine()


As you can see, mini-batch ADVI's running time is much lower. It also seems to converge faster.

For fun, we can also look at the trace. The point is that we also get uncertainty of our Neural Network weights.

In [51]:
pm.traceplot(trace);


## Summary

Hopefully this blog post demonstrated a very powerful new inference algorithm available in PyMC3: ADVI. I also think bridging the gap between Probabilistic Programming and Deep Learning can open up many new avenues for innovation in this space, as discussed above. Specifically, a hierarchical neural network sounds pretty bad-ass. These are really exciting times.

## Next steps

Theano, which is used by PyMC3 as its computational backend, was mainly developed for estimating neural networks and there are great libraries like Lasagne that build on top of Theano to make construction of the most common neural network architectures easy. Ideally, we wouldn't have to build the models by hand as I did above, but use the convenient syntax of Lasagne to construct the architecture, define our priors, and run ADVI.

While we haven't successfully run PyMC3 on the GPU yet, it should be fairly straight forward (this is what Theano does after all) and further reduce the running time significantly. If you know some Theano, this would be a great area for contributions!

You might also argue that the above network isn't really deep, but note that we could easily extend it to have more layers, including convolutional ones to train on more challenging data sets.

I also presented some of this work at PyData London, view the video below:

## Acknowledgements

Taku Yoshioka did a lot of work on ADVI in PyMC3, including the mini-batch implementation as well as the sampling from the variational posterior. I'd also like to the thank the Stan guys (specifically Alp Kucukelbir and Daniel Lee) for deriving ADVI and teaching us about it. Thanks also to Chris Fonnesbeck, Andrew Campbell, Taku Yoshioka, and Peadar Coyle for useful comments on an earlier draft.

## May 31, 2016

### Continuum Analytics news

#### Technical Collaboration Expanding Anaconda Ecosystem

Posted Tuesday, May 31, 2016

Intel and Continuum Analytics Work Together to Extend the Power of Python-based Analytics Across the Enterprise

PYCON 2016, PORTLAND, Ore—May 31, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading open data science platform powered by Python, welcomes Intel into the Anaconda ecosystem. Intel has adopted the Anaconda packaging and distribution and is working with Continuum to provide interoperability.

By offering Anaconda as the foundational high-performance Python distribution, Intel is empowering enterprises to more quickly build open analytics applications that drive immediate business value. Organizations can now combine the power of the Intel® Math Kernel Library (MKL) and Anaconda’s Python-based data science to build the high performance analytic modeling and visualization applications required to compete in today’s data-driven economies.

“We have been working closely with Continuum Analytics to bring the capabilities of Anaconda to the Intel Distribution for Python. We include conda, making it easier to install conda packages and create conda environments. You now have easy access to the large and growing set of packages available on Anaconda Cloud,” said Robert Cohn, Engineering Director for Intel’s Scripting and Analysis Tools in his recently posted blog.

“We are in the midst of a computing revolution where intelligent data-driven decisions will drive our every move––in business and at home. To unleash the floodgates to value, we need to make data science fast, accessible and open to everyone,” said Michele Chambers, VP of Products & CMO at Continuum Analytics. “Python is the defacto data science language that everyone from elementary to graduate school is using because it’s so easy to get started and powerful enough to drive highly complex analytics. Anaconda turbo boosts analytics without adding any complexity.”

Without optimization, high-level languages like Python lack the performance needed to analyze increasingly large data sets. The platform includes packages and technology that are accessible to beginner Python developers and powerful enough to tackle data science projects for Big Data. Anaconda offers support for advanced analytics, numerical computing, just-in-time compilation, profiling, parallelism, interactive visualization, collaboration and other analytic needs. Customers have experienced up to 100X performance increases with Anaconda.

Anaconda Cloud is a package management service that makes it easy to find, access, store and share public and private notebooks, environments, conda and PyPI packages. The Anaconda Cloud also keeps up with updates made to the packages and environments being used. Users are able to build packages using the Anaconda client command line interface (CLI), then manually or automatically upload the packages to Anaconda Cloud to quickly share with others or access from anywhere. The Intel channel on Anaconda Cloud is where users can go to get optimized packages that Intel is providing.

“Companies like Intel, Microsoft and Cloudera are making Open Data Science more accessible to enterprises. We are mutually committed to ensuring customers get access to open and transparent technology advances,” said Travis Oliphant, CEO and co-founder at Continuum Analytics. “Our technical collaborations with Intel and Open Data Science members are expanding and fueling the next generation of high performance computing for data science. Customers can now leverage their Intel-powered computing clusters––with or without Hadoop––along with a supercharged Python distribution to propel their organizations forward and capitalize on their ever growing data assets.”

Anaconda also powers Python for Microsoft’s Azure ML platform and Continuum recently partnered with Cloudera on a certified Cloudera parcel.

Continuum Analytics is the creator and driving force behind Anaconda, the leading, open data science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, data engineers and business analysts – to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

## May 29, 2016

### Matthieu Brucher

#### On modeling posts

I’m currently considering whether I should do more posts on preamps modeling or just keep implementing filters/plugins. Of course, it’s not one or the other, there are different options in this poll:

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

So the idea is to ask my readers what they actually want. I can explain how the new triodes filters are implemented, how they behave, but I can also add new filters in Audio Toolkit (based on different preamp and amp stages, dedicated to guitars, bass, other instruments), try to optimize them, and finally I can include them in new plugins that could be used by users. Or I can do something completely different.

So if you have any ideas, feel free to say so!

## May 27, 2016

### Continuum Analytics news

#### Taking the Wheel: How Open Source is Driving Data Science

Posted Friday, May 27, 2016

The world is a big, exciting place—and thanks to cutting-edge technology, we now have amazing ways to explore its many facets. Today, self-driving cars, bullet trains and even private rocket ships allow humans to travel anywhere faster, more safely and more efficiently than ever before.

But technology's impact on our exploratory abilities isn't just limited to transportation: it's also revolutionizing how we navigate the Data Science landscape. More companies are moving toward Open Data Science and the open source technology that underlies it. As a result, we now have an amazing new fleet of vehicles for our data-related excursions.

We're no longer constrained to the single railroad track or state highway of a proprietary analytics product. We can use hundreds of freely available open source libraries for any need: web scraping, ingesting and cleaning data, visualization, predictive analytics, report generation, online integration and more. With these tools, any corner of the Data Science map—astrophysics, financial services, public policy, you name it—can be reached nimbly and efficiently.

But even in this climate of innovation, nobody can afford to completely abandon previous solutions and traditional approaches still remain viable. Fortunately, graceful interoperability is one of the hallmarks of Open Data Science. In appropriate scenarios, it accommodates the blending of legacy code or proprietary products with open source solutions. After all, sometimes taking the train is necessary and even preferable.

Regardless of which technology teams use, the open nature of Open Data Science allows you to travel across the data terrain in a way that is transparent and accessible for all participants.

Data Science in Overdrive

Let's take a look at six specific ways Open Data Science is propelling analytics for small and large teams.

1. Community. Open Data Science prioritizes inclusivity; community involvement is a big reason that open source software has boomed in recent years. Communities can test out new software faster and more thoroughly than any one vendor, accelerating innovation and remediation of any bugs.

Today, the open source software repository, GitHub, is home to more than 5 million open source projects and thousands of distinct communities. One such community is conda-forge, a community of developers that build infrastructure and packages for the conda package manager, a general, cross-platform and cross-language package manager with a large and growing number of data science packages available. Considering that Python is the most popular language in computer science classrooms at U.S. universities, open source communities will only continue to grow.

2. Innovation. The Open Data Science movement recognizes that no one software vendor has all the answers. Instead, it embraces the large—and growing—community of bright minds that are constantly working to build new solutions to age-old challenges.

Because of its adherence to free or low-cost technologies, non-restrictive licensing and shareable code, Open Data Science offers developers unparalleled flexibility to experiment and create innovative software.

One example of the innovation that is possible with Open Data Science is taxcalc, an Open Source Policy Modeling Center project publically available via TaxBrain. Using open source software, the project brought developers from around the globe together to create a new kind of tax policy analysis. This software has the computational power to process the equivalent of more than 120 million tax returns, yet is easy-to-use and accessible to private citizens, policy professionals and journalists alike.

3. Inclusiveness. The Open Data Science movement unites dozens of different technologies and languages under a single umbrella. Data science is a team sport and the Open Data Science movement recognizes that complex projects require a multitude of tools and approaches.

This is why Open Data Science brings together leading open source data science tools under a single roof. It welcomes languages ranging from Python and R to FORTRAN and it provides a common base for data scientists, business analysts and domain experts like economists or biologists.

What's more, it can integrate legacy code or enterprise projects with newly developed code, allowing teams to take the most expedient path to solve their challenges. For example, with the conda package management system, developers can create conda packages from legacy code, allowing integration into a custom analytics platform with newer open source code. In fact, libraries like SciPy already leverage highly optimized legacy FORTRAN code.

4. Visualizations. Visualization has come a long way in the last decade, but many visualization technologies have been focused on reporting and static dashboards. Open Data Science; however, has unveiled intelligent web apps that offer rich, browser-based interactive visualizations, such as those produced with Bokeh. Visualizations empower data scientists and business executives to explore their data, revealing subtle nuances and hidden patterns.

One visualization solution, Anaconda's Datashader library, is a Big Data visualizer that plays to the strengths of the human visual system. The Datashader library—alongside the Bokeh visualization library—offers a clever solution to the problem of plotting an enormous number of points in a relatively limited number of pixels.

Another choice for data scientists is the D3 Javascript library, which exploded the number of visual tools for data. With wrappers for Python and other languages, D3 has prompted a real renaissance in data visualization.

5. Deep Learning. One of the hottest branches of data science is deep learning, a sub-segment of machine learning based on algorithms that work to model data abstractions using a multitude of processing layers. Open source technology, such as that embraced by Open Data Science, is critical to its expansion and improvement.

Some of the new entrants to the field—all of which are now open source—are Google's TensorFlow project, the Caffe deep learning framework, Microsoft's Computational Network Toolkit (CNTK), Amazon's Deep Scalable Sparse Tensor Network Engine (DSSTNE), Facebook's Torch framework and Nervana's Neon. These products enter a field with many participants like Theano whose Lasagne extension allows easy construction of deep learning models and Berkley's Caffe, which is an open deep learning framework.

These are only some of the most interesting frameworks. There are many others, which is a testament to the lively and burgeoning Open Data Science community and its commitment to innovation and idea sharing allowing for even more future innovation.

6. Interoperability. Traditional, proprietary data science tools typically integrate well only with their own suite. They’re either closed to outside tools or provide inferior, slow methods of integration. Open Data Science, by contrast, rejects these restrictions, instead allowing diverse tools to cooperate and interact in every more closely connected ways.

For example, Anaconda, includes open source distributions of the Python and R languages, which interoperate very well together enabling data scientists to use the technologies that make sense for them. For example, a business analyst might start with Excel, then work with predictive models in R and later fire up Tableau for data visualizations. Interoperable tools speed analysis, eliminate the need for switching between multiple toolsets and improve collaboration.

It's clear that open source tools will lead the charge towards innovation in Data Science and many of the top technology companies are moving in this direction. IBM, Microsoft, Google, Facebook, Amazon and others are all joining the Open Data Science revolution, making their technology available with APIs and open source code. This benefits technology companies and individual developers, as it empowers a motivated user base to improve code, create new software and use existing technologies in new contexts.

That's the power of open source software and inclusive Open Data Science platforms like Anaconda. Thankfully, today's user-friendly languages—like Python—make joining this new future easier than ever.

If you're considering open source for your next data project, now’s the time to grab the wheel. Join the Open Data Science movement and shift your analyses into overdrive.

## May 26, 2016

### NeuralEnsemble

#### Updated Docker images for biological neuronal network simulations with Python

The NeuralEnsemble Docker images for biological neuronal network simulations with Python have been updated to contain NEST 2.10, NEURON 7.4, Brian 2.0rc1 and PyNN 0.8.1.

In addition, the default images (which are based on NeuroDebian Jessie) now use Python 3.4. Images with Python 2.7 and Brian 1.4 are also available (using the "py2" tag). There is also an image with older versions (NEST 2.2 and PyNN 0.7.5).

The images are intended as a quick way to get simulation projects up-and-running on Linux, OS X and Windows. They can be used for teaching or as the basis for reproducible research projects that can easily be shared with others.

The images are available on Docker Hub.

To quickly get started, once you have Docker installed, run

docker pull neuralensemble/simulation
docker run -i -t neuralensemble/simulation /bin/bash

For Python 2.7:

docker pull neuralensemble/simulation:py2

For older versions:

docker pull neuralensemble/pynn07

For ssh/X11 support, use the "simulationx" image instead of "simulation". Full instructions are available here.

If anyone would like to help out, or suggest other tools that should be installed, please contact me, or open a ticket on Github.

#### PyNN 0.8.1 released

Having forgotten to blog about the release of PyNN 0.8.0, here is an announcement of PyNN 0.8.1!

For all the API changes between PyNN 0.7 and 0.8 see the release notes for 0.8.0. The main change with PyNN 0.8.1 is support for NEST 2.10.

PyNN 0.8.1 can be installed with pip from PyPI.

### What is PyNN?

PyNN (pronounced 'pine' ) is a simulator-independent language for building neuronal network models.

In other words, you can write the code for a model once, using the PyNN API and the Python programming language, and then run it without modification on any simulator that PyNN supports (currently NEURON, NEST and Brian as well as the SpiNNaker and BrainScaleS neuromorphic hardware systems).

Even if you don't wish to run simulations on multiple simulators, you may benefit from writing your simulation code using PyNN's powerful, high-level interface. In this case, you can use any neuron or synapse model supported by your simulator, and are not restricted to the standard models.

The code is released under the CeCILL licence (GPL-compatible).

### Paul Ivanov

#### in transit

Standing impatient, platform teeming, almost noon
Robo voices read off final destinations
But one commuter's already at his
He reached for life's third rail

There is no why in the abyss
There's only closing credit hiss
The soundtrack's gone, he didn't miss
Reaching for life's third rail

We ride on, now, relieved and moving forward
Each our own lives roll forth, for now
But now is gone, for one among us
Who reached for life's third rail

We rock, to-fro, and reach each station
Weight shifting onto forward foot
Flesh, bone ground up in violent elation
And bloody rags, hours ago a well worn suit

I ride the escalator up and pensive
About what did and not occur today
Commuter glut, flow restricted
A crooked kink in public transport hose resolved.


## Early Experience with Clusters

My first real experience with cluster computing came in 1999 during my graduate school days at the Mayo Clinic.  These were wonderful times.   My advisor was Dr. James Greenleaf.   He was very patient with allowing me to pester a bunch of IT professionals throughout the hospital to collect their aging Mac Performa machines and build my own home-grown cluster.   He also let me use a bunch of space in his ultrasound lab to host the cluster for about 6 months.

#### Building my own cluster

The form-factor for those Mac machines really made it easy to stack them.   I ended up with 28 machines in two stacks with 14 machines in each stack (all plugged into a few power strips and a standard lab-quality outlet).  With the recent release of Yellow-Dog Linux, I wiped the standard OS from all the machines  and installed Linux on all those Macs to create a beautiful cluster of UNIX goodness I could really get excited about.   I called my system "The Orchard" and thought it would be difficult to come up with 28 different kinds of apple varieties to name each machine after.  It wasn't difficult. It turns out there are over 7,500 varieties of apples grown throughout the world.

 Me smiling alongside by smoothly humming "Orchard" of interconnected Macs

The reason I put this cluster together was to simulate Magnetic Resonance Elastography (MRE) which is a technique to visualize motion using Magnetic Resonance Imaging (MRI).  I wanted to simulate the Bloch equations with a classical model for how MRI images are produced.  The goal was to create a simulation model for the MRE experiment that I could then use to both understand the data and perhaps eventually use this model to determine material properties directly from the measurements using Bayesian inversion (ambitiously bypassing the standard sequential steps of inverse FFT and local-frequency estimation).

Now I just had to get all these machines to talk to each other, and then I would be poised to do anything.  I read up a bit on MPI, PVM, and anything else I could find about getting computers to talk to each other.  My unfamiliarity with the field left me puzzled as I tried to learn these frameworks in addition to figuring out how to solve my immediate problem.  Eventually, I just settled down with a trusted UNIX book by the late W. Richard Stevens.    This book explained how the internet works.   I learned enough about TCP/IP and sockets so that I could write my own C++ classes representing the model.  These classes communicated directly with each other over raw sockets.   While using sockets directly was perhaps not the best approach, it did work and helped me understand the internet so much better.  It also makes me appreciate projects like tornado and zmq that much more.

#### Lessons Learned

I ended up with a system that worked reasonably well, and I could simulate MRE to some manner of fidelity with about 2-6 hours of computation. This little project didn't end up being critical to my graduation path and so it was abandoned after about 6 months.  I still value what I learned about C++, how abstractions can ruin performance, how to guard against that, and how to get machines to communicate with each other.

Using Numeric, Python, and my recently-linked ODE library (early SciPy), I built a simpler version of the simulator that was actually faster on one machine than my cluster-version was in C++ on 20+ machines.  I certainly could have optimized the C++ code, but I could have also optimized the Python code.   The Python code took me about 4 days to write, the C++ code took me about 4 weeks.  This experience has markedly influenced my thinking for many years about both pre-mature parallelization and pre-mature use of C++ and other compiled languages.

Fast forward over a decade.   My computer efforts until 2012 were spent on sequential array-oriented programming, creating SciPy, writing NumPy, solving inverse problems, and watching a few parallel computing paradigms emerge while I worked on projects to provide for my family.  I didn't personally get to work on parallel computing problems during that time, though I always dreamed of going back and implementing this MRE simulator using a parallel construct with NumPy and SciPy directly.   When I needed to do the occassional parallel computing example during this intermediate period, I would either use IPython parallel or multi-processing.

## Parallel Plans at Continuum

In 2012, Peter Wang and I started Continuum, created PyData, and released Anaconda.   We also worked closely with members of the community to establish NumFOCUS as an independent organization.  In order to give NumFOCUS the attention it deserved, we hired the indefatigable Leah Silen and donated her time entirely to the non-profit so she could work with the community to grow PyData and the Open Data Science community and ecosystem.  It has been amazing to watch the community-based, organic, and independent growth of NumFOCUS.    It took effort and resources to jump-start,  but now it is moving along with a diverse community driving it.   It is a great organization to join and contribute effort to.

A huge reason we started Continuum was to bring the NumPy stack to parallel computing --- for both scale-up (many cores) and scale-out (many nodes).   We knew that we could not do this alone and it would require creating a company and rallying a community to pull it off.   We worked hard to establish PyData as a conference and concept and then transitioned the effort to the community through NumFOCUS to rally the community behind the long-term mission of enabling data-, quantitative-, and computational-scientists with open-source software.  To ensure everyone in the community could get the software they needed to do data science with Python quickly and painlessly, we also created Anaconda and made it freely available.

In addition to important community work, we knew that we would need to work alone on specific, hard problems to also move things forward.   As part of our goals in starting Continuum we wanted to significantly improve the status of Python in the JVM-centric Hadoop world.   Conda, Bokeh, Numba, and Blaze were the four technologies we started specifically related to our goals as a company beginning in 2012.   Each had a relationship to parallel computing including Hadoop.

Conda enables easy creation and replication of environments built around deep and complex software dependencies that often exist in the data-scientist workflow.   This is a problem on a single node --- it's an even bigger problem when you want that environment easily updated and replicated across a cluster.

Bokeh  allows visualization-centric applications backed by quantitative-science to be built easily in the browser --- by non web-developers.   With the release of Bokeh 0.11 it is extremely simple to create visualization-centric-web-applications and dashboards with simple Python scripts (or also R-scripts thanks to rBokeh).

With Bokeh, Python data scientists now have the power of both d3 and Shiny, all in one package. One of the driving use-cases of Bokeh was also easy visualization of large data.  Connecting the visualization pipeline with large-scale cluster processing was always a goal of the project.   Now, with datashader, this goal is now also being realized to visualize billions of points in seconds and display them in the browser.

Our scale-up computing efforts centered on the open-source Numba project as well as our Accelerate product.  Numba has made tremendous progress in the past couple of years, and is in production use in multiple places.   Many are taking advantage of numba.vectorize to create array-oriented solutions and program the GPU with ease.   The CUDA Python support in Numba makes it the easiest way to program the GPU that I'm aware of.  The CUDA simulator provided in Numba makes it much simpler to debug in Python the logic of CUDA-based GPU programming.  The addition of parallel-contexts to numba.vectorize mean that any many-core architecture can now be exploited in Python easily.   Early HSA support is also in Numba now meaning that Numba can be used to program novel hardware devices from many vendors.

### Summarizing Blaze

The ambitious Blaze project will require another blog-post to explain its history and progress well. I will only try to summarize the project and where it's heading.  Blaze came out of a combination of deep experience with industry problems in finance, oil&gas, and other quantitative domains that would benefit from a large-scale logical array solution that was easy to use and connected with the Python ecosystem.    We observed that the MapReduce engine of Hadoop was definitely not what was needed.  We were also aware of Spark and RDD's but felt that they too were also not general enough (nor flexible enough) for the demands of distributed array computing we encountered in those fields.

#### DyND, Datashape, and a vision for the future of Array-computing

After early work trying to extend the NumPy code itself led to struggles because of both the organic complexity of the code base and the stability needs of a mature project, the Blaze effort started with an effort to re-build the core functionality of NumPy and Pandas to fix some major warts of NumPy that had been on my mind for some time.   With Continuum support, Mark Wiebe decided to continue to develop a C++ library that could then be used by Python and any-other data-science language (DyND).   This necessitated defining a new data-description language (datashape) that generalizes NumPy's dtype to structures of arrays (column-oriented layout) as well as variable-length strings and categorical types.   This work continues today and is making rapid progress which I will leave to others to describe in more detail.  I do want to say, however, that dynd is implementing my "Pluribus" vision for the future of array-oriented computing in Python.   We are factoring the core capability into 3 distinct parts:  the type-system (or data-declaration system), a generalized function mechanism that can interact with any "typed" memory-view or "typed" buffer, and finally the container itself.   We are nearing release of a separated type-library and are working on a separate C-API to the generalized function mechanism.   This is where we are heading and it will allow maximum flexibility and re-use in the dynamic and growing world of Python and data-analysis.   The DyND project is worth checking out right now (if you have desires to contribute) as it has made rapid progress in the past 6 months.

As we worked on the distributed aspects of Blaze it centered on the realization that to scale array computing to many machines you fundamentally have to move code and not data.   To do this well means that how the computer actually sees and makes decisions about the data must be exposed.  This information is usually part of the type system that is hidden either inside the compiler, in the specifics of the data-base schema, or implied as part of the runtime.   To fundamentally solve the problem of moving code to data in a general way, a first-class and wide-spread data-description language must be created and made available.   Python users will recognize that a subset of this kind of information is contained in the struct module (the struct "format" strings), in the Python 3 extended buffer protocol definition (PEP 3118), and in NumPy's dtype system.   Extending these concepts to any language is the purpose of datashape.

In addition, run-times that understand this information and can execute instructions on variables that expose this information must be adapted or created for every system.  This is part of the motivation for DyND and why very soon the datashape system and its C++ type library will be released independently from the rest of DyND and Blaze.   This is fundamentally why DyND and datashape are such important projects to me.  I see in them the long-term path to massive code-reuse, the breaking down of data-silos that currently cause so much analytics algorithm duplication and lack of cooperation.

Simple algorithms from data-munging scripts to complex machine-learning solutions must currently be re-built for every-kind of data-silo unless there is a common way to actually functionally bring code to data.  Datashape and the type-library runtime from DyND (ndt) will allow this future to exist.   I am eager to see the Apache Arrow project succeed as well because it has related goals (though more narrowly defined).

The next step in this direction is an on-disk and in-memory data-fabric that allows data to exist in a distributed file-system or a shared-memory across a cluster with a pointer to the head of that data along with a data-shape description of how to interpret that pointer so that any language that can understand the bytes in that layout can be used to execute analytics on those bytes.  The C++ type run-time stands ready to support any language that wants to parse and understand data-shape-described pointers in this future data-fabric.

From one point of view, this DyND and data-fabric effort are a natural evolution of the efforts I started in 1998 that led to the creation of SciPy and NumPy.  We built a system that allows existing algorithms in C/C++ and Fortran to be applied to any data in Python.   The evolution of that effort will allow algorithms from many other languages to be applied to any data in memory across a cluster.

#### Blaze Expressions and Server

The key part of Blaze that is also important to mention is the notion of the Blaze server and user-facing Blaze expressions and functions.   This is now what Blaze the project actually entails --- while other aspects of Blaze have been pushed into their respective projects.  Functionally, the Blaze server allows the data-fabric concept on a machine or a cluster of machines to be exposed to the rest of the internet as a data-url (e.g. http://mydomain.com/catalog/datasource/slice).   This data-url can then be consumed as a variable in a Blaze expression --- first across entire organizations and then across the world.

This is the truly exciting part of Blaze that would enable all the data in the world to be as accessible as an already-loaded data-frame or array.  The logical expressions and transformations you can then write on those data to be your "logical computer" will then be translated at compute time to the actual run-time instructions as determined by the Blaze server which is mediating communication with various backends depending on where the data is actually located.   We are realizing this vision on many data-sets and a certain set of expressions already with a growing collection of backends.   It is allowing true "write-once and run anywhere" to be applied to data-transformations and queries and eventually data-analytics.     Currently, the data-scientists finds herself to be in a situation similar to the assembly programmer in the 1960s who had to know what machine the code would run on before writing the code.   Before beginning a data analytics task, you have to determine which data-silo the data is located in before tackling the task.  SQL has provided a database-agnostic layer for years, but it is too limiting for advanced analytics --- and user-defined functions are still database specific.

Continuum's support of blaze development is currently taking place as defined by our consulting customers as well as by the demands of our Anaconda platform and the feature-set of an exciting new product for the Anaconda Platform that will be discussed in the coming weeks and months. This new product will provide a simplified graphical user-experience on top of Blaze expressions, and Bokeh visualizations for rapidly connecting quantitative analysts to their data and allowing explorations that retain provenance and governance.  General availability is currently planned for August.

Blaze also spawned additional efforts around fast compressed storage of data (blz which formed the inspiration and initial basis for bcolz) and experiments with castra as well as a popular and straight-forward tool for quickly copying data from one data-silo kind to another (odo).

The most important development to come out of Blaze, however, will have tremendous impact in the short term well before the full Blaze vision is completed.  This project is Dask and I'm excited for what Dask will bring to the community in 2016.   It is helping us finally deliver on scaled-out NumPy / Pandas and making Anaconda a first-class citizen in Hadoop.

In 2014, Matthew Rocklin started working at Continuum on the Blaze team.   Matthew is the well-known author of many functional tools for Python.  He has a great blog you should read regularly.   His first contribution to Blaze was to adapt a multiple-dispatch system he had built which formed the foundation of both odo and Blaze.  He also worked with Andy Terrel and Phillip Cloud to clarify the Blaze library as a front-end to multiple backends like Spark, Impala, Mongo, and NumPy/Pandas.

With these steps taken, it was clear that the Blaze project needed its own first-class backend as well something that the community could rally around to ensure that Python remained a first-class participant in the scale-out conversation --- especially where systems that connected with Hadoop were being promoted.  Python should not ultimately be relegated to being a mere front-end system that scripts Spark or Hadoop --- unable to talk directly to the underlying data.    This is not how Python achieved its place as a de-facto data-science language.  Python should be able to access and execute on the data directly inside Hadoop.

Getting there took time.  The first version of dask was released in early 2015 and while distributed work-flows were envisioned, the first versions were focused on out-of-core work-flows --- allowing problem-sizes that were too big to fit in memory to be explored with simple pandas-like and numpy-like APIs.

When Matthew showed me his first version of dask, I was excited.  I loved three things about it:  1) It was simple and could, therefore, be used as a foundation for parallel PyData.  2) It leveraged already existing code and infrastructure in NumPy and Pandas.  3) It had very clean separation between collections like arrays and data-frames, the directed graph representation, and the schedulers that executed those graphs.   This was the missing piece we needed in the Blaze ecosystem.   I immediately directed people on the Blaze team to work with Matt Rocklin on Dask and asked Matt to work full-time on it.

He and the team made great progress and by summer of 2015 had a very nice out-of-core system working with two functioning parallel-schedulers (multi-processing and multi-threaded).  There was also a "synchronous" scheduler that could be used for debugging the graph and the system showed well enough throughout 2015 to start to be adopted by other projects (scikit-image and xarray).

In the summer of 2015, Matt began working on the distributed scheduler.  By fall of 2015, he had a very nice core system leveraging the hard work of the Python community.   He built the API around the concepts of asynchronous computing already being promoted in Python 3 (futures) and built dask.distributed on top of tornado.   The next several months were spent improving the scheduler by exposing it to as many work-flows as possible from computational-science, quantitative-science and computational-science.   By February of 2016, the system was ready to be used by a variety of people interested in distributed computing with Python.   This process continues today.

Using dask.dataframes and dask.arrays you can quickly build array- and table-based work-flows with a Pandas-like and NumPy-like syntax respectively that works on data sitting across a cluster.

Anaconda and the PyData ecosystem now had another solution for the scale-out problem --- one whose design and implementation was something I felt could be a default run-time backend for Blaze.  As a result, I could get motivated to support, market, and seek additional funding for this effort.  Continuum has received some DARPA funding under the XDATA program.  However, this money was now spread pretty thin among Bokeh, Numba, Blaze, and now Dask.

With the distributed scheduler basically working and beginning to improve, two problems remained with respect to Hadoop interoperability: 1) direct access to the data sitting in HDFS and 2) interaction with the resource schedulers running most Hadoop clusters (YARN or mesos).

To see how important the next developments are, it is useful to describe an anecdote from early on in our XDATA experience.  In the summer of 2013, when the DARPA XDATA program first kicked-off, the program organizers had reserved a large Hadoop cluster (which even had GPUs on some of the nodes).  They loaded many data sets onto the cluster and communicated about its existence to all of the teams who had gathered to collaborate on getting insights out of "Big Data."    However, a large number of the people collaborating were using Python, R, or C++.  To them the Hadoop cluster was inaccessible as there was very little they could use to interact with the data stored in HDFS (beyond some high-latency and low-bandwidth streaming approaches) and nothing they could do to interact with the scheduler directly (without writing Scala or Java code). The Hadoop cluster sat idle for most of the summer while teams scrambled to get their own hardware to run their code on and deliver their results.

This same situation we encountered in 2013 exists in many organizations today.  People have large Hadoop infrastructures, but are not connecting that infrastructure effectively to their data-scientists who are more comfortable in Python, R, or some-other high-level (non JVM language).

With dask working reasonably well, tackling this data-connection problem head on became an important part of our Anaconda for Hadoop story and so in December of 2015 we began two initiatives to connect Anaconda directly to Hadoop.   Getting data from HDFS turned out to be much easier than we had initially expected because of the hard-work of many others.    There had been quite a bit of work building a C++ interface to Hadoop at Pivotal that had culminated in a library called libhdfs3.   Continuum wrote a Python interface to that library quickly, and it now exists as the hdfs3 library under the Dask organization on Github.

The second project was a little more involved as we needed to integrate with YARN directly.   Continuum developers worked on this and produced a Python library that communicates directly to the YARN classes (using Scala) in order to allow the Python developer to control computing resources as well as spread files to the Hadoop cluster.   This project is called knit, and we expect to connect it to mesos and other cluster resource managers in the near future (if you would like to sponsor this effort, please get in touch with me).

Early releases of hdfs3 and knit were available by the end of February 2015.  At that time, these projects were joined with dask.distributed and the dask code-base into a new Github organization called Dask.   The graduation of Dask into its own organization signified an important milestone that dask was now ready for rapid improvement and growth alongside Spark as a first-class execution engine in the Hadoop ecosystem.

Our initial goals for Dask are to build enough examples, capability, and awareness so that every PySpark user tries Dask to see if it helps them.    We also want Dask to be a compatible and respected member of the growing Hadoop execution-framework community.   We are also seeking to enable Dask to be used by scientists of all kinds who have both array and table data stored on central file-systems and distributed file-systems outside of the Hadoop ecosystem.

### Anaconda as a first-class execution ecosystem for Hadoop

With Dask (including hdfs3 and knit), Anaconda is now able to participate on an equal footing with every other execution framework for Hadoop.  Because of the vast reaches of Anaconda Python and Anaconda R communities, this means that a lot of native code can now be integrated to Hadoop much more easily, and any company that has stored their data in HDFS or other distributed file system (like s3fs or gpfs) can now connect that data easily to the entire Python and/or R computing stack.

This is exciting news!    While we are cautious because these integrative technologies are still young, they are connected to and leveraging the very mature PyData ecosystem.    While benchmarks can be misleading, we have a few benchmarks that I believe accurately reflect the reality of what parallel and distributed Anaconda can do and how it relates to other Hadoop systems.  For array-based and table-based computing workflows, Dask will be 10x to 100x faster than an equivalent PySpark solution.   For applications where you are not using arrays or tables (i.e. word-count using a dask.bag), Dask is a little bit slower than a similar PySpark solution.  However, I would argue that Dask is much more Pythonic and easier to understand for someone who has learned Python.

It will be very interesting to see what the next year brings as more and more people realize what is now available to them in Anaconda.  The PyData crowd will now have instant access to cluster computing at a scale that has previously been accessible only by learning complicated new systems based on the JVM or paying an unfortunate performance penalty.   The Hadoop crowd will now have direct and optimized access to entire classes of algorithms from Python (and R) that they have not previously been used to.

It will take time for this news and these new capabilities to percolate, be tested, and find use-cases that resonate with the particular problems people actually encounter in practice.  I look forward to helping many of you take the leap into using Anaconda at scale in 2016.

We will be showing off aspects of the new technology at Strata in San Jose in the Continuum booth #1336 (look for Anaconda logo and mark).  We have already announced at a high-level some of the capabilities:   Peter and I will both be at Strata along with several of the talented people at Continuum.    If you are attending drop by and say hello.

We first came to Strata on behalf of Continuum in 2012 in Santa Clara.  We announced that we were going to bring you scaled-out NumPy.  We are now beginning to deliver on this promise with Dask.   We brought you scaled-up NumPy with Numba.   Blaze and Bokeh will continue to bring them together along with the rest of the larger data community to provide real insight on data --- where-ever it is stored.   Try out Dask and join the new scaled-out PyData story which is richer than ever before, has a larger community than ever before, and has a brighter future than ever before.

## May 24, 2016

### Filipe Saraiva

#### if (LaKademy 2016) goto Rio de Janeiro

Rio de Janeiro, the “Cidade Maravilhosa”, land of the eternal Summer. The sunlight here is always clear and hot, the sea is refreshing, the sand is comfortable. The people is happy, Rio de Janeiro has good music, food, the craziest parties of the world, and beautiful bodies having fun with beach games (do you know futevolei?).

But while Rio de Janeiro is boiling, some Gearheads based in Latin America will be working together in a cold and dark room in the city, attending to our “multi-area” sprint named Latin America Akademy – LaKademy 2016.

In my plans I have a lot of work to do in Cantor, including a strong triage in bugs and several tests with some IPC technologies. I would like to choose one to be the “official” technology to implement backends for Cantor. Cantor needs a IPC technology with good multiplatform support for the main desktop operating systems. I am think about DBus… do you have other suggestions or tips?

Other contributors also want to work in Cantor. Wagner wants to build and test the application in Windows and begin an implementation of a backend for a new programming language. Fernando, my SoK 2015 student, wants to fix the R backend. I will be very happy seeing these developers dirtying their hands in Cantor source code, so I will help them in those tasks.

During LaKademy I intent to present for the attendees some ideas and prototypes of two new software I am working. I expect to get some feedback and I will think about the next steps for them. Maybe I can submit them for new KDE projects… Well, let’s see.

Wait for more news from the cold and dark room of our LaKademy event in Rio de Janeiro.

### Fabian Pedregosa

#### Hyperparameter optimization with approximate gradient

Most machine learning models rely on at least one hyperparameter to control for model complexity. For example, logistic regression commonly relies on a regularization parameter that controls the amount of $\ell_2$ regularization. Similarly, kernel methods also have hyperparameters that control for properties of the kernel, such as the "width" parameter in the RBF kernel. The fundamental distinction between model parameters and hyperparameters is that, while model parameters are estimated by minimizing a goodness of fit with the training data, hyperparameters need to be estimated by other means (such as a cross-validation loss), as otherwise models with excessive would be selected, a phenomenon known as overfitting.

The main idea is to use an approximate gradient to optimize a cross-validation loss with respect to hyperparameters. A decreasing bound between the true gradient and the approximate gradient ensures that the method converges towards a local minima.

Fitting hyperparameters is essential to obtain models with good accuracy and computationally challenging. The existing most popular methods for fitting hyperparameters are based on either exhaustively exploring the whole hyperparameter space (grid search and random search) or on Bayesian optimization techniques that use previous function evaluations to guide the optimization procedure. The starting point of this work was a simple question: why are the procedures to estimate parameters and hyperparameters so different? Is it possible to use known and reliable methods such as gradient descent to fit not only parameters, but also hyperparameters?

Interestingly, I found out that this question had been answered a long time ago. Already in the 90s, Larsen et al. devised a method (described here and here) using gradient-descent to estimate the optimal value of $\ell_2$ regularization for neural networks. Shortly after, Y. Bengio also published a paper on this topic. Recently, there has been a renewed interest in gradient-based methods (see for example this paper by Maclaurin or a slightly earlier work by Justin Domke, and references within the paper).

One of the drawbacks of gradient-based optimization of hyperparameters, is that these depend on quantities that are costly to compute such as the exact value of the model parameters and the inverse of a Hessian matrix. The aim of this work is to relax some of these assumptions and provide a method that works when the quantities involved (such as model parameters) are known only approximately. In practice, what this means is that hyperparameters can be updated before model parameters have fully converged, which results in big computational gains.

For more details and results, please take a look at the paper. I'll be presenting at the International Conference on Machine Learning (ICML 2016)

### Titus Brown

#### Increasing postdoc pay

I just gave all of my postdocs a $10,000-a-year raise. My two current postdocs all got a$10k raise over their current salary, and the four postdocs coming on board over the next 6 months will start at $10k over the NIH base salary we pay them already. (This means that my starting postdocs will get something like$52k/year, plus benefits.)

I already pay most of my grad students more than I'm required to by UC Davis regulations. While I'm a pretty strong believer that graduate school is school, and that it's pretty good training (see Claus Wilke's extended discussion here), there's something to be said for enabling them to think more about their work and less about whether or not they can afford a slightly more comfortable life. (I pay all my grad students the same salary, independent of graduate program; see below.)

Why did I increase the postdoc salaries now? I've been thinking about it for a while, but the main reason that motivated me to do the paperwork was the change in US labor regulations. There's the cold-blooded calculation that, hey, I don't want to pay overtime; but if it were just that, I could have given smaller raises to my existing postdocs. A bigger factor is that I really don't want the postdocs to have to think about tracking their time. I also hope it will decrease postdoc turnover, which can be a real problem: it takes a lot of time to recruit a new person to the lab, and if it takes a year to recruit a postdoc and they leave sooner because the salary sucks, well, that's really a net loss to me.

More broadly, I view my job as flying cover for my employees. If they worry a little bit less because of a (let's face it) measly $10k, well, so much the better. A while ago I decided to pay all my postdocs on the same scale; there are some people who are good at negotiating and asking, and others who aren't, and it's baldly unfair to listen more to the former. (I've had people who pushed for a raise every 6 months; I've had other people who offered to pay me back by personal check when they were out sick for a week.) I'm also really uncomfortable trying to judge a person's personal need - sure, one postdoc may have a family, and another postdoc may look free as a bird and capable of living out of the office (which has also happened due to low pay...), but everyone's lives are more complicated than they appear, and it's not my place to get that involved. So paying everyone the same salary and explaining that up front reduces any friction that might arise there, I think. There's also the fact that I can afford it at the moment, between my startup and my Moore Foundation grant. The$10k/person increase means that I'm paying somewhere around $80k extra per year, once you include the increase in benefits -- basically, an entire additional postdoc's salary. But being the world's worst manager, I'm not sure how I will deal with a lab with 9 people in it; a 10th would probably not have helped there. So maybe it's not such a bad thing to avoid hiring one more person :). And in the future I will simply budget it into grants. (I do have one grant out in review at the moment where I underbudgeted; if I get it, I'll have to supplement that with my startup.) The interesting thing is that I didn't realize how large a salary many of my future postdocs were turning down. In order to justify the raise to the admins, I asked for information on other offers the postdocs had received - I'd heard that some of them had turned down larger salaries, but hadn't asked for details before. Two of my future postdocs had offers in the$80k range; another was leaving a postdoc that paid north of $60k (not uncommon in some fields) to come to my lab. I'm somewhat surprised (and frankly humbled) that they were planning to come to my lab before this raise; even with this raise, I'm not approaching what they'd already been offered! There are some downsides for the postdocs here (although I think they're pretty mild, all things considered). First, I won't have as much unbudgeted cash lying around, so supply and travel expenditures will be a bit more constrained. Second, I can't afford to keep them all on for quite as long now, so some postdoc jobs may end sooner than they otherwise would have. Third, if they want to transition to a new postdoc at UC Davis, they will probably have to find someone willing to pay them the extra money - it's very hard to lower someone's salary within an institution. (I don't expect this to be a problem, but it's an issue to consider.) There are also some downsides for me that I don't think my employees always appreciate, too. I worry a lot - like, an awful lot - about money. I'm deathly afraid of overpromising employment and having to lay off a postdoc before they have a next step, or, worse, move a grad student to a lot of TAing. So this salary increase puts me a little bit more on edge, and makes me think more about writing grants, and less about research and other, more pleasant things. I can't help but resent that a teensy bit. On the flip side, that is my job and all things considered I'm at a pretty awesome place in a pretty awesome gig so shrug. There may be other downsides I hadn't considered - there usually are ;) -- and upsides as well. I'll follow up if anything interesting happens. --titus ### Matthieu Brucher #### Announcement: Audio TK 1.3.0 ATK is updated to 1.3.0 with new features and optimizations. Download link: ATK 1.3.0 Changelog: 1.3.0 * Added a family of triode preamplification filters with Python wrappers (requires Eigen) * Added a class A NPN preamplification filter with Python wrappers (requires Eigen) * Added a buffer filter with Python wrappers * Added a new Diode clipper with trapezoidal rule with Python wrappers * Added a new version of the SD1 distortion with ZDF mode and Python wrappers 1.2.0 * Added SecondOrderSVF filters from cytomic with Python wrappers * Implemented a LowPassReverbFilter with Python wrappers * Added Python wrappers to AllPassReverbFilter * Distortion filters optimization * Bunch of fixes (Linux compil, calls…) ## May 20, 2016 ### Continuum Analytics news #### New Pip 8.1.2 Release Leaves Anaconda Cloud Broken - Fix in Progress Posted Friday, May 20, 2016 This is an important update for Anaconda Cloud users who are upgrading to the latest version of Pip. Due to changes in a recent release of Pip v8.1.2, Anaconda Cloud users that are installing packages from the PyPI channel where the package name contains a "." or "-" (period or hypen) will be unable to install those packages. The short-term fix is to downgrade Pip to v8.1.1. (The Pip 8.1.2 conda package has been removed from repo.continuum.io so it's not conda-installable currently because of this issue but will be restored to the repo as soon as this issue is resolved in Anaconda Cloud) We anticipate having an updated version of Anaconda Cloud released in the next 1-2 weeks to address this issue and allow users to upgrade to 8.1.2. An update to this post will be shared when it's resolved. To read more about the underlying nature of the issue, please refer to this issue: pypa/pip#3666 ## May 19, 2016 ### Gaël Varoquaux #### Better Python compressed persistence in joblib ## Problem setting: persistence for big data Joblib is a powerful Python package for management of computation: parallel computing, caching, and primitives for out-of-core computing. It is handy when working on so called big data, that can consume more than the available RAM (several GB nowadays). In such situations, objects in the working space must be persisted to disk, for out-of-core computing, distribution of jobs, or caching. An efficient strategy to write code dealing with big data is to rely on numpy arrays to hold large chunks of structured data. The code then handles objects or arbitrary containers (list, dict) with numpy arrays. For data management, joblib provides transparent disk persistence that is very efficient with such objects. The internal mechanism relies on specializing pickle to handle better numpy arrays. Recent improvements reduce vastly the memory overhead of data persistence. ### Limitations of the old implementation ❶ Dumping/loading persisted data with compression was a memory hog, because of internal copies of data, limiting the maximum size of usable data with compressed persistence: We see the increased memory usage during the calls to dump and load functions, profiled using the memory_profiler package with this gist ❷ Another drawback was that large numpy arrays (>10MB) contained in an arbitrary Python object were dumped in separate .npy file, increasing the load on the file system [1]: >>> import numpy as np >>> import joblib # joblib version: 0.9.4 >>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))] # 3 files are generated: >>> joblib.dump(obj, '/tmp/test.pkl', compress=True) ['/tmp/test.pkl', '/tmp/test.pkl_01.npy.z', '/tmp/test.pkl_02.npy.z'] >>> joblib.load('/tmp/test.pkl') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  ## What’s new: compression, low memory… Memory usage is now stable: All numpy arrays are persisted in a single file: >>> import numpy as np >>> import joblib # joblib version: 0.10.0 (dev) >>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))] # only 1 file is generated: >>> joblib.dump(obj, '/tmp/test.pkl', compress=True) ['/tmp/test.pkl'] >>> joblib.load('/tmp/test.pkl') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  Persistence in a file handle (ongoing work in a pull request) More compression formats are available Backward compatibility Existing joblib users can be reassured: the new version is still compatible with pickles generated by older versions (>= 0.8.4). You are encouraged to update (rebuild?) your cache if you want to take advantage of this new version. ## Benchmarks: speed and memory consumption Joblib strives to have minimum dependencies (only numpy) and to be agnostic to the input data. Hence the goals are to deal with any kind of data while trying to be as efficient as possible with numpy arrays. To illustrate the benefits and cost of the new persistence implementation, let’s now compare a real life use case (LFW dataset from scikit-learn) with different libraries: • Joblib, with 2 different versions, 0.9.4 and master (dev), • Pickle • Numpy The four first lines use non compressed persistence strategies, the last four use persistence with zlib/gzip [2] strategies. Code to reproduce the benchmarks is available on this gist. Speed: the results between joblib 0.9.4 and 0.10.0 (dev) are similar whereas numpy and pickle are clearly slower than joblib in both compressed and non compressed cases. Memory consumption: Without compression, old and new joblib versions are the same; with compression, the new joblib version is much better than the old one. Joblib clearly outperforms pickle and numpy in terms of memory consumption. This can be explained by the fact that numpy relies on pickle if the object is not a pure numpy array (a list or a dict with arrays for example), so in this case it inherits the memory drawbacks from pickle. When persisting pure numpy arrays (not tested here), numpy uses its internal save/load functions which are efficient in terms of speed and memory consumption. Disk used: results are as expected: non compressed files have the same size as the in-memory data; compressed files are smaller. Caveat Emptor: performance is data-dependent Different data compress more or less easily. Speed and disk used will vary depending on the data. Key considerations are: • Fraction of data in arrays: joblib is efficient if much of the data is contained in numpy arrays. The worst case scenario is something like a large dictionary of random numbers as keys and values. • Entropy of the data: an array fully of zeros will compress well and fast. A fully random array will compress slowly, and use a lot of disk. Real data is often somewhere in the middle. ## Extra improvements in compressed persistence ### New compression formats Joblib can use new compression formats based on Python standard library modules: zlib, gzip, bz2, lzma and xz (the last 2 are available for Python greater than 3.3). The compressor is selected automatically when the file name has an explicit extension: >>> joblib.dump(obj, '/tmp/test.pkl.z') # zlib ['/tmp/test.pkl.z'] >>> joblib.dump(obj, '/tmp/test.pkl.gz') # gzip ['/tmp/test.pkl.gz'] >>> joblib.dump(obj, '/tmp/test.pkl.bz2') # bz2 ['/tmp/test.pkl.bz2'] >>> joblib.dump(obj, '/tmp/test.pkl.lzma') # lzma ['/tmp/test.pkl.lzma'] >>> joblib.dump(obj, '/tmp/test.pkl.xz') # xz ['/tmp/test.pkl.xz']  One can tune the compression level, setting the compressor explicitly: >>> joblib.dump(obj, '/tmp/test.pkl.compressed', compress=('zlib', 6)) ['/tmp/test.pkl.compressed'] >>> joblib.dump(obj, '/tmp/test.compressed', compress=('lzma', 6)) ['/tmp/test.pkl.compressed']  On loading, joblib uses the magic number of the file to determine the right decompression method. This makes loading compressed pickle transparent: >>> joblib.load('/tmp/test.compressed') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  Importantly, the generated compressed files use a standard compression file format: for instance, regular command line tools (zip/unzip, gzip/gunzip, bzip2, lzma, xz) can be used to compress/uncompress a pickled file generated with joblib. Joblib will be able to load cache compressed with those tools. Toward more and faster compression Specific compression strategies have been developped for fast compression, sometimes even faster than disk reads such as snappy , blosc, LZO or LZ4. With a file-like interface, they should be readily usable with joblib. In the benchmarks above, loading and dumping with compression is slower than without (though only by a factor of 3 for loading). These were done on a computer with an SSD, hence with very fast I/O. In a situation with slower I/O, as on a network drive, compression could save time. With faster compressors, compression will save time on most hardware. ### Compressed persistence into a file handle Now that everything is stored in a single file using standard compression formats, joblib can persist in an open file handle: >>> with open('/tmp/test.pkl', 'wb') as f: >>> joblib.dump(obj, f) ['/tmp/test.pkl'] >>> with open('/tmp/test.pkl', 'rb') as f: >>> print(joblib.load(f)) [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  This also works with compression file object available in the standard library, like gzip.GzipFile, bz2.Bz2File or lzma.LzmaFile: >>> import gzip >>> with gzip.GzipFile('/tmp/test.pkl.gz', 'wb') as f: >>> joblib.dump(data, f) ['/tmp/test.pkl.gz'] >>> with gzip.GzipFile('/tmp/test.pkl.gz', 'rb') as f: >>> print(joblib.load(f)) [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  Be sure that you use a decompressor matching the internal compression when loading with the above method. If unsure, simply use open, joblib will select the right decompressor: >>> with open('/tmp/test.pkl.gz', 'rb') as f: >>> print(joblib.load(f)) [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]  Towards dumping to elaborate stores Working with file handles opens the door to storing cache data in database blob or cloud storage such as Amazon S3, Amazon Glacier and Google Cloud Storage (for instance via the Python package boto). ## Implementation A Pickle Subclass: joblib relies on subclassing the Python Pickler/Unpickler [3]. These are state machines that walk the graph of nested objects (a dict may contain a list, that may contain…), creating a string representation of each object encountered. The new implementation proceeds as follows: • Pickling an arbitrary object: when an np.ndarray object is reached, instead of using the default pickling functions (__reduce__()), the joblib Pickler replaces in pickle stream the ndarray with a wrapper object containing all important array metadata (shape, dtype, flags). Then it writes the array content in the pickle file. Note that this step breaks the pickle compatibility. One benefit is that it enables using fast code for copyless handling of the numpy array. For compression, we pass chunks of the data to a compressor object (using the buffer protocol to avoid copies). • Unpickling from a file: when pickle reaches the array wrapper, as the object is in the pickle stream, the file handle is at the beginning of the array content. So at this point the Unpickler simply constructs an array based on the metadata contained in the wrapper and then fills the array buffer directly from the file. The object returned is the reconstructed array, the array wrapper being dropped. A benefit is that if the data is stored not compressed, the array can be directly memory mapped from the storage (the mmap_mode option of joblib.load). This technique allows joblib to pickle all objects in a single file but also to have memory-efficient dump and load. A fast compression stream: as the pickling refactoring opens the door to file objects usage, joblib is now able to persist data in any kind of file object: open, gzip.GzipFile, bz2.Bz2file and lzma.LzmaFile. For performance reason and usability, the new joblib version uses its own file object BinaryZlibFile for zlib compression. Compared to GzipFile, it disables crc computation, which bring a performance gain of 15%. Speed penalties of on-the-fly writes There’s also a small speed difference with dict/list objects between new/old joblib when using compression. The old version pickles the data inside a io.BytesIO buffer and then compress it in a row whereas the new version write “on the fly” compressed chunk of pickled data to the file. Because of this internal buffer the old implementation is not memory safe as it indeed copy the data in memory before compressing. The small speed difference was judged acceptable compared to this memory duplication. ## Conclusion and future work Memory copies were a limitation when caching on disk very large numpy arrays, e.g arrays with a size close to the available RAM on the computer. The problem was solved via intensive buffering and a lot of hacking on top of pickle and numpy. Unfortunately, our strategy has poor performance with big dictionaries or list compared to a cPickle, hence try to use numpy arrays in your internal data structures (note that something like scipy sparse matrices works well, as it builds on arrays). For the future, maybe numpy’s pickle methods could be improved and make a better use of 64-bit opcodes for large objects that were introduced in Python recently. Pickling using file handles is a first step toward pickling in sockets, enabling broadcasting of data between computing units on a network. This will be priceless with joblib’s new distributed backends. Other improvements will come from better compressor, making everything faster. Note The pull request was implemented by @aabadie. He thanks @lesteve, @ogrisel and @GaelVaroquaux for the valuable help, reviews and support.  [1] The load created by multiple files on the filesystem is particularly detrimental for network filesystems, as it triggers multiple requests and isn’t cache friendly.  [2] gzip is based on zlib with additional crc checks and a default compression level of 3.  [3] A drawback of subclassing the Python Pickler/Unpickler is that it is done for the pure-Python version, and not the “cPickle” version. The latter is much faster when dealing with a large number of Python objects. Once again, joblib is efficient when most of the data is represented as numpy arrays or subclasses. ## May 18, 2016 ### Martin Fitzpatrick #### Can I use setup.py to pack an app that requires PyQt5? ## Can I require PyQt5 via setup.py? In a word yes, as long as you restrict your support to PyQt5 and Python3. The requirements specified in setup.py are typically provided by requesting packages from the Python Package Index (PyPi). Until recently these packages were source only, meaning that an installation depending on PyQt5 would only work on a system where it was possible to build it from source. Building on Windows in particular requires quite a lot of set up, and this would therefore put your application out of reach for anyone unable or unwilling to do this. Note: As far as I am aware, it was never actually possible to build from source via PyPi. The standard approach was to download the source/binaries from Riverbank Software and build/install from there. This problem was solved by the introduction of Python Wheels which provide a means to install C extension packages without the need for compilation on the target system. This is achieved by platform-specific .whl files. Wheels for PyQt5 on Python3 are available on PyPi for multiple platforms, including MacOS X, Linux (any), Win32 and Win64 which should cover most uses. For example, this is the output when pip-installing PyQt5 on Python3 on a Mac: mfitzp@MacBook-Air ~$ pip3 install pyqt5
Collecting pyqt5
100% |████████████████████████████████| 73.2MB 2.5kB/s
Collecting sip (from pyqt5)
100% |████████████████████████████████| 49kB 1.8MB/s
Installing collected packages: sip, pyqt5
Successfully installed pyqt5-5.6 sip-4.18


To set PyQt5 as a dependency of your own package simply specify it as normal in your setup.py e.g. install_requires=['PyQt5']

## What’s the proper way of distributing a Python GUI application?

Here you have a few options. The above means that anyone with Python3 installed can now install your application using pip. However, this assumes that the end-user has Python and knows what pip is. If you are looking to distribute your application with a Windows installer, MacOSX ‘app’ bundle, or Linux package, you will need to use one of the tools dedicated to that purpose.

### Windows

• cx_Freeze is a cross-platform packager that can package Python applications for Windows, Mac and Linux. It works by analysing your project and freezing the required packages and subpackages. Success depends largely on which packages you depend on and their complexity/correctness.
• PyInstaller is another cross-platform packager that can package Python applications for Windows, Mac and Linux. This works in a similar way to cx_Freeze and will likely perform both better/worse depending on the packages.
• PyNSISt builds NSIS installer packages for Windows. This has the advantage of being very straightforward: it simply packages all the files together as-is, without ‘freezing’. The downside is that packages can end up very large and slower to install (but see the file-filter options). It now supports bundling of .whl files which will solve this in many cases. By far the easiest if you’re targeting Windows-only.

### MacOSX

• cx_Freeze see above.
• PyInstaller see above.
• Py2app creates .app bundles from the definition in your setup.py. Big advantage is the custom handlers that allow you to adjust packaging of troublesome packages. If you’re only targetting MacOSX this is probably your best option.

### Linux

• cx_Freeze see above.
• PyInstaller see above.
• stdeb build Debian-style packages from your setup.py definition.

Note: It is possible to write a very complex setup.py that allows you to build using one or more tools on different platforms, but I have usually ended up storing separate configs (e.g. setup-py2app.py) for clarity.

## May 17, 2016

### Matthieu Brucher

#### Audio Toolkit: Parameter smoothing

Audio Toolkit shines when the pipeline is fixed (filter-wise and parameter-wise). But in DAWs, automated parameters are often used, and to avoid glitches, it’s interesting to additionally smooth parameters of the pipeline. So let’s see how this can be efficiently achieved.

Although automation in a DAW would already smooth parameters and although some filters can have a heavy state (EQ, but also the dynamics filters, even with their threaded updates), it’s interesting to implement this pattern in some cases. So here it is:

// You need to setup memory, how the parameter is updated
// you need to setup max_interval_process, the number of samples before the next update
class ProcessingClass
{
double parameter_target;
double parameter_current;

int64_t interval_process;
public:
ProcessingClass()
:parameter_target(0), parameter_current(0), interval_process(0)
{}

void update()
{
parameter_current = parameter_current * (1 - memory) + parameter_target * memory;
interval_process = 0;
}

void process(double** in, double** out, int64_t size)
{
// Setup the input/outputs of the pipeline as usual

int64_t processed_size = 0;
do
{
// We can only process max_interval_process elements at a time, but if we already have some elements in the buffer,
// we need to take them into account.
int64_t size_to_process = std::min(max_interval_process - interval_process, size - processed_size);

pipeline_exit.process(size_to_process);

interval_process += size_to_process;
processed_size += size_to_process;
if(interval_process == max_interval_process)
{
update();
interval_process = 0;
}
}while(processed_size != size);
}
};

I’m considering that ProcessingClass has an Audio Toolkit pipeline and that it is embedded in a VST or AU plugin. A call to the parameter update function would update parameter_target and make a call to update(). During the call to process() where the plugin would do some processing, the snippet will cut the input and output arrays in chunks of max_interval_processing elements and call the pipeline for each chunk and then update the underlying parameters if required.

In this snippet, I’m calling update after the pipeline call, but I could also do it before the pipeline call and remove the call to the update function from the parameter change function. It’s a matter of taste.

## May 10, 2016

### Matthieu Brucher

#### Announcement: ATKAutoSwell 1.0.0

I’m happy to announce the release of a mono autoswell based on the Audio Toolkit. They are available on Windows and OS X (min. 10.8) in different formats.

This plugin applies a ratio to the global gain of a signal once it is higher than a given threshold. This means that contrary to a compressor where the power of the signal will never go lower than the threshold, for AutoSwell, it can.

ATKAutoSwell

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

## May 09, 2016

### Enthought

#### Webinar: Fast Forward Through the “Dirty Work” of Data Analysis: New Python Data Import and Manipulation Tool Makes Short Work of Data Munging Drudgery

No matter whether you are a data scientist, quantitative analyst, or an engineer, whether you are evaluating consumer purchase behavior, stock portfolios, or design simulation results, your data analysis workflow probably looks a lot like this: Acquire > Wrangle > Analyze and Model > Share and Refine > Publish The problem is that often 50 to 80 […]

### Continuum Analytics news

#### Community Powered conda Packaging: conda-forge

Posted Monday, May 9, 2016

conda-forge is a community led collection of recipes, build infrastructure and packages for the conda package manager.

### The Problem

Historically, the scientific Python community has always wanted a cross-platform package manager that does not require elevated privileges, handles all types of packages, including compiled Python packages and non-Python packages, and generally lets Python be the awesome scientific toolbox of choice.

The conda package manager solved that problem, but in doing so has created new ones:

• How to get niche tools that are not packaged by Continuum in the “default” channel for Anaconda and Miniconda?
• Where should built packages be hosted?
• How should binaries be built to ensure they are compatible with other systems and with the packages from the default channel?

Continuum Analytics does its best to produce reliable conda packages on the default channel, but it can be difficult to keep pace with the many highly specialized communities and their often complex build requirements. The default channel is therefore occasionally out of date, built without particular features or, in some situations, even broken. In response, Continuum provided Anaconda Cloud as a platform for hosting conda packages. Many communities have created their own channels on Anaconda Cloud to provide a collection of reliable packages that they know will work for their users. This has improved the situation within these communities but has also led to duplication of effort, recipe fragmentation and some unstable environments when combining packages from different channels.

### The Solution

conda-forge is an effort towards unification of these fragmented users and communities. The conda-forge organization was created to be a transparent, open and community-led organization to centralize and standardize package building and recipe hosting, while improving distribution of the maintenance burden.

### What Exactly is conda-forge?

In a nutshell, conda-forge is a GitHub organization containing repositories of conda recipes and a framework of automation scripts for facilitating CI setup for these recipes. In its current implementation, free services from AppVeyor, CircleCI and Travis CI power the continuous build service on Windows, Linux and OS X, respectively. Each recipe is treated as its own repository. referred to as a feedstock, and is automatically built in a clean and repeatable way on each platform.

The built distributions are uploaded to the central repository at anaconda.org/conda-forge and can be installed with conda. For example, to install a conda-forge package into an existing conda environment:

$conda install --channel conda-forge <package-name> Or, to add conda-forge to your channels so that it is always searched: $ conda config --add channels conda-forge

### How Can I be Part of the conda-forge Community?

Users can contribute in a number of ways. These include reporting issues (as can be seen by this sample issue), updating any of our long list of existing recipes (as can be seen by this sample PR) or by adding new recipes to the collection.

Adding new recipes starts with a a PR to staged-recipes. The recipe will be built on Windows, Linux and OS X to ensure the package's builds and the recipe’s tests pass.The PR will also be reviewed by the community to assert the recipes are written in a clear and maintainable way. Once the recipe is ready it will be merged and a new feedstock repository will automatically be created for the recipe by the staged-recipes automation scripts. The feedstock repository has a team with commit rights automatically created using the GitHub handles listed in the recipe extra/recipe-maintainers field. The build and upload processes takes place in the feedstock and, once completed, the package will be available on the conda-forge channel.

A full example of this process can be seen with the “colorlog” package. A PR was raised at staged-recipes proposing the new recipe. It was then built and tested automatically, and, after some iteration, it was merged. Subsequently, the colorlog-feedstock repository was automatically generated with full write access for everybody listed in the recipe-maintainers section.

### Feedstock Model vs Single Repository Model

Many communities are familiar with the “single repository” model - repositories like github.com/conda/conda-recipes that contain many folders of conda recipes.  This model is not ideal for community maintenance, as it lacks granularity of permissions and struggles to scale beyond tens of recipes. With the feedstock model, in which there is one repo per recipe, each recipe has its own team of maintainers and its own CI. The conda-forge/feedstocks repository puts the recipes back into the more familiar single repository model for those workflows which require it.

### Technical Build Details

The build centralization of conda-forge has provided an opportunity to standardize the build tools used in the ecosystem. By itself, Anaconda Cloud imposes no constraints on build tools. This results in some packages working with only a subset of user systems due to platform incompatibilities. For example, packages built on newer Linux systems will often not run on older Linux systems due to glibc compatibility issues. By unifying and solving these problems, together we are improving the likelihood that any package from the conda-forge channel will be compatible with most user systems. Additionally, pooling knowledge has led to better centralized build tools and documentation than any single community had before. Some of this documentation is at https://github.com/conda-forge/staged-recipes/wiki/Helpful-conda-links

### What's Next?

Conda forge is growing rapidly (~60 contributors, ~400 packages, and >118000 downloads). With more community involvement, everyone benefits: package compatibility is improved, packages stay current and we have a larger pool of knowledge to tackle more difficult issues. We can all go get work done, instead of fighting packaging!

conda-forge is open, transparent, and growing quickly. We would love to see more communities joining the effort to solve improve software packaging for the scientific Python community.

## May 03, 2016

### Matthieu Brucher

#### Analog modeling of a diode clipper (4): DK-method

DK method is explained at large by David Ye in his thesis. It’s based around nodal analysis and was also extensively used by cytomic in his papers.

When analyzing a circuit form scratch, we need to replace all capacitors by an equivalent circuit and solve the equation with this modified circuit. Then, the equivalent currents need to be updated with the proper formula.

# What does the formula mean?

So this is the update formula:

$i_{eq_{n+1}} = \frac{C}{\Delta t}V_{n+1} - i_{eq_{n}}$

Let’s write it differently:

$i_{eq_{n+1}} + i_{eq_{n}} = C\frac{V_{n+1}}{\Delta t}$

If we consider $V_{n+1}$ as being a difference, then this is a derivative with a trapezoidal approximation. In conjunction to the original equation, this means that we have a system of several equations that are staggered. Actually $V_n$ has not the same time constraints than $i_{eq_n}$, it lags it by half a sample.

On the one hand, there are several reasons why this is good. Staggered systems are easier to write, and also if the conditions are respected, they are more accurate. For instance, for wave equations, using central difference instead of the staggered system leads to HF instabilities.

The issues on the other hand are that we do a linear update. If this is fine for the SD1 circuit, it is not the same for the two clippers here, as the amount of current in the condensator is a function of the diode function (not the case of the SD1 circuit, as only the input voltage impacts it). But, still, it’s a good approximation.

# Usage on the clippers

OK, let’s see how to apply this on the first clipper:
$V_{in} = V_{on} + I_s sinh(\frac{V_{on}}{nV_t})(\frac{h}{C_1} + 2 R_1)) - \frac{hI_{eq_n}}{2C_1}$

The time dependency is kept inside $I_{eq_n}$, and we don’t need the rest like for the trapezoidal rule:

$V_{in+1} - V_{in} - I_s sinh(\frac{V_{on+1}}{nV_t}) (\frac{h}{C_1} + 2 R_1) - I_s sinh(\frac{V_{on}}{nV_t}) (\frac{h}{C_1} - 2 R_1) - V_{on+1} + V_{on} = 0$

Quite obvious it is simpler! But actually the update rule is a little bit more complicated:

$I_{eq_{n+1}} = \frac{2 C_1}{h} (V_{in} - V_{on} - R_1 I_s sinh(\frac{V_{on}}{nV_t})) - I_{eq_n}$

Actually,a s we computed all the intermediate values, this comes at a cost of a few additions and multiplications, so it’s good.

Let’s try the second clipper:

$V_{in} = V_{on} (1 + \frac{2 R_1 C_1}{h}) + R_1 I_s sinh(\frac{V_{on}}{nV_t}) + I_{eq_n} R_1$

Compared to:

$V_{on+1} - V_{on} = h(\frac{V_{in+1}}{R_1 C_1} - \frac{V_{on+1} + V_{on}}{2 R_1 C_1} + \frac{I_s}{C_1}(sinh(\frac{V_{on+1}}{nV_t}) + sinh(\frac{V_{on}}{nV_t})))$

And in this case, the update formula is simple, as the tension on the condensator is the output voltage:

$I_{eq_{n+1}} = \frac{2 C_1}{h} V_{on} - I_{eq_n}$

Once again, the dependency is hidden inside $I_{eq_n}$ which means simpler and also faster optimization.

# Conclusion

Using the equivalent currents transformation is actually really easy to implement and it allows to simplify the function to optimize. It doesn’t change the function itself compared to the trapezoidal rule, because they are actually (in my opinion, I have done the actual math) two sides of the same coin.

I’ve applied this to the SD1 filter. The simplification in the equation also leads to an improvement in the computation time, but for low sampling rates the filter does not converge. But the higher the sampling rate, the better the improvement over the traditional trapezoidal rule.

## April 29, 2016

### Continuum Analytics news

#### Open Data Science: Bringing “Magic” to Modern Analytics

Posted Friday, April 29, 2016

Science fiction author Arthur C. Clarke once wrote, “any sufficiently advanced technology is indistinguishable from magic.”

We’re nearer than ever to that incomprehensible, magical future. Our gadgets understand our speech, driverless cars have made their debut and we’ll soon be viewing virtual worlds at home.

These “magical” technologies spring from a 21st-century spirit of innovation—but not only from big companies. Thanks to the Internet—and to the open source movement—companies of all sizes are able to spur advancements in science and technology.

In the past, our analytics tools were proprietary, product-oriented solutions. These were necessarily limited in flexibility and they locked customers into the slow innovation cycles and whims of vendors. These closed-source solutions forced a “one size fits all” approach to analytics with monolithic tools that did not offer easy customization for different needs.

Open Data Science has changed that. It offers innovative software—free of proprietary restrictions and tailorable for all varieties of data science teams—created in the transparent collaboration that is driving today’s tech boom.

The Magic 8-Ball of Automated Modeling

One of Open Data Science's most visible points of innovation is in the sphere of data science modeling.

Initially, models were created exclusively by statisticians and analysts for business professionals, but demand from the business sector for software that could do this job gave rise to automatic model fitting—often called “black box” analytics—in which analysts let software algorithmically generate models that fit data and create predictive models.

Such a system creates models, but much like a magic 8-ball, it offers its users answers without business explanations. Mysteries are fun for toys, but no business will bet on them. Quite understandably, no marketing manager or product manager wants to approach the CEO with predictions, only to be stumped when he asks how the manager arrived at them. As Clarke knew, it’s not really magic creating the models, it’s advanced technology and it too operates under assumptions that might or might not make sense for the business.

App Starters Means More Transparent Modeling

Today’s business professionals want faster time-to-value and are dazzled by advanced technologies like automated model fitting, but they also want to understand exactly how and why the work.

That’s why Continuum Analytics is hard at work on Open Data Science solutions including Anaconda App Starters, expected to debut later this year. App Starters are solution “templates” aimed to be a 60-80 percent data science solution that make it easy for businesses to have a starting point. App Starters serve the same purpose as the “black box”—faster time-to-value— but are not a “black box” in that it allows analysts to see exactly how the model was created and to tweak models as desired.

Because the App Starters are are based on Open Data Science, they don’t include proprietary restrictions that keep business professionals or data scientists in the dark regarding the analytics pipeline including the algorithms. It still provides the value of “automagically” creating models, but the details of how it does so are transparent and accessible to the team. With App Starters, business professionals will finally have confidence in the models they’re using to formulate business strategies, while getting faster time-to-value from their growing data.

Over time App Starters will get more sophisticated and will include recommendations—just like how Netflix offers up movie and tv show recommendations for your watching pleasure—that will learn and suggest algorithms and visualizations that best fit the data. Unlike “black boxes” the entire narrative as to why recommendations are offered will be available for the business analyst to learn and gain confidence in the recommendations. However, the business analyst can choose to use the recommendation, tweak the recommendation, use the template without recommendations or they could try tuning the suggested models to find a perfect fit. This type of innovation will further the advancement of sophisticated data science solutions that realize more business value, while instilling confidence in the solution.

Casting Spells with Anaconda

Although App Starters are about to shake up automated modeling, businesses require melding new ideas with tried-and-true solutions. In business analytics, for instance, tools like Microsoft Excel are a staple of the field and being able to integrate them with newer “magic” is highly desirable.

Fortunately, interoperability is one of the keystones of the Open Data Science philosophy and Anaconda provides a way to bridge the reliable old world with the magical new one. With Anaconda, analysts who are comfortable using Excel have an entry point into the world of predictive analytics from the comfort of their spreadsheets. By using the same familiar interface, analysts can access powerful Python libraries to apply cutting-edge analytics to their data. Anaconda recognizes that business analysts want to improve—not disrupt—a proven workflow.

Because Anaconda leverages the Python ecosystem, analysts using Anaconda will achieve powerful results. They might apply a formula to an Excel sheet with a million data rows to predict repeat customers or they may create beautiful, informative visualizations to show how sales have shifted to a new demographic after the company’s newest marketing campaign kicked off. With Anaconda, business analysts can continue using Excel as their main interface, while harnessing the newest “magic” available in the open source community.

Open Data Science for Wizards…and Apprentices

Open Data Science is an inclusive movement. Although open source languages like Python and R dominate data science and allow for the most advanced—and therefore “magical”—analytics technology available, the community is open to all levels of expertise.

Anaconda is a great way for business analysts, for example, to embark on the road toward advanced analytics. But solutions, like App Starters, give advanced wizards the algorithmic visibility to alter and improve models as they see fit.

Open Data Science gives us the “sufficiently advanced technology” that Arthur C. Clarke mentioned—but it puts the power of that magic in our hands.

## April 26, 2016

### Matthieu Brucher

#### Analog modeling of a diode clipper (3b): Simulation

Let’s dive directly inside the second diode clipper and follow exactly the same pattern.

# Second diode clipper

So first let’s remember the equation:

$\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})$

## Forward Euler

The forward Euler approximation is then:

$V_{on+1} = V_{on} + h(\frac{V_{in+1} - V_{on}}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_{on}}{nV_t}))$

## Backward Euler

Backward Euler approximation is now:

$V_{on+1} - V_{on} = h(\frac{V_{in+1} - V_{on+1}}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_{on+1}}{nV_t}))$

(The equations are definitely easier to derive…)

## Trapezoidal rule

And finally trapezoidal rule gives:

$V_{on+1} - V_{on} = h(\frac{V_{in+1}}{R_1 C_1} - \frac{V_{on+1} + V_{on}}{2 R_1 C_1} + \frac{I_s}{C_1}(sinh(\frac{V_{on+1}}{nV_t}) + sinh(\frac{V_{on}}{nV_t})))$

## Starting estimates

For the estimates, we use exactly the same methods as the previous clipper, so I won’t recall them.

## Graphs

Numerical optimization comparison

The first obvious change is that the forward Euler can give pretty good results. This makes me think I may have made a mistake in the previous circuit, but as I had to derive the equation before doing the approximation, this may be the reason.

For the original estimates, just like last time, the results are identical:

Original estimates comparison

OK, let’s compare the result of the first iteration with different original estimates:
One step comparison

All estimates give a similar result, but the affine estimates give a better estimate than linear which gives a far better result than the default/copying estimate.

# Conclusion

Just for fun, let’s display the difference between the two clippers:

Diode clippers comparison

Obviously, the second clipper is more symmetric than the first one and thus will create less harmonics (which is confirmed by a spectrogram), and this is also easier to optimize (the second clipper uses at least one less iteration than the first one).

All things considered, the Newton Raphson algorithm is always efficient, with around 3 or less iterations for these circuits. Trying bisection or something else may not be that interesting, except if you are heavily using SIMD instructions. In this case, the optimization may be faster because you have a similar number of iterations.

Original estimates done with the last optimized value always works great although affine estimates are usually faster. The tricky part is deriving the equation. And more often than not, you make mistakes when implementing them!

Next step: DK method…

## April 25, 2016

### Continuum Analytics news

#### Accelerate 2.2 Released!

Posted Monday, April 25, 2016

We're happy to announce the latest update to Accelerate with the release of version 2.2. This version of Accelerate adds compatibility with the recently released Numba 0.25, and also expands the Anaconda Platform in two new directions:

• Data profiling
• MKL-accelerated ufuncs

I'll discuss each of these in detail below.

### Data Profiling

We've built up quite a bit of experience over the years optimizing numerical Python code for our customers, and these projects follow some common patterns. First, the most important step in the optimization process is profiling a realistic test case. You can't improve what you can't measure, and profiling is critical to identify the true bottlenecks in an application. Even experienced developers are often surprised by profiling results when they see which functions are consuming the most time. Ensuring the test case is realistic (but not necessarily long) is also very important, as unit and functional tests for applications tend to use smaller, or differently shaped, input data sets. The scaling behavior of many algorithms is non-linear, so profiling with a very small input can give misleading results.

The second step in optimization is to consider alternative implementations for the critical functions identified in the first step, possibly adopting a different algorithm, parallelizing the calculation to make use of multiple cores or a GPU, or moving up a level to eliminate or batch unnecessary calls to the function. In this step of the process, we often found ourselves lacking a critical piece of information: what data types and sizes were being passed to this function? The best approach often depends on this information. Are these NumPy arrays or custom classes? Are the arrays large or small? 32-bit or 64-bit float? What dimensionality? Large arrays might benefit from GPU acceleration, but small arrays often require moving up the call stack in order to see if calculations can be batched.

Rather than having to manually modify the code to collect this data type information in an ad-hoc way, we've added a new profiling tool to Accelerate that can record this type information as a part of normal profiling. For lack of a better term, we're calling this "data profiling."

We collect this extra information using a modified version of the built-in Python profiling mechanism, and can display it using the standard pstats-style table:

ncalls  tottime percall cumtime percall filename:lineno(function)
300/100 0.01313 0.0001313 0.03036 0.0003036  linalg.py:532(cholesky(a:ndarray(dtype=float64, shape=(3, 3))))
200/100 0.004237 4.237e-05 0.007189 7.189e-05 linalg.py:139(_commonType())
200/100 0.003431 3.431e-05 0.005312 5.312e-05 linalg.py:106(_makearray(a:ndarray(dtype=float64, shape=(3, 3))))
400/200 0.002663 1.332e-05 0.002663 1.332e-05 linalg.py:111(isComplexType(t:type))
300/100 0.002185 2.185e-05 0.002185 2.185e-05 linalg.py:209(_assertNdSquareness())
200/100 0.001592 1.592e-05 0.001592 1.592e-05 linalg.py:124(_realType(t:type, default:NoneType))
200/100 0.00107 1.07e-05 0.00107 1.07e-05 linalg.py:198(_assertRankAtLeast2())
100 0.000162 1.62e-06 0.000162 1.62e-06 linalg.py:101(get_linalg_error_extobj(callback:function))

The recorded function signatures now include data types, and NumPy arrays also have dtype and shape information. In the above example, we've selected only the linear algebra calls from the execution of a PyMC model. Here we can clearly see the Cholesky decomposition is being done on 3x3 matrices, which would dictate our optimization strategy if cholesky was the bottleneck in the code (in this case, it is not).

We've also integrated the SnakeViz profile visualization tool into the Accelerate profiler, so you can easily collect and view profile information right inside your Jupyter notebooks:

## profiling.png

All it takes to profile a function and view it in a notebook is a few lines:

from accelerate import profiler

p = profiler.Profile()

p.run('my_function_to_profile()')

profiler.plot(p)

### MKL-Accelerated Ufuncs

MKL is perhaps best known for high performance, multi-threaded linear algebra functionality, but MKL also provides highly optimized math functions, like sin() and cos() for arrays. Anaconda already ships with the numexpr library, which is linked against MKL to provide fast array math support. However, we have future plans for Accelerate that go beyond what numexpr can provide, so in the latest release of Accelerate, we've exposed the MKL array math functions as NumPy ufuncs you can call directly.

For code that makes extensive use of special math functions on arrays with many thousands of elements, the performance speedup is quite amazing:

import numpy as np

from accelerate.mkl import ufuncs as mkl_ufuncs

def spherical_to_cartesian_numpy(r, theta, phi):

    cos_theta = np.cos(theta)

    sin_theta = np.sin(theta)

    cos_phi = np.cos(phi)

    sin_phi = np.sin(phi)

    x = r * sin_theta * cos_phi

    y = r * sin_theta * sin_phi

    z = r * cos_theta

def spherical_to_cartesian_mkl(r, theta, phi):

    cos_theta = mkl_ufuncs.cos(theta)

    sin_theta = mkl_ufuncs.sin(theta)

    cos_phi = mkl_ufuncs.cos(phi)

    sin_phi = mkl_ufuncs.sin(phi)

        x = r * sin_theta * cos_phi

    y = r * sin_theta * sin_phi

    z = r, cos_theta

        return x, y, z

 n = 100000

r, theta, phi = np.random.uniform(1, 10, n), np.random.uniform(0, np.pi, n), np.random.uniform(-np.pi, np.pi, n)

%timeit spherical_to_cartesian_numpy(r, theta, phi)

%timeit spherical_to_cartesian_mkl(r, theta, phi)

    100 loops, best of 3: 7.01 ms per loop

    1000 loops, best of 3: 978 µs per loop

A speedup of 7x is not bad for a 2.3 GHz quad core laptop CPU from 2012. In future releases, we are looking to expand and integrate this functionality further into the Anaconda Platform, so stay tuned!

### Summary

You can install Accelerate with conda and use it free for 30 days:

conda install accelerate

Try it out, and let us know what you think. Academic users can get a free subscription to Anaconda (including several useful tools, like Accelerate) by following these instructions. Contact sales@continuum.io to find out how to get a subscription to Anaconda at your organization.

## April 20, 2016

### Matthew Rocklin

#### Ad Hoc Distributed Random Forests

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://www.youtube.com/watch?v=FkPlEqB8AnE

## TL;DR.

Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data.

Our machine learning model does not perform well, but we do learn how to execute ad-hoc computations easily.

## Motivation

In the past few posts we analyzed data on a cluster with Dask collections:

Often our computations don’t fit neatly into the bag, dataframe, or array abstractions. In these cases we want the flexibility of normal code with for loops, but still with the computational power of a cluster. With the dask.distributed task interface, we achieve something close to this.

## Application: Naive Distributed Random Forest Algorithm

As a motivating application we build a random forest algorithm from the ground up using the single-machine Scikit Learn library, and dask.distributed’s ability to quickly submit individual tasks to run on the cluster. Our algorithm will look like the following:

1. Pull data from some external source (S3) into several dataframes on the cluster
2. For each dataframe, create and train one RandomForestClassifier
3. Scatter single testing dataframe to all machines
4. For each RandomForestClassifier predict output on test dataframe
5. Aggregate independent predictions from each classifier together by a majority vote. To avoid bringing too much data to any one machine, perform this majority vote as a tree reduction.

## Data: NYC Taxi 2015

As in our blogpost on distributed dataframes we use the data on all NYC Taxi rides in 2015. This is around 20GB on disk and 60GB in RAM.

We predict the number of passengers in each cab given the other numeric columns like pickup and destination location, fare breakdown, distance, etc..

We do this first on a small bit of data on a single machine and then on the entire dataset on the cluster. Our cluster is composed of twelve m4.xlarges (4 cores, 15GB RAM each).

Disclaimer and Spoiler Alert: I am not an expert in machine learning. Our algorithm will perform very poorly. If you’re excited about machine learning you can stop reading here. However, if you’re interested in how to build distributed algorithms with Dask then you may want to read on, especially if you happen to know enough machine learning to improve upon my naive solution.

## API: submit, map, gather

We use a small number of dask.distributed functions to build our computation:

futures = executor.scatter(data)                     # scatter data
future = executor.submit(function, *args, **kwargs)  # submit single task
futures = executor.map(function, sequence)           # submit many tasks
results = executor.gather(futures)                   # gather results
executor.replicate(futures, n=number_of_replications)


In particular, functions like executor.submit(function, *args) let us send individual functions out to our cluster thousands of times a second. Because these functions consume their own results we can create complex workflows that stay entirely on the cluster and trust the distributed scheduler to move data around intelligently.

First we load data from Amazon S3. We use the s3.read_csv(..., collection=False) function to load 178 Pandas DataFrames on our cluster from CSV data on S3. We get back a list of Future objects that refer to these remote dataframes. The use of collection=False gives us this list of futures rather than a single cohesive Dask.dataframe object.

from distributed import Executor, s3
e = Executor('52.91.1.177:8786')

parse_dates=['tpep_pickup_datetime',
'tpep_dropoff_datetime'],
collection=False)
dfs = e.compute(dfs)


Each of these is a lightweight Future pointing to a pandas.DataFrame on the cluster.

>>> dfs[:5]
[<Future: status: finished, type: DataFrame, key: finalize-a06c3dd25769f434978fa27d5a4cf24b>,
<Future: status: finished, type: DataFrame, key: finalize-7dcb27364a8701f45cb02d2fe034728a>,
<Future: status: finished, type: DataFrame, key: finalize-b0dfe075000bd59c3a90bfdf89a990da>,
<Future: status: finished, type: DataFrame, key: finalize-1c9bb25cefa1b892fac9b48c0aef7e04>,
<Future: status: finished, type: DataFrame, key: finalize-c8254256b09ae287badca3cf6d9e3142>]


If we’re willing to wait a bit then we can pull data from any future back to our local process using the .result() method. We don’t want to do this too much though, data transfer can be expensive and we can’t hold the entire dataset in the memory of a single machine. Here we just bring back one of the dataframes:

>>> df = dfs[0].result()

VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0 0.3 16.30

## Train on a single machine

To start lets go through the standard Scikit Learn fit/predict/score cycle with this small bit of data on a single machine.

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

df_train, df_test = train_test_split(df)

columns = ['trip_distance', 'pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_latitude', 'payment_type',
'fare_amount', 'mta_tax', 'tip_amount', 'tolls_amount']

est = RandomForestClassifier(n_estimators=4)
est.fit(df_train[columns], df_train.passenger_count)


This builds a RandomForestClassifer with four decision trees and then trains it against the numeric columns in the data, trying to predict the passenger_count column. It takes around 10 seconds to train on a single core. We now see how well we do on the holdout testing data:

>>> est.score(df_test[columns], df_test.passenger_count)
0.65808188654721012


This 65% accuracy is actually pretty poor. About 70% of the rides in NYC have a single passenger, so the model of “always guess one” would out-perform our fancy random forest.

>>> from sklearn.metrics import accuracy_score
>>> import numpy as np
>>> accuracy_score(df_test.passenger_count,
...                np.ones_like(df_test.passenger_count))
0.70669390028780987


This is where my ignorance in machine learning really kills us. There is likely a simple way to improve this. However, because I’m more interested in showing how to build distributed computations with Dask than in actually doing machine learning I’m going to go ahead with this naive approach. Spoiler alert: we’re going to do a lot of computation and still not beat the “always guess one” strategy.

## Fit across the cluster with executor.map

First we build a function that does just what we did before, builds a random forest and then trains it on a dataframe.

def fit(df):
est = RandomForestClassifier(n_estimators=4)
est.fit(df[columns], df.passenger_count)
return est


Second we call this function on all of our training dataframes on the cluster using the standard e.map(function, sequence) function. This sends out many small tasks for the cluster to run. We use all but the last dataframe for training data and hold out the last dataframe for testing. There are more principled ways to do this, but again we’re going to charge ahead here.

train = dfs[:-1]
test = dfs[-1]

estimators = e.map(fit, train)


This takes around two minutes to train on all of the 177 dataframes and now we have 177 independent estimators, each capable of guessing how many passengers a particular ride had. There is relatively little overhead in this computation.

## Predict on testing data

Recall that we kept separate a future, test, that points to a Pandas dataframe on the cluster that was not used to train any of our 177 estimators. We’re going to replicate this dataframe across all workers on the cluster and then ask each estimator to predict the number of passengers for each ride in this dataset.

e.replicate([test], n=48)

def predict(est, X):
return est.predict(X[columns])

predictions = [e.submit(predict, est, test) for est in estimators]


Here we used the executor.submit(function, *args, **kwrags) function in a list comprehension to individually launch many tasks. The scheduler determines when and where to run these tasks for optimal computation time and minimal data transfer. As with all functions, this returns futures that we can use to collect data if we want in the future.

Developers note: we explicitly replicate here in order to take advantage of efficient tree-broadcasting algorithms. This is purely a performance consideration, everything would have worked fine without this, but the explicit broadcast turns a 30s communication+computation into a 2s communication+computation.

## Aggregate predictions by majority vote

For each estimator we now have an independent prediction of the passenger counts for all of the rides in our test data. In other words for each ride we have 177 different opinions on how many passengers were in the cab. By averaging these opinions together we hope to achieve a more accurate consensus opinion.

For example, consider the first four prediction arrays:

>>> a_few_predictions = e.gather(predictions[:4])  # remote futures -> local arrays
>>> a_few_predictions
[array([1, 2, 1, ..., 2, 2, 1]),
array([1, 1, 1, ..., 1, 1, 1]),
array([2, 1, 1, ..., 1, 1, 1]),
array([1, 1, 1, ..., 1, 1, 1])]


For the first ride/column we see that three of the four predictions are for a single passenger while one prediction disagrees and is for two passengers. We create a consensus opinion by taking the mode of the stacked arrays:

from scipy.stats import mode
import numpy as np

def mymode(*arrays):
array = np.stack(arrays, axis=0)
return mode(array)[0][0]

>>> mymode(*a_few_predictions)
array([1, 1, 1, ..., 1, 1, 1])


And so when we average these four prediction arrays together we see that the majority opinion of one passenger dominates for all of the six rides visible here.

## Tree Reduction

We could call our mymode function on all of our predictions like this:

>>> mode_prediction = e.submit(mymode, *predictions)  # this doesn't scale well


Unfortunately this would move all of our results to a single machine to compute the mode there. This might swamp that single machine.

Instead we batch our predictions into groups of size 10, average each group, and then repeat the process with the smaller set of predictions until we have only one left. This sort of multi-step reduction is called a tree reduction. We can write it up with a couple nested loops and executor.submit. This is only an approximation of the mode, but it’s a much more scalable computation. This finishes in about 1.5 seconds.

from toolz import partition_all

while len(predictions) > 1:
predictions = [e.submit(mymode, *chunk)
for chunk in partition_all(10, predictions)]

result = e.gather(predictions)[0]

>>> result
array([1, 1, 1, ..., 1, 1, 1])


## Final Score

Finally, after completing all of our work on our cluster we can see how well our distributed random forest algorithm does.

>>> accuracy_score(result, test.result().passenger_count)
0.67061974451423045


Still worse than the naive “always guess one” strategy. This just goes to show that, no matter how sophisticated your Big Data solution is, there is no substitute for common sense and a little bit of domain expertise.

## What didn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

• Clearly this would have benefited from more machine learning knowledge. What would have been a good approach for this problem?
• I’ve been thinking a bit about memory management of replicated data on the cluster. In this exercise we specifically replicated out the test data. Everything would have worked fine without this step but it would have been much slower as every worker gathered data from the single worker that originally had the test dataframe. Replicating data is great until you start filling up distributed RAM. It will be interesting to think of policies about when to start cleaning up redundant data and when to keep it around.
• Several people from both open source users and Continuum customers have asked about a general Dask library for machine learning, something akin to Spark’s MLlib. Ideally a future Dask.learn module would leverage Scikit-Learn in the same way that Dask.dataframe leverages Pandas. It’s not clear how to cleanly break up and parallelize Scikit-Learn algorithms.

## Conclusion

This blogpost gives a concrete example using basic task submission with executor.map and executor.submit to build a non-trivial computation. This approach is straightforward and not restrictive. Personally this interface excites me more than collections like Dask.dataframe; there is a lot of freedom in arbitrary task submission.

## April 19, 2016

### Matthieu Brucher

#### Book review: The Culture Map: Decoding How People Think and Get Things Done in a Global World

I work in an international company, and there are lots of people from different cultures around me, and with whom I need to interact. Out of the blue, it feels like it’s easy to work with all of them, I mean, how difficult could it be to work with them?

Actually, it’s easy, but sometimes interactions are intriguing and people do not react the way you expect them to react. And why is that? Lots of reasons, of course, but one of them is that they have a different culture and do not expect you to explicitly tell them what they did wrong (which is something I do. A lot).

#### Content and opinions

Enters Erin Meyer. She had to navigate between these cultures, as she’s American, married to a French guy, in France. In her book, she presents 8 scales, and each culture is placed differently on each scale.

I won’t enter in all the details of the different scales, but they are about all the different ways of people interacting with other people. Whether it’s about scheduling, feedback to decision-making, all cultures are different. And sometimes, even if the cultures are close geographically, they can be quite different on some scales. After all, they are all influenced by their short or long history, their philosophers…

All the scales are imaged with stories of Meyer’s experience in teaching them, stories from her students, and they are always spot on.

#### Conclusion

Of course, the book only tells you the differences, what to look for. It doesn’t educate you to do the right think. This takes practice, and it requires work.

Also it doesn’t solve all interaction problems. Everyone is different in one’s own culture (not even talking about people having several cultures…), on the left or on the right compared to the average of one’s culture on each scale. So you can’t sum up someone to a culture. But if you want to learn more about interacting with people, you already know that.

## April 18, 2016

### Continuum Analytics news

#### Conda + Spark

Posted Tuesday, April 19, 2016

In my previous post, I described different scenarios for bootstrapping Python on a multi-node cluster. I offered a general solution using Anaconda for cluster management and solution using a custom conda env deployed with Knit.

In a follow-up to that post, I was asked if the machinery in Knit would also work for Spark. Sure--of course! In fact, much of Knit's design comes from Spark's deploy codebase. Here, I am going to demonstrate how we can ship a Python environment, complete with desired dependencies, as part of a Spark job without installing Python on every node.

## Spark YARN Deploy

First, I want to briefly describe key points in Spark's YARN deploy methodologies. After negotiating which resources to provision with YARN's Resource Manager, Spark asks for a directory to be constructed on HDFS: /user/ubuntu/.sparkStaging/application_1460665326796_0065/ The directory will always be in the user's home, and the application ID issued by YARN is appended to the directory name (thinking about this now, perhaps this is obvious and straightforward to JAVA/JVM folks where bundling Uber JARs has long been the practice in traditional Map-Reduce jobs). In any case, Spark then uploads itself to the stagingDirectory, and when YARN provisions a container, the contents of the directory are pulled down and the spark-assembly jar is executed. If you are using PySpark or sparkR, a corresponding pyspark.zip and sparkr.zip will be found in the staging directory as well.

Occasionally, users see FileNotFoundException errors -- this can be caused by a few things: incorrect Spark Contexts, incorrect SPARK_HOME, and I have faint recollection that there was a packaging problem once where pyspark.zip or sparkr.zip was missing, or could not be created do to permissions? Anyway -- below is the output you will see when Spark works cleanly.

16/04/15 13:01:03 INFO Client: Uploading resource file:/opt/anaconda/share/spark-1.6.0/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://ip-172-31-50-60:9000/user/ubuntu/.sparkStaging/application_1460665326796_0065/spark-assembly-1.6.0-hadoop2.6.0.jar

16/04/15 13:01:07 INFO Client: Uploading resource file:/opt/anaconda/share/spark-1.6.0/python/lib/pyspark.zip -> hdfs://ip-172-31-50-60:9000/user/ubuntu/.sparkStaging/application_1460665326796_0065/pyspark.zip

Not terribly exciting, but positive confirmation that Spark is uploading local files to HDFS.

## Bootstrap-Fu Redux

Most of what I described above is what the YARN framework allows developers to do -- it's more that Spark implements a YARN application than Spark doing magical things (and Knit as well!). If I were using Scala/Java, I would package up everything in a jar and use spark-submit -- Done!

Unfortunately, there's a little more work to be done for an Uber Python jar equivalent.

One of the killer features of conda is environment management. When conda creates a new environment, it uses hard-links when possible. Generally, this greatly reduces disk usage. But, if we move the directory to another machine, we're probably just moving a handful of hard-links and not the files themselves. Fortunately, we can tell conda: "No! Copy the files!"

For example:

conda create -p /home/ubuntu/dev --copy -y -q python=3 pandas scikit-learn

By using the --copy, we "install all packages using copies instead of hard or soft-linking." The headers in various files in the bin/ directory may have lines like #!/home/ubuntu/dev/bin/python. But, we don't need to be concerned about that -- we're not going to be using 2to3, idle, pip, etc. If we zipped up the environment, we could move this onto another machine of a similar OS type, execute Python, and we'd be able to load any library in the lib/python3.45/site-packages directory.

We're very close to our Uber Python jar -- now with a zipped conda directory in mind, let's proceed.

zip -r dev.zip dev

## Death by ENV Vars

We are going to need a handful of specific command line options and environment variables: Spark Yarn Configuration and Spark Environment Variables. We'll be using:

• PYSPARK_PYTHON: The Python binary Spark should use
• spark.yarn.appMasterEnv.PYSPARK_PYTHON (though this one could be wrong/unnecessary/only used for --master yarn-cluster)
• --archives: include local tgz/jar/zip in .sparkStaging directory and pull down into temporary YARN container

We'll also need a test script. The following is a reasonable test to prove which Python Spark is using -- we're writing a no-op function which returns Python's various paths it is using to find libraries

# test_spark.py

import os

import sys

from pyspark import SparkContext

from pyspark import SparkConf

conf = SparkConf()

conf.setAppName("get-hosts")

sc = SparkContext(conf=conf)

def noop(x):

    import socket

    import sys

    return socket.gethostname() + ' '.join(sys.path) + ' '.join(os.environ)

rdd = sc.parallelize(range(1000), 100)

hosts = rdd.map(noop).distinct().collect()

print(hosts)

And executing everything together:

 PYSPARK_PYTHON=./ANACONDA/dev/bin/python spark-submit \

 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ANACONDA/dev/bin/python \

 --master yarn-cluster \

 --archives /home/ubuntu/dev.zip#ANACONDA \

 /home/ubuntu/test_spark.py

We'll get the following output in the yarn logs:

It's a little hard to parse -- what should be noted are file paths like:

.../container_1460665326796_0070_01_000002/ANACONDA/dev/lib/python3.5/site- packages

This is demonstrating that Spark is using the unzipped directory in the YARN container. Ta-da!

## Thoughts

Okay, perhaps that's not super exciting, so let's zoom out again:

1. We create a zipped conda environment with dependencies: pandas, python=3,...
2. We successfully launched a Python Spark job without any Python binaries or libraries previously installed on the nodes.

There is an open JIRA ticket discussing the option of having Spark ingest a requirements.txt and building the Python environment as a preamble to a Spark job. This is also a fairly novel approach to the same end -- using Spark to bootstrap a runtime environment. It's even a bit more general, since the method described above relies on YARN. I first saw this strategy in use with streamparse. Similarly to the implementation in JIRA ticket, streamparse can ship a Python requirements.txt and construct a Python environment as part of a Streamparse Storm job!

## Rrrrrrrrrrrrrr

Oh, and R conda environments work as well...but it's more involved.

## Create/Munge R Env

First, it's pretty cool that conda can install and manage R environments. Again, we create a conda environment with R binaries and libraries

conda create -p /home/ubuntu/r_env --copy -y -q r-essentials -c r

R is not exactly relocatable so we need to munge a bit:

sed -i "s/home\/ubuntu/.r_env.zip/g" /home/ubuntu/r_env/bin/R

zip -r r_env.zip r_env

My R skills are at a below-novice level, so the following test script could probably be improved

# /home/ubuntu/test_spark.R

library(SparkR)

sc <- sparkR.init(appName="get-hosts-R")

noop <- function(x) {

  path <- toString(.libPaths())

  host <- toString(Sys.info()['nodename'])

  host_path <- toString(cbind(host,path))

  host_path

}

rdd <- SparkR:::parallelize(sc, 1:1000, 100)

hosts <- SparkR:::map(rdd, noop)

d_hosts <- SparkR:::distinct(hosts)

out <- SparkR:::collect(d_hosts)

print(out)

Execute (and the real death by options):

SPARKR_DRIVER_R=./r_env.zip/r_env/lib/R spark-submit --master yarn-cluster \

--conf spark.yarn.appMasterEnv.R_HOME=./r_env.zip/r_env/lib64/R \

--conf spark.yarn.appMasterEnv.RHOME=./r_env.zip/r_env \

--conf spark.yarn.appMasterEnv.R_SHARE_DIR=./r_env.zip/r_env/lib/R/share \

--conf spark.yarn.appMasterEnv.R_INCLUDE_DIR=./r_env.zip/r_env/lib/R/include \

--conf spark.executorEnv.R_HOME=./r_env.zip/r_env/lib64/R \

--conf spark.executorEnv.RHOME=./r_env.zip/r_env \

--conf spark.executorEnv.R_SHARE_DIR=./r_env.zip/r_env/lib/R/share \

--conf spark.executorEnv.R_INCLUDE_DIR=./r_env.zip/r_env/lib/R/include \

--conf  spark.r.command=./r_env.zip/r_env/bin/Rscript \

--archives r_env.zip \

/home/ubuntu/test_spark.R

Example output:

[1] "ip-172-31-50-59, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/230/sparkr.zip, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/229/r_env.zip/r_env/lib64/R/library"

[1] "ip-172-31-50-61, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/183/sparkr.zip, /var/lib/hadoop-yarn/data/1/yarn/local/usercache/ubuntu/filecache/182/r_env.zip/r_env/lib64/R/library"

This post is also published on Ben's website here.

Posted Monday, April 18, 2016

## Overview

Dask is a flexible open source parallel computation framework that lets you comfortably scale up and scale out your analytics. If you’re running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.

Dask enables distributed computing in pure Python and complements the existing numerical and scientific computing capability within Anaconda. Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data, and it scales up resiliently and elastically on clusters with hundreds of nodes.

Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP.

In this post, we’ll show you how you can use Anaconda with Dask for distributed computations and workflows, including distributed dataframes, arrays, text processing, and custom parallel workflows that can help you make the most of Anaconda and Dask on your cluster. We’ll work with Anaconda and Dask interactively from the Jupyter Notebook while the heavy computations are running on the cluster.

There are many different ways to get started with Anaconda and Dask on your Hadoop or HPC cluster, including manual setup via SSH; by integrating with resource managers such as YARN, SGE, or Slurm; launching instances on Amazon EC2; or by using the enterprise-ready Anaconda for cluster management.

Anaconda for cluster management makes it easy to install familiar packages from Anaconda (including NumPy, SciPy, Pandas, NLTK, scikit-learn, scikit-image, and access to 720+ more packages in Anaconda) and the Dask parallel processing framework on all of your bare-metal or cloud-based cluster nodes. You can provision centrally managed installations of Anaconda, Dask and the Jupyter notebook using two simple commands with Anaconda for cluster management:

$acluster create dask-cluster -p dask-cluster $ acluster install dask notebook

Additional features of Anaconda for cluster management include:

• Easily install Python and R packages across multiple cluster nodes
• Manage multiple conda environments across a cluster
• Push local conda environments to all cluster nodes
• Works on cloud-based and bare-metal clusters with existing Hadoop installations

Once you’ve installed Anaconda and Dask on your cluster, you can perform many types of distributed computations, including text processing (similar to Spark), distributed dataframes, distributed arrays, and custom parallel workflows. We’ll show some examples in the following sections.

## Distributed Text and Language Processing (Dask Bag)

Dask works well with standard computations such as text processing and natural language processing and with data in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask Bag collection is similar to other parallel frameworks and supports operations like  filter, count, fold, frequencies, pluck, and take, which are useful for working with a collection of Python objects such as text.

For example, we can use the natural language processing toolkit (NLTK) in Anaconda to perform distributed language processing on a Hadoop cluster, all while working interactively in a Jupyter notebook.

In this example, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.

First, we import libraries from Dask and connect to the Dask distributed scheduler:

>>> import dask

>>> from distributed import Executor, hdfs, progress

>>> e = Executor('54.164.41.213:8786')

Next, we load 242 GB of JSON data from HDFS using pure Python:

>>> import json

>>> lines = hdfs.read_text('/user/ubuntu/RC_2015-*.json')

>>> js = lines.map(json.loads)

We can filter and load the data into distributed memory across the cluster:

>>> movies = js.filter(lambda d: 'movies' in d['subreddit'])

>>> movies = e.persist(movies)

Once we’ve loaded the data into distributed memory, we can import the NLTK library from Anaconda and construct stacked expressions to tokenize words, tag parts of speech, and filter out non-words from the dataset.

>>> import nltk

>>> pos = e.persist(movies.pluck('body')

...                       .map(nltk.word_tokenize)

...                       .map(nltk.pos_tag)

...                       .concat()

...                       .filter(lambda (word, pos): word.isalpha()))

In this example, we’ll generate a list of the top 10 proper nouns from the movies subreddit.

>>> f = e.compute(pos.filter(lambda (word, type): type == 'NNP')

...                  .pluck(0)

...                  .frequencies()

...                  .topk(10, lambda (word, count): count))

>>> f.result()

[(u'Marvel', 35452),

 (u'Star', 34849),

 (u'Batman', 31749),

 (u'Wars', 28875),

 (u'Man', 26423),

 (u'John', 25304),

 (u'Superman', 22476),

 (u'Hollywood', 19840),

 (u'Max', 19558),

 (u'CGI', 19304)]

Finally, we can use Bokeh to generate an interactive plot of the resulting data:

## Analysis with Distributed Dataframes (Dask DataFrame)

Dask allows you to work with familiar Pandas dataframe syntax on a single machine or on many nodes on a Hadoop or HPC cluster. You can work with data stored in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask DataFrame collection mimics the Pandas API, uses Pandas under the hood, and supports operations like head, groupby, value_counts, merge, and set_index.

For example, we can use the Dask to perform computations with dataframes on a Hadoop cluster with data stored in HDFS, all while working interactively in a Jupyter notebook.

First, we import libraries from Dask and connect to the Dask distributed scheduler:

>>> import dask

>>> from distributed import Executor, hdfs, progress, wait, s3

>>> e = Executor('54.164.41.213:8786')

Next, we’ll load the NYC taxi data in CSV format from HDFS using pure Python and persist the data in memory:

>>> df = hdfs.read_csv('/user/ubuntu/nyc/yellow_tripdata_2015-*.csv',

                       parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'],

                       header='infer')

>>> df = e.persist(df)

We can perform familiar operations such as computing value counts on columns and statistical correlations:

>>> df.payment_type.value_counts().compute()

1    91574644

2    53864648

3      503070

4      170599

5          28

Name: payment_type, dtype: int64

>>> df2 = df.assign(payment_2=(df.payment_type == 2),

...                 no_tip=(df.tip_amount == 0))

>>> df2.astype(int).corr().compute()

           no_tip    payment_2

no_tip    1.000000    0.943123

payment_2    0.943123    1.000000

Dask runs entirely asynchronously, leaving us free to explore other cells in the notebook while computations happen in the background. Dask also handles all of the messy CSV schema handling for us automatically.

Finally, we can use Bokeh to generate an interactive plot of the resulting data:

## Numerical, Statistical and Scientific Computations with Distributed Arrays (Dask Array)

Dask works well with numerical and scientific computations on n-dimensional array data. The Dask Array collection mimics a subset of the NumPy API, uses NumPy under the hood, and supports operations like dot, flatten, max, mean, and std.

For example, we can use the Dask to perform computations with arrays on a cluster with global temperature/weather data stored in NetCDF format (like HDF5), all while working interactively in a Jupyter notebook. The data files contain measurements that were taken every six hours at every quarter degree latitude and longitude.

First, we import the netCDF4 library and point to the data files stored on disk:

>>> import netCDF4

>>> from glob import glob

>>> filenames = sorted(glob('2014-*.nc3'))

>>> t2m = [netCDF4.Dataset(fn).variables['t2m'] for fn in filenames]

>>> t2m[0]

<class 'netCDF4._netCDF4.Variable'>

int16 t2m(time, latitude, longitude)

    scale_factor: 0.00159734395579

    add_offset: 268.172358066

    _FillValue: -32767

    missing_value: -32767

    units: K

    long_name: 2 metre temperature

unlimited dimensions:

current shape = (4, 721, 1440)

filling off

We then import Dask and read in the data from the NumPy arrays:

>>> import dask.array as da

>>> xs = [da.from_array(t, chunks=t.shape) for t in t2m]

>>> x = da.concatenate(xs, axis=0)

We can then perform distributed computations on the cluster, such as computing the mean temperature, variance of the temperature over time, and normalized temperature. We can view the progress of the computations as they run on the cluster nodes and continue to work in other cells in the notebook:

>>> avg, std = da.compute(x.mean(axis=0), x.std(axis=0))

>>> z = (x - avg) / std

>>> progress(z)

We can plot the resulting normalized temperature using matplotlib:

We can also create interactive widgets in the notebook to interact with and visualize the data in real-time while the computations are running across the cluster:

## Creating Custom Parallel Workflows

When one of the standard Dask collections isn’t a good fit for your workflow, Dask gives you the flexibility to work with different file formats and custom parallel workflows. The Dask Imperative collection lets you wrap functions in existing Python code and run the computations on a single machine or across a cluster.

In this example, we have multiple files stored hierarchically in a custom file format (Feather for reading and writing Python and R dataframes on disk). We can build a custom workflow by wrapping the code with Dask Imperative and making use of the Feather library:

>>> import feather

>>> from dask import delayed

>>> from glob import glob

>>> import os

>>> lazy_dataframes = []

>>> for directory in glob('2016-*'):

...     for symbol in os.listdir(directory):

...         filename = os.path.join(directory, symbol)

...         df = delayed(feather.read_dataframe)(filename)

...         df = delayed(pd.DataFrame.assign)(df,

                    date=pd.Timestamp(directory),

                    symbol=symbol)

...         lazy_dataframes.append(df)

You can get started with Anaconda and Dask using Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$conda install anaconda-client -n root $ anaconda login

$conda install anaconda-cluster -c anaconda-cluster In addition to Anaconda subscriptions, there are many different ways that Continuum can help you get started with Anaconda and Dask to construct parallel workflows, parallelize your existing code, or integrate with your existing Hadoop or HPC cluster, including: • Architecture consulting and review • Manage Python packages and environments on a cluster • Develop custom package management solutions on existing clusters • Migrate and parallelize existing code with Python and Dask • Architect parallel workflows and data pipelines with Dask • Build proof of concepts and interactive applications with Dask • Custom product/OSS core development • Training on parallel development with Dask For more information about the above solutions, or if you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io. ## April 17, 2016 ### Titus Brown #### MinHash signatures as ways to find samples, and collaborators? As I wrote last week my latest enthusiasm is MinHash sketches, applied (for the moment) to RNAseq data sets. Briefly, these are small "signatures" of data sets that can be used to compare data sets quickly. In the previous blog post, I talked a bit about their effectiveness and showed that (at least in my hands, and on a small data set of ~200 samples) I could use them to cluster RNAseq data sets by species. What I didn't highlight in that blog post is that they could potentially be used to find samples of interest as well as (maybe) collaborators. ## Finding samples of interest The "samples of interest" idea is pretty clear - supposed we had a collection of signatures from all the RNAseq in the the Sequence Read Archive? Then we could search the entire SRA for data sets that were "close" to ours, and then just use those to do transcriptome studies. It's not yet clear how well this might work for finding RNAseq data sets with similar expression patterns, but if you're working with non-model species, then it might be a good way to pick out all the data sets that you should use to generate a de novo assembly. More generally, as we get more and more data, finding relevant samples may get harder and harder. This kind of approach lets you search on sequence content, not annotations or metadata, which may be incomplete or inaccurate for all sorts of reasons. In support of this general idea, I have defined a provisional file format (in YAML) that can be used to transport around these signatures. It's rather minimal and fairly human readable - we would need to augment it with additional metadata fields for any serious use in databases(but see below for more discussion on that). Each record (and there can currently only be one record per signature file) can contain multiple different sketches, corresponding to different k-mer sizes used in generating the sketch. (For different sized sketches with the same k-mers, you just store the biggest one, because we're using bottom sketches so the bigger sketches properly include the smaller sketches.) If you want to play with some signatures, you can -- here's an executable binder with some examples of generating distance matrices between signatures, and plotting them. Note that by far the most time is spent in loading the signatures - the comparisons are super quick, and in any case could be sped up a lot by moving them from pure Python over to C. I've got a pile of all echinoderm SRA signatures already built, for those who are interested in looking at a collection -- look here. ## Finding collaborators Searching public databases is all well and good, and is a pretty cool application to enable with a few dozen lines of code. But I'm also interested in enabling the search of pre-publication data and doing matchmaking between potential collaborators. How could this work? Well, the interesting thing about these signatures is that they are irreversible signatures with a one-sided error (a match means something; no match means very little). This means that you can't learn much of anything about the original sample from the signature unless you have a matching sample, and even then all you know is the species and maybe something about the tissue/stage being sequenced. In turn, this means that it might be possible to convince people to publicly post signatures of pre-publication mRNAseq data sets. Why would they do this?? An underappreciated challenge in the non-model organism world is that building reference transcriptomes requires a lot of samples. Sure, you can go sequence just the tissues you're interested in, but you have to sequence deeply and broadly in order to generate good enough data to produce a good reference transcriptome so that you can interpret your own mRNAseq. In part because of this (as well as many other reasons), people are slow to publish on their mRNAseq - and, generally, data isn't made available pre-publication. What if you could go fishing for collaborators on building a reference transcriptome? Very few people are excited about just publishing a transcriptome (with some reason, when you see papers that publish 300), but those are really valuable building blocks for the field as a whole. So, suppose you had some RNAseq, and you wanted to find other people with RNAseq from the same organism, and there was this service where you could post your RNAseq signature and get notified when similar signatures were posted? You wouldn't need to do anything more than supply an e-mail address along with your signature, and if you're worried about leaking information about who you are, it's easy enough to make new e-mail addresses. I dunno. Seems interesting. Could work. Right? One fun point is that this could be a distributed service. The signatures are small enough (~1-2 kb) that you can post them on places like github, and then have aggregators that collect them. The only "centralized" service involved would be in searching all of them, and that's pretty lightweight in practice. Another fun point is that we already have a good way to communicate RNAseq for the limited purpose of transcrpiptome assembly -- diginorm. Abundance-normalized RNAseq is useless for doing expression analysis, and if you normalize a bunch of samples together you can't even figure out what the original tissue was. So, if you're worried about other people having access to your expression levels, you can simply normalize the data all together before handing it over. ## Further thoughts As I said in the first post, this was all nucleated by reading the mash and MetaPalette papers. In my review for MetaPalette, I suggested that they look at mash to see if MinHash signatures could be used to dramatically reduce their database size, and now that I actually understand MinHash a bit more, I think the answer is clearly yes. Which leads to another question - the Mash folk are clearly planning to use MinHash & mash to search assembled genomes, with a side helping of unassembled short and long reads. If we can all agree on an interchange format or three, why couldn't we just start generating public signatures of all the things, mRNAseq and genomic and metagenomic all? I see many, many uses, all somewhat dimly... (Lest anyone think I believe this to be a novel observation, clearly the Mash folk are well ahead of me here -- they undersold it in their paper, so I didn't notice until I re-read it with this in mind, but it's there :). Anyway, it seems like a great idea and we should totally do it. Who's in? What are the use cases? What do we need to do? Where is it going to break? --titus p.s. Thanks to Luiz Irber for some helpful discussion about YAML formats! ## April 14, 2016 ### Matthew Rocklin #### Fast Message Serialization This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project Very high performance isn’t about doing one thing well, it’s about doing nothing poorly. This week I optimized the inter-node communication protocol used by dask.distributed. It was a fun exercise in optimization that involved several different and unexpected components. I separately had to deal with Pickle, NumPy, Tornado, MsgPack, and compression libraries. This blogpost is not advertising any particular functionality, rather it’s a story of the problems I ran into when designing and optimizing a protocol to quickly send both very small and very large numeric data between machines on the Python stack. We care very strongly about both the many small messages case (thousands of 100 byte messages per second) and the very large messages case (100-1000 MB). This spans an interesting range of performance space. We end up with a protocol that costs around 5 microseconds in the small case and operates at 1-1.5 GB/s in the large case. ## Identify a Problem This came about as I was preparing a demo using dask.array on a distributed cluster for a Continuum webinar. I noticed that my computations were taking much longer than expected. The Web UI quickly pointed me to the fact that my machines were spending 10-20 seconds moving 30 MB chunks of numpy array data between them. This is very strange because I was on 100MB/s network, and so I expected these transfers to happen in more like 0.3s than 15s. The Web UI made this glaringly apparent, so my first lesson was how valuable visual profiling tools can be when they make performance issues glaringly obvious. Thanks here goes to the Bokeh developers who helped the development of the Dask real-time Web UI. ## Problem 1: Tornado’s sentinels Dask’s networking is built off of Tornado’s TCP IOStreams. There are two common ways to delineate messages on a socket, sentinel values that signal the end of a message, and prefixing a length before every message. Early on we tried both in Dask but found that prefixing a length before every message was slow. It turns out that this was because TCP sockets try to batch small messages to increase bandwidth. Turning this optimization off ended up being an effective and easy solution, see the TCP_NODELAY parameter. However, before we figured that out we used sentinels for a long time. Unfortunately Tornado does not handle sentinels well for large messages. At the receipt of every new message it reads through all buffered data to see if it can find the sentinel. This makes lots and lots of copies and reads through lots and lots of bytes. This isn’t a problem if your messages are a few kilobytes, as is common in web development, but it’s terrible if your messages are millions or billions of bytes long. Switching back to prefixing messages with lengths and turning off the no-delay optimization moved our bandwidth up from 3MB/s to 20MB/s per node. Thanks goes to Ben Darnell (main Tornado developer) for helping us to track this down. ## Problem 2: Memory Copies A nice machine can copy memory at 5 GB/s. If your network is only 100 MB/s then you can easily suffer several memory copies in your system without caring. This leads to code that looks like the following: socket.send(header + payload)  This code concatenates two bytestrings, header and payload before sending the result down a socket. If we cared deeply about avoiding memory copies then we might instead send these two separately: socket.send(header) socket.send(payload)  But who cares, right? At 5 GB/s copying memory is cheap! Unfortunately this breaks down under either of the following conditions 1. You are sloppy enough to do this multiple times 2. You find yourself on a machine with surprisingly low memory bandwidth, like 10 times slower, as is the case on some EC2 machines. Both of these were true for me but fortunately it’s usually straightforward to reduce the number of copies down to a small number (we got down to three), with moderate effort. ## Problem 3: Unwanted Compression Dask compresses all large messages with LZ4 or Snappy if they’re available. Unfortunately, if your data isn’t very compressible then this is mostly lost time. Doubly unforutnate is that you also have to decompress the data on the recipient side. Decompressing not-very-compressible data was surprisingly slow. Now we compress with the following policy: 1. If the message is less than 10kB, don’t bother 2. Pick out five 10kB samples of the data and compress those. If the result isn’t well compressed then don’t bother compressing the full payload. 3. Compress the full payload, if it doesn’t compress well then just send along the original to spare the receiver’s side from compressing. In this case we use cheap checks to guard against unwanted compression. We also avoid any cost at all for small messages, which we care about deeply. ## Problem 4: Cloudpickle is not as fast as Pickle This was surprising, because cloudpickle mostly defers to Pickle for the easy stuff, like NumPy arrays. In [1]: import numpy as np In [2]: data = np.random.randint(0, 255, dtype='u1', size=10000000) In [3]: import pickle, cloudpickle In [4]: %time len(pickle.dumps(data, protocol=-1)) CPU times: user 8.65 ms, sys: 8.42 ms, total: 17.1 ms Wall time: 16.9 ms Out[4]: 10000161 In [5]: %time len(cloudpickle.dumps(data, protocol=-1)) CPU times: user 20.6 ms, sys: 24.5 ms, total: 45.1 ms Wall time: 44.4 ms Out[5]: 10000161  But it turns out that cloudpickle is using the Python implementation, while pickle itself (or cPickle in Python 2) is using the compiled C implemenation. Fortunately this is easy to correct, and a quick typecheck on common large dataformats in Python (NumPy and Pandas) gets us this speed boost. ## Problem 5: Pickle is still slower than you’d expect Pickle runs at about half the speed of memcopy, which is what you’d expect from a protocol that is mostly just “serialize the dtype, strides, then tack on the data bytes”. There must be an extraneous memory copy in there. See issue 7544 ## Problem 6: MsgPack is bad at large bytestrings Dask serializes most messages with MsgPack, which is ordinarily very fast. Unfortunately the MsgPack spec doesn’t support bytestrings greater than 4GB (which do come up for us) and the Python implementations don’t pass through large bytestrings very efficiently. So we had to handle large bytestrings separately. Any message that contains bytestrings over 1MB in size will have them stripped out and sent along in a separate frame. This both avoids the MsgPack overhead and avoids a memory copy (we can send the bytes directly to the socket). ## Problem 7: Tornado makes a copy Sockets on Windows don’t accept payloads greater than 128kB in size. As a result Tornado chops up large messages into many small ones. On linux this memory copy is extraneous. It can be removed with a bit of logic within Tornado. I might do this in the moderate future. ## Results We serialize small messages in about 5 microseconds (thanks msgpack!) and move large bytes around in the cost of three memory copies (about 1-1.5 GB/s) which is generally faster than most networks in use. Here is a profile of sending and receiving a gigabyte-sized NumPy array of random values through to the same process over localhost (500 MB/s on my machine.)  381360 function calls (381323 primitive calls) in 1.451 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.366 0.366 0.366 0.366 {built-in method dumps} 8 0.289 0.036 0.291 0.036 iostream.py:360(write) 15353 0.228 0.000 0.228 0.000 {method 'join' of 'bytes' objects} 15355 0.166 0.000 0.166 0.000 {method 'recv' of '_socket.socket' objects} 15362 0.156 0.000 0.398 0.000 iostream.py:1510(_merge_prefix) 7759 0.101 0.000 0.101 0.000 {method 'send' of '_socket.socket' objects} 17/14 0.026 0.002 0.686 0.049 gen.py:990(run) 15355 0.021 0.000 0.198 0.000 iostream.py:721(_read_to_buffer) 8 0.018 0.002 0.203 0.025 iostream.py:876(_consume) 91 0.017 0.000 0.335 0.004 iostream.py:827(_handle_write) 89 0.015 0.000 0.217 0.002 iostream.py:585(_read_to_buffer_loop) 122567 0.009 0.000 0.009 0.000 {built-in method len} 15355 0.008 0.000 0.173 0.000 iostream.py:1010(read_from_fd) 38369 0.004 0.000 0.004 0.000 {method 'append' of 'list' objects} 7759 0.004 0.000 0.104 0.000 iostream.py:1023(write_to_fd) 1 0.003 0.003 1.451 1.451 ioloop.py:746(start)  Dominant unwanted costs include the following: 1. 400ms: Pickling the NumPy array 2. 400ms: Bytestring handling within Tornado After this we’re just bound by pushing bytes down a wire. ## Conclusion Writing fast code isn’t about writing any one thing particularly well, it’s about mitigating everything that can get in your way. As you approch peak performance, previously minor flaws suddenly become your dominant bottleneck. Success here depends on frequent profiling and keeping your mind open to unexpected and surprising costs. ## April 13, 2016 ### Titus Brown #### Applying MinHash to cluster RNAseq samples (I gave a talk on this on Monday, April 11th - you can see the slides slides here, on figshare. This is a Reproducible Blog Post. You can regenerate all the figures and play with this software yourself on binder.) So, my latest enthusiasm is MinHash sketches. A few weeks back, I had the luck to be asked to review both the mash paper (preprint here) and the MetaPalette paper (preprint here). The mash paper made me learn about MinHash sketches, while the MetaPalette paper made some very nice points about shared k-mers and species identification. After reading, I got to thinking. I wondered to myself, hey, could I use MinHash signatures to cluster unassembled Illumina RNAseq samples? While the mash folk showed that MinHash could be applied to raw reads nicely, I guessed that the greater dynamic range of gene expression would cause problems - mainly because high-abundance transcripts would yield many, many erroneous k-mers. Conveniently, however, my lab has some not-so-secret sauce for dealing with this problem - would it work, here? I thought it might. Combined with all of this, my former grad student, Dr. Qingpeng Zhang (first author on the not-so-secret sauce, above) has some other still-unpublished work showing that the the first ~1m reads of metagenome samples can be used to cluster samples together. So, I reasoned, perhaps it would work well to stream the first million or so reads from the beginning of RNAseq samples through our error trimming approach, compute a MinHash signature, and then use that signature to identify the species from which the RNAseq was isolated (and perhaps even closely related samples). tl; dr? It seems to work, with some modifications. For everything below, I used a k-mer hash size of 32 and only chose read data sets with reads of length 72 or higher. (Here's a nice presentation on MinHash, via Luiz Irber.) ## MinHash is super easy to implement I implemented MinHash in only a few lines of Python; see the repository at https://github.com/dib-lab/sourmash/. The most relevant code is sourmash_lib.py. Here, I'm using a bottom sketch, and at the moment I'm building some of it on top of khmer, although I will probably remove that requirement soon. After lots of trial and error (some of it reported below), I settled on using a k-mer size of k=32, and a sketch size of 500. (You can go down to a sketch size of 100, but you lose resolution. Lower k-mer sizes have the expected effect of slowly decreasing resolution; odd k-mer sizes effectively halve the sketch size.) ## How fast is it, and how much memory does it use, and how big are the sketches? I haven't bothered benchmarking it, but • everything but the hash function itself is on Python; • on my 3 yro laptop it takes about 5 minutes to add 1m reads; • the memory usage of sourmash itself is negligible - error trimming the reads requires about 1 GB of RAM; • the sketches are tiny - less than a few kb - and the program is dominated by the Python overhead. So it's super fast and super lightweight. ## Do you need to error trim the reads? The figure below shows a dendrogram next to a distance matrix of 8 samples - four mouse samples, untrimmed, and the same four mouse samples, trimmed at low-abundance k-mers. (You can see the trimming command here, using khmer's trim-low-abund command.) The two house mouse samples are replicates, and they always cluster together. However, they are much further apart without trimming. The effect of trimming on the disease mouse samples (which are independent biological samples, I believe) is much less; it rearranges the tree a bit but it's not as convincing as with the trimming. So you seem to get better resolution when you error trim the reads, which is expected. The signal isn't as strong as I thought it'd be, though. Have to think about that; I'm surprised MinHash is that robust to errors! ## Species group together pretty robustly with only 1m reads How many reads do you need to use? If you're looking for species groupings, not that many -- 1m reads is enough to cluster mouse vs yeast separately. (Which is good, right? If that didn't work...) Approximately 1m reads turns out to work equally well for 200 echinoderm (sea urchin and sea star) samples, too. Here, I downloaded all 204 echinoderm HiSeq mRNAseq data sets from SRA, trimmed them as above, and computed the MinHash signatures, and then compared them all to each other. The blocks of similarity are all specific species, and all the species groups cluster properly, and none of them (with one exception) cluster with other species. This is also an impressive demonstration of the speed of MinHash - you can do all 204 samples against each other in about 10 seconds. Most of that time is spent loading my YAML format into memory; the actual comparison takes < 1s! (The whole notebook for making all of these figures takes less than 30 seconds to run, since the signatures are already there; check it out!) ## Species that do group together may actually belong together In the urchin clustering above, there's only one "confused" species grouping where one cluster contains more than one species - that's Patiria miniata and Patiria pectinifera, which are both bat stars. I posted this figure on Facebook and noted the grouping, and Dan Rokhsar pointed out that on Wikipedia, Patiria has been identified as a complex of three closely related species in the Pacific. So that's good - it seems like the only group that has cross-species clustering is, indeed, truly multi-species. ## You can sample any 1m reads and get pretty similar results In theory, FASTQ files from shotgun sequencing are perfectly random, so you should be able to pick any 1m reads you want - including the first 1m. In practice, of course, this is not true. How similar are different subsamples? Answer: quite similar. All seven 1m read subsamples (5 random, one from the middle, one from the end) are above 70% in similarity. ## (Very) widely divergent species don't cross-compare at all If you look at (say) yeast and mouse, there's simply no similarity there at all. 32-mer signatures are apparently very specific. (The graph below is kind of stupid. It's just looking at similarity between mouse and yeast data sets as you walk through the two data streams. It's 0.2% all the way.) ## Species samples get more similar (or stay the same) as you go through the stream What happens when you look at more than 1m reads? Do the streams get more or less similar? If you walk through two streams and update the MinHash signature regularly, you see either constant similarity or a general increase in similarity; in the mouse replicates, it's constant and high, and between disease mouse and house mouse, it grows as you step through the stream. (The inflection points are probably due to how we rearrange the reads during the abundance trimming. More investigation needed.) Yeast replicates also maintain high similarity through the data stream. ## What we're actually doing is mostly picking k-mers from the transcriptome (This is pretty much what we expected, but as my dad always said, "trust but verify.") The next question is, what are we actually seeing signatures of? For example, in the above mouse example, we see growing similarity between two mouse data sets as we step through the data stream. Is this because we're counting more sequencing artifacts as we look at more data, or is this because we're seeing true signal? To investigate, I calculated the MinHash signature of the mouse RNA RefSeq file, and then asked if the streams were getting closer to that as we walked through them. They are: So, it seems like part of what's happening here is that we are looking at the True Signature of the mouse transcriptome. Good to know. And that's it for today, folks. ## What can this be used for? So, it all seems to work pretty well - the mash folk are dead-on right, and this is a pretty awesome and simple way to look at sequences. Right now, my approach above seems like it's most useful for identifying what species some RNAseq is from. If we can do that, then we can start thinking about other uses. If we can't do that pretty robustly, then that's a problem ;). So that's where I started. It might be fun to run against portions of the SRA to identify mislabeled samples. Once we have the SRA digested, we can make that available to people who are looking for more samples from their species of interest; whether this is useful will depend. I'm guessing that it's not immediately useful, since the SRA species identification seem pretty decent. One simple idea is to simply run this on each new sample you get back from a sequencing facility. "Hey, this looks like Drosophila. ...did you intend to sequence Drosophila?" It won't work for identifying low-lying contamination that well, but it could identify mis-labeled samples pretty quickly. Tracy Teal suggested that this could be used in-house in large labs to find out if others in the lab have samples of interest to you. Hmm. More on that idea later. ## Some big remaining questions • Do samples actually cluster by expression similarity? Maybe - more work needed. • Can this be used to compare different metagenomes using raw reads? No, probably not very well. At least, the metagenomes I care about are too diverse; you will probably need a different strategy. I'm thinking about it. ## One last shoutout I pretty much reimplemented parts of mash; there's nothing particularly novel here, other than exploring it in my own code on public data :). So, thanks, mash authors! --titus ## April 12, 2016 ### Continuum Analytics news #### Using Anaconda with PySpark for Distributed Language Processing on a Hadoop Cluster Posted Tuesday, April 12, 2016 ### Overview Working with your favorite Python packages along with distributed PySpark jobs across a Hadoop cluster can be difficult due to tedious manual setup and configuration issues, which is a problem that becomes more painful as the number of nodes in your cluster increases. Anaconda makes it easy to manage packages (including Python, R and Scala) and their dependencies on an existing Hadoop cluster with PySpark, including data processing, machine learning, image processing and natural language processing. ## 1-pyspark-reddit-language.png In a previous post, we’ve demonstrated how you can use libraries in Anaconda to query and visualize 1.7 billion comments on a Hadoop cluster. In this post, we’ll use Anaconda to perform distributed natural language processing with PySpark using a subset of the same data set. We’ll configure different enterprise Hadoop distributions, including Cloudera CDH and Hortonworks HDP, to work interactively on your Hadoop cluster with PySpark, Anaconda and a Jupyter Notebook. ## 2-pyspark-reddit-language.png In the remainder of this post, we'll: 1. Install Anaconda and the Jupyter Notebook on an existing Hadoop cluster. 2. Load the text/language data into HDFS on the cluster. 3. Configure PySpark to work with Anaconda and the Jupyter Notebook with different enterprise Hadoop distributions. 4. Perform distributed natural language processing on the data with the NLTK library from Anaconda. 5. Work locally with a subset of the data using Pandas and Bokeh for data analysis and interactive visualization. ### Provisioning Anaconda on a cluster Because we’re installing Anaconda on an existing Hadoop cluster, we can follow the bare-metal cluster setup instructions in Anaconda for cluster management from a Windows, Mac, or Linux machine. We can install and configure conda on each node of the existing Hadoop cluster with a single command: $ acluster create cluster-hadoop --profile cluster-hadoop

After a few minutes, we’ll have a centrally managed installation of conda across our Hadoop cluster in the default location of /opt/anaconda.

### Installing Anaconda packages on the cluster

Once we’ve provisioned conda on the cluster, we can install the packages from Anaconda that we’ll need for this example to perform language processing, data analysis and visualization:

$acluster conda install nltk pandas bokeh We’ll need to download the NLTK data on each node of the cluster. For convenience, we can do this using the distributed shell functionality in Anaconda for cluster management: $ acluster cmd 'sudo /opt/anaconda/bin/python -m nltk.downloader -d /usr/share/nltk_data all'

In this post, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.

Note that we could convert the data into different formats or load it into various query engines; however, since the focus of this blog post is using libraries with Anaconda, we will be working with the raw JSON data in PySpark.

We’ll load the reddit comment data into HDFS from the head node. You can SSH into the head node by running the following command from the client machine:

$acluster ssh The remaining commands in this section will be executed on the head node. If it doesn’t already exist, we’ll need to create a user directory in HDFS and assign the appropriate permissions: $ sudo -u hdfs hadoop fs -mkdir /user/ubuntu

$sudo -u hdfs hadoop fs -chown ubuntu /user/ubuntu We can then move the data by running the following command with valid AWS credentials, which will transfer the reddit comment data from the year 2015 (242 GB of JSON data) from a public Amazon S3 bucket into HDFS on the cluster: $ hadoop distcp s3n://AWS_KEY:AWS_SECRET@blaze-data/reddit/json/2015/*.json /user/ubuntu/

Replace AWS_KEY and AWS_SECRET in the above command with valid Amazon AWS credentials.

To use Python from Anaconda along with PySpark, you can set the PYSPARK_PYTHON environment variable on a per-job basis along with the spark-submit command. If you’re using the Anaconda parcel for CDH, you can run a PySpark script (e.g., spark-job.py) using the following command:

$PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py If you’re using Anaconda for cluster management with Cloudera CDH or Hortonworks HDP, you can run the PySpark script using the following command (note the different path to Python): $ PYSPARK_PYTHON=/opt/anaconda/bin/python spark-submit spark-job.py

Using the spark-submit command is a quick and easy way to verify that our PySpark script works in batch mode. However, it can be tedious to work with our analysis in a non-interactive manner as Java and Python logs scroll by.

## 4-pyspark-reddit-language.png

Instead, we can use the Jupyter Notebook on our Hadoop cluster to work interactively with our data via Anaconda and PySpark.

## 3-pyspark-reddit-language.png

Using Anaconda for cluster management, we can install Jupyter Notebook on the head node of the cluster with a single command, then open the notebook interface in our local web browser:

$acluster install notebook $ acluster open notebook

Once we’ve opened a new notebook, we’ll need to configure some environment variables for PySpark to work with Anaconda. The following sections include details on how to configure the environment variables for Anaconda to work with PySpark on Cloudera CDH and Hortonworks HDP.

#### Using the Anaconda Parcel with Cloudera CDH

If you’re using the Anaconda parcel with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and the Anaconda 4.0 parcel.

>>> import os

>>> import sys

>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"

>>> os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

#### Using Anaconda for cluster management with Cloudera CDH

If you’re using Anaconda for cluster management with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and Anaconda for cluster management 1.4.0.

>>> import os

>>> import sys

>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"

>>> os.environ["SPARK_HOME"] = "/opt/anaconda/parcels/CDH/lib/spark"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

#### Using Anaconda for cluster management with Hortonworks HDP

If you’re using Anaconda for cluster management with Hortonworks HDP, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Hortonworks HDP running Spark 1.6.0 and Anaconda for cluster management 1.4.0.

>>> import os

>>> import sys

>>> os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"

>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"

>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"

>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")

>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

### Initializing the SparkContext

After we’ve configured Anaconda to work with PySpark on our Hadoop cluster, we can initialize a SparkContext that we’ll use for distributed computations. In this example, we’ll be using the YARN resource manager in client mode:

>>> from pyspark import SparkConf

>>> from pyspark import SparkContext

>>> conf = SparkConf()

>>> conf.setMaster('yarn-client')

>>> conf.setAppName('anaconda-pyspark-language')

>>> sc = SparkContext(conf=conf)

Now that we’ve created a SparkContext, we can load the JSON reddit comment data into a Resilient Distributed Dataset (RDD) from PySpark:

>>> lines = sc.textFile("/user/ubuntu/*.json")

Next, we decode the JSON data and decide that we want to filter comments from the movies subreddit:

>>> import json

>>> data = lines.map(json.loads)

>>> movies = data.filter(lambda x: x['subreddit'] == 'movies')

We can then persist the RDD in distributed memory across the cluster so that future computations and queries will be computed quickly from memory. Note that this operation only marks the RDD to be persisted; the data will be persisted in memory after the first computation is triggered:

>>> movies.persist()

We can count the total number of comments in the movies subreddit (about 2.9 million comments):

>>> movies.count()

2905085

We can inspect the first comment in the dataset, which shows fields for the author, comment body, creation time, subreddit, etc.:

>>> movies.take(1)

CPU times: user 8 ms, sys: 0 ns, total: 8 msWall time: 113 ms

[{u'archived': False,

 u'author': u'kylionsfan',

 u'author_flair_css_class': None,

 u'author_flair_text': None,

 u'body': u'Goonies',

 u'controversiality': 0,

 u'created_utc': u'1420070402',

 u'distinguished': None,

 u'downs': 0,

 u'edited': False,

 u'gilded': 0,

 u'id': u'cnas90u',

 u'link_id': u't3_2qyjda',

 u'name': u't1_cnas90u',

 u'parent_id': u't3_2qyjda',

 u'retrieved_on': 1425124282,

 u'score': 1,

 u'score_hidden': False,

 u'subreddit': u'movies',

 u'subreddit_id': u't5_2qh3s',

 u'ups': 1}]

### Distributed Natural Language Processing

Now that we’ve filtered a subset of the data and loaded it into memory across the cluster, we can perform distributed natural language computations using Anaconda with PySpark.

First, we define a parse() function that imports the natural language toolkit (NLTK) from Anaconda and tags words in each comment with their corresponding part of speech. Then, we can map the parse() function to the movies RDD:

>>> def parse(record):

...    import nltk

...    tokens = nltk.word_tokenize(record["body"])

...    record["n_words"] = len(tokens)

...    record["pos"] = nltk.pos_tag(tokens)

...    return record

>>> movies2 = movies.map(parse)

Let’s take a look at the body of one of the comments:

>>> movies2.take(10)[6]['body']

u'Dawn of the Apes was such an incredible movie, it should be up there in my opinion.'

And the same comment with tagged parts of speech (e.g., nouns, verbs, prepositions):

>>> movies2.take(10)[6]['pos']

[(u'Dawn', 'NN'),

(u'of', 'IN'),

(u'the', 'DT'),

(u'Apes', 'NNP'),

(u'was', 'VBD'),

(u'such', 'JJ'),

(u'an', 'DT'),

(u'incredible', 'JJ'),

(u'movie', 'NN'),

(u',', ','),

(u'it', 'PRP'),

(u'should', 'MD'),

(u'be', 'VB'),

(u'up', 'RP'),

(u'there', 'RB'),

(u'in', 'IN'),

(u'my', 'PRP$'), (u'opinion', 'NN'), (u'.', '.')]  We can define a get_NN() function that extracts nouns from the records, filters stopwords, and removes non-words from the data set: >>> def get_NN(record): ... import re ... from nltk.corpus import stopwords ... all_pos = record["pos"] ... ret = [] ... for pos in all_pos: ... if pos[1] == "NN" \ ... and pos[0] not in stopwords.words('english') \ ... and re.search("^[0-9a-zA-Z]+$", pos[0]) is not None:

...            ret.append(pos[0])

...    return ret

>>> nouns = movies2.flatMap(get_NN)

We can then generate word counts for the nouns that we extracted from the dataset:

>>> counts = nouns.map(lambda word: (word, 1))

After we’ve done the heavy lifting, processing, filtering and cleaning on the text data using Anaconda and PySpark, we can collect the reduced word count results onto the head node.

>>> top_nouns = counts.countByKey()

>>> top_nouns = dict(top_nouns)

In the next section, we’ll continue our analysis on the head node of the cluster while working with familiar libraries in Anaconda, all in the same interactive Jupyter notebook.

### Local analysis with Pandas and Bokeh

Now that we’ve done the heavy lifting using Anaconda and PySpark across the cluster, we can work with the results as a dataframe in Pandas, where we can query and inspect the data as usual:

>>> import pandas as pd

>>> df = pd.DataFrame(top_nouns.items(), columns=['Noun', 'Count'])

Let’s sort the resulting word counts, and view the top 10 nouns by frequency:

>>> df = df.sort_values('Count', ascending=False)

>>> df_top_10 = df.head(10)

>>> df_top_10

 Noun Count movie 539698 film 220366 time 157595 way 112752 gt 105313 http 92619 something 87835 lot 85573 scene 82229 thing 82101

Let’s generate a bar chart of the top 10 nouns using Pandas:

>>> %matplotlib inline

>>> df_top_10.plot(kind='bar', x=df_top_10['Noun'])

 5-pyspark-reddit-language.png 
 

Finally, we can use Bokeh to generate an interactive plot of the data:

>>> from bokeh.charts import Bar, show

>>> from bokeh.io import output_notebook

>>> from bokeh.charts.attributes import cat

>>> output_notebook()

>>> p = Bar(df_top_10,

...         label=cat(columns='Noun', sort=False),

...         values='Count',

...         title='Top N nouns in r/movies subreddit')

>>> show(p)

 6-pyspark-reddit-language.png 
 

### Conclusion

In this post, we used Anaconda with PySpark to perform distributed natural language processing and computations on data stored in HDFS. We configured Anaconda and the Jupyter Notebook to work with PySpark on various enterprise Hadoop distributions (including Cloudera CDH and Hortonworks HDP), which allowed us to work interactively with Anaconda and the Hadoop cluster. This made it convenient to work with Anaconda for the distributed processing with PySpark, while reducing the data to a size that we could work with on a single machine, all in the same interactive notebook environment. The complete notebook for this example with Anaconda, PySpark, and NLTK can be viewed on Anaconda Cloud.

You can get started with Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$conda install anaconda-client $ anaconda login

$conda install anaconda-cluster -c anaconda-cluster If you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io. The enterprise features of Anaconda, including the cluster management functionality and on-premises repository, are certified for use with Cloudera CDH 5. If you’re running into memory errors, performance issues (related to JVM overhead or Python/Java serialization), problems translating your existing Python code to PySpark, or other limitations with PySpark, stay tuned for a future post about a parallel processing framework in pure Python that works with libraries in Anaconda and your existing Hadoop cluster, including HDFS and YARN. ### Matthieu Brucher #### Analog modeling of a diode clipper (3a): Simulation Now that we have a few methods, let’s try to simulate them. For both circuits, I’ll use the forward Euler, then backward Euler and trapezoidal approximations, then I will show the results of changing the start estimate and then finish by the Newton Raphson optimization. I haven’t checked (yet?) algorithms that don’t use the derivative like the bisection or Brent algorithm. All graphs are done with a x4 oversampling (although I also tried x8, x16 and x32). # First diode clipper Let’s start with the original equation: $V_i - 2 R_1 I_s sinh(V_o/nV_t) - \int \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_o = 0$ ## Forward Euler Let’s now figure out what to do with the integral by deriving the equation: $\frac{dV_o}{dt} = \frac{\frac{dV_i}{dt} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t})}{1 + \frac{2 I_s R_1}{nV_t} cosh(\frac{V_o}{nV_t})}$ So now we have the standard form that can used the usual way. For the derivative of the input, I’ll always use the trapezoidal approximation, and then for the output one, I’ll use the forward Euler which leads to the “simple” equation: $V_{on+1} = V_{on} + \frac{V_{in+1} - V_{in} - \frac{4 h I_s}{C_1} sinh(\frac{V_{on}}{nV_t})}{1 + \frac{2 I_s R_1}{nV_t} cosh(\frac{V_{on}}{nV_t})}$ ## Backward Euler For the backward Euler, I’ll start from the integral equation again and remove the time dependency: $V_{in+1} - V_{in} - 2 R_1 I_s (sinh(\frac{V_{on+1}}{nV_t}) - sinh(\frac{V_{on}}{nV_t})) - \int^{t_{n+1}}_{t_n} \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) - V_{on+1} + V_{on} = 0$ Now the discretization becomes: $V_{in+1} - V_{in} - 2 R_1 I_s (sinh(\frac{V_{on+1}}{nV_t}) - sinh(\frac{V_{on}}{nV_t})) - \frac{2 h I_s}{C_1} sinh(\frac{V_{on+1}}{nV_t}) - V_{on+1} + V_{on} = 0$ I didn’t use this equation for the Backward Euler because I would have had a dependency in the sinh term, so I would still have required the numerical methods to solve the equation. ## Trapezoidal rule Here, we just need to change the discretization for a trapezoidal one: $V_{in+1} - V_{in} - I_s sinh(\frac{V_{on+1}}{nV_t}) (\frac{h}{C_1} + 2 R_1) - I_s sinh(\frac{V_{on}}{nV_t}) (\frac{h}{C_1} - 2 R_1) - V_{on+1} + V_{on} = 0$ ## Starting estimates Starting from the different rules, we need to replace sinh(x): • for the pivotal by $\frac{x}{x_0} sinh(x_0)$ • for the tangent rule by $\frac{x}{nV_t} cosh(\frac{x_0}{nV_t}) + y_0 - \frac{x_0}{nV_t} cosh(\frac{x_0}{nV_t})$ ## Graphs Let’s see now how all these optimizers compare (with an estimate of the next element being the last optimized value): Numerical optimization comparison Obviously, the Forward Euler method is definitely not good. Although is on average 4 times lower, the accuracy is definitely not good enough. On the other end, the other two methods give similar results (probably because I try to achieve a convergence quite strong, with less that 10e-8 difference between two iterations). Now, how does the original estimate impact the results? I tried the Backward Euler to start, and the results are identical: Original estimates comparison To have a better picture, let’s turn down the number of iterations to 1 for all the estimates: One step comparison So all the estimates give a similar result. By comparing the number of iterations with the three estimates, the pivotal method gives the worst results, whereas the affine estimates lowers the number of iterations by one. Of course, there is a price to pay in the computation. So the obvious choice is to use trapezoidal approximation with affine starting point estimate, which is not my default choice in SimpleOverdriveFilter. # To be continued The post is getting longer than I thought, so let’s keep it there for now and the next post on the subject will tackle the other diode clipper circuit. ## April 11, 2016 ### Continuum Analytics news #### Data Science with Python at ODSC East Posted Tuesday, April 12, 2016 By, Sheamus McGovern, Open Data Science Conference Chair At ODSC East, the most influential minds and institutions in data science will convene at the Boston Convention & Exhibition Center from May 20th to the 22nd to discuss and teach the newest and most exciting developments in data science. As you know, the Python ecosystem is now one of the most important data science development environments available today. This is due, in large part, to the existence of a rich suite of user-facing data analysis libraries. Powerful Python machine learning libraries like Scikit-learn, XGBoost and others bring sophisticated predictive analytics to the masses. The NLTK and Gensim libraries enable deep analysis of textual information in Python and the Topik library provides a high-level interface to these and other, natural language libraries, adding a new layer of usability. The Pandas library has brought data analysis in Python to a new level by providing expressive data structures for quick and intuitive data manipulation and analysis. The notebook ecosystem in Python has also flourished with the development of the Jupyter, Rodeo and Beaker notebooks. The notebook interface is an increasingly popular way for data scientists to perform complex analyses that serve the purpose of conveying and sharing analyses and their results to colleagues and to stakeholders. Python is also host to a number of rich web-development frameworks that are used not only for building data science dash boards, but also for full-scale data science powered web-apps. Flask and Django lead the way in terms of the Python web-app development landscape, but Bottle and Pyramid are also quite popular. With Cython, code can approach speeds akin to that of C or C++ and new developments, like the Dask package, to make computing on larger-than-memory datasets very easy. Visualization libraries, like Plot.ly and Bokeh, have brought rich, interactive and impactful data visualization tools to the fingertips of data analysts everywhere. Anaconda has streamlined the use of many of these wildly popular open source data science packages by providing an easy way to install, manage and use Python libraries. With Anaconda, users no longer need to worry about tedious incompatibilities and library management across their development environments. Several of the most influential Python developers and data scientists will be talking and teaching at ODSC East. Indeed, Peter Wang will be speaking ODSC East. Peter is the co-founder and CTO at Continuum Analytics, as well as the mastermind behind the popular Bokeh visualization library, the Blaze ecosystem, which simplifies the the analysis of Big Data with Python and Anaconda. At ODSC East, there will be over 100 speakers, 20 workshops and 10 training sessions spanning seven conferences that focused on Open Data Science, Disruptive Data Science, Big Data science, Data Visualization, Data Science for Good, Open Data and a Careers and Training conference. See below for very small sampling of some of the powerful Python workshops and speakers we will have at ODSC East. ●Bayesian Statistics Made Simple - Allen Downey, Think Python ●Intro to Scikit learn for Machine Learning - Andreas Mueller, NYU Center for Data Science ●Parallelizing Data Science in Python with Dask - Matthew Rocklin, Continuum Analytics ●Interactive Viz of a Billion Points with Bokeh Datashader – Peter Wang, Continuum Analytics ## April 07, 2016 ### Titus Brown #### Bashing on monstrous sequencing collections So, there's this fairly large collection of about 700 RNAseq samples, from 300 species in 40 or so phyla. It's called the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), and was funded by the Moore Foundation as a truly field-wide collaboration to improve our reference collection for genes (and more). Back When, it was sequenced and assembled by the National Center for Genome Resources, and published in PLOS Biology (Keeling et al., 2014). Partly because we think assembly has improved in the last few years, partly as an educational exercise, partly as an infrastructure exercise, partly as a demo, and partly just because we can, Lisa Cohen in my lab is starting to reassemble all of the data - starting with about 10%. She has some of the basic evaluations (mostly via transrate) posted, and before we pull the trigger on the rest of the assemblies, we're pausing to reflect and to think about what metrics to use, and what kinds of resources we plan to produce. (We are not lacking in ideas, but we might be lacking in good ideas, if you know what I mean.) In particular, this exercise raises some interesting questions that we hope to dig into: • what does a good transcriptome look like, and how could having 700 assemblies help us figure that out? (hint: distributions) • what is a good canonical set of analyses for characterizing transcriptome assemblies? • what products should we be making available for each assembly? • what kind of data formatting makes it easiest for other bioinformaticians to build off of the compute we're doing? • how should we distribute the workflow components? (Lisa really likes shell scripts but I've been lobbying for something more structured. 'make' doesn't really fit the bill here, though.) • how do we "alert" the community if and when we come up with better assemblies? How do we merge assemblies between programs and efforts, and properly credit everyone involved? Anyway, feedback welcome, here or on Lisa's post! We are happy to share methods, data, analyses, results, etc. etc. --titus p.s. Yes, that's right. I ask new grad students to start by assemblying 700 transcriptomes. So? :) ### Martin Fitzpatrick #### Why are numpy calculations not affected by the global interpreter lock? Many numpy calculations are unaffected by the GIL, but not all. While in code that does not require the Python interpreter (e.g. C libraries) it is possible to specifically release the GIL - allowing other code that depends on the interpreter to continue running. In the Numpy C codebase the macros NPY_BEGIN_THREADS and NPY_END_THREADS are used to delimit blocks of code that permit GIL release. You can see these in this search of the numpy source. The NumPy C API documentation has more information on threading support. Note the additional macros NPY_BEGIN_THREADS_DESCR, NPY_END_THREADS_DESCR and NPY_BEGIN_THREADS_THRESHOLDED which handle conditional GIL release, dependent on array dtypes and the size of loops. Most core functions release the GIL - for example Universal Functions (ufunc) do so as described: as long as no object arrays are involved, the Python Global Interpreter Lock (GIL) is released prior to calling the loops. It is re-acquired if necessary to handle error conditions. With regard to your own code, the source code for NumPy is available. Check the functions you use (and the functions they call) for the above macros. Note also that the performance benefit is heavily dependent on how long the GIL is released - if your code is constantly dropping in/out of Python you won’t see much of an improvement. The other option is to just test it. However, bear in mind that functions using the conditional GIL macros may exhibit different behaviour with small and large arrays. A test with a small dataset may therefore not be an accurate representation of performance for a larger task. There is some additional information on parallel processing with numpy available on the official wiki and a useful post about the Python GIL in general over on Programmers.SE. ## April 06, 2016 ### Continuum Analytics news #### Anaconda 4.0 Release Posted Wednesday, April 6, 2016 We are happy to announce that Anaconda 4.0 has been released, which includes the new Anaconda Navigator. Did you notice we skipped from release 2.5 to 4.0? Sharp eyes! The team decided to move up to 4.0 release number to reduce confusion with common Python versions. Anaconda Navigator is a desktop graphical user interface included in Anaconda that allows you to launch applications and easily manage conda packages, environments and channels without the need to use command line commands. It is available for Windows, OS X and Linux. For those familiar with the Anaconda Launcher, Anaconda Navigator has replaced Launcher. If you are already using Anaconda Cloud to host private packages, you can access them easily by signing in with your Anaconda Cloud account. ## navigator-home.png There are four main components in Anaconda Navigator, each one can be selected by clicking the corresponding tab on the left-hand column: • Home where you can install, upgrade, and launch applications • Environments allows you to manage channels, environments and packages. • Learning shows a long list of learning resources in several categories: webinars, documentation, videos and training. • Community where you can connect to other users through events, forums and social media. If you already have Anaconda installed, update to Anaconda 4.0 by using conda: conda update conda conda install anaconda=4.0 The full list of changes, fixes and updates for Anaconda v4.0 can be found in the changelog. We’d very much appreciate your feedback on the latest release, especially the new Anaconda Navigator. Please submit comments or issues through our anaconda-issues GitHub repo. ## April 05, 2016 ### Enthought #### Just Released: PyXLL v 3.0 (Python in Excel). New Real Time Data Stream Capabilities, Excel Ribbon Integration, and More. Download a free 30 day trial of PyXLL and try it with your own data. Since PyXLL was first released back in 2010 it has grown hugely in popularity and is used by businesses in many different sectors. The original motivation for PyXLL was to be able to use all the best bits of Excel […] ### Matthieu Brucher #### Analog modeling of a diode clipper (2): Discretization Let’s start with the two equations we got from the last post and see what we can do with usual/academic tools to solve them (I will tackle nodal and ZDF tools later in this series). # Euler and trapezoidal approximation The usual tools start with a specific form:$$\dot{y} = f(y)$

I’ll work with the second clipper whose equation is of this form:

$\frac{dV_o}{dt} = \frac{V_i - V_o}{R_1 C_1} - \frac{2 I_s}{C_1} sinh(\frac{V_o}{nV_t}) = f(V_o)$

## Forward Euler

The simplest way of computing the derivative term is to use the following rule, with h, the inverse of the sampling frequency:

$V_{on+1} = V_{on} + h f(V_{on})$

The nice thing about this rule is that it is easy to compute. The main drawback is that the result may not be accurate and stable enough (let’s keep this for the next post, with actual derivations).

## Backward Euler

Instead of using the past to compute the new sample, we can use the future, which leads to

$V_{on+1} = V_{on} + h f(V_{on+1})$

As the result is present on both sides, solving the problem is not simple. In this case, we can even say that the equation has no close form solution (due to the sinh term), and thus no analytical solution. The only way is to use numerical methods like Brent or Newton Raphson to solve the equation.

## Trapezoidal approximation

Another solution is to combine both solution to have a better approximation of the derivative term:

$V_{on+1} = V_{on} + \frac{h}{2}(f(V_{on}) + f(V_{on+1}))$

We still need the numerical methods to solve the clipper equation with this method, but like the Backward Euler method, this one is said to be A-stable, which is a mandatory condition when solving stiff systems (or systems that have a bad condition number). For a one variable system, the condition number is of course 1…

## Other approximations

There are different other ways of approximating this derivative term. The most used one is the trapezoidal methods, but there are others like all the linear multistep methods (that actually encompass the first three).

# Numerical methods

Let’s try to analyse a few numerical methods. If we used trapezoidal approximation, then the following function needs to be considered:

$g(V_{on+1}) = V_{on+1} - V_{on} - \frac{h}{2}(f(V_{on}) + f(V_{on+1}))$

The goal is to find a value where the function is zero, called a root of the function.

## Bisection method

This method is simple to implement, from two starting points, on either side of the root. Then, we take the middle of the interval, check the sign of it and keep the original point that has a different sign, and keep on until we get close enough of the root.

What is interesting with this method is that it can be vectorized easily by checking several values in the interval instead of just one.

## Newton-Raphson

This numerical method requires the derivative function of g(). Then we can start from the original starting point and iterate this series:

$x_{n+1} = x_n - \frac{g(x_n)}{g'(x_n)}$

For those who are used to optimize cost function and know about the Newton method, it is exactly the same as this one. To optimize a cost function, we need to find a zero of the derivative function. So if g() is this derivative function, then we end up on the Newton method to minimize (or maximize) a cost function.

That being said, if the rate of convergence is quadratic for the Newton-Raphson method, it may not converge. The only way to achieve convergence is to be close enough to the root we are looking for and have some conditions that are usually quite complex to check (see the wikipedia page).

## Starting point

The are several ways of starting the methods.

The first one is any enough: just use the result of the last optimization.

The second one is a little bit more complex: approximate all the complex functions (like sinh) by their tangent and solve the resulting polynomial.

The third one is derived from the second one and called pivotal method/mystran method. Instead of using the tangent, we use the linear function that crosses the origin and the last point. The idea here is that it can be more stable that the tangent method (consider doing this for the hyperbolic tangent, the resulting result could be quite far).

# Conclusion

Of course, there are other numerical methods that I haven’t spoken about. Any can be tried and used. Please do so and report your results!

Let’s see how the ones I’ve shown behave in the next post.

## April 04, 2016

### Continuum Analytics news

#### Anaconda Powers TaxBrain to Transform Washington Through Transparency, Access and Collaboration

Posted Monday, April 4, 2016

Continuum Analytics and Open Source Policy Center Leverage the Power of Open Source to Build Vital Policy Forecasting Models with Anaconda

AUSTIN, TX—April 4, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading open source analytics platform powered by Python, today announced that Anaconda is powering the American Enterprise Institute’s (AEI) Open Source Policy Center (OSPC) TaxBrain initiative. TaxBrain is a web-based application that lets users simulate and study the effect of tax policy reforms using open source economic models. TaxBrain provides transparent analysis for policy makers and the public, ultimately creating a more democratic and scientific platform to analyze economic policy.

“OSPC’s mission is to empower the public to contribute to government policymaking through open source methods and technology, making policy analysis more transparent, trustworthy and collaborative,” said Matt Jensen, founder and managing director of the Open Source Policy Center at the American Enterprise Institute. “TaxBrain is OSPC’s first product, and with Anaconda it is already improving tax policy by making the policy analysis process more democratic and scientific. By leveraging the power of open source, we are able to provide policy makers, journalists and the general public with the information they need to impact and change policy for the better.”

TaxBrain is made possible by a community of economists, data scientists, software developers, and policy experts who are motivated by a shared belief that public policy should be guided by open scientific inquiry, rather than proprietary analysis. The community also believes that the analysis of public policy should be freely available to everyone, rather than just to a select group of those in power.

“The TaxBrain initiative is only the beginning of a much larger movement to use open source approaches in policy and government,” said Travis Oliphant, CEO and co-founder of Continuum Analytics. “TaxBrain is the perfect example of how Anaconda can inspire people to harness the power of data science to enable positive changes. With Anaconda, the OSPC is able to empower a growing community with the superpowers necessary to promote change and democratic policy reform.”

Anaconda has allowed TaxBrain to tap a vast network of outside contributors in the PyData community to accelerate the total number of open source economic models. Contributions from the PyData community come quickly and—because of Anaconda—get large performance gains from Numba, the Python compiler included in Anaconda, and are easy to integrate into TaxBrain. These contributions can be used in other applications and are hosted on the Anaconda Cloud (anaconda.org/ospc).

AEI is a nonprofit, nonpartisan public policy research organization that works to expand liberty, increase individual opportunity, and strengthen free enterprise.

Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team––data scientists, developers, DevOps, data engineers and business analysts––to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

## April 02, 2016

### NeuralEnsemble

#### EU Human Brain Project Releases Platforms to the Public

"Geneva, 30 March 2016 — The Human Brain Project (HBP) is pleased to announce the release of initial versions of its six Information and Communications Technology (ICT) Platforms to users outside the Project. These Platforms are designed to help the scientific community to accelerate progress in neuroscience, medicine, and computing.

[...]

The six HBP Platforms are:
• The Neuroinformatics Platform: registration, search, analysis of neuroscience data.
• The Brain Simulation Platform: reconstruction and simulation of the brain.
• The High Performance Computing Platform: computing and storage facilities to run complex simulations and analyse large data sets.
• The Medical Informatics Platform: searching of real patient data to understand similarities and differences among brain diseases.
• The Neuromorphic Computing Platform: access to computer systems that emulate brain microcircuits and apply principles similar to the way the brain learns.
• The Neurorobotics Platform: testing of virtual models of the brain by connecting them to simulated robot bodies and environments.
All the Platforms can be accessed via the HBP Collaboratory, a web portal where users can also find guidelines, tutorials and information on training seminars. Please note that users will need to register to access the Platforms and that some of the Platform resources have capacity limits."

... More in the official press release here.

The HBP held an online release event on 30 March:

Prof. Felix Schürmann (EPFL-BBP, Geneva), Dr. Eilif Muller (EPFL-BBP, Geneva), and Prof. Idan Segev (HUJI, Jerusalem) present an overview of the mission, tools, capabilities and science of the EU Human Brain Project (HBP) Brain Simulation Platform:

A publicly accessible forum for the BSP is here:
https://forum.humanbrainproject.eu/c/bsp
and for community models
https://forum.humanbrainproject.eu/c/community-models
and for community models of hippocampus in particular
https://forum.humanbrainproject.eu/c/community-models/hippocampus

## April 01, 2016

### Enthought

#### The Latest Features in Virtual Core: CT Scan, Photo, and Well Log Co-visualization

Enthought is pleased to announce Virtual Core 1.8.  Virtual Core automates aspects of core description for geologists, drastically reducing the time and effort required for core description, and its unified visualization interface displays cleansed whole-core CT data alongside core photographs and well logs.  It provides tools for geoscientists to analyze core data and extract features from […]

#### Canopy Geoscience: Python-Based Analysis Environment for Geoscience Data

Today we officially release Canopy Geoscience 0.10.0, our Python-based analysis environment for geoscience data. Canopy Geoscience integrates data I/O, visualization, and programming, in an easy-to-use environment. Canopy Geoscience is tightly integrated with Enthought Canopy’s Python distribution, giving you access to hundreds of high-performance scientific libraries to extract information from your data. The Canopy Geoscience environment […]

## March 31, 2016

### Continuum Analytics news

#### Why Every CEO Needs To Understand Data Science

Posted Thursday, March 31, 2016

Tech culture has perpetuated the myth that data science is a sort of magic; something that only those with exceptional math skills, deep technical know-how and industry knowledge can understand or act on. While it’s true that math skills and technical knowledge are required to effectively extract insights from data, it’s far from magic. Given a little time and effort, anyone can become familiar with the basic concepts.

As a CEO, you don’t need to understand every technical detail, but it’s very important to have a good grasp of the entire process behind extracting useful insights from your data. Click on the full article below to read the five big-picture steps you must take to ensure you understand data science (and to ensure your company is gaining actionable insights throughout the process).