## April 17, 2018

### Continuum Analytics

I had the great honor and pleasure of presenting the first tutorial at AnacondaCon 2018, on machine learning with scikit-learn. I spoke to a full room of about 120 enthusiastic data scientists and aspiring data scientists. I would like to thank my colleagues at Anaconda, Inc. who did such a wonderful job of organizing this …

### Matthieu Brucher

#### Book review: C++17 Quick Syntax Reference: A Pocket Guide to the Language, APIs and Library

I work on a day-to-day basis on a big project that has many developers with different C++ level. Scott Meyers wrote a wonderful book on modern C++ (that I still need to review one day, especially since there is a new Effective Modern C++), but it is not for beginners. So I’m looking for that rare book with modern C++ and an explanation of good practices.

#### Discussion

Let’s cut to the chase right away. It’s not this book. This book is bad. Very bad. So at the core, it’s supposed to be about the syntax, but even if it was about the syntax, you can still teach the good approach, can’t you?

A few examples. Templates are tackled in one of the last chapter, and so are classes. Then, the book starts almost from the beginning to tell people to use using namespace std. Is there anything more to add?

Yes, there is. New and delete are tackled, then the array version is done very much further, and I’m not even talking about smart pointers. They are addressed, but so far that people think it is still good to start by not using them. Yes, talk about new/delete, but RIGHT AWAY, say that they should use std::unique_ptr, std::shared_ptr and the make_* version. It’s supposed to be about C++17, and in C++17, we avoid new/delete. OK, it is mentioned, but 2 lines after several chapters of bad practices.

For range loops. They are introduced badly as well. for(auto&i: l) std::cout i std::endl; Why? Why the &? Why can’t you explain the purpose of this instead of waiting additional chapters and not even talking about when you use pass by value, pass by ref or pass by const ref?

I’m still trying to figure out why it is supposed to be a syntax book, but still the author tackles smart pointers. And tuples. Why not the rest of the standard library?

A good C++ book should start by presenting templates as soon as possible, the standard library and the good practices. Yes, it’s tough, but that’s why not everyone should write a C++ book.

#### Conclusion

The book is supposed to be about the syntax. But it lacks the good practices, with no reference to the C++ core guidelines. In the end, you still need to read another book to learn Modern C++ (hint…).

## April 16, 2018

### Continuum Analytics

#### AnacondaCON 2018 Recap: An Exploration of Modern Data Science

Last year’s inaugural AnacondaCON was a major milestone for our company. Our goal was to create a conference that highlights all the different ways people are using data science and predictive analytics, and reflects the passionate and eclectic nature of our growing Python community. When over 400 people descended upon Austin to connect with peers …

## April 12, 2018

### Continuum Analytics

#### What You Missed on Day Three of AnacondaCON 2018

And that’s a wrap! Yesterday was the third and final day of AnacondaCON 2018, and what a ride it’s been. Read some highlights from what you missed, and stay tuned for our comprehensive AnacondaCON 2018 recap, coming soon! Improving Your Anaconda Distribution User Experience Anaconda Product Manager Crystal Soja presented a roadmap of upcoming plans …

### Randy Olson

#### Traveling salesman portrait in Python

Last week, Antonio S. Chinchón made an interesting post showing how to create a traveling salesman portrait in R. Essentially, the idea is to sample a bunch of dark pixels in an image, solve the well-known traveling salesman problem for

## April 11, 2018

### Continuum Analytics

#### What You Missed on Day Two of AnacondaCON 2018

What a day! On Tuesday we got started bright and early, then partied our way into the night. Here are some highlights from Day Two of AnacondaCON 2018. Opening Keynote: John Kim John Kim, President of HomeAway, kicked things off for us with a personal, touching keynote on Love in the Age of Machine Learning. …

## April 10, 2018

### Matthieu Brucher

#### Book review: LLVM Cookbook

After the book on LLVM core libraries, I want to have a look at the cookbook.

#### Discussion

The idea was that once I had a broad view of LLVM, I could try to apply some recipes for what I wanted to do. Let’s just say that I was deeply mistaken.

First, the two authors have a very different way of writing code. One of them is… rubbish. I don’t think there is another way of saying this, but this is C++, and the guy writes C++ code as if it was C code, no class, with static states, without the override keyword. If such a guy is a professional developer, I’m sorry but I’m very scared about anything he would write professionally.

The second guy is better (he uses override, for instance, so it’s very disturbing to see both styles in the same book), it’s just too bad that the code he writes seems to be just showing things existing in LLVM, but no real recipes (OK, I’m exaggerating, there are a few such examples, but the majority is “execute that command to see how LLVM does this”, and just doing “this” doesn’t have any relevance in the big picture.

I suppose the only relevant and interesting parts are the first few recipes that are focused on reusing LLVM parts for a custom language. The rest is basically explanations of the later stages in a compiler. Basically what you would get from my previous review, without the explanations…

#### Conclusion

Have you ever read a recipe book that will explain how to prepare your kitchen for cooking instead of actually cooking recipes? This book is like that. You might learn how to use LLVM commands, but not LLVM libraries. Avoid.

## April 09, 2018

### Continuum Analytics

#### What You Missed on Day One of AnacondaCON 2018

And we’re off! Day One of AnacondaCON 2018 is officially in the books, y’all. For those of you who couldn’t make the trek to Texas, here are some highlights from what you missed today. “Why are they shooting at us?” “They’re the IT team!”The festivities kicked off this morning with a movie trailer for deep learning …

#### Introducing the Anaconda Data Science Certification Program

There is strong demand today for data science skills across all sectors of the economy. Organizations worldwide are actively looking to recruit qualified data scientists and improve the skills of their existing teams. Individuals are looking to stand out from the competition and differentiate themselves in a growing marketplace. As the creators of the world’s …

#### Anaconda Debuts Data Science Certification Program

Certification to Standardize Data Science Skill Set among Employers and Professionals AnacondaCON, Austin, TX—April 9, 2018 — Anaconda, the most popular Python data science platform provider, today introduced the Anaconda Data Science Certification, giving data scientists a way to verify their proficiency and organizations an independent standard for qualifying current and prospective data science experts. “The …

## April 03, 2018

### Matthieu Brucher

#### Book review: Getting Started with LLVM Core Libraries

LLVM has always intrigued me. Actually, I always thought about one day writing a compiler. But it was more a challenge than a requirement for any of my works, private or professional, so never dived into it. The design of LLVM was also very well thought, and probably close to something I would have had liked to create.

So now the easiest is just to use LLVM for the different goals I want to achieve. I recently had to write clang-tidy rules, and I also want to perhaps create a JIT for Audio Toolkit and the modeling libraries. So lots of reasons to look at LLVM.

#### Discussion

The book more or less goes from C/C++ parsing to code generation.

OF course, the first chapters are about setting everything up. The book using Makefiles mainly, which is not an option anymore in current LLVM versions. But it does provide the equivalent CMake version, so it is fine. Also the structure of the projects have not changed, so everything still works. Of course, lots of projects matured also over time (lld, libcxx…), so when you read that something is not yet production ready, check online (if you can find the information, I have to say that LLVM communication is very bad, just look at release notes to get an idea!).

The third chapter tackles LLVM design. That’s what I liked with LLVM, the modular design, but it can also be scary because you can build more or less anything, and the API do evolve with time. But the chapter does reassure me, and helps understanding the philosophy.

Then, at the fourth chapter, we start working through clang pipeline by starting with all the steps between the C/C++ code and LLVM Intermediate Representation. The AST and interaction with it are very well presented with the different stages required to generate the IR. The missing bit may be explaining why the AST is so important to have, why LLVM people had to create a new intermediate representation for this front-end.

The fifth chapter is about everything we can do on the IR. I left the chapter still hungry for more. OK, the IR phases can evolve the graph, but it feels like not enough here. How does the matching actually work? This is where you can see that the book is for beginners and not for intermediate or advanced users. Also it made me realize that there is no way I can generate IR directly for my projects, I would go from a C++ AST to IR to the JIT…

After working on the IR, of course, we get to code generation and the different tools in LLVM to generate either byte code or machine code and everything in-between. Lots of time is devoted to explain that this phase is very costly, as we go from something quite generic to something definitely not generic, and this part was very instructional.

The seventh chapter was strange. It spent lots of time talking about a part of LLVM that was about to be removed from LLVM, the “old” JIT framework. I suppose at the time the new one was too new and some people still had to understand the old one. I still felt it was a little bit a waste of space.

Cross-compilation is tackled after that, and more precisely that you may not require to do anything. This is also where one can see the limit of LLVM. To get the proper backends, you need to get the gcc toolchain. I think this is still something people do today. Even for clang 6, I actually compiled it against a gcc 7 set so that I don’t have to rebuild all the C++ third-party libraries. Also the ARM backend seemed to be broken for a long time, so that’s also not very great for trust!

The last two chapters tackle tools made with clang. The first one is the static analyzer, and I have to say that I didn’t even knew it existed. There are tools with it that allow to generate HTML reports, and I liked that. But when I tried to use them with CMake, they just broke (scan_build). There is chapter about libclang and clang-tidy, which is probably my reference now. Something that wasn’t done in 2014 is that the static analyzer rules are now integrated inside clang-tidy, it’s just that it can’t build HTML reports out of them. Is it really mandatory? It gives a better view of static code issues (whereas the other rules are geared towards sugar-coating).

The book ends very quickly in a small paragraph at the end of the libeling chapter. Very disturbing.

#### Conclusion

Despite the age of the book and the changes that went inside LLVM (clang-modernize is now part of clang-tidy, DragonEgg is… I don’t know where it went), the book seem to stay very much current (clang is still the main front-end). I would have liked more example on clang AST matchers, but I suppose it requires a full cookbook, and the audience may not be that big. Still, I’m looking forward to use the different bits to write a JIT and C++ output for electronic modeling/SPICE.

## March 30, 2018

### Continuum Analytics

#### Improved Security & Performance in Anaconda Distribution 5

We announced the release of Anaconda Distribution 5 back in October 2017, but we’re only now catching up with a blog post on the security and performance implications of that release.  Improving security and enabling new language features were our primary goals, but we also reaped some performance improvements along the way. This blog post …

## March 28, 2018

### Continuum Analytics

#### Anaconda Community Survey

If you’re an Anaconda user, we’d love to hear from you! Please complete our short survey below, or by clicking on this link. As an extra incentive when you complete the survey you can enter a drawing to win a Sonus One Smart Speaker with Amazon Alexa.

## March 25, 2018

### Leonardo Uieda

#### The future of Fatiando a Terra

I started developing the Fatiando a Terra Python library in 2010. Since then, many other open-source Python libraries for geophysics have appeared, each with unique capabilities. In this post, I'll explore where I think Fatiando fits in this larger ecosystem and how we can better fill our niche.

## What is Fatiando a Terra?

Fatiando is a Python library for modeling and inversion in geophysics. It's composed of different subpackages:

• fatiando.gridder: functions for dealing with spatial data. It's mostly used to generate point scatters or coordinate arrays for regular grids. Both are required as inputs for modeling or creating synthetic datasets.
• fatiando.mesher: classes that represent geometric objects (polygons, prisms, spheres, etc) and regular meshes. These classes are used to define the geometry and physical properties of our models. They are often the inputs for gravity and magnetic modeling functions.
• fatiando.vis: utilities for plotting data using matplotlib and 3D models using Mayavi. Mostly deprecated but there is a lot of useful code for displaying fatiando.mesher elements in Mayavi.
• fatiando.inversion: classes for solving inverse problems. The idea is that the user needs only to implement the forward problem (the forward function and the Jacobian matrix) and the classes take care of the rest. Ideally, this would form the basis for all inversions in Fatiando.
• fatiando.datasets: functions for loading data from common file formats and loading sample datasets packaged with Fatiando.
• fatiando.seismic: functions and classes for modeling seismic data and some basic inversions. Mostly toy problems.
• fatiando.geothermal: geothermal modeling functions. Has a single module for modeling how temperature perturbations at the surface propagate down into the Earth.
• fatiando.gravmag: functions for gravity and magnetic processing, modeling, and inversion. By far the most developed package, though some components have lagged behind.

## Fatiando's niche

We set out with the goal of modeling the whole Earth using all geophysical methods. Humble, right? Turns out this is extremely hard and way beyond what a couple of grad students can do in a couple of years. Back then, there were very few Python geophysical modeling libraries. A decade later, the ecosystem has expanded. The five currently on going projects of which I'm aware are (let me know in the comments if I missed any):

• PyGMI: GUI + library for 3D modeling of gravity and magnetic data.
• SimPEG: Forward modeling and inversion library based on the finite volume method.
• pyGIMLi: Forward modeling and inversion library based on the finite element and finite volume methods.
• Bruges: Modeling and processing for seismic and petrophysics.
• Pyrocko: A collection of tools and libraries, mostly for seismology.

The two projects that are most similar to us (SimPEG and pyGIMLi) implement flexible partial differential equation solvers that they use to run all forward modeling calculations. This makes a lot of sense because it gives them a unified framework to model most geophysical methods. It is the most sensible approach to build joint inversions of multiple geophysical datasets. However, there are some inverse problems that don't fit this paradigm, like inverting Moho relief from gravity data and some non-conventional inversion algorithms (see the animation below).

It's no coincidence that Fatiando mostly contains the tools needed to implement this type of inverse problem (i.e., analytical solutions for the gravity and magnetic fields of geometric objects). This is precisely the type of research that we do at the PINGA lab. We also develop processing methods for gravity and magnetics.

The niche I see for Fatiando is in gravity and magnetic methods, particularly using these analytical solutions. The processing functions are an important feature because there are hardly any open-source alternatives out there to commercial software like Oasis Montaj and Intrepid.

## The current state

Fatiando has grown over the years as I slowly learned how to develop and maintain an open-source Python project. As a result, the codebase is littered with the bad choices that I made along the way. The most urgent problems that need to be fixed are:

• Python 3 support. It's no longer a huge sacrifice to make the switch because all of our dependencies are supported. Actually, some of them don't even support Python 2 anymore. Support both versions is a bit of a pain and it's not worth it. The conda environments also make using multiple versions of Python easy. We should just migrate to Python 3 only and be done with it.
• Test coverage is sparse and a lot of code is not maintained. There is a lot of old code in Fatiando that was included before I learned how to write good tests. As a result, they have little to no tests and are largely unused. They might be broken right now and I would have no way of knowing. We should only include code that we are willing to use and maintain.
• Too many "toy problems". Mostly in the seismic package. They are useful for teaching and I don't think we need to delete all of it. But we have to be careful how we advertise these features. They shouldn't be packaged with well-tested and robust production code.
• A single package. The meshing, inversion, and gridding code is not really dependent on the rest of Fatiando. There is no reason why they can't be standalone projects. This modularity might help lower the barrier for other projects to use them. Installing can still be easy by using fatiando as a metapackage (like Jupyter).

## A way forward

The best way forward for Fatiando that I can see, is to become an ecosystem of specialized tools and libraries, rather than a single Python package. Having things in separate libraries allows us to better indicate what is robust and professional and what is experimental or meant as a teaching tool. In particular, the meshing library has some overlap with discretize and we should be considering a merger of our projects. Separating what we have in a library will help us articulate the requirements of Fatiando so that we can see if a merger is beneficial. We can also include experimental libraries (like fatiando.seismic.wavefd) and CLI or GUI programs as independent projects.

This is how I envision the Fatiando ecosystem in the future (I have already started working on some of these projects):

• fatiando: A metapackage that can be used to install all the whole stack (like the jupyter package).
• deeplook: the inversion package. Should define a scikit-learn like interface and provide all of the standard tools (regularization, optimization, etc).
• geometric: the geometric objects and meshes. Includes an optional way of plotting them on Mayavi and matplotlib. The way physical properties are handled needs to be redesigned and meshes need to support slicing and fancy indexing.
• verde: the gridding package. It will include some new Green's functions based interpolation on which I've been working. Should also include the functions for calculating derivatives that are currently in fatiando.gravmag.transform.
• harmonica: the gravity and magnetic methods package. Will port over most of the code from fatiando.gravmag.
• sismica: a package for seismics and seismology. For now, will include some of the toy examples from the fatiando.seismic.
• wavefd: the experimental 2D FD wave propagation code (useful for teaching but I don't trust it enough for research).
• moulder: GUI for 2D gravity and magnetic modeling.

All of these packages will be tied together in the fatiando Github organization and the fatiando.org website, which will include instructions for installing the entire stack. The website will also link to individual packages (as is done right now for the subpackages) and any other project in the fatiando umbrella. Members of the organization will be free to create new repositories and we'll provide a template for doing so.

The requirements and goals for these new packages are:

• All code will be Python 3 only.
• All docstrings will use the numpy style.
• Each package will have it's own docs page with tutorials, API reference, install instructions, changelog, and gallery. They will share a common template and a simple theme.
• All repos will include a Code of Conduct and Contributing Guide.
• All main packages will have a comprehensive test suite. Anything not tested or experimental will be moved to separate packages. Full test coverage (or as much as possible) will be a requirement for merging a contribution.

This is how I think we could implement this:

1. Release Fatiando 0.6 with what we currently have in the master branch along with a note that this will be the last release to support Python 2.7.
2. Create a package template repository with the shared infrastructure (setup.py, docs, continuous integration configuration, Makefile, testing, etc).
3. Start repositories for each of the packages listed above.
4. Specify clear goals for each package and an example of how we want the API to look.
5. Focus on redesigning the inversion package first. This is the basis for many other packages.
6. Slowly copy over code from fatiando/fatiando while ensuring that everything is tested and documented.

## Help!

The goal of all these changes is to make Fatiando better for users and developers by making the code more robust and well documented. I'm curious to know what the Python geophysics community thinks about all of this. Do I have it all wrong? What should be done differently? And most importantly, would you like to help?

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

## March 23, 2018

### Titus Brown

#### Pydoit, snakemake, and workflows-as-applications

Ever since Camille Scott, a grad student in the lab, developed the dammit transcriptome annotator, I've been intrigued by the design decision she made. Dammit runs a lot of other software, and Camille made the brilliant decision to avoid having dammit coordinate the execution of the dependent software itself - instead, she wrapped dammit around doit, a Python workflow library in Python.

doit, like other somewhat related systems such as make, makeflow, and snakemake, specifies workflows in a declarative manner: "to reach such and such a target result, you need these intermediate results", and so on - effectively laying out a directed acyclic graph of dependencies. As part of this, these systems coordinate the execution of the commands needed to produce all of the results. And, because they have insight into the structure of the dependencies, they can do clever things like execute them in parallel, on multiple nodes, restarting failed jobs, etc. etc.

By using doit, Camille was able to set up the dependency graph for the final annotated transcriptome and could then delegate the execution to the pydoit library. I myself have written many a spaghetti ball of shell commands in my time, and I was impressed with the separation of workflow logic from execution details achieved by dammit.

Now, I was all set to use doit myself for some projects, but in the meantime my lab fell under the sway of my other CS grad student, Luiz Irber, who had been slowly converting people in the lab over to snakemake without me really noticing.

It turns out that snakemake is much easier to dig into that doit, and between that and Luiz's wealth of knowledge (and inexorable persuasion), I ended up implementing the spacegraphcats application workflow in snakemake. And I've been pretty happy with that so far, after a few months of working with it. (More on spacegraphcats at some future point.)

Now, my lab does a lot of workflow-y stuff, because we're a bioinformatics group and bioinformatics is all about running other people's software on other people's data (which is about as much fun as it sounds, but we get by). So when yet another project, the dahak metagenomics project decided to use snakemake to specify its workflows, I requested a command-line interface in the same style as spacegraphcats - but with a few extra fun twists. I wrote up a quick example in 2018-snakemake-cli, which shows a simple way to combine workflow specification with parameter specification. From the 2018-snakemake-cli README, we use run to execute snakemake workflows:

./run <workflow_file> <parameters_file>


e.g.

rm -f hello.txt
./run workflow-hello params-amy


creates hello.txt with "hello amy" in it, while

rm -f hello.txt
./run workflow-hello params-beth


creates hello.txt with "hello beth" in it.

Here, the workflow file workflow-hello.json specifies the target hello.txt, while the parameters file params-amy parameterizes the workflow with the name "amy".

Likewise,

rm -f goodbye.txt
./run workflow-goodbye params-beth


will put goodbye beth in goodbye.txt.

All workflows use the same set of Snakemake rules in Snakefile.

...and this is now being implemented for dahak in dahak-taco. (Warning: the dahak repos have become self aware and are replicating.)

Anyway, to bring this back around to the beginning:

I really like the idea of specifying workflows in a dedicated workflow engine, and then building an application around that. It means we don't have to worry about executing commands, we can tap into a large existing support community, we can make use of more powerful abstractions in our own code, and as the workflow system expands its functionality we can take advantage of it automatically. For example, snakemake seems to interface well with biocontainers and has support for Kubernetes which are both things we intend to make use of in the future. It also (in theory) makes the application much more extensible and hackable vs the traditional "I wrote my own shell command management foo" stuff I used to do.

--titus

## March 21, 2018

### Matthew Rocklin

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.17.2. This is a minor release with new features and stability improvements. This blogpost outlines notable changes since the 0.17.0 release on February 12th.

conda install dask


or pip install from PyPI:

pip install dask[complete] --upgrade


Full changelogs are available here:

Some notable changes follow:

Tornado is a popular framework for concurrent network programming that Dask relies on heavily. Tornado recently released a major version update that included both some major features for Dask as well as a couple of bugs.

The new IOStream.read_into method allows Dask communications (or anyone using this API) to move large datasets more efficiently over the network with fewer copies. This enables Dask to take advantage of high performance networking available on modern super-computers. On the Cheyenne system, where we tested this, we were able to get the full 3GB/s bandwidth available through the Infiniband network with this change (when using a few worker processes).

Many thanks to Antoine Pitrou and Ben Darnell for their efforts on this.

At the same time there were some unforeseen issues in the update to Tornado 5.0. More pervasive use of bytearrays over bytes caused issues with compression libraries like Snappy and Python 2 that were not expecting these types. There is a brief window in distributed.__version__ == 1.21.3 that enables this functionality if Tornado 5.0 is present but will misbehave if Snappy is also present.

### HTTP File System

Dask leverages a file-system-like protocol for access to remote data. This is what makes commands like the following work:

import dask.dataframe as dd



We have now added http and https file systems for reading data directly from web servers. These also support random access if the web server supports range queries.

df = dd.read_parquet('https://...')


As with S3, HDFS, GCS, … you can also use these tools outside of Dask development. Here we read the first twenty bytes of the Pandas license:

from dask.bytes.http import HTTPFileSystem
http = HTTPFileSystem()

b'BSD 3-Clause License'


Thanks to Martin Durant who did this work and manages Dask’s byte handling generally. See remote data documentation for more information.

### Fixed a correctness bug in Dask dataframe’s shuffle

We identified and resolved a correctness bug in dask.dataframe’s shuffle that resulted in some rows being dropped during complex operations like joins and groupby-applies with many partitions.

### Cluster super-class and intelligent adaptive deployments

There are many Python subprojects that help you deploy Dask on different cluster resource managers like Yarn, SGE, Kubernetes, PBS, and more. These have all converged to have more-or-less the same API that we have now combined into a consistent interface that downstream projects can inherit from in distributed.deploy.Cluster.

Now that we have a consistent interface we have started to invest more in improving the interface and intelligence of these systems as a group. This includes both pleasant IPython widgets like the following:

as well as improved logic around adaptive deployments. Adaptive deployments allow clusters to scale themselves automatically based on current workload. If you have recently submitted a lot of work the scheduler will estimate its duration and ask for an appropriate number of workers to finish the computation quickly. When the computation has finished the scheduler will release the workers back to the system to free up resources.

The logic here has improved substantially including the following:

• The scheduler estimates computation duration and asks for workers appropriately
• There is some additional delay in giving back workers to avoid hysteresis, or cases where we repeatedly ask for and return workers

Some news from related projects:

• A new project, dask-jobqueue was started to handle launching Dask clusters on traditional batch queuing systems like PBS, SLURM, SGE, TORQUE, etc.. This projet grew out of the Pangeo collaboration
• A Dask Helm chart has been added to Helm’s stable channel

## Acknowledgements

The following people contributed to the dask/dask repository since the 0.17.0 release on February 12h:

• Anderson Banihirwe
• Dan Collins
• Dieter Weber
• Gabriele Lanaro
• John Kirkham
• James Bourbeau
• Julien Lhermitte
• Matthew Rocklin
• Martin Durant
• Max Epstein
• okkez
• Pangeran Bottor
• Rich Postelnik
• Scott M. Edenbaum
• Simon Perkins
• Thrasibule
• Tom Augspurger
• Tor E Hagemann
• Uwe L. Korn
• Wes Roach

The following people contributed to the dask/distributed repository since the 1.21.0 release on February 12th:

• Alexander Ford
• Andy Jones
• Antoine Pitrou
• Brett Naul
• Joe Hamman
• John Kirkham
• Loïc Estève
• Matthew Rocklin
• Matti Lyra
• Sven Kreiss
• Thrasibule
• Tom Augspurger

## March 20, 2018

### Fabian Pedregosa

#### Notes on the Frank-Wolfe algorithm, Part I

$$\def\xx{\boldsymbol x} \def\yy{\boldsymbol y} \def\ss{\boldsymbol s} \def\dd{\boldsymbol d} \DeclareMathOperator*{\argmin}{{arg\,min}} \DeclareMathOperator*{\minimize}{{minimize}} \DeclareMathOperator*{\diam}{{diam}}$$

This blog post is the first in a series discussing different theoretical and practical aspects of the Frank-Wolfe algorithm.

### Matthieu Brucher

#### Writing custom checks for clang-tidy

I started taking a heavier interest in clang-tidy a few months ago, as I was looking at static analyzers. I found at the time that it was quite complicated to work on clang internal AST. It is a wonderful tool, but it is also a very complex one. Thankfully, the cfe-dev mailing list is full of nice people.

I also started my journey in the LLVM/clang land with the help of this blog post.

# Quick setup

The previous blog post is very great to explain how to setup a build:

git clone http://llvm.org/git/llvm.git
cd llvm/tools/
git clone http://llvm.org/git/clang.git
cd clang/tools/
git clone https://github.com/mbrucher/clang-tools-extra extra

cd ../../../
mkdir build && cd build/
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
make check-clang-tools

A new checker can be created with the following command line:

./add_new_check.py misc catch-by-const-ref

A new folder can easily be created manually, and each checker consists of two sections:

1. A matcher that will select AST sections
2. A checker that will add additional checks on top of the matcher, like macro or file

Now let’s try to implement two rules:

• The first one will check that we catch exceptions by const ref (a good practice)
• The second will allow functions detections (C functions that are now replaced by C++, or functions that should be replaced by safer ones).

# A simple matcher for catching by const ref

The best reference for the matchers in clang is (unfortunately) the doxygen for the last_matcher namespace. As you can see, it is quite difficult to navigate, but it’s not as complicated.

• A variable declaration
• Inside a catch statement
• That is a reference
• But not const

My first trial was to match all catch statement. This is easily done by using cxxCatchStmt. When I did that, the issue that I could check that the variable underneath was declared const or not. So instead, I asked some help from the cfe-dev people.

So let’s start over again. This is what we need:

• variable declaration is matched with varDecl (for a type VarDecl, notice the case difference)
• inside a catch statement is matched by isExceptionVariable
• a reference type is matched by references
• and the const aspect is matched by isConstQualified

varDecl is an instance of VariadicDynCastAllOfMatcher that matches VarDecl. It can take several parameters. So the first parameter will be isExceptionVariable. The second will describe that type of access we are looking for hasType(references(qualType(unless(isConstQualified())))). If you unroll this match, we are looking for a reference on a qualifier type that is not (unless) const qualified.

The result is then:

1. void CatchByConstRefCheck::registerMatchers(MatchFinder *Finder) {
2.   // This is a C++ only check thus we register the matchers only for C++
3.   if (!getLangOpts().CPlusPlus)
4.     return;
5. 
6.   Finder->addMatcher(varDecl(isExceptionVariable(),hasType(references(qualType(unless(isConstQualified()))))).bind("catch"), this);
7. }

Now that we have a good matcher, the checker is easy to write. We want to warn for all these variables, and we can even easily propose a fix.

1. void CatchByConstRefCheck::check(const MatchFinder::MatchResult &Result) {
2. 
3.   const VarDecl* varCatch = Result.Nodes.getNodeAs<VarDecl>("catch");
4. 
5.   const char *diagMsgCatchReference = "catch handler catches by non const reference; "
6.                                         "catching by const-reference may be more efficient";
7. 
8.   // Emit error message if the type is not const (ref)s
9.   diag(varCatch->getLocStart(), diagMsgCatchReference)
10.     << FixItHint::CreateInsertion(varCatch->getLocStart(), "const ");
11. }

Of course, I’ve written a few examples that are tested by clang testing framework (make check-clang-tools).

# Using check options for matching deprecated functions

Now, for a second rule, I wanted to detect some C functions that have a C++ equivalent. For instance, exp() should be replaced by std::exp(), or fabs() by std::abs(). As the list can be different for different projects (and as you may want to replace other functions by others).

When using options, there are two things to do. First getting the options in the constructor, and also use a store call:

1. DetectCFunctionsCheck::DetectCFunctionsCheck(StringRef Name, ClangTidyContext *Context)
2.     : ClangTidyCheck(Name, Context),
3.       stdNamespaceFunctions(Options.get("stdNamespaceFunctions", "floor,exp")),
4.       functionsToChange(Options.get("functionsToChange", "fabs>std::abs"))
5. {
6.     parseStdFunctions();
7.     parseFunctionToChange();
8. }
9. 
10. void DetectCFunctionsCheck::storeOptions(ClangTidyOptions::OptionMap &Opts)
11. {
12.   Options.store(Opts, "stdNamespaceFunctions", stdNamespaceFunctions);
13.   Options.store(Opts, "functionsToChange", functionsToChange);
14. }

I have two calls here to parse the option strings. They are in charge of splitting them at ‘,’ and then for the replacement functions, we split them at ‘>’. Here, the default options are very simple, and it is easy to change it.

The matchers are very similar:

1. void DetectCFunctionsCheck::registerMatchers(MatchFinder *Finder) {
2.     // Should check if there are duplicates.
3.     for(auto fun: stdNamespaceFunctionsSet)
4.     {
5.       Finder->addMatcher(callExpr(callee(functionDecl(allOf(hasName(fun), unless(cxxMethodDecl()), hasParent(translationUnitDecl()))))).bind(fun), this);
6.     }
7.     for(auto fun: functionsToChangeMap)
8.     {
9.       Finder->addMatcher(callExpr(callee(functionDecl(allOf(hasName(fun.first), unless(cxxMethodDecl()), hasParent(translationUnitDecl())))).bind(fun.first), this);
10.     }
11. }

So we select call expression that use a function whose name contains the name required, that is not a call to a method and that the parent of the call is the translation unit a.k.a. the global namespace (one could use namespace here if the function was to be in a namespace). Then the check is very easy as well as the fix-it hint:

1. 
2. void DetectCFunctionsCheck::check(const MatchFinder::MatchResult &Result) {
3. 
4.     for(const auto& fun: stdNamespaceFunctionsSet)
5.     {
6.         const CallExpr* call = Result.Nodes.getNodeAs<CallExpr>(fun);
7.         if(call)
8.         {
9.             diag(call->getLocStart(), "this function has a corresponding std version. Consider using it (std::" + fun + ")")
10.                 << FixItHint::CreateInsertion(call->getLocStart(), "std::");
11.         }
12.     }
13.     for(const auto& fun: functionsToChangeMap)
14.     {
15.         const CallExpr* call = Result.Nodes.getNodeAs<CallExpr>(fun.first);
16.         if(call)
17.         {
18.             auto start = call->getLocStart();
19.             diag(start, "this function has a better version. Consider using it (" + fun.second + ")")
20.                 << FixItHint::CreateReplacement(SourceRange(start, start.getLocWithOffset(fun.first.size() - 1)), fun.second);
21.         }
22.     }
23. }

It would be easy to add a third category, for instance for C unsafe functions, but I don’t need this for now.

I have additional functional tests as well in the repository.

# Conclusion

I like writing rules, as clang-tidy is very powerful. Unfortunately, it is sometimes difficult to figure out what query you want to write. Although clang-query helps on this a lot, it is not very nice to use (there is no history of previous rules, you can’t go back on the same line…). I suppose dumping the AST helps as you can figure out what is the match you really want.

These two rules are available on github.

### Matthew Rocklin

#### Summer Student Projects 2018

Around this time of year students look for Summer projects. Often they get internships at potential future employers. Sometimes they become more engaged in open source software.

This blogpost contains some projects that I think are appropriate for a summer student in a computational field. They reflect my biases (which, assuming you read my blog, you’re ok with) and are by no means comprehensive of opportunities within the Scientific Python ecosystem. To be perfectly clear I’m only providing ideas and context here, I offer neither funding nor mentorship.

### Criteria for a good project

1. Is well defined and tightly scoped to reduce uncertainty about what a successful outcome looks like, and to reduce the necessity for high-level advising
2. Is calibrated so that an industrious student can complete it in a few months
3. It’s useful, but also peripheral. It has value to the ecosystem but is not critical enough that a core devs is likely to complete the task in the next few months, or be overly picky about the implementation.
4. It’s interesting, and is likely to stimulate thought within the student
5. It teaches valuable skills that will help the student in a future job search
6. It can lead to future work, if the student makes a strong connection

The projects listed here target someone who already has decent knowledge of the fundamentals PyData or SciPy ecosystem (numpy, pandas, general understanding of programming, etc..). They are somewhat focused around Dask and other projects that I personally work on.

### Distributed GPU NDArrays with CuPy, Bohrium, or other

Dask arrays coordinate many NumPy arrays to operate in parallel. It includes all of the parallel algorithms, leaving the in-memory implementation to NumPy chunks.

But the chunk arrays don’t actually have to be NumPy arrays, they just have to look similar enough to NumPy arrays to fool Dask Array. We’ve done this before with sparse arrays which implement a subset of the numpy.ndarray API, but with sparse storage, and it has worked nicely.

There are a few GPU NDArray projects out there that satisfy much of the NumPy interface:

It would be valuable to do the same thing with Dask Array with them. This might give us a decent general purpose distributed GPU array relatively cheaply. This would engage the following:

1. Knowledge of GPUs and performance implications of using them
2. NumPy protocols (assuming that the GPU library will still need some changes to make it fully compatible)
3. Distributed performance, focusing on bandwidths between various parts of the architecture
4. Profiling and benchmarking

### Use Numba and Dask for Numerical Simulations

While Python is very popular in data analytics it has been less successful in hard-core numeric algorithms and simulation, which are typically done in C++/Fortran and MPI. This is because Python is perceived to be too slow for serious numerical computing.

Yet with recent advances in Numba for fast in-core computing and Dask for parallel computing things may be changing. Certainly fine-tuned C++/Fortran + MPI can out-perform Numba and Dask, but by how much? If the answer is only 10% or so then it could be that the lower barrier to entry of Numba, or the dynamic scaling of Dask, can make them competitive in fields where Python has not previously had a major impact.

For which kinds of problems is a dynamic JITted language almost-as-good as C++/MPI? For which kinds of problems is the dynamic nature of these tools valuable, either due to more rapid development, greater flexibility in accepting community created modules, dynamic load balancing, or other reasons?

This project would require the student to come in with an understanding of their own field, the kinds of computational problems that are relevant there, and an understanding of the performance characteristics that might make dynamic systems tolerable. They would learn about optimization and profiling, and would characterize the relevant costs of dynamic languages in a slightly more modern era.

### Blocked Numerical Linear Algebra

Dask arrays contain some algorithms for blocked linear algebra, like least squares, QR, LU, Cholesky, etc.., but no particular attention has been paid to them.

It would be interesting to investigate the performance of these algorithms and compare them to proper distributed BLAS/LAPACK implementations. This will very likely lead to opportunities to improve the algorithms and possibly some of Dask’s internal machinery.

Someone with understanding of R’s or Julia’s networking stack could adapt Dask’s distributed scheduler for those languages. Recall that the dask.distributed network consists of a central scheduler, many distributed workers, one or more user-facing clients. Currently these are all written in Python and only really useful from that language.

Making this system useful in another language would require rewriting the client and worker code, but would not require rewriting the scheduler code, which is intentionally language agnostic. Fortunately the client and worker are both relatively simple codebases (relative to the scheduler at least) and minimal implementations could probably be written in around 1-2k lines each.

This would not provide the high-level collections like dask.array or dask.dataframe, but would provide all of the distributed networking, load balancing, resilience, etc.. that is necessary to build a distributed computing stack. It would also allow others to come later and build the high level collections that would be appropriate for that language (presumably R and Julia user communities don’t want exactly Pandas-style dataframe semantics anyway).

This is discussed further in dask/distributed #586 and has actually been partially implemented in Julia in the Invenia project.

This would require some knowledge of network programming and, ideally, async programming in either R or Julia.

### High-Level NumPy Optimizations

Projects like Numpy and Dask array compute what the user says, even if a more efficient solution exists.

(x + 1)[:5]  # what user said

x[:5] + 1    # faster and equivalent solution


It would be useful to have a project that exactly copies the Numpy API, but constructs a symbolic representation of that computation instead of performs work. This would enable a few important use cases that we’ve seen arise recently. These include both applications from just analyzing the symbolic representation and also applications from changing it to a more optimal form:

1. You could analyze this representation and warn users about intermediate stages that require a lot of RAM or compute time
2. You could suggest ideal chunking patterns based on the full computation
3. You could communicate this computation over the network to a remote server to perform the computation
4. You could visualize the computation to help users or students understand what they’re computing
5. You could manipulate the representation into more efficient forms (such as what is shown above)

The first part of this would be to construct a class that behaves like a Numpy array but constructs a symbolic tree representation instead. This would be similar to Sympy, Theano, Tensorflow, Blaze.expr or similar projects, but it would have much smaller scope and would not be at all creative in designing new APIs. I suspect that you could bootstrap this project quickly using systems like dask.array, which already do all of the shape and dtype computations for you. This is also a good opportunity to connect to recent and ongoing work in Numpy to establish protocols that allow other array libraries (like this one) to work smoothly with existing Numpy code.

Then you would start to build some of the analyses listed above on top of this representation. Some of these are harder than others to do robustly, but presumably they would get easier in time.

## March 19, 2018

### Continuum Analytics

#### Anaconda included in Gartner’s 2018 Magic Quadrant for Data Science and Machine Learning Platforms

Gartner recently released its 2018 Magic Quadrant for Data Science and Machine Learning Platforms, featuring Anaconda for the first time. For those unfamiliar with the process, vendors complete an extensive survey (150+ questions) and submit financial data and customer references to Gartner for evaluation. There’s a qualification bar based on revenue and customer traction, and …

## March 18, 2018

### Titus Brown

#### My approach to community building and coordination

For the last three months I've been knee-deep (neck deep? thoroughly underwater?) in the #CommonsPilot project, where I have been funded to take on a community coordinating role.

Someone asked today about how my coordination approach dovetails with the immense complexity of the project, and I put together the following answer, which I liked, and am now sharing with y'all.

I am trying to use the following process:

• put in place some loose guidelines (we'll be using this platform for e-mail, that project for documents, etc.);

• define a community code of conduct, and start things off the way you want to continue with communication;

• devise some simple on-boarding to connect people with guidelines;

• watch carefully to see what communication avenues are actually being used / work well (slack has been a success, google calendar... not so much); fine tune accordingly (we're now using groups.io for both mailing lists and calendars);

• then, observe & extract the emergent "Desire Paths" and bring them into the on-boarding docs ;

• layer more structure on as need becomes apparent, but don't do it too early, because each layer of structure acts as a straitjacket on the project and limits flexibility and adaptation;

• use training and in-person / 1-1 meetings to inculcate culture and process;

• iterate!

This process has emerged from my participation in open source projects over the last 30 years, as well as from watching Greg Wilson grow Software Carpentry from scratch over a decade. It is as different from the way academics think about building collaborations as the Carpentry teaching method is from "sage on a stage"-style teaching :).

--titus

p.s. I just started taking scuba lessons (PADI Open Water cert), which may have been a subconscious reaction to this #CommonsPilot thing... "breathe deeply and slowly. don't panic. and when your air runs out? bail to the surface." :)

## March 15, 2018

### Continuum Analytics

#### March 2018 Kubernetes Security Vulnerabilities and Anaconda Enterprise

The Anaconda team tracks security vulnerabilities and CVEs via the National Vulnerability Database (NVD) on an ongoing basis. Our team is committed to the security of Anaconda Enterprise by making updates available in a timely manner in response to security vulnerabilities and similar incidents. Two security vulnerabilities (CVE-2017-1002101 and CVE-2017-1002102) were recently identified in Kubernetes, which …

### Leonardo Uieda

#### A template for reproducible papers

At the PINGA lab, we have been experimenting with ways to increase the reproducibility of our research by publishing the git repositories that accompany our papers. You can find them on our Github organzation. I've synthesized the experience of the last 4 years into a template in the pinga-lab/paper-template repository.

The template reflects the tools we've been using and the type of research that we do:

• Most papers are proposing a new methodology rather than the analysis of a dataset.
• There is always an application to a dataset to show the method works. We can't always publish the data but we include it in the repository whenever we can.
• All papers include an implementation of the proposed method.
• Our code is usually written in Python and executed in Jupyter notebooks.
• The focus of the paper is usually on the methodology, not the code. As such, the code is more of a proof-of-concept than a full blown application or library.
• The paper itself is written in LaTeX with the source usually included in the repository.

This certainly won't fit everyone's needs but I hope that you can at least use a few bits and pieces for inspiration. Of course, the template code is open-source (BSD license) and you are free to reuse it however you like. The template includes a sample application to climate change data, complete with a Python package, automated tests, an analysis notebook, a notebook that generates the paper figure, raw data, and a LaTeX text. Everything, from compilation to building the final PDF, can be done with a single make command.

We've been using different versions of this template for a few years and I've been tweaking it to address some of the difficulties we encountered along the way.

• Running experiments in Jupyter notebooks can get messy when people aren't diligent about the execution order. It can be hard to remember to "Reset and run all" before using the results.
• The execution was done manually so you had to remember and document in what order the notebooks need to be run.
• Experimental parameters (e.g., number of data points, inversion parameters, model configuration) were copied into the text manually. This sometimes led to values getting out of sync between the notebooks and paper.
• We only had integration tests implemented in notebooks. More often than not, the checks were visual and not automated. I think a big reason for this is the lack of experience in writing tests within the group and setting up all of the testing infrastructure (mainly how to use pytest and what kind of test to perform).

The latest update addresses all of these pain points. The main features of the new template are:

• Uses Makefiles to automate the workflow. You can build and test the software, generate results and figures, and compile the PDF with a single make command.
• A Makefile for building the manuscript PDF with extra rules for running proselint, counting words, and opening the PDF.
• A starter conda environment for managing dependencies and making sure everyone gets the same version of the dependencies.
• A Makefile for building the Python package, testing it with pytest, running static code checks (flake8 and pylint), and generating results and figures from the notebooks.
• The code Makefile can run the notebooks using jupyter nbconvert to guarantee that the notebooks are executed in sequential order (top to bottom). I would love to use nbflow but the SCons requirement puts me off a bit. make works fine and the basic syntax is easier to understand.
• An example of using code to write experimental parameters in a .tex file. The file defines new variables that are used in the main text. This guarantees that the values cited in the text are the ones that you actually used to produce the results.

This last feature is my favorite. For example, the notebook code/notebooks/estimate-hawaii-trend.ipynb has the following code:

tex = r"""
% Generated by code/notebooks/estimate-hawaii-trend.ipynb
\newcommand{{\HawaiiLinearCoef}}{{{linear:.3f} C}}
\newcommand{{\HawaiiAngularCoef}}{{{angular:.3f} C/year}}
""".format(linear=trend.linear_coef, angular=trend.angular_coef)

with open('../../manuscript/hawaii_trend.tex', 'w') as f:
f.write(tex)


It defines the LaTeX commands \HawaiiLinearCoef and \HawaiiAngularCoef that can be used in the paper to insert the values estimated by the Python code. The commands are saved to a .tex file that can be included in the main manuscript.tex. Since this file is generated by the code, the values are guaranteed to be up-to-date.

If you want to use the template to start a new project:

1. Create a new git repository:

mkdir mypaper
cd mypaper
git init

2. Pull in the template code:

git pull https://github.com/pinga-lab/paper-template.git master

3. Create a new repository on Github.

4. Push the template code to Github:

git remote add origin https://github.com/USER/REPOSITORY.git
git push -u origin master

5. Follow the instruction in the README.md.

Alternatively, you can use the "Import repository" option on Github.

I hope that this template will be useful to people outside of our lab. There is definitely still room for improvement and I'm looking forward to trying it out on my next project.

What other features would you like to see in the template? Let me know in the comments (or better yet, submit a pull request). I'd love to know about your experiences and workflows for computational papers.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

## March 13, 2018

### Matthieu Brucher

#### Compile last Audio Toolkit on Bela with Clang

More than a year ago, I started playing with the Bela board. At the time, I had issues compiling Audio ToolKit with clang. The issue was that the gcc shipped with the Debian image the BeagleBoard used was too old and didn’t fully support C++11. The one that ships now is GCC 6, which is even C++14 compliant. Meaning that everything is available to build Audio Toolkit with Python support.

# Setting up Bela

Starting from a fresh image, the main things to install are:

• Boost
• FFTW
• Python and Numpy

With the BeagleBoard, it’s easy as one apt-get line:

apt-get install python3-dev python3-numpy python3-scipy python3-nose libfftw3-dev libboost-dev libboost-system1.62-dev libboost-test1.62-dev

And one line to install nosetests:
pip3 install nosetests

After all the dependent packages are downloaded, the git repository can be cloned:

git clone https://github.com/mbrucher/AudioTK.git
git submodule init
git submodule update

# Direct build

Let’s create a new folder named AudioTK-build on the same level as AudioTK. We can then run cmake from it:

cmake ../AudioTK

Once this is done, let’s build ATK:

make

It will take some time, but then we can run the tests. First, we want to export the building Python path to be able to test in place before the install step.
export PYTHONPATH=/root/local/src/AudioTK-build/Python/:PYTHONPATH And then: make test The result should look something like this: Test project /root/local/src/AudioTK-build Start 1: Adaptive 1/23 Test #1: Adaptive ......................... Passed 9.54 sec Start 2: Core 2/23 Test #2: Core ............................. Passed 481.68 sec Start 3: Delay 3/23 Test #3: Delay ............................ Passed 168.64 sec Start 4: Distortion 4/23 Test #4: Distortion ....................... Passed 0.50 sec Start 5: Dynamic 5/23 Test #5: Dynamic .......................... Passed 23.90 sec Start 6: EQ 6/23 Test #6: EQ ............................... Passed 158.12 sec Start 7: IO 7/23 Test #7: IO ............................... Passed 0.66 sec Start 8: Mock 8/23 Test #8: Mock ............................. Passed 2.98 sec Start 9: Preamplifier 9/23 Test #9: Preamplifier ..................... Passed 1.45 sec Start 10: PyAdaptive 10/23 Test #10: PyAdaptive ....................... Passed 10.51 sec Start 11: PyCore 11/23 Test #11: PyCore ........................... Passed 6.26 sec Start 12: PyDelay 12/23 Test #12: PyDelay .......................... Passed 5.78 sec Start 13: PyDistortion 13/23 Test #13: PyDistortion ..................... Passed 7.13 sec Start 14: PyDynamic 14/23 Test #14: PyDynamic ........................ Passed 16.39 sec Start 15: PyEQ 15/23 Test #15: PyEQ ............................. Passed 9.83 sec Start 16: PyPreamplifier 16/23 Test #16: PyPreamplifier ................... Passed 13.20 sec Start 17: PyReverberation 17/23 Test #17: PyReverberation .................. Passed 4.45 sec Start 18: PySpecial 18/23 Test #18: PySpecial ........................ Passed 4.41 sec Start 19: PyTools 19/23 Test #19: PyTools .......................... Passed 5.79 sec Start 20: Reverberation 20/23 Test #20: Reverberation .................... Passed 12.18 sec Start 21: Special 21/23 Test #21: Special .......................... Passed 1295.59 sec Start 22: Tools 22/23 Test #22: Tools ............................ Passed 13.43 sec Start 23: Utility 23/23 Test #23: Utility .......................... Passed 0.13 sec 100% tests passed, 0 tests failed out of 23 Obviously, this is very slow. It’s more or less 50 times slower than the same on my old MacBook Pro! # Conclusion Clearly I’m not in a place where I can use ATK on the BeagleBoard. While looking at the assembler code, it seems that almost no Neon instructions were generated. So the next entry in this series will tackle optimizing ATK on ARM! ## March 09, 2018 ### Leonardo Uieda #### Podcasts in my playlist (2018 edition) Last year, I posted my podcast playlist in response to a similar post by John Leeman (of Don't Panic Geocast fame). In a recent episode (maybe episode 158), John asked listeners for an updated list of recommendations. Here are mine. I'll start with the new additions since last year, then the ones that stayed with me throughout 2017, and finally the ones that I'm looking to get started this year. New additions: • Gastropod: A podcast that "looks at food through the lens of science and history". In each episode, the hosts dive deep into the science behind a type of food/process/ingredient and how it became what it is today. One of my favorite episodes is about koji, the fungus behind sake, miso, shoyu, and more. • The Unmade Podcast: A podcast about insane ideas for new podcasts. Very meta and silly but a fun way to pass the time and get a few laughs. • We Martians: A podcast all about the science and exploration of Mars. Just listened to a few episodes but I'm enjoying it so far. The survivors: • Undersampled Radio: Geeky and fun interviews, mostly about geo/science/technology. • Don't Panic Geocast: All things geoscience (sometimes with very interesting guests). • Hello Internet: A light conversation between two friends who make science videos on YouTube with a surprisingly common discussion of flags. • Imaginary Worlds: "A show about how we create them and why we suspend our disbelief". Still one of my favorites. • Invisibilia: A series about the forces that shape our lives. Also one of my favorites. • Talk Python To Me: The title pretty much says it all. • Radiolab: Interesting stories about all sorts of topics. Very high quality production. The ones I haven't tried yet: • Fieldwork Diaries: Interviews with scientists about their field experiences. • In Defense of Plants: I'm curious to learn more about the weird world of botany. • Ologies: Each episode is about a different field of knowledge. I think it's based on an idea from the Unmade Podcast. • The Truth: "Movies for your ears". That's it for my list. Do you have any recommendations? Comments? Leave one below or let me know on Twitter @leouieda. Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes! Please enable JavaScript to view the comments powered by Disqus. ## March 06, 2018 ### Continuum Analytics #### Anaconda Repository Changes Afoot In August 2017, Continuum Analytics announced it is now Anaconda, Inc. Here at Anaconda, we are all excited about the change, and have spent the last several months switching everything over to the Anaconda name. One of the last big changes we need to make is to switch our default conda repository from https://repo.continuum.io to … Read more → ### Travis Oliphant #### Reflections on Anaconda as I start a new chapter with Quansight Leaving the company you founded is always a tough decision and a tough process that involves many people. It requires a series of potentially emotional "crucial-conversations." It is actually not that uncommon in venture-backed companies for one or more of the original founders to leave at some point. There is a decent article on the topic here: https://hbswk.hbs.edu/item/the-founding-ceos-dilemma-stay-or-go. Still it is extremely difficult to let go. You live and breathe the company you start. Years of working to connect as many people as possible to the dream gives you a feeling of "ownership" and connection that no stock certificate can replace. Starting a company is a lot of work. It takes a lot of effort. There are many decisions to make and many voices to incorporate. Hiring, firing, raising money, engaging customers, engaging employees, planning projects, organizing events, and aligning a pastiche of personalities while staying relevant in a rapidly evolving technology jungle is difficult. As a founder over 40 with modest means, I had a family of 6 children who relied on me. That family had teenage children who needed my attention and pre-school and elementary-school children that I could not simply leave only in the hands of my wife. I look back and sometimes wonder how we pulled it off. The truth probably lies in the time we borrowed: time from exercise, time from sleep, time from vacations, and time from family. I'd like to say that this dissonance against "work-life-harmony" was always a bad choice, but honestly, I don't see how I could have made too may different choices and still have created Anaconda. Several things drove me. I could not let the people associated with the company down. I would not lose the money for those that invested in us. I could not let down the people who worked their tail off to build manage, document, market, and sell the technology and products that we produced. Furthermore, I would not let the community of customers and users down that had enabled us to continue to thrive. The only way you succeed as a founder is through your customers being served by the efforts of those who surround you. It is only the efforts of the talented people who joined us in our journey that has allowed Anaconda to succeed so far. It is critical to stay focused on what is in the best interests of those people. Permit me to use the name Continuum to describe the angel-funded and bootstrapped early-stage company that Peter and I founded in 2012 and Anaconda to describe the venture-backed company that Continuum became (This company we called Continuum 2.0 internally that really got started in the summer of 2015 after we raised the first tranche of22 million from VCs.)

Back in 2012, Peter and I knew a few things: 1) we had to connect Python to the Big Data movement; 2) we needed to help the scientific programmer, or a data-scientist developer build visualization-based applications quickly in the web; and 3) we needed to scale the stack of code around the PyData community to bigger hardware and multiple machines. We had big visions of an interconnected data-web, distributed schedulers, and data-structures that traversed the internet which could be analyzed across the cloud with simple Python scripts. We talked and talked about these things and grew misty-eyed in our enthusiasm for the potential of what was possible if we just built the right technology and sold just the right product to fund it.

We knew that we wanted to build a product-company -- though we didn't know exactly what those products would be at the outset.  We had some ideas, only portions of which actually worked out.  I knew how to run a consulting and training company around Python and open-source. Because of this, I felt comfortable raising money from family members. While consulting companies are not "high-growth" they can make real returns for investors. I was pretty confident that I would not lose their money.

We raised 2.25million from a few dozen investors consisting of Peter's family, my family, and a host of third-parties from our mutual networks. Peter's family was critical to this early stage because they basically "led the early round" and ensured that we could get off the ground. After they put their money in the bank, we could finish raising the rest of the seed round which took about 6 months to finish. It is interesting (and somewhat embarrassing and so not detailed here) to go back and look at what products we thought we would be making. Some of the technologies we ended up building (like Excel integration, Numba, Bokeh, and Dask) were reflected in those early product dreams. However, the real products and commercial success that Anaconda has had so far are only a vague resemblance to what we thought we would do. Building a Python distribution was the last thing on our minds. I had been building Python distributions since I released SciPy in 2001. As I have often repeated, SciPy was actually the first Python distribution masquerading as a library. The single biggest effort in releasing SciPy was building the binary installers and making sure everything compiled well. With Fortran compilers still more scarce than they should be, it can still be difficult to compile and build SciPy. Fortunately, with conda, conda-forge, and Anaconda, along with the emergence of wheels, almost nobody needs to build SciPy anymore. It is so easy today to get started with a data-science project and get all the software you need to do amazing work fast. You still have to work to maintain your dependencies and keep that workflow reproducible. But, I'm so happy that Anaconda makes that relatively straightforward today. This was only possible because General Catalyst and BuildGroup joined us in the journey in the spring of 2015 to really grow the Anaconda story. Their investment allowed us to 1) convert to a serious product-company from a bootstrapped consulting company with a few small products and 2) continue to invest heavily in conda, conda-forge, and Anaconda. There is nothing like real-world experience as a teacher, and the challenge of converting to a serious product company was a tremendous experience that taught me a great deal. I'm grateful to all the people who brought their best to the company and taught me everyday. It was a privilege and an honor to be a part of their success. I am grateful for their patience with me as my "learning experiences" often led to real struggles for them. There are many lasting learnings that I look forward to applying in future endeavors. The one that deserves mention in this post, however, is that building enterprise software that helps open-source communities should be done by selling a complementary product to the open-source. The "open-core" model does not work as well. I'm a firm believer that there will always be software to sell, but infrastructure should be and will be open-source --- sustained vibrantly from the companies that depend on it. Joel Spolsky has written about complementary products before. You should read his exposition. Early on at Anaconda, Peter and I decided to be a board-led company. This board which includes Peter and I has the final say in company leadership and made the important decision to transition Anaconda from being founder-led to being led by a more experienced CEO. After this transition and through multiple conversations over many months we all concluded that the best course of action that would maximize my energy and passion while also allowing Anaconda to focus on its next chapter would be for me to spin-out of Anaconda and start a new services and open-source company where I could pursue a broader mission. This new company is Quansight (short for Quantitative Insight). Our place-holder homepage is at http://www.quansight.com and we are @quansightai on Twitter. I'm excited to tell you more about the company in future blog-posts and announcements. A few paragraphs will suffice for now. Our overall mission is to develop people, build technology, and discover products to empower people with knowledge and data to solve the world’s most challenging problems. We are doing that currently by connecting organizations sustainably with open source communities to solve their hardest problems by enabling teams to transparently apply science to their data. One of the things we are doing is to help companies get started with AI and ML by applying the entire PyData stack to the fundamental data organization, data visualization, and model management problem that is required for practical success with ML and AI in business. We also help companies generally improve their data-science practice by leveraging all the power of the Python, PyData, and related ecoystems. We are also hard at work on the sustainability problem by continuing the tradition we started at Continuum Analytics of building successful and sustainable open-source "practices" that synchronize company needs with open-source technology development. We have some innovative business approaches to this that we will be announcing in the coming weeks and months. I'm excited that we have several devs working hard to help bring JupyterLab to 1.0 this year along with a vibrant community. There are many exciting extensions to this remarkable platform that remain to be written. We also expect to continue to contribute to the PyViz activities that continue to explode in the Python ecosystem as visualization is a critical first step to understanding and using any data you care about. Finally, Stefan Krah has joined us at Quansight. Stefan is an award-winning Python core developer who has been steadily working over the past 18 months on a small but powerful collection of projects collectively called Plures. These will be more broadly available in the next few months and published under the xnd brand. Xnd is a generic container concept in C with a Python binding that together with its siblings ndtypes and gumath allows building flexible array-computing pipelines over many kinds of data-types. This technology will serve to underly any array-computing framework and be a glue between machine-learning and data-science frameworks of all kinds. Our plan is to use this tool to help reduce the data and computational silos that currently exist across the open-source ecosystem. There is still much to work on and many more technologies to emerge. It's an exciting time to work in machine learning, data-science, and scientific computing. I'm thrilled that I continue to get the opportunity to be part of it. Let me know if you'd like to be a part of our journey. ## February 28, 2018 ### Matthew Rocklin #### Craft Minimal Bug Reports Following up on a post on supporting users in open source this post lists some suggestions on how to ask a maintainer to help you with a problem. You don’t have to follow these suggestions. They are optional. They make it more likely that a project maintainer will spend time helping you. It’s important to remember that their willingness to support you for free is optional too. Crafting minimal bug reports is essential for the life and maintenance of community-driven open source projects. Doing this well is an incredible service to the community. ## Minimal Complete Verifiable Examples I strongly recommend following Stack Overflow’s guidelines on Minimal Complete Verifiable Exmamples. I’ll include brief highlights here: … code should be … • Minimal – Use as little code as possible that still produces the same problem • Complete – Provide all parts needed to reproduce the problem • Verifiable – Test the code you’re about to provide to make sure it reproduces the problem Lets be clear, this is hard and takes time. As a question-asker I find that creating an MCVE often takes 10-30 minutes for a simple problem. Fortunately this work is usually straightforward, even if I don’t know very much about the package I’m having trouble with. Most of the work to create a minimal example is about removing all of the code that was specific to my application, and as the question-asker I am probably the most qualified person to do that. When answering questions I often point people to StackOverflow’s MCVE document. They sometimes come back with a better-but-not-yet-minimal example. This post clarifies a few common issues. As an running example I’m going to use Pandas dataframe problems. ## Don’t post data You shouldn’t post the file that you’re working with. Instead, try to see if you can reproduce the problem with just a few lines of data rather than the whole thing. Having to download a file, unzip it, etc. make it much less likely that someone will actually run your example in their free time. ### Don’t I’ve uploaded my data to Dropbox and you can get it here: my-data.csv.gz import pandas as pd df = pd.read_csv('my-data.csv.gz')  ### Do You should be able to copy-paste the following to get enough of my data to cause the problem: import pandas as pd df = pd.DataFrame({'account-start': ['2017-02-03', '2017-03-03', '2017-01-01'], 'client': ['Alice Anders', 'Bob Baker', 'Charlie Chaplin'], 'balance': [-1432.32, 10.43, 30000.00], 'db-id': [1234, 2424, 251], 'proxy-id': [525, 1525, 2542], 'rank': [52, 525, 32], ... })  ## Actually don’t include your data at all Actually, your data probably has lots of information that is very specific to your application. Your eyes gloss over it but a maintainer doesn’t know what is relevant and what isn’t, so it will take them time to digest it if you include it. Instead see if you can reproduce your same failure with artificial or random data. ### Don’t Here is enough of my data to reproduce the problem import pandas as pd df = pd.DataFrame({'account-start': ['2017-02-03', '2017-03-03', '2017-01-01'], 'client': ['Alice Anders', 'Bob Baker', 'Charlie Chaplin'], 'balance': [-1432.32, 10.43, 30000.00], 'db-id': [1234, 2424, 251], 'proxy-id': [525, 1525, 2542], 'rank': [52, 525, 32], ... })  ### Do My actual problem is about finding the best ranked employee over a certain time period, but we can reproduce the problem with this simpler dataset. Notice that the dates are out of order in this data (2000-01-02 comes after 2000-01-03). I found that this was critical to reproducing the error. import pandas as pd df = pd.DataFrame({'account-start': ['2000-01-01', '2000-01-03', '2000-01-02'], 'db-id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'})  As we shrink down our example problem we often discover a lot about what causes the problem. This discovery is valuable and something that only the question-asker is capable of doing efficiently. ## See how small you can make things To make it even easier, see how small you can make your data. For example if working with tabular data (like Pandas), then how many columns do you actually need to reproduce the failure? How many rows do you actually need to reproduce the failure? Do the columns need to be named as you have them now or could they be just “A” and “B” or descriptive of the types within? ### Do import pandas as pd df = pd.DataFrame({'datetime': ['2000-01-03', '2000-01-02'], 'id': [1, 2]})  ## Remove unnecessary steps Is every line in your example absolutely necessary to reproduce the error? If you’re able to delete a line of code then please do. Because you already understand your problem you are much more efficient at doing this than the maintainer is. They probably know more about the tool, but you know more about your code. ### Don’t The groupby step below is raising a warning that I don’t understand df = pd.DataFrame(...) df = df[df.value > 0] df = df.fillna(0) df.groupby(df.x).y.mean() # <-- this produces the error  ### Do The groupby step below is raising a warning that I don’t understand df = pd.DataFrame(...) df.groupby(df.x).y.mean() # <-- this produces the error  ## Use Syntax Highlighting When using Github you can enclose code blocks in triple-backticks (the character on the top-left of your keyboard on US-standard QWERTY keyboards). It looks like this: python x = 1   ## Provide complete tracebacks You know all of that stuff between your code and the exception that is hard to make sense of? You should include it. ### Don’t I get a ZeroDivisionError from the following code: python def div(x, y): return x / y div(1, 0)   ### Do I get a ZeroDivisionError from the following code: python def div(x, y): return x / y div(1, 0)  python-traceback ZeroDivisionError Traceback (most recent call last) <ipython-input-4-7b96263abbfa> in <module>() ----> 1 div(1, 0) <ipython-input-3-7685f97b4ce5> in div(x, y) 1 def div(x, y): ----> 2 return x / y 3 ZeroDivisionError: division by zero   If the traceback is long that’s ok. If you really want to be clean you can put it in <details> brackets. I get a ZeroDivisionError from the following code: python def div(x, y): return x / y div(1, 0)  ### Traceback <details> python ZeroDivisionError Traceback (most recent call last) <ipython-input-4-7b96263abbfa> in <module>() ----> 1 div(1, 0) <ipython-input-3-7685f97b4ce5> in div(x, y) 1 def div(x, y): ----> 2 return x / y 3 ZeroDivisionError: division by zero  </details>  ### Ask Questions in Public Places When raising issues you often have a few possible locations: 1. GitHub issue tracker 2. Stack Overflow 3. Project mailing list 4. Project Chat room 5. E-mail maintainers directly (never do this) Different projects handle this differently, but they usually have a page on their documentation about where to go for help. This is often labeled “Community”, “Support” or “Where to ask for help”. Here are the recommendations from the Pandas community. Generally it’s good to ask questions where many maintainers can see your question and help, and where other users can find your question and answer if they encounter a similar bug in the future. While your goal may be to solve your problem, the maintainer’s goal is likely to create a record of how to solve problems like yours. This helps many more users who will have a similar problem in the future, see your well-crafted bug report, and learn from the resulting conversation. ### My personal preferences • For user questions like “What is the right way to do X?” I prefer Stack Overflow. • For bug reports like “I did X, I’m pretty confident that it should work, but I get this error” I prefer Github issues • For general chit-chat I prefer Gitter, though actually, I personally spend almost no time in gitter because it isn’t easily searchable by future users. If you’ve asked me a question in Gitter I will almost certainly not respond to it, except to direct you to github, stack overflow, or this blogpost. • I only like personal e-mail if someone is proposing to fund or seriously support the project in some way But again, different projects do this differently and have different policies. You should check the documentation of the project you’re dealing with to learn how they like to support users. ## February 27, 2018 ### Continuum Analytics #### Introducing Microsoft R Open as Default R for Anaconda Distribution Although Anaconda, Inc. is best known as the creator of the world’s most popular Python data science platform, for many years we also have been creating conda packages for R. In September 2017, we announced a partnership with Microsoft that included bringing Microsoft R Open (MRO) to Anaconda users as our default R. We are … Read more → ## February 26, 2018 ### Titus Brown #### Assessment report for ANGUS 2017 Note: This is an invited blog post by Dr. Karen Word on our 2017 sequence analysis workshop, ANGUS. Our 2018 schedule has now been posted! # DIBSI Assessment: preliminary ANGUS report There are two sorts of stories I find myself telling based on my own experience talking to people & reading through things: 1) This workshop catered to learners of many abilities, and no group was categorically dissatisfied. Satisfaction for these different groups was based on different outcomes, e.g. for novices building a mental model & confidence was considered success; for practitioners validating practices and acquiring new tools, tricks, and contacts was more likely to constitute success. 2) From our perspective, a desirable outcome would be the creation of a community within which learners can support each other in solving problems to minimize the strain on experts. Learners largely did not seem to be seeking this kind of community -- even where they valued socialization, it was rarely with an eye towards technical support. The exception seemed to be active practitioners who had already been seeking support within their own community and were able to understand other learners as a resource for help. ### Basic demographics: Note: demographics were collected in the pre-assessment only. Individuals who did not take the pre-assessment are not represented here. The "other" category above was selected mainly by postdocs, but also includes DVM, MD, BS-holding professional, and one "Assistant professor." This is a similar mixture as those cited under "PhD-holding professional" hence with few exceptions these categories could be combined (yellow & green) indicating similar representation of advanced degree-holders relative to graduate students. ### A few response summaries: People were overwhelmingly satisfied with the workshop: Curiously, even learners who indicated that they left with "no ability" to install and run genome assembly software still largely felt that their needs had been met: Other metrics are similarly supportive. 89% of respondants indicated that they learned what they hoped to learn, and 96% say they would recommend the workshop to colleagues. We asked two open-response content questions on the pre- and post-assessments. One of them was as follows: Suppose that you are using Illumina to sequence DNA from a mouse sample that should have genetic differences from the mouse reference genome. Discuss one or more approaches you would take to analyze the data, as well as your expected sensitivity to SNPs and indels. Include in your discussion how much you will miss, and how much you find that will be wrong. This word cloud shows the most common terms from pre-assesment responses to the question above, scrubbed of most terminology present in the question: vs. Post-assessment responses to same, reflecting common details acquired: The figure below shows pre vs. post self-evaluation of ability in various relevant subject areas. Substantial changes are evident in all categories except Python programming, a skill that is not directly taught in the workshop. The figure below shows more specific responses to "I know how to" or "I understand" questions on a Likert-type scale. Progress is evident in all categories, again with python scripting ("Qscriptpy") taking up the rear. The greatest progress is in knowing what relevant tools do. ### A few quotes: In response to "please comment on the extent to which you learned what you expected to learn": Success as defined by a learner who does not plan to analyze their own data: 1 . I wanted to understand the generic pipeline for analyzing RNA sequencing data- SUCCESS 2. I wanted to learn the fundamental skills needed to use tools in the shell/R- SUCCESS 3. I wanted to feel comfortable talking with bioinformaticists about how they performed analyses, so I can be sure that they did what I need them to do- SUCCESS Additionally, I received a great introduction to the concepts of "open science" and "making stuff actually reproducible," things which are not emphasized in my program! Success for learner who has data that have previously been analyzed, but wants to become more independent: I came with zero to none programming skills and had no idea how to get started with analyzing next-gen sequencing data. I was sitting on RNA-Seq datasets that someone else had analyzed for me and felt so powerless and overwhelmed. Since attending the workshop, I was able to get my own Jetstream allocation, move my data into the new server, do genome assembly --> Diff gene expression all on my own, while generating more than decent plots using R. I feel immensely relieved that I have all the basic resources I need to get started on exploring Bioinformatics and all the possible research directions that can open up as a result of this. I also met some freaking awesome people, helping build a network of skilled Bioinformaticians I can reach out to if I ever need advice/help. For someone coming from an obscure univ which doesn't have any bioinformaticians at all, this is huge. Thank you DIBSI organizers!! Another learner who wants to analyze their own data and came in with some computational skill: I work with non model organism and this workshop turned out to be great to answer my queries and how to tackle genome and rna seq data analysis of such organisms. Also, the knowledge of tools used would further help me perform assembly and analysis of my own data set. I feel way more confident in running the tools for analysis after participating in this workshop In response to "Please comment on how your understanding of computational science changed": The perspective of an advanced learner: This workshop pushed me from being an adequate programmer who can google things and piece together a decent pipeline to get the necessary data, to understanding what's going on under the hood at each step and analyze data quality much better. And another: I had some prior experience of bioinformatics tools used to analysis rna seq and DNA seq data, but this workshop further helped me with understanding new tools such as R studio , markdown and writing own scripts. Whereas from people reporting lower computational skills, one benefit was gaining focus on what was important or learning what they did not need to know: My initial thought was that I had to obtain a year long computational course in order to be able to assess genetic data. With this course I learned that all I need is the tools specific to proceed with the process of mapping, and analyzing data. I Was also impressed that with working on the cloud instance there is no need for a huge memory computer. And many people spoke of reducing a barrier of fear for computational processes and/or seeking help: I have a base understanding now of what forms the data comes in and how to appropriately prepare the reads for downstream analyses. There are many steps in the pipeline and many options for programs. I do feel that I have more confidence to try to tackle analysis of my own data and to write scripts. Accessing the cloud or shell used to be very intimidating for me, but I feel more comfortable now. Most importantly, I have gained enough knowledge to actually be able to ask the appropriate questions of my systems admin and colleagues that are more computationally inclined when I get stuck on certain analyses. ### Complaints & Suggestions: A request for consistent embedded pipeline visualization recurs in various responses (this was discussed in Tigers room, possibly explaining this specific trend). Other learners referred to wanting more "big picture" introductions or wanting to know "why" things were being done. Several people would also like to get more practice with the tools or more opportunity to work with their own data. One person observed that we target computational novices well but not necessarily genetics novices. There were two individuals who indicated that they would not recommend this workshop to colleagues. One of them indicated that they do not have colleagues who work with NGS data. The other indicated objections to instructional style and a sense that the workshop was "too casual" and lacked organization. However, elsewhere they seem almost to be responding on behalf of less experienced learners, stating: "I learn some good tricks and got some good tricks. If this was my first workshop, I will felt lost after the first week." ### A few pertinent recommendations: I suggest articulating plans for advance embedding of formative assessessment in the curriculum since: • Formative assessment can functionally provide practice with the tools in small ways -- while not providing exactly what is requested, this may resolve the feeling that learners have not "played" with the tools. • Formative assessment can also raise "why" questions to prompt discussions of broader connectivity as necessary It would be relatively straightforward to provide a pipeline roadmap and a vocabulary list (or glossary) for computational and genetic terms and abbreviations. (I suggest having these on paper to avoid adding to overcrowded screens) We had very few people who reported lacking background in genetics, but we also did not survey for this directly. We should consider whether ths is something we plan to address, and consider adding language in the course description if it is not. Finally, regarding assessment in future years, I recommend that we more directly inquire as to the goals that attendees have coming into the workshop. I also suggest that we ask about ways in which their home community expects to rely upon their training. Given that success appears to take such different forms for different learners, this would help us to more precisely assess the extent to which we are meeting those varying needs. Our plans for retrospective surveys and interviews will also help tease apart the impact that these different kinds of experiences have on careers and communities in the long term. ### numfocus #### Bloomberg Supports Jupyter as NumFOCUS Platinum Sponsor ## February 23, 2018 ### Continuum Analytics #### Harness the Power of Data Science at AnacondaCON 2018! Last spring, Anaconda celebrated the inaugural AnacondaCON, where over 400 people descended upon Austin to connect with peers and thought leaders within the Python data science community. This year’s event promises to be even bigger and better! Taking place in Austin on April 8-11, AnacondaCON 2018 is shaping up to be one of the hottest … Read more → ## February 20, 2018 ### Matthieu Brucher #### Announcement: ATKAutoSwell 2.0.0 I’m happy to announce the update of ATK Auto Swell based on the Audio Toolkit and JUCE. They are available on Windows (AVX compatible processors) and OS X (min. 10.9, SSE4.2) in different formats. This plugin requires the universal runtime on Windows, which is automatically deployed with Windows update (see tis discussion on the JUCE forum). If you don’t have it installed, please check Microsoft website. ATK Auto Swell 2.0.0 The supported formats are: • VST2 (32bits/64bits on Windows, 32/64bits on OS X) • VST3 (32bits/64bits on Windows, 32/64bits on OS X) • Audio Unit (32/64bits, OS X) Direct link for ATKGuitarPreamp. The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code. ## February 16, 2018 ### numfocus #### StanCon 2018 The post StanCon 2018 appeared first on NumFOCUS. ## February 15, 2018 ### numfocus #### Applications Are Open for Google Summer of Code with NumFOCUS! ### Continuum Analytics #### VS Code in Anaconda Distribution 5.1 A few months ago, Anaconda, Inc., creator of the world’s most popular Python data science platform, announced a partnership with Microsoft that included providing Anaconda Distribution users easy access to Visual Studio Code (VS Code). We are pleased to announce that, with the February 15th release of Anaconda Distribution 5.1, this goal is now a … Read more → ### Titus Brown #### Do software and data products advance biology more than papers? There are many outputs from our lab and our collaborators - off the top of my head, the big ones are: • papers and preprints • software • data sets • blog posts and tweets • talk slides and videos • grant proposal text • training materials and tutorials • trainees (core lab members, rotation students, people who attend our workshops, etc) Traditionally, only the first (papers) and some small part of the last (trainees who get a PhD or do a postdoc in the lab) are explicitly recognized in biology as "products". I personally value all of them to some degree. In terms of actual effect I believe that software, trainees, blog posts, and training materials are more impactful products than our papers. In terms of taming the chaos of science, I view advances in our software's capabilities, and the development and evolution of our perspectives on data analysis, as a kind of ratchet that inexorably advances our science. Papers, unless they accomplish the very difficult task of nailing down a concept and explaining it well, do very little to advance our lab's science. They are merely artifacts that we produce because they meet metrics, with the side effect of being one relatively ineffective way to communicate methods and results. A question that I've been considering is this: To what extent is the focus on papers as a primary output in biology (or at least genomics and bioinformatics) skewing our field's perspectives and slowing progress by distracting us from more useful outputs? A companion question: How (if at all) is the rise of software and data products as putative equivalents to papers leading to epistemic confusion as to what constitutes actual progress in biology? To explain this last point a bit more, it's not clear that many papers really advance biology directly, given the flood of papers and results and the resulting loss of ability to read and comprehend them all in a particular subject. (This is more true in some areas than in others, but you could also argue that big fields are maybe getting subdivided into more narrow fields because of our inability to comprehend the results in big fields.) More and more, the results of papers need to be incorporated into theory (difficult in bio) or databases and software before they become useful in biology. From this perspective, good data and software papers actually advance biology more than a specific finding. I don't think this is entirely right but I feel like the field is trending in this direction. But most senior people are really focused on papers as outputs and ignore software and data. This makes it hard for me to talk to them sometimes. Ultimately, of course, insight and cures, for lack of a better word, are the rightful end products of basic research and biomedical science, respectively. So the question is how to get there faster. Are papers the best way? Probably not. ## Some side notes I've been pretty happy with the way UC Davis handles merit and promotion, in that faculty in my department really get to explain what they're doing and why. It's not all about papers here, although of course for research-intensive profs that's still a major component. ## Acknowledgements This blog post was greatly inspired by conversations with Becca Calisi-Rodriguez and Tracy Teal, as well as (as always) the members of the DIB Lab. Thanks!! (I'm not implying that they agree with me, of course!) I'm particularly indebted to Dr. Tamer Mansour, who, a year ago, said (paraphrased): "This lab is not a research lab. Mostly we train people, and do software engineering. Research is a distinct third." I disagree but it sure was hard to figure out why :) --titus ## February 13, 2018 ### numfocus #### Cantera Joins NumFOCUS Sponsored Projects ### Matthieu Brucher #### Announcement: Audio TK 2.3.0 ATK is updated to 2.3.0 with major fixes and code coverage improvement (see here). Lots of bugs were fixed during that effort and native build on embedded platforms was also fixed. CMake builds on Linux don’t have to be installed before Python tests have to be ran. SIMD filters are now also easier to implement. Download link: ATK 2.3.0 Changelog: 2.3.0 * Increased test coverage and fix lots of small mistakes in the API * Allow in place Python tests (before make install) on Linux * Split big files to allow native compilation on embedded platforms 2.2.2 * Fix a TDF2 IIR filter bug when the state was not reinitialized, leading to instabilities * Fix a bug when delays were changed but not the underlying buffers, leading to buffer underflows * Adding a new Broadcast filter (filling all SIMD vector lines with the same input value) * Adding a new Reduce filter (summing all SIMD vector lines to the output value) 2.2.1 * Fix alignment issues in SIMD filters * Fix SIMD EQ dispatcher export issues on Windows (too many possible filters!) * Implemented relevant Tools SIMD filters ## February 12, 2018 ### Matthew Rocklin #### Dask Release 0.17.0 This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation. I’m pleased to announce the release of Dask version 0.17.0. This a significant major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.16.0 release on November 21st. You can conda install Dask: conda install dask -c conda-forge  or pip install from PyPI: pip install dask[complete] --upgrade  Full changelogs are available here: Some notable changes follow. ### Deprecations • Removed dask.dataframe.rolling_* methods, which were previously deprecated both in dask.dataframe and in pandas. These are replaced with the rolling.* namespace • We’ve generally stopped maintenance of the dask-ec2 project to launch dask clusters on Amazon’s EC2 using Salt. We generally recommend kubernetes instead both for Amazon’s EC2, and for Google and Azure as well dask.pydata.org/en/latest/setup/kubernetes.html • Internal state of the distributed scheduler has changed significantly. This may affect advanced users who were inspecting this state for debugging or diagnostics. ### Task Ordering As Dask encounters more complex problems from more domains we continually run into problems where its current heuristics do not perform optimally. This release includes a rewrite of our static task prioritization heuristics. This will improve Dask’s ability to traverse complex computations in a way that keeps memory use low. To aid debugging we also integrated these heuristics into the GraphViz-style plots that come from the visualize method. x = da.random.random(...) ... x.visualize(color='order', cmap='RdBu')  ### Nested Joblib Dask supports parallelizing Scikit-Learn by extending Scikit-Learn’s underlying library for parallelism, Joblib. This allows Dask to distribute some SKLearn algorithms across a cluster just by wrapping them with a context manager. This relationship has been strengthened, and particular attention has been focused when nesting one parallel computation within another, such as occurs when you train a parallel estimator, like RandomForest, within another parallel computation, like GridSearchCV. Previously this would result in spawning too many threads/processes and generally oversubscribing hardware. Due to recent combined development within both Joblib and Dask, these sorts of situations can now be resolved efficiently by handing them off to Dask, providing speedups even in single-machine cases: from sklearn.externals import joblib import distributed.joblib # register the dask joblib backend from dask.distributed import Client client = Client() est = ParallelEstimator() gs = GridSearchCV(est) with joblib.parallel_backend('dask'): gs.fit()  See Tom Augspurger’s recent post with more details about this work: Thanks to Tom Augspurger, Jim Crist, and Olivier Grisel who did most of this work. ### Scheduler Internal Refactor The distributed scheduler has been significantly refactored to change it from a forest of dictionaries: priority = {'a': 1, 'b': 2, 'c': 3} dependencies = {'a': {'b'}, 'b': {'c'}, 'c': []} nbytes = {'a': 1000, 'b': 1000, 'c': 28}  To a bunch of objects: tasks = {'a': Task('a', priority=1, nbytes=1000, dependencies=...), 'b': Task('b': priority=2, nbytes=1000, dependencies=...), 'c': Task('c': priority=3, nbytes=28, dependencies=[])}  (there is much more state than what is listed above, but hopefully the examples above are clear.) There were a few motivations for this: 1. We wanted to try out Cython and PyPy, for which objects like this might be more effective than dictionaries. 2. We believe that this is probably a bit easier for developers new to the schedulers to understand. The proliferation of state dictionaries was not highly discoverable. Goal one ended up not working out. We have not yet been able to make the scheduler significantly faster under Cython or PyPy with this new layout. There is even a slight memory increase with these changes. However we have been happy with the results in code readability, and we hope that others find this useful as well. Thanks to Antoine Pitrou, who did most of the work here. ### User Priorities You can now submit tasks with different priorities. x = client.submit(f, 1, priority=10) # Higher priority preferred y = client.submit(f, 1, priority=-10) # Lower priority happens later  To be clear, Dask has always had priorities, they just weren’t easily user-settable. Higher priorities are given precedence. The default priority for all tasks is zero. You can also submit priorities for collections (like arrays and dataframes) df = df.persist(priority=5) # give this computation higher priority.  Several related projects are also undergoing releases: • Tornado is updating to version 5.0 (there is a beta out now). This is a major change that will put Tornado on the Asyncio event loop in Python 3. It also includes many performance enhancements for high-bandwidth networks. • Bokeh 0.12.14 was just released. Note that you will need to update Dask to work with this version of Bokeh • Daskernetes, a new project for launching Dask on Kubernetes clusters ## Acknowledgements The following people contributed to the dask/dask repository since the 0.16.0 release on November 14th: • Albert DeFusco • Apostolos Vlachopoulos • castalheiro • James Bourbeau • Jon Mease • Ian Hopkinson • Jakub Nowacki • Jim Crist • John A Kirkham • Joseph Lin • Keisuke Fujii • Martijn Arts • Martin Durant • Matthew Rocklin • Markus Gonser • Nir • Rich Signell • Roman Yurchak • S. Andrew Sheppard • sephib • Stephan Hoyer • Tom Augspurger • Uwe L. Korn • Wei Ji • Xander Johnson The following people contributed to the dask/distributed repository since the 1.20.0 release on November 14th: • Alexander Ford • Antoine Pitrou • Brett Naul • Brian Broll • Bruce Merry • Cornelius Riemenschneider • Daniel Li • Jim Crist • Kelvin Yang • Matthew Rocklin • Min RK • rqx • Russ Bubley • Scott Sievert • Tom Augspurger • Xander Johnson ## February 09, 2018 ### numfocus #### Julia Joins Petaflop Club ### Continuum Analytics #### Credit Modeling with Dask I’ve been working with a large retail bank on their credit modeling system. We’re doing interesting work with Dask to manage complex computations (see task graph below) that I’d like to share. This is an example of using Dask for complex problems that are neither a big dataframe nor a big array, but are still … Read more → ## February 07, 2018 ### numfocus #### ActiveState Joins NumFOCUS as Corporate Sponsor ## February 06, 2018 ### Continuum Analytics #### Easy Distributed Training with Joblib and Dask This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. This article will talk about some changes we made to improve training scikit-learn models using a … Read more → ### Matthew Rocklin #### HDF in the Cloud Multi-dimensional data, such as is commonly stored in HDF and NetCDF formats, is difficult to access on traditional cloud storage platforms. This post outlines the situation, the following possible solutions, and their strengths and weaknesses. 1. Cloud Optimized GeoTIFF: We can use modern and efficient formats from other domains, like Cloud Optimized GeoTIFF 2. HDF + FUSE: Continue using HDF, but mount cloud object stores as a file system with FUSE 3. HDF + Custom Reader: Continue using HDF, but teach it how to read from S3, GCS, ADL, … 4. Build a Distributed Service: Allow others to serve this data behind a web API, built however they think best 5. New Formats for Scientific Data: Design a new format, optimized for scientific data in the cloud ## Not Tabular Data If your data fits into a tabular format, such that you can use tools like SQL, Pandas, or Spark, then this post is not for you. You should consider Parquet, ORC, or any of a hundred other excellent formats or databases that are well designed for use on cloud storage technologies. ## Multi-Dimensional Data We’re talking about data that is multi-dimensional and regularly strided. This data often occurs in simulation output (like climate models), biomedical imaging (like an MRI scan), or needs to be efficiently accessed across a number of different dimensions (like many complex time series). Here is an image from the popular XArray library to put you in the right frame of mind: This data is often stored in blocks such that, say, each 100x100x100 chunk of data is stored together, and can be accessed without reading through the rest of the file. A few file formats allow this layout, the most popular of which is the HDF format, which has been the standard for decades and forms the basis for other scientific formats like NetCDF. HDF is a powerful and efficient format capable of handling both complex hierarchical data systems (filesystem-in-a-file) and efficiently blocked numeric arrays. Unfortunately HDF is difficult to access from cloud object stores (like S3), which presents a challenge to the scientific community. ## The Opportunity and Challenge of Cloud Storage The scientific community generates several petabytes of HDF data annually. Supercomputer simulations (like a large climate model) produce a few petabytes. Planned NASA satellite missions will produce hundreds of petabytes a year of observational data. All of these tend to be stored in HDF. To increase access, institutions now place this data on the cloud. Hopefully this generates more social value from existing simulations and observations, as they are ideally now more accessible to any researcher or any company without coordination with the host institution. Unfortunately, the layout of HDF files makes them difficult to query efficiently on cloud storage systems (like Amazon’s S3, Google’s GCS, or Microsoft’s ADL). The HDF format is complex and metadata is strewn throughout the file, so that a complex sequence of reads is required to reach a specific chunk of data. The only pragmatic way to read a chunk of data from an HDF file today is to use the existing HDF C library, which expects to receive a C FILE object, pointing to a normal file system (not a cloud object store) (this is not entirely true, as we’ll see below). So organizations like NASA are dumping large amounts of HDF onto Amazon’s S3 that no one can actually read, except by downloading the entire file to their local hard drive, and then pulling out the particular bits that they need with the HDF library. This is inefficient. It misses out on the potential that cloud-hosted public data can offer to our society. The rest of this post discusses a few of the options to solve this problem, including their advantages and disadvantages. 1. Cloud Optimized GeoTIFF: We can use modern and efficient formats from other domains, like Cloud Optimized GeoTIFF Good: Fast, well established Bad: Not sophisticated enough to handle some scientific domains 2. HDF + FUSE: Continue using HDF, but mount cloud object stores as a file system with Filesystem in Userspace, aka FUSE Good: Works with existing files, no changes to the HDF library necessary, useful in non-HDF contexts as well Bad: It’s complex, probably not performance-optimal, and has historically been brittle 3. HDF + Custom Reader: Continue using HDF, but teach it how to read from S3, GCS, ADL, … Good: Works with existing files, no complex FUSE tricks Bad: Requires plugins to the HDF library and tweaks to downstream libraries (like Python wrappers). Will require effort to make performance optimal 4. Build a Distributed Service: Allow others to serve this data behind a web API, built however they think best Good: Lets other groups think about this problem and evolve complex backend solutions while maintaining stable frontend API Bad: Complex to write and deploy. Probably not free. Introduces an intermediary between us and our data. 5. New Formats for Scientific Data: Design a new format, optimized for scientific data in the cloud Good: Fast, intuitive, and modern Bad: Not a community standard Now we discuss each option in more depth. ## Use Other Formats, like Cloud Optimized GeoTIFF We could use formats other than HDF and NetCDF that are already well established. The two that I hear most often proposed are Cloud Optimized GeoTIFF and Apache Parquet. Both are efficient, well designed for cloud storage, and well established as community standards. If you haven’t already, I strongly recommend reading Chris Holmes’ (Planet) blog series on Cloud Native Geospatial. These formats are well designed for cloud storage because they support random access well with relatively few communications and with relatively simple code. If you needed to you could look at the Cloud Optimized GeoTIFF spec, and within an hour of reading, get an image that you wanted using nothing but a few curl commands. Metadata is in a clear centralized place. That metadata provides enough information to issue further commands to get the relevant bytes from the object store. Those bytes are stored in a format that is easily interpreted by a variety of common tools across all platforms. However, neither of these formats are sufficiently expressive to handle some of the use cases of HDF and NetCDF. Recall our image earlier about atmospheric data: Our data isn’t a parquet table, nor is it a stack of geo-images. While it’s true that you could store any data in these formats, for example by saving each horizontal slice as a GeoTIFF, or each spatial point as a row in a Parquet table, these storage layouts would be inefficient for regular access patterns. Some parts of the scientific community need blocked layouts for multi-dimensional array data. ## HDF and Filesystems in Userspace (FUSE) We could access HDF data on the cloud now if we were able to convince our operating system that S3 or GCS or ADL were a normal file system. This is a reasonable goal; cloud object stores look and operate much like normal file systems. They have directories that you can list and navigate. They have files/objects that you can copy, move, rename, and from which you can read or write small sections. We can achieve this using an operating systems trick, FUSE, or Filesystem in Userspace. This allows us to make a program that the operating system treats as a normal file system. Several groups have already done this for a variety of cloud providers. Here is an example with the gcsfs Python library  pip install gcsfs --upgrade
$mkdir /gcs$ gcsfuse bucket-name /gcs --background
Mounting bucket bucket-name to directory /gcs

$ls /gcs my-file-1.hdf my-file-2.hdf my-file-3.hdf ...  Now we point our HDF library to a NetCDF file in that directory (which actually points to an object on Google Cloud Storage), and it happily uses C File objects to read and write data. The operating system passes the read/write requests to gcsfs, which goes out to the cloud to get data, and then hands it back to the operating system, which hands it to HDF. All normal HDF operations just work, although they may now be significantly slower. The cloud is further away than local disk. This slowdown is significant because the HDF library makes many small 4kB reads in order to gather the metadata necessary to pull out a chunk of data. Each of those tiny reads made sense when the data was local, but now that we’re sending out a web request each time. This means that users can sit for minutes just to open a file. Fortunately, we can be clever. By buffering and caching data, we can reduce the number of web requests. For example, when asked to download 4kB we actually download 100kB or 1MB. If some of the future 4kB reads are within this 1MB then we can return them immediately., Looking at HDF traces it looks like we can probably reduce “dozens” of web requests to “a few”. FUSE also requires elevated operating system permissions, which can introduce challenges if working from Docker containers (which is likely on the cloud). Docker containers running FUSE need to be running in privileged mode. There are some tricks around this, but generally FUSE brings some excess baggage. ## HDF and a Custom Reader The HDF library doesn’t need to use C File pointers, we can extend it to use other storage backends as well. Virtual File Layers are an extension mechanism within HDF5 that could allow it to target cloud object stores. This has already been done to support Amazon’s S3 object store twice: 1. Once by the HDF group, S3VFD (currently private), 2. Once by Joe Jevnik and Scott Sanderson (Quantopian) at https://h5s3.github.io/h5s3/ (highly experimental) This provides an alternative to FUSE that is better because it doesn’t require privileged access, but is worse because it only solves this problem for HDF and not all file access. In either case we’ll need to do look-ahead buffering and caching to get reasonable performance (or see below). ## Centralize Metadata Alternatively, we might centralize metadata in the HDF file in order to avoid many hops throughout that file. This would remove the need to perform clever file-system caching and buffering tricks. Here is a brief technical explanation from Joe Jevnik: Regarding the centralization of metadata: this is already a feature of hdf5 and is used by many of the built-in drivers. This optimization is enabled by setting the H5FD_FEAT_AGGREGATE_METADATA and H5FD_FEAT_ACCUMULATE_METADATA feature flags in your VFL driver’s query function. The first flag says that the hdf5 library should pre-allocate a large region to store metadata, future metadata allocations will be served from this pool. The second flag says that the library should buffer metadata changes into large chunks before dispatching the VFL driver’s write function. Both the default driver (sec2) and h5s3 enable these optimizations. This is further supported by using the H5FD_FLMAP_DICHOTOMY free list option which uses one free list for metadata allocations and another for non-metadata allocations. If you really want to ensure that the metadata is aggregated, even without a repack, you can use the built-in ‘multi’ driver which dispatches different allocation flavors to their own driver. ## Distributed Service We could offload this problem to a company, like the non-profit HDF group or a for-profit cloud provider like Google, Amazon, or Microsoft. They would solve this problem however they like, and expose a web API that we can hit for our data. This would be a distributed service of several computers on the cloud near our data, that takes our requests for what data we want, perform whatever tricks they deem appropriate to get that data, and then deliver it to us. This fleet of machines will still have to address the problems listed above, but we can let them figure it out, and presumably they’ll learn as they go. However, this has both technical and social costs. Technically this is complex, and they’ll have to handle a new set of issues around scalability, consistency, and so on that are already solved(ish) in the cloud object stores. Socially this creates an intermediary between us and our data, which we may not want both for reasons of cost and trust. The HDF group is working on such a service, HSDS that works on Amazon’s S3 (or anything that looks like S3). They have created a h5pyd library that is a drop-in replacement for the popular h5py Python library. Presumably a cloud provider, like Amazon, Google, or Microsoft could do this as well. By providing open standards like OpenDAP they might attract more science users onto their platform to more efficiently query their cloud-hosted datasets. The satellite imagery company Planet already has such a service. ## New Formats for Scientific Data Alternatively, we can move on from the HDF file format, and invent a new data storage specification that fits cloud storage (or other storage) more cleanly without worrying about supporting the legacy layout of existing HDF files. This has already been going on, informally, for years. Most often we see people break large arrays into blocks, store each block as a separate object in the cloud object store with a suggestive name, and store a metadata file describing how the blocks relate to each other. This looks something like the following: /metadata.json /0.0.0.dat /0.0.1.dat /0.0.2.dat ... /10.10.8.dat /10.10.9.dat /10.10.10.dat  There are many variants: • They might extend this to have groups or sub-arrays in sub-directories. • They might choose to compress the individual blocks in the .dat files or not. • They might choose different encoding schemes for the metadata and the binary blobs. But generally most array people on the cloud do something like this with their research data, and they’ve been doing it for years. It works efficiently, is easy to understand and manage, and transfers well to any cloud platform, onto a local file system, or even into a standalone zip file or small database. There are two groups that have done this in a more mature way, defining both modular standalone libraries to manage their data, as well as proper specification documents that inform others how to interpret this data in a long-term stable way. These are both well maintained and well designed libraries (by my judgment), in Python and Java respectively. They offer layouts like the layout above, although with more sophistication. Entertainingly their specs are similar enough that another library, Z5, built a cross-compatible parser for each in C++. This unintended uniformity is a good sign. It means that both developer groups were avoiding creativity, and have converged on a sensible common solution. I encourage you to read the Zarr Spec in particular. However, technical merits are not alone sufficient to justify a shift in data format, especially for archival datasets of record that we’re discussing. The institutions in charge of this data and have multi-decade horizons and so move slowly. For them, moving off of the historically community standard would be major shift. And so we need to answer a couple of difficult questions: 1. How hard is it to make HDF efficient in the cloud? 2. How hard is it to shift the community to a new standard? ## A Sense of Urgency These questions are important now. NASA and other agencies are pushing NetCDF data into the Cloud today and will be increasing these rates substantially in the coming years. From its current cumulative archive size of almost 22 petabytes (PB), the volume of data in the EOSDIS archive is expected to grow to almost 247 PB by 2025, according to estimates by NASA’s Earth Science Data Systems (ESDS) Program. Over the next five years, the daily ingest rate of data into the EOSDIS archive is expected to reach 114 terabytes (TB) per day, with the upcoming joint NASA/Indian Space Research Organization Synthetic Aperture Radar (NISAR) mission (scheduled for launch by 2021) contributing an estimated 86 TB per day of data when operational. This is only one example of many agencies in many domains pushing scientific data to the cloud. ## Acknowledgements Thanks to Joe Jevnik (Quantopian), John Readey (HDF Group), Rich Signell (USGS), and Ryan Abernathey (Columbia University) for their feedback when writing this article. This conversation started within the Pangeo collaboration. ## February 02, 2018 ### Continuum Analytics #### The Case for Numba in Community Code The numeric Python community should consider adopting Numba more widely within community code. Numba is strong in performance and usability, but historically weak in ease of installation and community trust. This blog post discusses these strengths and weaknesses from the perspective of an OSS library maintainer. It uses other more technical blog posts written on … Read more → ## February 01, 2018 ### Prabhu Ramachandran #### VTK-8.1.0 wheels for all platforms on pypi! I cannot believe it has been 6 years since my last blog post! Anyway, I have some good news to announce here. In the Python community, VTK has always been somewhat difficult to install (in comparison to pure Python packages). One has required to either use a specific package management tool or resort to source builds. This has been a major problem when trying to install tools that rely on VTK, like Mayavi. During the SciPy 2017 conference held at Austin last year, a few of the Kitware developers, notably Jean-Christophe Fillion-Robin (JC for short) and some of the VTK developers got together with some of us from the SciPy community and decided to try and put together wheels for VTK. JC did the hard work of figuring this out and setting up a nice VTKPythonPackage during the sprints to make this process easy. As of last week (Jan 27, 2018) Mac OS X wheels were not supported. Last weekend, I finally got the time (thanks to Enthought) to play with JC's work. I figured out how to get the wheels working on OS X. With this, in principle, we could build VTK wheels on all the major platforms. We decided to try and push wheels at least for the major VTK releases. This in itself would be a massive improvement in making VTK easier to install. Over the last few days, I have built wheels on Linux, OS X, and Windows. All of these are 64 bit wheels for VTK-8.1.0. Now, VTK 8.x adds a c++11 dependency, and so we cannot build these versions of VTK for Python 2.7 on Windows. So now we have 64 bit wheels on Windows for Python versions 3.5.x and 3.6.x. Unfortunately, 3.4.x required a different Visual Studio installed and I lost patience setting things up on my Windows VM. On Linux, we have 64 bit wheels for Python 2.7.x, 3.4.x, 3.5.x, and 3.6.x. On MacOS, we have 64 bit wheels for Python 2.7.x, 3.4.x, 3.5.x, and 3.6.x. So if you are using a 64 bit Python, you can now do$ pip install vtk

and have VTK-8.1.0 installed!

This is really nice to have and should hopefully make VTK and other tools a lot easier to install.

A big thank you to JC, the other Kitware developers, the VTK Python developers, especially David Gobbi who has worked on the VTK Python wrappers for many many years now,  for making this happen. Apologies if I missed anyone but thank you all!

Enjoy!

## January 30, 2018

### Matthieu Brucher

#### Book review: Python for Finance: Analyze Big Financial Data

Recently, I moved to the finance industry. As usually when I start in a new domain, I look at the Python books for it. And Python for Finance from Yves Hilpisch is one of the most known ones.

#### Discussion

The book is split in 3 unequal parts. The first one is short and presents the usage of Python in the finance industry, how to install it and a few example of its usage for finance. The Python code is quite simple, strangely the author decided to go for global variables and almost no parameters. Why not presenting classes here? At least he uses examples through IPython/Jupyter, so that’s good!

The second part tackles finance applications to Python and the useful modules. Obviously, the first chapter here handles Numpy. I liked the fact that vectorization is an important part here (not using explicit loops). Then of course an important point is visualizing plot, and especially time series. The third chapter tackles pandas, a library that was originally written for finance analysis, so obviously it has to be used!

Strangely, the chapter after that one is about reading and writing data. I’m not really sure it is worth spending so much time on some functions that are already in numpy and pandas. I agree that I/O is important, but I’m not sure it deserves so much space in a Python book. Or even talk about SQL.

The next chapter tackles performance in Python. The author compares different ways of make your code faster. I liked the IPython example, as lots of people would work from Jupyter with several available cores behind. The multiprocessing module is nice, but can sometimes be… awkward to use. Not sure that the NumbaPro example was useful, as not many people will be able to use them (I felt this was more an ad than actually useful pages).

After this chapter, we are back to math tools for finance. The strange part is that the previous chapter may not really be used for this chapter. Not many algorithms can be efficiently parallelized when they come out of available packages (except when they are meant for this like sklearn pipeline model). So the chapter here will talk about regression (one of the main tool to understand a trend in time series; although the prediction may be completely bogus), interpolation or optimization. The latter one is what you need for lots of models. Later in the chapter, symbolic computation is also introduced, and I have to say that if you know an analytical approach to a problem, then this is quite effective (I always take a similar route for my electronic models).

The tenth chapter dives into the core of finance maths with stochastic equations (and the Black-Scholes one!). Of course here, it’s basically using random number generators, and then applying some rules on top of them. The chapter after that handles puts several of the previous topics together, like normality test for stats, or portfolio optimization for… optimization. There is a part on PCA, but I’m biased, I hate PCA since lots of people use it for dimensionality reduction on data that is not Euclidian…

There is also a chapter on Excel, probably because lots of people use it to analyze data, and you need to be able to exchange data with it. I guess.

And then, the chapter where the author finally tackles classes!! Really!! And by saying that it’s an important aspect of Python. That’s what I don’t understand. Especially the way it’s presented. The part with traits is OK, although the online tutorials are just as good.

Then, there is a chapter on web apps, not sure exactly why there is, to be franc.

After this part with ups and down, there is a part on creating a derivative library. This is the part where there is some real finance computation, although the author refers back to his other book for the theory itself. The chapters are quite small and try to wrap everything from the previous part in a unique framework.

I just wish this integration was done in the second part instead.

#### Conclusion

So basically the content of the book is on some kind of Python. If you don’t know about finance, you want to know much more at the end of this book. But if want to learn about Python, you will know about modules, but actually not about good Python.

So unfortunately, avoid.

### Matthew Rocklin

#### The Case for Numba in Community Code

The numeric Python community should consider adopting Numba more widely within community code.

Numba is strong in performance and usability, but historically weak in ease of installation and community trust. This blogpost discusses these these strengths and weaknesses from the perspective of a OSS library maintainer. It uses other more technical blogposts written on the topic as references. It is biased in favor of wider adoption given recent changes to the project.

Let’s start with a wildly unprophetic quote from Jake Vanderplas in 2013:

I’m becoming more and more convinced that Numba is the future of fast scientific computing in Python.

– Jake Vanderplas, 2013-06-15

http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/

We’ll use the following blogposts by other community members throughout this post. They’re all good reads and are more technical, showing code examples, performance numbers, etc..

At the end of the blogpost these authors will also share some thoughts on Numba today, looking back with some hindsight.

Disclaimer: I work alongside many of the Numba developers within the same company and am partially funded through the same granting institution.

### Compiled code in Python

Many open source numeric Python libraries need to write efficient low-level code that works well on Numpy arrays, but is more complex than the Numpy library itself can express. Typically they use one of the following options:

1. C-extensions: mostly older projects like NumPy and Scipy
2. Cython: probably the current standard for mainline projects, like scikit-learn, pandas, scikit-image, geopandas, and so on
3. Standalone C/C++ codebases with Python wrappers: for newer projects that target inter-language operation, like XTensor and Arrow

Each of these choices has tradeoffs in performance, packaging, attracting new developers and so on. Ideally we want a solution that is …

1. Fast: about as fast as C/Fortran
2. Easy: Is accessible to a broad base of developers and maintainers
3. Builds easily: Introduces few complications in building and packaging
4. Installs easily: Introduces few install and runtime dependencies
5. Trustworthy: Is well trusted within the community, both in terms of governance and long term maintenance

The two main approaches today, Cython and C/C++, both do well on most of these objectives. However neither is perfect. Some issues that arise include the following:

• Cython
• Often requires effort to make fast
• Is often only used by core developers. Requires expertise to use well.
• Introduces mild packaging pain, though this pain is solved frequently enough that experienced community members are used to dealing with it
• Standalone C/C++
• Sometimes introduces complex build and packaging concerns
• Is often only used by core developers. These projects have difficulty attracting the Python community’s standard developer pool (though they do attract developers from other communities).

There are some other options out there like Numba and Pythran that, while they provide excellent performance and usability benefits, are rarely used. Let’s look into Numba’s benefits and drawbacks more closely.

### Numba Benefits

Numba is generally well regarded from a technical perspective (it’s fast, easy to use, well maintained, etc.) but has historically not been trusted due to packaging and community concerns.

In any test of either performance or usability Numba almost always wins (or ties for the win). It does all of the compiler optimization tricks you expect. It supports both for-loopy code as well as Numpy-style slicing and bulk operation code. It requires almost no additional information from the user (assuming that you’re ok with JIT behavior) and so is very approachable, and very easy for novices to use well.

This means that we get phrases like the following:

• https://dionhaefner.github.io/2016/11/suck-less-scientific-python-part-2-efficient-number-crunching/
• “This is rightaway faster than NumPy.”
• “In fact, we can infer from this that numba managed to generate pure C code from our function and that it did it already previously.”
• “Numba delivered the best performance on this problem, while still being easy to use.”
• https://dionhaefner.github.io/2016/11/suck-less-scientific-python-part-2-efficient-number-crunching/
• “Using numba is very simple; just apply the jit decorator to the function you want to get compiled. In this case, the function code is exactly the same as before”
• “Wow! A speedup by a factor of about 400, just by applying a decorator to the function. “
• http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/
• “Much better! We’re now within about a factor of 3 of the Fortran speed, and we’re still writing pure Python!”
• “I should emphasize here that I have years of experience with Cython, and in this function I’ve used every Cython optimization there is … By comparison, the Numba version is a simple, unadorned wrapper around plainly-written Python code.”
• http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/
• Numba is extremely simple to use. We just wrap our python function with autojit (JIT stands for “just in time” compilation) to automatically create an efficient, compiled version of the function
• Adding this simple expression speeds up our execution by over a factor of over 1400! For those keeping track, this is about 50% faster than the version of Numba that I tested last August on the same machine.
• The Cython version, despite all the optimization, is a few percent slower than the result of the simple Numba decorator!
• http://stephanhoyer.com/2015/04/09/numba-vs-cython-how-to-choose/
• “Numba is usually easier to write for the simple cases where it works”
• https://murillogroupmsu.com/numba-versus-c/
• “Numba allows for speedups comparable to most compiled languages with almost no effort”
• “We find that Numba is more than 100 times as fast as basic Python for this application. In fact, using a straight conversion of the basic Python code to C++ is slower than Numba.”

In all cases where authors compared Numba to Cython for numeric code (Cython is probably the standard for these cases) Numba always performs as-well-or-better and is always much simpler to write.

Here is a code example from Jake’s second blogpost:

#### Example: Code Complexity

# From http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/

# Numba                                 # Cython
import numpy as np                      import numpy as np
import numba                            cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@numba.jit                              @cython.wraparound(False)
def pairwise_python(X):                 def pairwise_cython(double[:, ::1] X):
M = X.shape[0]                          cdef int M = X.shape[0]
N = X.shape[1]                          cdef int N = X.shape[1]
cdef double tmp, d
D = np.empty((M, M), dtype=np.float)    cdef double[:, ::1] D = np.empty((M, M),
dtype=np.float64)
for i in range(M):                      for i in range(M):
for j in range(M):                      for j in range(M):
d = 0.0                                 d = 0.0
for k in range(N):                      for k in range(N):
tmp = X[i, k] - X[j, k]                 tmp = X[i, k] - X[j, k]
d += tmp * tmp                          d += tmp * tmp
D[i, j] = np.sqrt(d)                    D[i, j] = sqrt(d)
return D                                return np.asarray(D)


The algorithmic body of each function (the nested for loops) is identical. However the Cython code is more verbose with annotations, both at the function definition (which we would expect for any AOT compiler), but also within the body of the function for various utility variables. The Numba code is just straight Python + Numpy code. We could remove the @numba.jit decorator and step through our function with normal Python.

#### Example: Numpy Operations

Additionally Numba lets us use Numpy syntax directly in the function, so for example the following function is well accelerated by Numba, even though it already fits NumPy’s syntax well.

# from https://flothesof.github.io/optimizing-python-code-numpy-cython-pythran-numba.html

@numba.jit
def laplace_numba(image):
"""Laplace operator in NumPy for 2D images. Accelerated using numba."""
laplacian = ( image[:-2, 1:-1] + image[2:, 1:-1]
+ image[1:-1, :-2] + image[1:-1, 2:]
- 4*image[1:-1, 1:-1])
thresh = np.abs(laplacian) > 0.05
return thresh


Mixing and matching Numpy-style with for-loop style is often helpful when writing complex numeric algorithms.

Benchmarks in the these blogposts show that Numba is both simpler to use and often as-fast-or-faster than more commonly used technologies like Cython.

### Numba drawbacks

So, given these advantages why didn’t Jake’s original prophecy hold true?

I believe that there are three primary reasons why Numba has not been more widely adopted among other open source projects:

1. LLVM Dependency: Numba depends on LLVM, which was historically difficult to install without a system package manager (like apt-get, brew) or conda. Library authors are not willing to exclude users that use other packaging toolchains, particularly Python’s standard tool, pip.
2. Community Trust: Numba is largely developed within a single for-profit company (Anaconda Inc.) and its developers are not well known by other library maintainers.
3. Lack of Interpretability: Numba’s output, LLVM, is less well understood by the community than Cython’s output, C (discussed in original-author comments in the last section)

All three of these are excellent reasons to avoid adding a dependency. Technical excellence alone is insufficient, and must be considered alongside community and long-term maintenance concerns.

### But Numba has evolved recently

#### LLVM

Numba now depends on the easier-to-install library, llvmlite which, as of a few months ago is pip installable with binary wheels on Windows, Mac, and Linux. The llvmlite package is still a heavy-ish runtime dependency (42MB), but that’s significantly less than large Cython libraries like Pandas or SciPy.

If your concern was about the average user’s inability to install Numba, then I think that this concern has been resolved.

#### Community

Numba has three community problems:

1. Development of Numba has traditionally happened within the closed walls of Anaconda Inc (formerly Continuum Analytics)
2. The Numba maintainers are not well known within the broader Python community
3. There used to be a proprietary version, Numba Pro

This combination strongly attached Numba’s image to Continuum’s for-profit ventures, making community-oriented software maintainers understandably wary of dependence, for fear that dependence on this library might be used for Continuum’s financial gain at the expense of community users.

Things have changed significantly.

Numba Pro was abolished years ago. The funding for the project today comes more often from Anaconda Inc. consulting revenue, hardware vendors looking to ensure that Python runs as efficiently as possible on their systems, and from generous donations from the Gordon and Betty Moore foundation to ensure that Numba serves the open source Python community.

Developers outside of Anaconda Inc. now have core commit access, which forces communication to happen in public channels, notably GitHub (which was standard before) and Gitter chat (which is relatively new).

The maintainers are still fairly relatively unknown within the broader community. This isn’t due to any sort of conspiracy, but is instead due more to shyness or having interests outside of OSS. Antoine, Siu, Stan, and Stuart are all considerate, funny, and clever fellows with strong enthusiasm for compilers, OSS, and performance. They are quite responsive on the Numba mailing list should you have any questions or concerns.

If your concern was about Numba trapping users into a for-profit mode, then that seems to have been resolved years ago.

If your concern is more about not knowing who is behind the project then I encourage you to reach out. I would be surprised if you don’t walk away pleased.

### The Continued Cases Against Numba

For completeness, let’s list a number of reasons why it is still quite reasonable to avoid Numba today:

1. It isn’t a community standard
2. Numba hasn’t attracted a wide developer base (compilers are hard), and so is probably still dependent on financial support for paid developers
3. You want to speed up non-numeric code that includes classes, dicts, lists, etc. for which I need Cython or PyPy
4. You want to build a library that is useful outside of Python, and so plan to build most numeric algorithms on C/C++/Fortran
5. You prefer ahead-of-time compilation and want to avoid JIT times
6. While llvmlite is cheaper than LLVM, it’s still 50MB
7. Understanding the compiled results is hard, and you don’t have good familiarity with LLVM

### Numba features we didn’t talk about

1. Multi-core parallelism
2. GPUs
3. Run-time Specialization to the CPU you’re running on
4. Easy to swap out for other JIT compilers, like PyPy, if they arise in the future

### Update from the original blogpost authors

After writing the above I reached out both to Stan and Siu from Numba and to the original authors of the referenced blogposts to get some of their impressions now having the benefit of additional experience.

Here are a few choice responses:

1. Stan:

I think one of the biggest arguments against Numba still is time. Due to a massive rewrite of the code base, Numba, in its present form, is ~3 years old, which isn’t that old for a project like this. I think it took PyPy at least 5-7 years to reach a point where it was stable enough to really trust. Cython is 10 years old. People have good reason to be conservative with taking on new core dependencies.

2. Jake:

One thing I think would be worth fleshing-out a bit (you mention it in the final bullet list) is the fact that numba is kind of a black box from the perspective of the developer. My experience is that it works well for straightforward applications, but when it doesn’t work well it’s *extremely difficult to diagnose what the problem might be.*

Contrast that with Cython, where the html annotation output does wonders for understanding your bottlenecks both at a non-technical level (“this is dark yellow so I should do something different”) and a technical level (“let me look at the C code that was generated”). If there’s a similar tool for numba, I haven’t seen it.

• Florian:

Elaborating on Jake’s answer, I completely agree that Cython’s annotation tool does wonders in terms of understanding your code. In fact, numba does possess this too, but as a command-line utility. I tried to demonstrate this in my blogpost, but exporting the CSS in the final HTML render kind of mangles my blog post so here’s a screenshot:

This is a case where jit(nopython=True) works, so there seems to be no coloring at all.

Florian also pointed to the SciPy 2017 tutorial by Gil Forsyth and Lorena Barba

• Dion:

I hold Numba in high regard, and the speedups impress me every time. I use it quite often to optimize some bottlenecks in our production code or data analysis pipelines (unfortunately not open source). And I love how Numba makes some functions like scipy.optimize.minimize or scipy.ndimage.generic_filter well-usable with minimal effort.

However, I would never use Numba to build larger systems, precisely for the reason Jake mentioned. Subjectively, Numba feels hard to debug, has cryptic error messages, and seemingly inconsistent behavior. It is not a “decorate and forget” solution; instead it always involves plenty of fiddling to get right.

That being said, if I were to build some high-level scientific library à la Astropy with some few performance bottlenecks, I would definitely favor Numba over Cython (and if it’s just to spare myself the headache of getting a working C compiler on Windows).

• Stephan:

I wonder if there are any examples of complex codebases (say >1000 LOC) using Numba. My sense is that this is where Numba’s limitations will start to become more evident, but maybe newer features like jitclass would make this feasible.

As a final take-away, you might want to follow Florian’s advice and watch Gil and Lorena’s tutorial here:

## January 27, 2018

### Matthew Rocklin

#### Write Dumb Code

The best way you can contribute to an open source project is to remove lines of code from it. We should endeavor to write code that a novice programmer can easily understand without explanation or that a maintainer can understand without significant time investment.

As students we attempt increasingly challenging problems with increasingly sophisticated technologies. We first learn loops, then functions, then classes, etc.. We are praised as we ascend this hierarchy, writing longer programs with more advanced technology. We learn that experienced programmers use monads while new programmers use for loops.

Then we graduate and find a job or open source project to work on with others. We search for something that we can add, and implement a solution pridefully, using the all the tricks that we learned in school.

Ah ha! I can extend this project to do X! And I can use inheritance here! Excellent!

We implement this feature and feel accomplished, and with good reason. Programming in real systems is no small accomplishment. This was certainly my experience. I was excited to write code and proud that I could show off all of the things that I knew how to do to the world. As evidence of my historical love of programming technology, here is a linear algebra language built with a another meta-programming language. Notice that no one has touched this code in several years.

However after maintaining code a bit more I now think somewhat differently.

1. We should not seek to build software. Software is the currency that we pay to solve problems, which is our actual goal. We should endeavor to build as little software as possible to solve our problems.
2. We should use technologies that are as simple as possible, so that as many people as possible can use and extend them without needing to understand our advanced techniques. We should use advanced techniques only when we are not smart enough to figure out how to use more common techniques.

Neither of these points are novel. Most people I meet agree with them to some extent, but somehow we forget them when we go to contribute to a new project. The instinct to contribute by building and to demonstrate sophistication often take over.

## Software is a cost

Every line that you write costs people time. It costs you time to write it of course, but you are willing to make this personal sacrifice. However this code also costs the reviewers their time to understand it. It costs future maintainers and developers their time as they fix and modify your code. They could be spending this time outside in the sunshine or with their family.

So when you add code to a project you should feel meek. It should feel as though you are eating with your family and there isn’t enough food on the table. You should take only what you need and no more. The people with you will respect you for your efforts to restrict yourself. Solving problems with less code is a hard, but it is a burden that you take on yourself to lighten the burdens of others.

## Complex technologies are harder to maintain

As students, we demonstrate merit by using increasingly advanced technologies. Our measure of worth depends on our ability to use functions, then classes, then higher order functions, then monads, etc. in public projects. We show off our solutions to our peers and feel pride or shame according to our sophistication.

However when working with a team to solve problems in the world the situation is reversed. Now we strive to solve problems with code that is as simple as possible. When we solve a problem simply we enable junior programmers to extend our solution to solve other problems. Simple code enables others and boosts our impact. We demonstrate our value by solving hard problems with only basic techniques.

Look! I replaced this recursive function with a for loop and it still does everything that we need it to. I know it’s not as clever, but I noticed that the interns were having trouble with it and I thought that this change might help.

If you are a good programmer then you don’t need to demonstrate that you know cool tricks. Instead, you can demonstrate your value by solving a problem in a simple way that enables everyone on your team to contribute in the future.

## But moderation, of course

That being said, over-adherence to the “build things with simple tools” dogma can be counter productive. Often a recursive solution can be much simpler than a for-loop solution and often times using a Class or a Monad is the right approach. But we should be mindful when using these technologies that we are building for ourselves our own system; a system with which others have had no experience.

## January 24, 2018

### Bruno Pinho

#### Fast and Reliable Top of Atmosphere (TOA) calculations of Landsat-8 data in Python

How to efficiently extract reflectance information from Landsat-8 Level-1 Data Product images.

## January 22, 2018

### numfocus

#### Testing a NumPy-based code on Travis with plain pip and wheels

Installing the scientific Python stack is not the most obvious task in a scientist's routine. This is especially annoying for automated deployments such as for continuous integration testing. I present here a short way to deploy Travis CI testing for a small library that depends only on NumPy.

### The goal

I developed a small library that relies only on Python and NumPy, as a design requirement. I wanted a simple pip-based deployment of my Python package testing via continuous integration, including the version of NumPy of my choice and with no rebuild of NumPy.

I started by performing the tests on my machines, simply issuing python -m pytest when changing the code. This is a limitation, mostly because I am limited to a few Python/NumPy versions.

### How to set up Travis

Travis has instructions and support for Python-based projects. The typical "SciPy stack" is not covered (except for one version of NumPy that ships with their images), so most Python-based scientific software downloads Anaconda or Miniconda as part of their continuous integration testing, getting access to plently of binary packages.

I have no specific argument against the conda solution apart that it is a large dependency in terms of download size, and that I believe "plain pip" is the most general solution for Python and I like to stick to it.

So, I set up Travis with a test matrix for Python 2.7, 3.5 and 3.6. I wanted to test several NumPy versions as well. I couldn't find a lightweight solution (i.e. a nice sample .travis.yml file) as most projects use (ana/mini)conda. Since the arrival of manylinux wheels, it is actually easy to rely on "plain pip" to install NumPy on Travis. Make sure to update pip itself first and to install "wheels" as well.

The timing of the build on travis is between 30 and 80s, so there is obviously no build of NumPy occurring there and this is a reasonable use of resources.

In the example, I exclude NumPy 1.11.0 from the Python 3.6 test because there are no "Python 3.6 NumPy 1.11.0" manylinux wheels.

language: python

python:
- 2.7
- 3.5
- 3.6

env:
- NUMPY_VERSION=1.11.0
- NUMPY_VERSION=1.12.1
- NUMPY_VERSION=1.14.0

matrix:
exclude:
- python: 3.6
env: NUMPY_VERSION=1.11.0

script:
- virtualenv --python=python venv
- source venv/bin/activate
- python -m pip install -U pip
- pip install -U wheel
- pip install numpy==\$NUMPY_VERSION
- pip install pytest
- python setup.py build
- python -m pytest


### Ending

I hope that this solution will be useful to others. If you want to see the repository itself, it is here (with a badge to the travis-ci builds).

The resulting .travis.yml file is really short, which is (in my opinion) a benefit. As SciPy also provides manylinux wheels, this is really a powerful and easy way to deploy. Any scientific package that depends on NumPy/SciPy can use it and add a build of the compiled package with, for instance, an extra dependency on GCC or Cython.

### Matthew Rocklin

#### Pangeo: JupyterHub, Dask, and XArray on the Cloud

This work is supported by Anaconda Inc, the NSF EarthCube program, and UC Berkeley BIDS

A few weeks ago a few of us stood up pangeo.pydata.org, an experimental deployment of JupyterHub, Dask, and XArray on Google Container Engine (GKE) to support atmospheric and oceanographic data analysis on large datasets. This follows on recent work to deploy Dask and XArray for the same workloads on super computers. This system is a proof of concept that has taught us a great deal about how to move forward. This blogpost briefly describes the problem, the system, then describes the collaboration, and finally discusses a number of challenges that we’ll be working on in coming months.

## The Problem

Atmospheric and oceanographic sciences collect (with satellites) and generate (with simulations) large datasets that they would like to analyze with distributed systems. Libraries like Dask and XArray already solve this problem computationally if scientists have their own clusters, but we seek to expand access by deploying on cloud-based systems. We build a system to which people can log in, get Jupyter Notebooks, and launch Dask clusters without much hassle. We hope that this increases access, and connects more scientists with more cloud-based datasets.

## The System

We integrate several pre-existing technologies to build a system where people can log in, get access to a Jupyter notebook, launch distributed compute clusters using Dask, and analyze large datasets stored in the cloud. They have a full user environment available to them through a website, can leverage thousands of cores for computation, and use existing APIs and workflows that look familiar to how they work on their laptop.

A video walk-through follows below:

We assembled this system from a number of pieces and technologies:

• JupyterHub: Provides both the ability to launch single-user notebook servers and handles user management for us. In particular we use the KubeSpawner and the excellent documentation at Zero to JupyterHub, which we recommend to anyone interested in this area.
• KubeSpawner: A JupyterHub spawner that makes it easy to launch single-user notebook servers on Kubernetes systems
• JupyterLab: The newer version of the classic notebook, which we use to provide a richer remote user interface, complete with terminals, file management, and more.
• XArray: Provides computation on NetCDF-style data. XArray extends NumPy and Pandas to enable scientists to express complex computations on complex datasets in ways that they find intuitive.
• Dask: Provides the parallel computation behind XArray
• Kubernetes: In case it’s not already clear, all of this is based on Kubernetes, which manages launching programs (like Jupyter notebook servers or Dask workers) on different machines, while handling load balancing, permissions, and so on
• Google Container Engine: Google’s managed Kubernetes service. Every major cloud provider now has such a system, which makes us happy about not relying too heavily on one system
• GCSFS: A Python library providing intuitive access to Google Cloud Storage, either through Python file interfaces or through a FUSE file system
• Zarr: A chunked array storage format that is suitable for the cloud

## Collaboration

We were able to build, deploy, and use this system to answer real science questions in a couple weeks. We feel that this result is significant in its own right, and is largely because we collaborated widely. This project required the expertise of several individuals across several projects, institutions, and funding sources. Here are a few examples of who did what from which organization. We list institutions and positions mostly to show the roles involved.

• Alistair Miles, Professor, Oxford: Helped to optimize Zarr for XArray on GCS
• Jacob Tomlinson, Staff, UK Met Informatics Lab: Developed original JADE deployment and early Dask-Kubernetes work.
• Joe Hamman, Postdoc, National Center for Atmospheric Research: Provided scientific use case, data, and work flow. Tuned XArray and Zarr for efficient data storing and saving.
• Martin Durant, Software developer, Anaconda Inc.: Tuned GCSFS for many-access workloads. Also provided FUSE system for NetCDF support
• Matt Pryor, Staff, Centre for Envronmental Data Analysis: Extended original JADE deployment and early Dask-Kubernetes work.
• Matthew Rocklin, Software Developer, Anaconda Inc. Integration. Also performance testing.
• Ryan Abernathey, Assistant Professor, Columbia University: XArray + Zarr support, scientific use cases, coordination
• Stephan Hoyer, Software engineer, Google: XArray support
• Yuvi Panda, Staff, UC Berkeley BIDS and Data Science Education Program: Provided assistance configuring JupyterHub with KubeSpawner. Also prototyped the Daskernetes Dask + Kubernetes tool.

Notice the mix of academic and for-profit institutions. Also notice the mix of scientists, staff, and professional software developers. We believe that this mixture helps ensure the efficient construction of useful solutions.

## Lessons

This experiment has taught us a few things that we hope to explore further:

1. Users can launch Kubernetes deployments from Kubernetes pods, such as launching Dask clusters from their JupyterHub single-user notebooks.

To do this well we need to start defining user roles more explicitly within JupyterHub. We need to give users a safe an isolated space on the cluster to use without affecting their neighbors.

2. HDF5 and NetCDF on cloud storage is an open question

The file formats used for this sort of data are pervasive, but not particulary convenient or efficent on cloud storage. In particular the libraries used to read them make many small reads, each of which is costly when operating on cloud object storage

I see a few options:

1. Use FUSE file systems, but tune them with tricks like read-ahead and caching in order to compensate for HDF’s access patterns
2. Use the HDF group’s proposed HSDS service, which promises to resolve these issues
3. Adopt new file formats that are more cloud friendly. Zarr is one such example that has so far performed admirably, but certainly doesn’t have the long history of trust that HDF and NetCDF have earned.
3. Environment customization is important and tricky, especially when adding distributed computing.

Immediately after showing this to science groups they want to try it out with their own software environments. They can do this easily in their notebook session with tools like pip or conda, but to apply those same changes to their dask workers is a bit more challenging, especially when those workers come and go dynamically.

We have solutions for this. They can bulid and publish docker images. They can add environment variables to specify extra pip or conda packages. They can deploy their own pangeo deployment for their own group.

However these have all taken some work to do well so far. We hope that some combination of Binder-like publishing and small modification tricks like environment variables resolve this problem.

4. Our docker images are very large. This means that users sometimes need to wait a minute or more for their session or their dask workers to start up (less after things have warmed up a bit).

It is surprising how much of this comes from conda and node packages. We hope to resolve this both by improving our Docker hygeine and by engaging packaging communities to audit package size.

5. Explore other clouds

We started with Google just because their Kubernetes support has been around the longest, but all major cloud providers (Google, AWS, Azure) now provide some level of managed Kubernetes support. Everything we’ve done has been cloud-vendor agnostic, and various groups with data already on other clouds have reached out and are starting deployment on those systems.

6. Combine efforts with other groups

We’re actually not the first group to do this. The UK Met Informatics Lab quietly built a similar prototype, JADE (Jupyter and Dask Environment) many months ago. We’re now collaborating to merge efforts.

It’s also worth mentioning that they prototyped the first iteration of Daskernetes.

7. Reach out to other communities

While we started our collaboration with atmospheric and oceanographic scientists, these same solutions apply to many other disciplines. We should investigate other fields and start collaborations with those communities.

8. Improve Dask + XArray algorithms

When we try new problems in new environments we often uncover new opportunities to improve Dask’s internal scheduling algorithms. This case is no different :)

Much of this upcoming work is happening in the upstream projects so this experimentation is both of concrete use to ongoing scientific research as well as more broad use to the open source communities that these projects serve.

## Community uptake

We presented this at a couple conferences over the past week.

We found that this project aligns well with current efforts from many government agencies to publish large datasets on cloud stores (mostly S3). Many of these data publication endeavors seek a computational system to enable access for the scientific public. Our project seems to complement these needs without significant coordination.

## Disclaimers

While we encourage people to try out pangeo.pydata.org we also warn you that this system is immature. In particular it has the following issues:

1. it is insecure, please do not host sensitive data
2. it is unstable, and may be taken down at any time
3. it is small, we only have a handful of cores deployed at any time, mostly for experimentation purposes

However it is also open, and instructions to deploy your own live here.

## Come help

We are a growing group comprised of many institutions including technologists, scientists, and open source projects. There is plenty to do and plenty to discuss. Please engage with us at github.com/pangeo-data/pangeo/issues/new