February 20, 2018

Matthieu Brucher

Announcement: ATKAutoSwell 2.0.0

I’m happy to announce the update of ATK Auto Swell based on the Audio Toolkit and JUCE. They are available on Windows (AVX compatible processors) and OS X (min. 10.9, SSE4.2) in different formats.

This plugin requires the universal runtime on Windows, which is automatically deployed with Windows update (see tis discussion on the JUCE forum). If you don’t have it installed, please check Microsoft website.

ATK Auto Swell 2.0.0

The supported formats are:

  • VST2 (32bits/64bits on Windows, 32/64bits on OS X)
  • VST3 (32bits/64bits on Windows, 32/64bits on OS X)
  • Audio Unit (32/64bits, OS X)

Direct link for ATKGuitarPreamp.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at February 20, 2018 08:09 AM

February 16, 2018

February 15, 2018

Continuum Analytics

VS Code in Anaconda Distribution 5.1

A few months ago, Anaconda, Inc., creator of the world’s most popular Python data science platform, announced a partnership with Microsoft that included providing Anaconda Distribution users easy access to Visual Studio Code (VS Code). We are pleased to announce that, with the February 15th release of Anaconda Distribution 5.1, this goal is now a …
Read more →

by Rory Merritt at February 15, 2018 04:39 PM

Titus Brown

Do software and data products advance biology more than papers?

There are many outputs from our lab and our collaborators - off the top of my head, the big ones are:

  • papers and preprints
  • software
  • data sets
  • blog posts and tweets
  • talk slides and videos
  • grant proposal text
  • training materials and tutorials
  • trainees (core lab members, rotation students, people who attend our workshops, etc)

Traditionally, only the first (papers) and some small part of the last (trainees who get a PhD or do a postdoc in the lab) are explicitly recognized in biology as "products". I personally value all of them to some degree.

In terms of actual effect I believe that software, trainees, blog posts, and training materials are more impactful products than our papers.

In terms of taming the chaos of science, I view advances in our software's capabilities, and the development and evolution of our perspectives on data analysis, as a kind of ratchet that inexorably advances our science.

Papers, unless they accomplish the very difficult task of nailing down a concept and explaining it well, do very little to advance our lab's science. They are merely artifacts that we produce because they meet metrics, with the side effect of being one relatively ineffective way to communicate methods and results.

A question that I've been considering is this:

To what extent is the focus on papers as a primary output in biology (or at least genomics and bioinformatics) skewing our field's perspectives and slowing progress by distracting us from more useful outputs?

A companion question:

How (if at all) is the rise of software and data products as putative equivalents to papers leading to epistemic confusion as to what constitutes actual progress in biology?

To explain this last point a bit more,

it's not clear that many papers really advance biology directly, given the flood of papers and results and the resulting loss of ability to read and comprehend them all in a particular subject. (This is more true in some areas than in others, but you could also argue that big fields are maybe getting subdivided into more narrow fields because of our inability to comprehend the results in big fields.)

More and more, the results of papers need to be incorporated into theory (difficult in bio) or databases and software before they become useful in biology.

From this perspective, good data and software papers actually advance biology more than a specific finding.

I don't think this is entirely right but I feel like the field is trending in this direction.

But most senior people are really focused on papers as outputs and ignore software and data. This makes it hard for me to talk to them sometimes.

Ultimately, of course, insight and cures, for lack of a better word, are the rightful end products of basic research and biomedical science, respectively. So the question is how to get there faster.

Are papers the best way? Probably not.

Some side notes

I've been pretty happy with the way UC Davis handles merit and promotion, in that faculty in my department really get to explain what they're doing and why. It's not all about papers here, although of course for research-intensive profs that's still a major component.

Acknowledgements

This blog post was greatly inspired by conversations with Becca Calisi-Rodriguez and Tracy Teal, as well as (as always) the members of the DIB Lab. Thanks!! (I'm not implying that they agree with me, of course!)

I'm particularly indebted to Dr. Tamer Mansour, who, a year ago, said (paraphrased): "This lab is not a research lab. Mostly we train people, and do software engineering. Research is a distinct third." I disagree but it sure was hard to figure out why :)

--titus

by C. Titus Brown at February 15, 2018 09:20 AM

February 13, 2018

Matthieu Brucher

Announcement: Audio TK 2.3.0

ATK is updated to 2.3.0 with major fixes and code coverage improvement (see here). Lots of bugs were fixed during that effort and native build on embedded platforms was also fixed.

CMake builds on Linux don’t have to be installed before Python tests have to be ran. SIMD filters are now also easier to implement.

Download link: ATK 2.3.0

Changelog:
2.3.0
* Increased test coverage and fix lots of small mistakes in the API
* Allow in place Python tests (before make install) on Linux
* Split big files to allow native compilation on embedded platforms
2.2.2
* Fix a TDF2 IIR filter bug when the state was not reinitialized, leading to instabilities
* Fix a bug when delays were changed but not the underlying buffers, leading to buffer underflows
* Adding a new Broadcast filter (filling all SIMD vector lines with the same input value)
* Adding a new Reduce filter (summing all SIMD vector lines to the output value)
2.2.1
* Fix alignment issues in SIMD filters
* Fix SIMD EQ dispatcher export issues on Windows (too many possible filters!)
* Implemented relevant Tools SIMD filters

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at February 13, 2018 08:00 AM

February 12, 2018

Matthew Rocklin

Dask Release 0.17.0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.17.0. This a significant major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.16.0 release on November 21st.

You can conda install Dask:

conda install dask -c conda-forge

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

Some notable changes follow.

Deprecations

  • Removed dask.dataframe.rolling_* methods, which were previously deprecated both in dask.dataframe and in pandas. These are replaced with the rolling.* namespace
  • We’ve generally stopped maintenance of the dask-ec2 project to launch dask clusters on Amazon’s EC2 using Salt. We generally recommend kubernetes instead both for Amazon’s EC2, and for Google and Azure as well

    dask.pydata.org/en/latest/setup/kubernetes.html

  • Internal state of the distributed scheduler has changed significantly. This may affect advanced users who were inspecting this state for debugging or diagnostics.

Task Ordering

As Dask encounters more complex problems from more domains we continually run into problems where its current heuristics do not perform optimally. This release includes a rewrite of our static task prioritization heuristics. This will improve Dask’s ability to traverse complex computations in a way that keeps memory use low.

To aid debugging we also integrated these heuristics into the GraphViz-style plots that come from the visualize method.

x = da.random.random(...)
...
x.visualize(color='order', cmap='RdBu')

Nested Joblib

Dask supports parallelizing Scikit-Learn by extending Scikit-Learn’s underlying library for parallelism, Joblib. This allows Dask to distribute some SKLearn algorithms across a cluster just by wrapping them with a context manager.

This relationship has been strengthened, and particular attention has been focused when nesting one parallel computation within another, such as occurs when you train a parallel estimator, like RandomForest, within another parallel computation, like GridSearchCV. Previously this would result in spawning too many threads/processes and generally oversubscribing hardware.

Due to recent combined development within both Joblib and Dask, these sorts of situations can now be resolved efficiently by handing them off to Dask, providing speedups even in single-machine cases:

from sklearn.externals import joblib
import distributed.joblib  # register the dask joblib backend

from dask.distributed import Client
client = Client()

est = ParallelEstimator()
gs = GridSearchCV(est)

with joblib.parallel_backend('dask'):
    gs.fit()

See Tom Augspurger’s recent post with more details about this work:

Thanks to Tom Augspurger, Jim Crist, and Olivier Grisel who did most of this work.

Scheduler Internal Refactor

The distributed scheduler has been significantly refactored to change it from a forest of dictionaries:

priority = {'a': 1, 'b': 2, 'c': 3}
dependencies = {'a': {'b'}, 'b': {'c'}, 'c': []}
nbytes = {'a': 1000, 'b': 1000, 'c': 28}

To a bunch of objects:

tasks = {'a': Task('a', priority=1, nbytes=1000, dependencies=...),
         'b': Task('b': priority=2, nbytes=1000, dependencies=...),
         'c': Task('c': priority=3, nbytes=28, dependencies=[])}

(there is much more state than what is listed above, but hopefully the examples above are clear.)

There were a few motivations for this:

  1. We wanted to try out Cython and PyPy, for which objects like this might be more effective than dictionaries.
  2. We believe that this is probably a bit easier for developers new to the schedulers to understand. The proliferation of state dictionaries was not highly discoverable.

Goal one ended up not working out. We have not yet been able to make the scheduler significantly faster under Cython or PyPy with this new layout. There is even a slight memory increase with these changes. However we have been happy with the results in code readability, and we hope that others find this useful as well.

Thanks to Antoine Pitrou, who did most of the work here.

User Priorities

You can now submit tasks with different priorities.

x = client.submit(f, 1, priority=10)   # Higher priority preferred
y = client.submit(f, 1, priority=-10)  # Lower priority happens later

To be clear, Dask has always had priorities, they just weren’t easily user-settable. Higher priorities are given precedence. The default priority for all tasks is zero. You can also submit priorities for collections (like arrays and dataframes)

df = df.persist(priority=5)  # give this computation higher priority.

Several related projects are also undergoing releases:

  • Tornado is updating to version 5.0 (there is a beta out now). This is a major change that will put Tornado on the Asyncio event loop in Python 3. It also includes many performance enhancements for high-bandwidth networks.
  • Bokeh 0.12.14 was just released.

    Note that you will need to update Dask to work with this version of Bokeh

  • Daskernetes, a new project for launching Dask on Kubernetes clusters

Acknowledgements

The following people contributed to the dask/dask repository since the 0.16.0 release on November 14th:

  • Albert DeFusco
  • Apostolos Vlachopoulos
  • castalheiro
  • James Bourbeau
  • Jon Mease
  • Ian Hopkinson
  • Jakub Nowacki
  • Jim Crist
  • John A Kirkham
  • Joseph Lin
  • Keisuke Fujii
  • Martijn Arts
  • Martin Durant
  • Matthew Rocklin
  • Markus Gonser
  • Nir
  • Rich Signell
  • Roman Yurchak
  • S. Andrew Sheppard
  • sephib
  • Stephan Hoyer
  • Tom Augspurger
  • Uwe L. Korn
  • Wei Ji
  • Xander Johnson

The following people contributed to the dask/distributed repository since the 1.20.0 release on November 14th:

  • Alexander Ford
  • Antoine Pitrou
  • Brett Naul
  • Brian Broll
  • Bruce Merry
  • Cornelius Riemenschneider
  • Daniel Li
  • Jim Crist
  • Kelvin Yang
  • Matthew Rocklin
  • Min RK
  • rqx
  • Russ Bubley
  • Scott Sievert
  • Tom Augspurger
  • Xander Johnson

February 12, 2018 12:00 AM

February 09, 2018

Continuum Analytics

Credit Modeling with Dask

I’ve been working with a large retail bank on their credit modeling system. We’re doing interesting work with Dask to manage complex computations (see task graph below) that I’d like to share. This is an example of using Dask for complex problems that are neither a big dataframe nor a big array, but are still …
Read more →

by Rory Merritt at February 09, 2018 03:56 PM

February 07, 2018

February 06, 2018

Continuum Analytics

Easy Distributed Training with Joblib and Dask

This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. This article will talk about some changes we made to improve training scikit-learn models using a …
Read more →

by Rory Merritt at February 06, 2018 03:32 PM

Matthew Rocklin

HDF in the Cloud

Multi-dimensional data, such as is commonly stored in HDF and NetCDF formats, is difficult to access on traditional cloud storage platforms. This post outlines the situation, the following possible solutions, and their strengths and weaknesses.

  1. Cloud Optimized GeoTIFF: We can use modern and efficient formats from other domains, like Cloud Optimized GeoTIFF
  2. HDF + FUSE: Continue using HDF, but mount cloud object stores as a file system with FUSE
  3. HDF + Custom Reader: Continue using HDF, but teach it how to read from S3, GCS, ADL, …
  4. Build a Distributed Service: Allow others to serve this data behind a web API, built however they think best
  5. New Formats for Scientific Data: Design a new format, optimized for scientific data in the cloud

Not Tabular Data

If your data fits into a tabular format, such that you can use tools like SQL, Pandas, or Spark, then this post is not for you. You should consider Parquet, ORC, or any of a hundred other excellent formats or databases that are well designed for use on cloud storage technologies.

Multi-Dimensional Data

We’re talking about data that is multi-dimensional and regularly strided. This data often occurs in simulation output (like climate models), biomedical imaging (like an MRI scan), or needs to be efficiently accessed across a number of different dimensions (like many complex time series). Here is an image from the popular XArray library to put you in the right frame of mind:

This data is often stored in blocks such that, say, each 100x100x100 chunk of data is stored together, and can be accessed without reading through the rest of the file.

A few file formats allow this layout, the most popular of which is the HDF format, which has been the standard for decades and forms the basis for other scientific formats like NetCDF. HDF is a powerful and efficient format capable of handling both complex hierarchical data systems (filesystem-in-a-file) and efficiently blocked numeric arrays. Unfortunately HDF is difficult to access from cloud object stores (like S3), which presents a challenge to the scientific community.

The Opportunity and Challenge of Cloud Storage

The scientific community generates several petabytes of HDF data annually. Supercomputer simulations (like a large climate model) produce a few petabytes. Planned NASA satellite missions will produce hundreds of petabytes a year of observational data. All of these tend to be stored in HDF.

To increase access, institutions now place this data on the cloud. Hopefully this generates more social value from existing simulations and observations, as they are ideally now more accessible to any researcher or any company without coordination with the host institution.

Unfortunately, the layout of HDF files makes them difficult to query efficiently on cloud storage systems (like Amazon’s S3, Google’s GCS, or Microsoft’s ADL). The HDF format is complex and metadata is strewn throughout the file, so that a complex sequence of reads is required to reach a specific chunk of data. The only pragmatic way to read a chunk of data from an HDF file today is to use the existing HDF C library, which expects to receive a C FILE object, pointing to a normal file system (not a cloud object store) (this is not entirely true, as we’ll see below).

So organizations like NASA are dumping large amounts of HDF onto Amazon’s S3 that no one can actually read, except by downloading the entire file to their local hard drive, and then pulling out the particular bits that they need with the HDF library. This is inefficient. It misses out on the potential that cloud-hosted public data can offer to our society.

The rest of this post discusses a few of the options to solve this problem, including their advantages and disadvantages.

  1. Cloud Optimized GeoTIFF: We can use modern and efficient formats from other domains, like Cloud Optimized GeoTIFF

    Good: Fast, well established

    Bad: Not sophisticated enough to handle some scientific domains

  2. HDF + FUSE: Continue using HDF, but mount cloud object stores as a file system with Filesystem in Userspace, aka FUSE

    Good: Works with existing files, no changes to the HDF library necessary, useful in non-HDF contexts as well

    Bad: It’s complex, probably not performance-optimal, and has historically been brittle

  3. HDF + Custom Reader: Continue using HDF, but teach it how to read from S3, GCS, ADL, …

    Good: Works with existing files, no complex FUSE tricks

    Bad: Requires plugins to the HDF library and tweaks to downstream libraries (like Python wrappers). Will require effort to make performance optimal

  4. Build a Distributed Service: Allow others to serve this data behind a web API, built however they think best

    Good: Lets other groups think about this problem and evolve complex backend solutions while maintaining stable frontend API

    Bad: Complex to write and deploy. Probably not free. Introduces an intermediary between us and our data.

  5. New Formats for Scientific Data: Design a new format, optimized for scientific data in the cloud

    Good: Fast, intuitive, and modern

    Bad: Not a community standard

Now we discuss each option in more depth.

Use Other Formats, like Cloud Optimized GeoTIFF

We could use formats other than HDF and NetCDF that are already well established. The two that I hear most often proposed are Cloud Optimized GeoTIFF and Apache Parquet. Both are efficient, well designed for cloud storage, and well established as community standards. If you haven’t already, I strongly recommend reading Chris Holmes’ (Planet) blog series on Cloud Native Geospatial.

These formats are well designed for cloud storage because they support random access well with relatively few communications and with relatively simple code. If you needed to you could look at the Cloud Optimized GeoTIFF spec, and within an hour of reading, get an image that you wanted using nothing but a few curl commands. Metadata is in a clear centralized place. That metadata provides enough information to issue further commands to get the relevant bytes from the object store. Those bytes are stored in a format that is easily interpreted by a variety of common tools across all platforms.

However, neither of these formats are sufficiently expressive to handle some of the use cases of HDF and NetCDF. Recall our image earlier about atmospheric data:

Our data isn’t a parquet table, nor is it a stack of geo-images. While it’s true that you could store any data in these formats, for example by saving each horizontal slice as a GeoTIFF, or each spatial point as a row in a Parquet table, these storage layouts would be inefficient for regular access patterns. Some parts of the scientific community need blocked layouts for multi-dimensional array data.

HDF and Filesystems in Userspace (FUSE)

We could access HDF data on the cloud now if we were able to convince our operating system that S3 or GCS or ADL were a normal file system. This is a reasonable goal; cloud object stores look and operate much like normal file systems. They have directories that you can list and navigate. They have files/objects that you can copy, move, rename, and from which you can read or write small sections.

We can achieve this using an operating systems trick, FUSE, or Filesystem in Userspace. This allows us to make a program that the operating system treats as a normal file system. Several groups have already done this for a variety of cloud providers. Here is an example with the gcsfs Python library

$ pip install gcsfs --upgrade
$ mkdir /gcs
$ gcsfuse bucket-name /gcs --background
Mounting bucket bucket-name to directory /gcs

$ ls /gcs
my-file-1.hdf
my-file-2.hdf
my-file-3.hdf
...

Now we point our HDF library to a NetCDF file in that directory (which actually points to an object on Google Cloud Storage), and it happily uses C File objects to read and write data. The operating system passes the read/write requests to gcsfs, which goes out to the cloud to get data, and then hands it back to the operating system, which hands it to HDF. All normal HDF operations just work, although they may now be significantly slower. The cloud is further away than local disk.

This slowdown is significant because the HDF library makes many small 4kB reads in order to gather the metadata necessary to pull out a chunk of data. Each of those tiny reads made sense when the data was local, but now that we’re sending out a web request each time. This means that users can sit for minutes just to open a file.

Fortunately, we can be clever. By buffering and caching data, we can reduce the number of web requests. For example, when asked to download 4kB we actually download 100kB or 1MB. If some of the future 4kB reads are within this 1MB then we can return them immediately., Looking at HDF traces it looks like we can probably reduce “dozens” of web requests to “a few”.

FUSE also requires elevated operating system permissions, which can introduce challenges if working from Docker containers (which is likely on the cloud). Docker containers running FUSE need to be running in privileged mode. There are some tricks around this, but generally FUSE brings some excess baggage.

HDF and a Custom Reader

The HDF library doesn’t need to use C File pointers, we can extend it to use other storage backends as well. Virtual File Layers are an extension mechanism within HDF5 that could allow it to target cloud object stores. This has already been done to support Amazon’s S3 object store twice:

  1. Once by the HDF group, S3VFD (currently private),
  2. Once by Joe Jevnik and Scott Sanderson (Quantopian) at https://h5s3.github.io/h5s3/ (highly experimental)

This provides an alternative to FUSE that is better because it doesn’t require privileged access, but is worse because it only solves this problem for HDF and not all file access.

In either case we’ll need to do look-ahead buffering and caching to get reasonable performance (or see below).

Centralize Metadata

Alternatively, we might centralize metadata in the HDF file in order to avoid many hops throughout that file. This would remove the need to perform clever file-system caching and buffering tricks.

Here is a brief technical explanation from Joe Jevnik:

Regarding the centralization of metadata: this is already a feature of hdf5 and is used by many of the built-in drivers. This optimization is enabled by setting the H5FD_FEAT_AGGREGATE_METADATA and H5FD_FEAT_ACCUMULATE_METADATA feature flags in your VFL driver’s query function. The first flag says that the hdf5 library should pre-allocate a large region to store metadata, future metadata allocations will be served from this pool. The second flag says that the library should buffer metadata changes into large chunks before dispatching the VFL driver’s write function. Both the default driver (sec2) and h5s3 enable these optimizations. This is further supported by using the H5FD_FLMAP_DICHOTOMY free list option which uses one free list for metadata allocations and another for non-metadata allocations. If you really want to ensure that the metadata is aggregated, even without a repack, you can use the built-in ‘multi’ driver which dispatches different allocation flavors to their own driver.

Distributed Service

We could offload this problem to a company, like the non-profit HDF group or a for-profit cloud provider like Google, Amazon, or Microsoft. They would solve this problem however they like, and expose a web API that we can hit for our data.

This would be a distributed service of several computers on the cloud near our data, that takes our requests for what data we want, perform whatever tricks they deem appropriate to get that data, and then deliver it to us. This fleet of machines will still have to address the problems listed above, but we can let them figure it out, and presumably they’ll learn as they go.

However, this has both technical and social costs. Technically this is complex, and they’ll have to handle a new set of issues around scalability, consistency, and so on that are already solved(ish) in the cloud object stores. Socially this creates an intermediary between us and our data, which we may not want both for reasons of cost and trust.

The HDF group is working on such a service, HSDS that works on Amazon’s S3 (or anything that looks like S3). They have created a h5pyd library that is a drop-in replacement for the popular h5py Python library.

Presumably a cloud provider, like Amazon, Google, or Microsoft could do this as well. By providing open standards like OpenDAP they might attract more science users onto their platform to more efficiently query their cloud-hosted datasets.

The satellite imagery company Planet already has such a service.

New Formats for Scientific Data

Alternatively, we can move on from the HDF file format, and invent a new data storage specification that fits cloud storage (or other storage) more cleanly without worrying about supporting the legacy layout of existing HDF files.

This has already been going on, informally, for years. Most often we see people break large arrays into blocks, store each block as a separate object in the cloud object store with a suggestive name, and store a metadata file describing how the blocks relate to each other. This looks something like the following:

/metadata.json
/0.0.0.dat
/0.0.1.dat
/0.0.2.dat
 ...
/10.10.8.dat
/10.10.9.dat
/10.10.10.dat

There are many variants:

  • They might extend this to have groups or sub-arrays in sub-directories.
  • They might choose to compress the individual blocks in the .dat files or not.
  • They might choose different encoding schemes for the metadata and the binary blobs.

But generally most array people on the cloud do something like this with their research data, and they’ve been doing it for years. It works efficiently, is easy to understand and manage, and transfers well to any cloud platform, onto a local file system, or even into a standalone zip file or small database.

There are two groups that have done this in a more mature way, defining both modular standalone libraries to manage their data, as well as proper specification documents that inform others how to interpret this data in a long-term stable way.

These are both well maintained and well designed libraries (by my judgment), in Python and Java respectively. They offer layouts like the layout above, although with more sophistication. Entertainingly their specs are similar enough that another library, Z5, built a cross-compatible parser for each in C++. This unintended uniformity is a good sign. It means that both developer groups were avoiding creativity, and have converged on a sensible common solution. I encourage you to read the Zarr Spec in particular.

However, technical merits are not alone sufficient to justify a shift in data format, especially for archival datasets of record that we’re discussing. The institutions in charge of this data and have multi-decade horizons and so move slowly. For them, moving off of the historically community standard would be major shift.

And so we need to answer a couple of difficult questions:

  1. How hard is it to make HDF efficient in the cloud?
  2. How hard is it to shift the community to a new standard?

A Sense of Urgency

These questions are important now. NASA and other agencies are pushing NetCDF data into the Cloud today and will be increasing these rates substantially in the coming years.

From earthdata.nasa.gov/cmr-and-esdc-in-cloud (via Ryan Abernathey)

From its current cumulative archive size of almost 22 petabytes (PB), the volume of data in the EOSDIS archive is expected to grow to almost 247 PB by 2025, according to estimates by NASA’s Earth Science Data Systems (ESDS) Program. Over the next five years, the daily ingest rate of data into the EOSDIS archive is expected to reach 114 terabytes (TB) per day, with the upcoming joint NASA/Indian Space Research Organization Synthetic Aperture Radar (NISAR) mission (scheduled for launch by 2021) contributing an estimated 86 TB per day of data when operational.

This is only one example of many agencies in many domains pushing scientific data to the cloud.

Acknowledgements

Thanks to Joe Jevnik (Quantopian), John Readey (HDF Group), Rich Signell (USGS), and Ryan Abernathey (Columbia University) for their feedback when writing this article. This conversation started within the Pangeo collaboration.

February 06, 2018 12:00 AM

February 02, 2018

Continuum Analytics

The Case for Numba in Community Code

The numeric Python community should consider adopting Numba more widely within community code. Numba is strong in performance and usability, but historically weak in ease of installation and community trust. This blog post discusses these strengths and weaknesses from the perspective of an OSS library maintainer. It uses other more technical blog posts written on …
Read more →

by Rory Merritt at February 02, 2018 03:17 PM

February 01, 2018

Prabhu Ramachandran

VTK-8.1.0 wheels for all platforms on pypi!


I cannot believe it has been 6 years since my last blog post!  Anyway, I have some good news to announce here.

In the Python community, VTK has always been somewhat difficult to install (in comparison to pure Python packages). One has required to either use a specific package management tool or resort to source builds. This has been a major problem when trying to install tools that rely on VTK, like Mayavi.

During the SciPy 2017 conference held at Austin last year, a few of the Kitware developers, notably Jean-Christophe Fillion-Robin  (JC for short) and some of the VTK developers got together with some of us from the SciPy community and decided to try and put together wheels for VTK.

JC did the hard work of figuring this out and setting up a nice VTKPythonPackage during the sprints to make this process easy. As of last week (Jan 27, 2018) Mac OS X wheels were not supported. Last weekend, I finally got the time (thanks to Enthought) to play with JC's work. I figured out how to get the wheels working on OS X. With this, in principle, we could build VTK wheels on all the major platforms.

We decided to try and push wheels at least for the major VTK releases. This in itself would be a massive improvement in making VTK easier to install. Over the last few days, I have built wheels on Linux, OS X, and Windows. All of these are 64 bit wheels for VTK-8.1.0.

Now, VTK 8.x adds a c++11 dependency, and so we cannot build these versions of VTK for Python 2.7 on Windows.

So now we have 64 bit wheels on Windows for Python versions 3.5.x and 3.6.x.
Unfortunately, 3.4.x required a different Visual Studio installed and I lost patience setting things up on my Windows VM.

On Linux, we have 64 bit wheels for Python 2.7.x, 3.4.x, 3.5.x, and 3.6.x.

On MacOS, we have 64 bit wheels for Python 2.7.x, 3.4.x, 3.5.x, and 3.6.x.

So if you are using a 64 bit Python, you can now do

   $ pip install vtk

and have VTK-8.1.0 installed!

This is really nice to have and should hopefully make VTK and other tools a lot easier to install.

A big thank you to JC, the other Kitware developers, the VTK Python developers, especially David Gobbi who has worked on the VTK Python wrappers for many many years now,  for making this happen. Apologies if I missed anyone but thank you all!

Enjoy!

by Prabhu Ramachandran (noreply@blogger.com) at February 01, 2018 12:18 AM

January 30, 2018

Matthieu Brucher

Book review: Python for Finance: Analyze Big Financial Data

Recently, I moved to the finance industry. As usually when I start in a new domain, I look at the Python books for it. And Python for Finance from Yves Hilpisch is one of the most known ones.

Discussion

The book is split in 3 unequal parts. The first one is short and presents the usage of Python in the finance industry, how to install it and a few example of its usage for finance. The Python code is quite simple, strangely the author decided to go for global variables and almost no parameters. Why not presenting classes here? At least he uses examples through IPython/Jupyter, so that’s good!

The second part tackles finance applications to Python and the useful modules. Obviously, the first chapter here handles Numpy. I liked the fact that vectorization is an important part here (not using explicit loops). Then of course an important point is visualizing plot, and especially time series. The third chapter tackles pandas, a library that was originally written for finance analysis, so obviously it has to be used!

Strangely, the chapter after that one is about reading and writing data. I’m not really sure it is worth spending so much time on some functions that are already in numpy and pandas. I agree that I/O is important, but I’m not sure it deserves so much space in a Python book. Or even talk about SQL.

The next chapter tackles performance in Python. The author compares different ways of make your code faster. I liked the IPython example, as lots of people would work from Jupyter with several available cores behind. The multiprocessing module is nice, but can sometimes be… awkward to use. Not sure that the NumbaPro example was useful, as not many people will be able to use them (I felt this was more an ad than actually useful pages).

After this chapter, we are back to math tools for finance. The strange part is that the previous chapter may not really be used for this chapter. Not many algorithms can be efficiently parallelized when they come out of available packages (except when they are meant for this like sklearn pipeline model). So the chapter here will talk about regression (one of the main tool to understand a trend in time series; although the prediction may be completely bogus), interpolation or optimization. The latter one is what you need for lots of models. Later in the chapter, symbolic computation is also introduced, and I have to say that if you know an analytical approach to a problem, then this is quite effective (I always take a similar route for my electronic models).

The tenth chapter dives into the core of finance maths with stochastic equations (and the Black-Scholes one!). Of course here, it’s basically using random number generators, and then applying some rules on top of them. The chapter after that handles puts several of the previous topics together, like normality test for stats, or portfolio optimization for… optimization. There is a part on PCA, but I’m biased, I hate PCA since lots of people use it for dimensionality reduction on data that is not Euclidian…

There is also a chapter on Excel, probably because lots of people use it to analyze data, and you need to be able to exchange data with it. I guess.

And then, the chapter where the author finally tackles classes!! Really!! And by saying that it’s an important aspect of Python. That’s what I don’t understand. Especially the way it’s presented. The part with traits is OK, although the online tutorials are just as good.

Then, there is a chapter on web apps, not sure exactly why there is, to be franc.

After this part with ups and down, there is a part on creating a derivative library. This is the part where there is some real finance computation, although the author refers back to his other book for the theory itself. The chapters are quite small and try to wrap everything from the previous part in a unique framework.

I just wish this integration was done in the second part instead.

Conclusion

So basically the content of the book is on some kind of Python. If you don’t know about finance, you want to know much more at the end of this book. But if want to learn about Python, you will know about modules, but actually not about good Python.

So unfortunately, avoid.

by Matt at January 30, 2018 08:39 AM

Matthew Rocklin

The Case for Numba in Community Code

The numeric Python community should consider adopting Numba more widely within community code.

Numba is strong in performance and usability, but historically weak in ease of installation and community trust. This blogpost discusses these these strengths and weaknesses from the perspective of a OSS library maintainer. It uses other more technical blogposts written on the topic as references. It is biased in favor of wider adoption given recent changes to the project.

Let’s start with a wildly unprophetic quote from Jake Vanderplas in 2013:

I’m becoming more and more convinced that Numba is the future of fast scientific computing in Python.

– Jake Vanderplas, 2013-06-15

http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/

We’ll use the following blogposts by other community members throughout this post. They’re all good reads and are more technical, showing code examples, performance numbers, etc..

At the end of the blogpost these authors will also share some thoughts on Numba today, looking back with some hindsight.

Disclaimer: I work alongside many of the Numba developers within the same company and am partially funded through the same granting institution.

Compiled code in Python

Many open source numeric Python libraries need to write efficient low-level code that works well on Numpy arrays, but is more complex than the Numpy library itself can express. Typically they use one of the following options:

  1. C-extensions: mostly older projects like NumPy and Scipy
  2. Cython: probably the current standard for mainline projects, like scikit-learn, pandas, scikit-image, geopandas, and so on
  3. Standalone C/C++ codebases with Python wrappers: for newer projects that target inter-language operation, like XTensor and Arrow

Each of these choices has tradeoffs in performance, packaging, attracting new developers and so on. Ideally we want a solution that is …

  1. Fast: about as fast as C/Fortran
  2. Easy: Is accessible to a broad base of developers and maintainers
  3. Builds easily: Introduces few complications in building and packaging
  4. Installs easily: Introduces few install and runtime dependencies
  5. Trustworthy: Is well trusted within the community, both in terms of governance and long term maintenance

The two main approaches today, Cython and C/C++, both do well on most of these objectives. However neither is perfect. Some issues that arise include the following:

  • Cython
    • Often requires effort to make fast
    • Is often only used by core developers. Requires expertise to use well.
    • Introduces mild packaging pain, though this pain is solved frequently enough that experienced community members are used to dealing with it
  • Standalone C/C++
    • Sometimes introduces complex build and packaging concerns
    • Is often only used by core developers. These projects have difficulty attracting the Python community’s standard developer pool (though they do attract developers from other communities).

There are some other options out there like Numba and Pythran that, while they provide excellent performance and usability benefits, are rarely used. Let’s look into Numba’s benefits and drawbacks more closely.

Numba Benefits

Numba is generally well regarded from a technical perspective (it’s fast, easy to use, well maintained, etc.) but has historically not been trusted due to packaging and community concerns.

In any test of either performance or usability Numba almost always wins (or ties for the win). It does all of the compiler optimization tricks you expect. It supports both for-loopy code as well as Numpy-style slicing and bulk operation code. It requires almost no additional information from the user (assuming that you’re ok with JIT behavior) and so is very approachable, and very easy for novices to use well.

This means that we get phrases like the following:

  • https://dionhaefner.github.io/2016/11/suck-less-scientific-python-part-2-efficient-number-crunching/
    • “This is rightaway faster than NumPy.”
    • “In fact, we can infer from this that numba managed to generate pure C code from our function and that it did it already previously.”
    • “Numba delivered the best performance on this problem, while still being easy to use.”
  • https://dionhaefner.github.io/2016/11/suck-less-scientific-python-part-2-efficient-number-crunching/
    • “Using numba is very simple; just apply the jit decorator to the function you want to get compiled. In this case, the function code is exactly the same as before”
    • “Wow! A speedup by a factor of about 400, just by applying a decorator to the function. “
  • http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/
    • “Much better! We’re now within about a factor of 3 of the Fortran speed, and we’re still writing pure Python!”
    • “I should emphasize here that I have years of experience with Cython, and in this function I’ve used every Cython optimization there is … By comparison, the Numba version is a simple, unadorned wrapper around plainly-written Python code.”
  • http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/
    • Numba is extremely simple to use. We just wrap our python function with autojit (JIT stands for “just in time” compilation) to automatically create an efficient, compiled version of the function
    • Adding this simple expression speeds up our execution by over a factor of over 1400! For those keeping track, this is about 50% faster than the version of Numba that I tested last August on the same machine.
    • The Cython version, despite all the optimization, is a few percent slower than the result of the simple Numba decorator!
  • http://stephanhoyer.com/2015/04/09/numba-vs-cython-how-to-choose/
    • “Using Numba is usually about as simple as adding a decorator to your functions”
    • “Numba is usually easier to write for the simple cases where it works”
  • https://murillogroupmsu.com/numba-versus-c/
    • “Numba allows for speedups comparable to most compiled languages with almost no effort”
    • “We find that Numba is more than 100 times as fast as basic Python for this application. In fact, using a straight conversion of the basic Python code to C++ is slower than Numba.”

In all cases where authors compared Numba to Cython for numeric code (Cython is probably the standard for these cases) Numba always performs as-well-or-better and is always much simpler to write.

Here is a code example from Jake’s second blogpost:

Example: Code Complexity

# From http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/

# Numba                                 # Cython
import numpy as np                      import numpy as np
import numba                            cimport cython
                                        from libc.math cimport sqrt

                                        @cython.boundscheck(False)
@numba.jit                              @cython.wraparound(False)
def pairwise_python(X):                 def pairwise_cython(double[:, ::1] X):
    M = X.shape[0]                          cdef int M = X.shape[0]
    N = X.shape[1]                          cdef int N = X.shape[1]
                                            cdef double tmp, d
    D = np.empty((M, M), dtype=np.float)    cdef double[:, ::1] D = np.empty((M, M),
                                                                             dtype=np.float64)
    for i in range(M):                      for i in range(M):
        for j in range(M):                      for j in range(M):
            d = 0.0                                 d = 0.0
            for k in range(N):                      for k in range(N):
                tmp = X[i, k] - X[j, k]                 tmp = X[i, k] - X[j, k]
                d += tmp * tmp                          d += tmp * tmp
            D[i, j] = np.sqrt(d)                    D[i, j] = sqrt(d)
    return D                                return np.asarray(D)

The algorithmic body of each function (the nested for loops) is identical. However the Cython code is more verbose with annotations, both at the function definition (which we would expect for any AOT compiler), but also within the body of the function for various utility variables. The Numba code is just straight Python + Numpy code. We could remove the @numba.jit decorator and step through our function with normal Python.

Example: Numpy Operations

Additionally Numba lets us use Numpy syntax directly in the function, so for example the following function is well accelerated by Numba, even though it already fits NumPy’s syntax well.

# from https://flothesof.github.io/optimizing-python-code-numpy-cython-pythran-numba.html

@numba.jit
def laplace_numba(image):
    """Laplace operator in NumPy for 2D images. Accelerated using numba."""
    laplacian = ( image[:-2, 1:-1] + image[2:, 1:-1]
                + image[1:-1, :-2] + image[1:-1, 2:]
                - 4*image[1:-1, 1:-1])
    thresh = np.abs(laplacian) > 0.05
    return thresh

Mixing and matching Numpy-style with for-loop style is often helpful when writing complex numeric algorithms.

Benchmarks in the these blogposts show that Numba is both simpler to use and often as-fast-or-faster than more commonly used technologies like Cython.

Numba drawbacks

So, given these advantages why didn’t Jake’s original prophecy hold true?

I believe that there are three primary reasons why Numba has not been more widely adopted among other open source projects:

  1. LLVM Dependency: Numba depends on LLVM, which was historically difficult to install without a system package manager (like apt-get, brew) or conda. Library authors are not willing to exclude users that use other packaging toolchains, particularly Python’s standard tool, pip.
  2. Community Trust: Numba is largely developed within a single for-profit company (Anaconda Inc.) and its developers are not well known by other library maintainers.
  3. Lack of Interpretability: Numba’s output, LLVM, is less well understood by the community than Cython’s output, C (discussed in original-author comments in the last section)

All three of these are excellent reasons to avoid adding a dependency. Technical excellence alone is insufficient, and must be considered alongside community and long-term maintenance concerns.

But Numba has evolved recently

LLVM

Numba now depends on the easier-to-install library, llvmlite which, as of a few months ago is pip installable with binary wheels on Windows, Mac, and Linux. The llvmlite package is still a heavy-ish runtime dependency (42MB), but that’s significantly less than large Cython libraries like Pandas or SciPy.

If your concern was about the average user’s inability to install Numba, then I think that this concern has been resolved.

Community

Numba has three community problems:

  1. Development of Numba has traditionally happened within the closed walls of Anaconda Inc (formerly Continuum Analytics)
  2. The Numba maintainers are not well known within the broader Python community
  3. There used to be a proprietary version, Numba Pro

This combination strongly attached Numba’s image to Continuum’s for-profit ventures, making community-oriented software maintainers understandably wary of dependence, for fear that dependence on this library might be used for Continuum’s financial gain at the expense of community users.

Things have changed significantly.

Numba Pro was abolished years ago. The funding for the project today comes more often from Anaconda Inc. consulting revenue, hardware vendors looking to ensure that Python runs as efficiently as possible on their systems, and from generous donations from the Gordon and Betty Moore foundation to ensure that Numba serves the open source Python community.

Developers outside of Anaconda Inc. now have core commit access, which forces communication to happen in public channels, notably GitHub (which was standard before) and Gitter chat (which is relatively new).

The maintainers are still fairly relatively unknown within the broader community. This isn’t due to any sort of conspiracy, but is instead due more to shyness or having interests outside of OSS. Antoine, Siu, Stan, and Stuart are all considerate, funny, and clever fellows with strong enthusiasm for compilers, OSS, and performance. They are quite responsive on the Numba mailing list should you have any questions or concerns.

If your concern was about Numba trapping users into a for-profit mode, then that seems to have been resolved years ago.

If your concern is more about not knowing who is behind the project then I encourage you to reach out. I would be surprised if you don’t walk away pleased.

The Continued Cases Against Numba

For completeness, let’s list a number of reasons why it is still quite reasonable to avoid Numba today:

  1. It isn’t a community standard
  2. Numba hasn’t attracted a wide developer base (compilers are hard), and so is probably still dependent on financial support for paid developers
  3. You want to speed up non-numeric code that includes classes, dicts, lists, etc. for which I need Cython or PyPy
  4. You want to build a library that is useful outside of Python, and so plan to build most numeric algorithms on C/C++/Fortran
  5. You prefer ahead-of-time compilation and want to avoid JIT times
  6. While llvmlite is cheaper than LLVM, it’s still 50MB
  7. Understanding the compiled results is hard, and you don’t have good familiarity with LLVM

Numba features we didn’t talk about

  1. Multi-core parallelism
  2. GPUs
  3. Run-time Specialization to the CPU you’re running on
  4. Easy to swap out for other JIT compilers, like PyPy, if they arise in the future

Update from the original blogpost authors

After writing the above I reached out both to Stan and Siu from Numba and to the original authors of the referenced blogposts to get some of their impressions now having the benefit of additional experience.

Here are a few choice responses:

  1. Stan:

    I think one of the biggest arguments against Numba still is time. Due to a massive rewrite of the code base, Numba, in its present form, is ~3 years old, which isn’t that old for a project like this. I think it took PyPy at least 5-7 years to reach a point where it was stable enough to really trust. Cython is 10 years old. People have good reason to be conservative with taking on new core dependencies.

  2. Jake:

    One thing I think would be worth fleshing-out a bit (you mention it in the final bullet list) is the fact that numba is kind of a black box from the perspective of the developer. My experience is that it works well for straightforward applications, but when it doesn’t work well it’s *extremely difficult to diagnose what the problem might be.*

    Contrast that with Cython, where the html annotation output does wonders for understanding your bottlenecks both at a non-technical level (“this is dark yellow so I should do something different”) and a technical level (“let me look at the C code that was generated”). If there’s a similar tool for numba, I haven’t seen it.

  • Florian:

    Elaborating on Jake’s answer, I completely agree that Cython’s annotation tool does wonders in terms of understanding your code. In fact, numba does possess this too, but as a command-line utility. I tried to demonstrate this in my blogpost, but exporting the CSS in the final HTML render kind of mangles my blog post so here’s a screenshot:

    Numba HTML annotations

    This is a case where jit(nopython=True) works, so there seems to be no coloring at all.

    Florian also pointed to the SciPy 2017 tutorial by Gil Forsyth and Lorena Barba

  • Dion:

    I hold Numba in high regard, and the speedups impress me every time. I use it quite often to optimize some bottlenecks in our production code or data analysis pipelines (unfortunately not open source). And I love how Numba makes some functions like scipy.optimize.minimize or scipy.ndimage.generic_filter well-usable with minimal effort.

    However, I would never use Numba to build larger systems, precisely for the reason Jake mentioned. Subjectively, Numba feels hard to debug, has cryptic error messages, and seemingly inconsistent behavior. It is not a “decorate and forget” solution; instead it always involves plenty of fiddling to get right.

    That being said, if I were to build some high-level scientific library à la Astropy with some few performance bottlenecks, I would definitely favor Numba over Cython (and if it’s just to spare myself the headache of getting a working C compiler on Windows).

  • Stephan:

    I wonder if there are any examples of complex codebases (say >1000 LOC) using Numba. My sense is that this is where Numba’s limitations will start to become more evident, but maybe newer features like jitclass would make this feasible.

As a final take-away, you might want to follow Florian’s advice and watch Gil and Lorena’s tutorial here:

January 30, 2018 12:00 AM

January 27, 2018

Matthew Rocklin

Write Dumb Code

The best way you can contribute to an open source project is to remove lines of code from it. We should endeavor to write code that a novice programmer can easily understand without explanation or that a maintainer can understand without significant time investment.

As students we attempt increasingly challenging problems with increasingly sophisticated technologies. We first learn loops, then functions, then classes, etc.. We are praised as we ascend this hierarchy, writing longer programs with more advanced technology. We learn that experienced programmers use monads while new programmers use for loops.

Then we graduate and find a job or open source project to work on with others. We search for something that we can add, and implement a solution pridefully, using the all the tricks that we learned in school.

Ah ha! I can extend this project to do X! And I can use inheritance here! Excellent!

We implement this feature and feel accomplished, and with good reason. Programming in real systems is no small accomplishment. This was certainly my experience. I was excited to write code and proud that I could show off all of the things that I knew how to do to the world. As evidence of my historical love of programming technology, here is a linear algebra language built with a another meta-programming language. Notice that no one has touched this code in several years.

However after maintaining code a bit more I now think somewhat differently.

  1. We should not seek to build software. Software is the currency that we pay to solve problems, which is our actual goal. We should endeavor to build as little software as possible to solve our problems.
  2. We should use technologies that are as simple as possible, so that as many people as possible can use and extend them without needing to understand our advanced techniques. We should use advanced techniques only when we are not smart enough to figure out how to use more common techniques.

Neither of these points are novel. Most people I meet agree with them to some extent, but somehow we forget them when we go to contribute to a new project. The instinct to contribute by building and to demonstrate sophistication often take over.

Software is a cost

Every line that you write costs people time. It costs you time to write it of course, but you are willing to make this personal sacrifice. However this code also costs the reviewers their time to understand it. It costs future maintainers and developers their time as they fix and modify your code. They could be spending this time outside in the sunshine or with their family.

So when you add code to a project you should feel meek. It should feel as though you are eating with your family and there isn’t enough food on the table. You should take only what you need and no more. The people with you will respect you for your efforts to restrict yourself. Solving problems with less code is a hard, but it is a burden that you take on yourself to lighten the burdens of others.

Complex technologies are harder to maintain

As students, we demonstrate merit by using increasingly advanced technologies. Our measure of worth depends on our ability to use functions, then classes, then higher order functions, then monads, etc. in public projects. We show off our solutions to our peers and feel pride or shame according to our sophistication.

However when working with a team to solve problems in the world the situation is reversed. Now we strive to solve problems with code that is as simple as possible. When we solve a problem simply we enable junior programmers to extend our solution to solve other problems. Simple code enables others and boosts our impact. We demonstrate our value by solving hard problems with only basic techniques.

Look! I replaced this recursive function with a for loop and it still does everything that we need it to. I know it’s not as clever, but I noticed that the interns were having trouble with it and I thought that this change might help.

If you are a good programmer then you don’t need to demonstrate that you know cool tricks. Instead, you can demonstrate your value by solving a problem in a simple way that enables everyone on your team to contribute in the future.

But moderation, of course

That being said, over-adherence to the “build things with simple tools” dogma can be counter productive. Often a recursive solution can be much simpler than a for-loop solution and often times using a Class or a Monad is the right approach. But we should be mindful when using these technologies that we are building for ourselves our own system; a system with which others have had no experience.

January 27, 2018 12:00 AM

January 24, 2018

Bruno Pinho

Fast and Reliable Top of Atmosphere (TOA) calculations of Landsat-8 data in Python

How to efficiently extract reflectance information from Landsat-8 Level-1 Data Product images.

by Bruno Ruas de Pinho at January 24, 2018 04:01 PM

January 22, 2018

Pierre de Buyl

Testing a NumPy-based code on Travis with plain pip and wheels

Installing the scientific Python stack is not the most obvious task in a scientist's routine. This is especially annoying for automated deployments such as for continuous integration testing. I present here a short way to deploy Travis CI testing for a small library that depends only on NumPy.

The goal

I developed a small library that relies only on Python and NumPy, as a design requirement. I wanted a simple pip-based deployment of my Python package testing via continuous integration, including the version of NumPy of my choice and with no rebuild of NumPy.

I started by performing the tests on my machines, simply issuing python -m pytest when changing the code. This is a limitation, mostly because I am limited to a few Python/NumPy versions.

How to set up Travis

Travis has instructions and support for Python-based projects. The typical "SciPy stack" is not covered (except for one version of NumPy that ships with their images), so most Python-based scientific software downloads Anaconda or Miniconda as part of their continuous integration testing, getting access to plently of binary packages.

I have no specific argument against the conda solution apart that it is a large dependency in terms of download size, and that I believe "plain pip" is the most general solution for Python and I like to stick to it.

So, I set up Travis with a test matrix for Python 2.7, 3.5 and 3.6. I wanted to test several NumPy versions as well. I couldn't find a lightweight solution (i.e. a nice sample .travis.yml file) as most projects use (ana/mini)conda. Since the arrival of manylinux wheels, it is actually easy to rely on "plain pip" to install NumPy on Travis. Make sure to update pip itself first and to install "wheels" as well.

The timing of the build on travis is between 30 and 80s, so there is obviously no build of NumPy occurring there and this is a reasonable use of resources.

In the example, I exclude NumPy 1.11.0 from the Python 3.6 test because there are no "Python 3.6 NumPy 1.11.0" manylinux wheels.

language: python

python:
 - 2.7
 - 3.5
 - 3.6

env:
  - NUMPY_VERSION=1.11.0
  - NUMPY_VERSION=1.12.1
  - NUMPY_VERSION=1.14.0

matrix:
  exclude:
    - python: 3.6
      env: NUMPY_VERSION=1.11.0

script:
  - virtualenv --python=python venv
  - source venv/bin/activate
  - python -m pip install -U pip
  - pip install -U wheel
  - pip install numpy==$NUMPY_VERSION
  - pip install pytest
  - python setup.py build
  - python -m pytest

Ending

I hope that this solution will be useful to others. If you want to see the repository itself, it is here (with a badge to the travis-ci builds).

The resulting .travis.yml file is really short, which is (in my opinion) a benefit. As SciPy also provides manylinux wheels, this is really a powerful and easy way to deploy. Any scientific package that depends on NumPy/SciPy can use it and add a build of the compiled package with, for instance, an extra dependency on GCC or Cython.

by Pierre de Buyl at January 22, 2018 03:00 PM

Matthew Rocklin

Pangeo: JupyterHub, Dask, and XArray on the Cloud

This work is supported by Anaconda Inc, the NSF EarthCube program, and UC Berkeley BIDS

A few weeks ago a few of us stood up pangeo.pydata.org, an experimental deployment of JupyterHub, Dask, and XArray on Google Container Engine (GKE) to support atmospheric and oceanographic data analysis on large datasets. This follows on recent work to deploy Dask and XArray for the same workloads on super computers. This system is a proof of concept that has taught us a great deal about how to move forward. This blogpost briefly describes the problem, the system, then describes the collaboration, and finally discusses a number of challenges that we’ll be working on in coming months.

The Problem

Atmospheric and oceanographic sciences collect (with satellites) and generate (with simulations) large datasets that they would like to analyze with distributed systems. Libraries like Dask and XArray already solve this problem computationally if scientists have their own clusters, but we seek to expand access by deploying on cloud-based systems. We build a system to which people can log in, get Jupyter Notebooks, and launch Dask clusters without much hassle. We hope that this increases access, and connects more scientists with more cloud-based datasets.

The System

We integrate several pre-existing technologies to build a system where people can log in, get access to a Jupyter notebook, launch distributed compute clusters using Dask, and analyze large datasets stored in the cloud. They have a full user environment available to them through a website, can leverage thousands of cores for computation, and use existing APIs and workflows that look familiar to how they work on their laptop.

A video walk-through follows below:

We assembled this system from a number of pieces and technologies:

  • JupyterHub: Provides both the ability to launch single-user notebook servers and handles user management for us. In particular we use the KubeSpawner and the excellent documentation at Zero to JupyterHub, which we recommend to anyone interested in this area.
  • KubeSpawner: A JupyterHub spawner that makes it easy to launch single-user notebook servers on Kubernetes systems
  • JupyterLab: The newer version of the classic notebook, which we use to provide a richer remote user interface, complete with terminals, file management, and more.
  • XArray: Provides computation on NetCDF-style data. XArray extends NumPy and Pandas to enable scientists to express complex computations on complex datasets in ways that they find intuitive.
  • Dask: Provides the parallel computation behind XArray
  • Daskernetes: Makes it easy to launch Dask clusters on Kubernetes
  • Kubernetes: In case it’s not already clear, all of this is based on Kubernetes, which manages launching programs (like Jupyter notebook servers or Dask workers) on different machines, while handling load balancing, permissions, and so on
  • Google Container Engine: Google’s managed Kubernetes service. Every major cloud provider now has such a system, which makes us happy about not relying too heavily on one system
  • GCSFS: A Python library providing intuitive access to Google Cloud Storage, either through Python file interfaces or through a FUSE file system
  • Zarr: A chunked array storage format that is suitable for the cloud

Collaboration

We were able to build, deploy, and use this system to answer real science questions in a couple weeks. We feel that this result is significant in its own right, and is largely because we collaborated widely. This project required the expertise of several individuals across several projects, institutions, and funding sources. Here are a few examples of who did what from which organization. We list institutions and positions mostly to show the roles involved.

  • Alistair Miles, Professor, Oxford: Helped to optimize Zarr for XArray on GCS
  • Jacob Tomlinson, Staff, UK Met Informatics Lab: Developed original JADE deployment and early Dask-Kubernetes work.
  • Joe Hamman, Postdoc, National Center for Atmospheric Research: Provided scientific use case, data, and work flow. Tuned XArray and Zarr for efficient data storing and saving.
  • Martin Durant, Software developer, Anaconda Inc.: Tuned GCSFS for many-access workloads. Also provided FUSE system for NetCDF support
  • Matt Pryor, Staff, Centre for Envronmental Data Analysis: Extended original JADE deployment and early Dask-Kubernetes work.
  • Matthew Rocklin, Software Developer, Anaconda Inc. Integration. Also performance testing.
  • Ryan Abernathey, Assistant Professor, Columbia University: XArray + Zarr support, scientific use cases, coordination
  • Stephan Hoyer, Software engineer, Google: XArray support
  • Yuvi Panda, Staff, UC Berkeley BIDS and Data Science Education Program: Provided assistance configuring JupyterHub with KubeSpawner. Also prototyped the Daskernetes Dask + Kubernetes tool.

Notice the mix of academic and for-profit institutions. Also notice the mix of scientists, staff, and professional software developers. We believe that this mixture helps ensure the efficient construction of useful solutions.

Lessons

This experiment has taught us a few things that we hope to explore further:

  1. Users can launch Kubernetes deployments from Kubernetes pods, such as launching Dask clusters from their JupyterHub single-user notebooks.

    To do this well we need to start defining user roles more explicitly within JupyterHub. We need to give users a safe an isolated space on the cluster to use without affecting their neighbors.

  2. HDF5 and NetCDF on cloud storage is an open question

    The file formats used for this sort of data are pervasive, but not particulary convenient or efficent on cloud storage. In particular the libraries used to read them make many small reads, each of which is costly when operating on cloud object storage

    I see a few options:

    1. Use FUSE file systems, but tune them with tricks like read-ahead and caching in order to compensate for HDF’s access patterns
    2. Use the HDF group’s proposed HSDS service, which promises to resolve these issues
    3. Adopt new file formats that are more cloud friendly. Zarr is one such example that has so far performed admirably, but certainly doesn’t have the long history of trust that HDF and NetCDF have earned.
  3. Environment customization is important and tricky, especially when adding distributed computing.

    Immediately after showing this to science groups they want to try it out with their own software environments. They can do this easily in their notebook session with tools like pip or conda, but to apply those same changes to their dask workers is a bit more challenging, especially when those workers come and go dynamically.

    We have solutions for this. They can bulid and publish docker images. They can add environment variables to specify extra pip or conda packages. They can deploy their own pangeo deployment for their own group.

    However these have all taken some work to do well so far. We hope that some combination of Binder-like publishing and small modification tricks like environment variables resolve this problem.

  4. Our docker images are very large. This means that users sometimes need to wait a minute or more for their session or their dask workers to start up (less after things have warmed up a bit).

    It is surprising how much of this comes from conda and node packages. We hope to resolve this both by improving our Docker hygeine and by engaging packaging communities to audit package size.

  5. Explore other clouds

    We started with Google just because their Kubernetes support has been around the longest, but all major cloud providers (Google, AWS, Azure) now provide some level of managed Kubernetes support. Everything we’ve done has been cloud-vendor agnostic, and various groups with data already on other clouds have reached out and are starting deployment on those systems.

  6. Combine efforts with other groups

    We’re actually not the first group to do this. The UK Met Informatics Lab quietly built a similar prototype, JADE (Jupyter and Dask Environment) many months ago. We’re now collaborating to merge efforts.

    It’s also worth mentioning that they prototyped the first iteration of Daskernetes.

  7. Reach out to other communities

    While we started our collaboration with atmospheric and oceanographic scientists, these same solutions apply to many other disciplines. We should investigate other fields and start collaborations with those communities.

  8. Improve Dask + XArray algorithms

    When we try new problems in new environments we often uncover new opportunities to improve Dask’s internal scheduling algorithms. This case is no different :)

Much of this upcoming work is happening in the upstream projects so this experimentation is both of concrete use to ongoing scientific research as well as more broad use to the open source communities that these projects serve.

Community uptake

We presented this at a couple conferences over the past week.

We found that this project aligns well with current efforts from many government agencies to publish large datasets on cloud stores (mostly S3). Many of these data publication endeavors seek a computational system to enable access for the scientific public. Our project seems to complement these needs without significant coordination.

Disclaimers

While we encourage people to try out pangeo.pydata.org we also warn you that this system is immature. In particular it has the following issues:

  1. it is insecure, please do not host sensitive data
  2. it is unstable, and may be taken down at any time
  3. it is small, we only have a handful of cores deployed at any time, mostly for experimentation purposes

However it is also open, and instructions to deploy your own live here.

Come help

We are a growing group comprised of many institutions including technologists, scientists, and open source projects. There is plenty to do and plenty to discuss. Please engage with us at github.com/pangeo-data/pangeo/issues/new

January 22, 2018 12:00 AM

January 19, 2018

January 10, 2018

Titus Brown

Some resources for the data science ecosystem

ref my talk at CSUPERB, "Data Science is, like, the new critical thinking!" --

Training and teaching

Datacarpentry.org - the home base for Data Carpentry, which runs two day workshops around the world and has an instructor training program that teaches people to teach their materials.

data8.org - the home base for the UC Berkeley "Foundations of Data Science" course. All open/free materials running with open source tools.

Reading and background

Influential works in Data-Driven Discovery, a paper by Mark Stalzer and Chris Mentzel, that outlines topics that probably fit within the "data science" field.

Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science, a grant proposal by the Jupyter team.

Some Web sites worth visiting

mybinder.org - a Web site for running Jupyter Notebooks in the cloud, for free! Try out my Monty Hall problem notebook! (or see the source).

Media for thinking the unthinkable by Bret Victor - an inspiring video lecture that I happened across as I was preparing my talk...


Please add resources below in the comments!

--titus

by C. Titus Brown at January 10, 2018 11:00 PM

Bruno Pinho

Automated Bulk Downloads of Landsat-8 Data Products in Python

Earth Explorer provides a very good interface to download Landsat-8 data. However, we usually want to automate the process and run everything without spending time with GUIs. In this tutorial, I will show how to automate the bulk download of low Cloud Covered Landsat-8 images, in Python, using Amazon S3 or Google Storage servers.

by Bruno Ruas de Pinho at January 10, 2018 01:00 PM

January 09, 2018

Titus Brown

Some interview questions for a job building data analysis pipelines

Recently we interviewed for a staff job that involved building bioinformatics data analysis pipelines. We came up with the following interview questions, which seemed to work quite well for a first round interview, & I thought I'd share --

Question 1

Scenario: you've been maintaining a data analysis pipeline that involves running a shell script by hand. The shell script works perfectly about 95% of the time, and breaks the remaining 5% of the time because of many small issues. This is OK so far because you've been asked to process 1 data set a week and the rest of the time is spent on other tasks. But now the job has changed and you're working 50% or more of your time on this and expected to analyze 100 data sets a month. How would you allocate your time and efforts? Feel free to fill in backstory from your own previous work experiences.

Question 2

Scenario: You're running the same data analysis pipeline as above, and after two months, you suddenly get feedback from your bosses boss that the results are wrong now. How do you approach this situation?

Bonus: Question 3

You are building a workflow or pipeline with a bunch of software that is incompatible in its dependencies and installation requirements. What approaches would you consider, what kinds of questions would you ask about the workflow and pipeline to choose between the approaches, and what are the drawbacks of the various approaches?

--titus

by C. Titus Brown at January 09, 2018 11:00 PM

January 07, 2018

Titus Brown

Reassessing the ‘Digital Commons’

Part I -- Sustainability and Funding

Funding FLOSS contributions

Individual contributors

Sustainability of Free/Libre and Open Source Software (FLOSS) has been an ongoing subject of concern. In 2017, Sustain released a practical report on the topic, sharing findings and recommendations pertaining to the sustainability of FLOSS [Nickolls, 2017]. The authors of this report, also known as the Sustainers, use the term 'FOSS' and, more often, 'OSS.' We prefer using the term 'FLOSS' to express neutrality [Stallman, 2013].

At the lowest level, FLOSS consists of lines of code contributed by individuals. The latter contribute either voluntarily or because they are paid to do so [Schweik, 2011]. Some organizations, whether for-profit or not-for-profit, hire people to work on FLOSS projects, either part-time or full-time. To quote the Sustainers, "contributions are often made on the basis of immediate and individual needs." And so is the funding of these contributions, from a standpoint where we equate a contribution with its funding: That is, a contribution would not have landed, had it not been funded somehow, whether directly or indirectly, whether in the form of money or time.

This individual-centric perspective makes funding, if not sustainability, a non-issue. It is only good at answering, as an individual, the all-too-common question "How do you make money working on FLOSS?" We consider that sustainability includes, but is not restricted to, funding. Indeed, what if you have funding, but no talent existing or available to take advantage of it? Now, we may ask the holistic question: Isn't a FLOSS project more than the sum of its individual contributions?

Collective projects

FLOSS projects which see communities of practice emerge and organize around them are definitely much more. An example we cherish would be SciPy, a Python-based ecosystem for scientific computing. Interactions between members of these communities create value, knowledge, and culture. These members do not have to be code contributors; they may be end users, power users, or contributors in a broader sense. Remarkably, the yt project has pushed the definition of its "members" (yt is a Python package for analyzing and visualizing high-dimensional scientific data): Quoting [Turk, 2016], "yt has a model in place for recognizing contributions that go beyond code."

So, can we grasp this collective dimension? We sense that sustainability should be a concern shared throughout the community. When, additionally, whole segments of technological, cultural, educational, and economic activities rely on FLOSS projects, we agree with the Sustainers that the concern for sustainability (including funding) should be shared by "stakeholders" who are many and diverse, far beyond the small circle of (code) contributors. The Sustainers call this key subset of FLOSS our "essential digital infrastructure." Further, they identify it as a "public good." For each piece of this infrastructure, the circle of contributors is indeed very small with respect to its end-user base, made up of "consumers" or "users" [Nickolls, 2017].

Although the Sustainers link to Elinor Ostrom's "8 Principles for Managing A Commmons [sic]" when recommending good governance, we argue that their report is a missed opportunity for leveraging the concept of Commons. In the following, we explain why we care about viewing FLOSS as a digital commons (rather than a public good). We note that other digital (information, knowledge) commons have been approached as public goods. One example would be information acquisition, as studied by [Ramachandran and Chaintreau, 2015]. They also report very low ratios of "contributors" to "consumers," falling within a production/consumption view.

Bringing the Commons into play

They say FLOSS is a digital commons

Historically, the Commons have described natural resources that were shared within a community---not only as a matter of fact, but through intentional rules and collective self-management which ensured their sustainability and fair access [Maurel, 2016]. As more and more commons were enclosed and sacrificed to private interests (including that of the State), they all but disappeared from the official economic discourse. Instead, the discussion narrowed down to the private/public dichotomy. In that paradigm, "public" goods merely qualified what could not (at the time) be realistically privatized, such as air or water. They were described in contrast to private goods, as non-rivalrous and non-excludable [Hess and Ostrom, 2011a].

The concept of Commons reappeared with the rise of environmental concerns [Bollier, 2011] as well as the development of technologies, which suddenly enabled "the capture of what were once free and open public goods" [Hess and Ostrom, 2011a]. In a somewhat similar way, knowledge has long been able to straddle the ambiguous border between private and public: the necessity of print grounding it in the private property realm, while the public domain materialized its non-rivalrous, non-excludable nature.

The Internet and the advent of the digital era have changed this situation. Once the limitations inherent to print are gone, the complex status of knowledge is revealed. Defining it as a commons is an attempt to grasp and honor this complexity. Indeed, the term "Commons" translates a desire to move away and beyond the simplistic understanding of private vs public. By speaking of commons, its advocates seek to build a new framework for analysis, which integrates the philosophical, political, and social dimensions along with the traditional, market-centred economic one [Bollier, 2011].

How is the collective dimension enforced?

In 2017, we celebrated the 10th anniversary of the GNU General Public License version 3 (GPLv3). It is one of the most popular Free Software licenses. To remain neutral, we wish we could use the term 'FLOSS license' to mean any software license approved by both the Free Software Foundation and the Open Source Initiative. Free Software licenses are tools designed to safeguard and advance the freedom of software users. Indeed, user freedom is the ultimate motivation underlying Free Software. But, since it is not that of Open Source, we cannot casually replace 'Free Software' with 'FLOSS' in our second-to-last sentence.

What we can highlight is that FLOSS licenses grant individual freedoms. There is no built-in mechanism to account for a community. The sense of community is typically derived from the practice of sharing (allowed by FLOSS licensing), in-person or remote participation in events (conferences, hackathons, etc.), and collaboration on certain contributions (possibly event organization, project maintenance, etc.). We can describe FLOSS as pro-sharing, alongside other movements such as Creative Commons or Open Science. We note that the concern for sharing has been at the heart of Free Software since the very beginning [Stallman, 1983].

At the end of the day, distribution and dissemination are one-way ideas. They do not bear on collective responsibility. Still, we recognize that copyleft---and the related ShareAlike offered by Creative Commons licenses---represent a means to extend some responsibility to all community members. Therefore, we hypothesize that, even though copyleft and related tools do serve the project of building digital commons, they might not be sufficient. And, although FLOSS has been a great source of inspiration to other digital commons [Laurent, 2012], the FLOSS way does not have to be the only way to the digital Commons.

Generally, most of Commons literature seems to present copyright enclosure as the one big threat to the digital Commons. Since digital knowledge is in essence non-rivalrous, there is a presumption that Hardin's famous "tragedy of commons"1 does not apply [Hess and Ostrom, 2011a]. In fact, the opposite is considered more likely to be true: "the tragedy of the anticommons (...) lies in the potential underuse of scarce scientific resources caused by excessive intellectual property rights and overpatenting in biomedical research" [Hess and Ostrom, 2011a]. As a reaction, commons-oriented initiatives tend to overemphasize accessibility, to the expense of sustainability and governance---as if these concerns ranked second in the definition of commons.

Issues, solutions, and questions

A critical take on the priorities put forward for these commons

Commons movements deemed successful include FLOSS, Open Access (OA), or free culture. Why does their focus on free access and use fall short? Mostly because they reflect only the authors' or maintainers' intentions, with little regard for or feedback from the other stakeholders' needs. First of all, the very definitions of what constitutes 'freedom' (in FLOSS and free culture) or 'open access' (in eponymous OA) are subject to a cultural bias. Open Access, for instance, operates a hierarchy between so-called barriers. While the removal of some (price and permission) is a compulsory prerequisite to be labelled OA, others ("handicap access," "connectivity," language, etc.), arguably harder obstacles to overcome, are merely acknowledged as works in progress [Suber, 2011].

This helps us see "free and unfettered access" [Hess and Ostrom, 2011a] as a relative concept, and the set of criteria which determine it as mere guidelines, rather than objective conditions. In this light, we would like to argue for a more comprehensive view of "accessibility." If we are to treat digital Commons as commons, then we may need to do more (or less, depending) than giving up privileges traditionally associated with copyright---a privilege unto itself, ironically! We want to find whatever specific provisions are most likely to serve and engage the community. Yet the 'free/open' argument, insofar as it is arbitrary and partial, necessarily promotes the concerns of some over those of others.

Here is an example of different interests conflicting. In his contribution on OA to book [Hess and Ostrom, 2011b], Suber posits that the concept of open access can be extended to royalty-producing literature [Suber, 2011]. Yet the focus on eliminating the "price barrier" creates a contention. His argument that OA does not adversely affect sales is based on the assumption that people do not read whole books in electronic format---a surprising opinion, which seems irrevocably outdated. Moreover, if maximum dissemination is the goal, then distribution and searchability are more important factors than price---or rather, lack thereof. In many fields where traditional sales channels are still the norm, putting a price on something---even a symbolic one---remains the best guarantee of effectively sharing one's work.

As a matter of fact, in the chapter that follows Suber's, Ghosh argues that a well-regulated marketplace can help realize the process of exchange, which is crucial to the Commons [Ghosh, 2011]. This does not negate Suber's defense of scholars' "insulation from the market", but it does put it into perspective.

Implications for funding and sustainability

In reality, the issue is cyclical. "Consumers" might object to a price which does not appropriately meet their means, or their perception of the work's value. On the other hand, commoners, who may have a say in setting the price and defining exactly what they are paying for (access, use of a resource, or the work which made both possible), would, presumably, agree to such a (financial) contribution. Here, we are drawing on the now well-known research on common-pool resource systems. It has shown success not to be linked with any specific set of rules, but broader principles [Hess and Ostrom, 2011a]. One of them is of particular interest to us: "Individuals affected by these rules can usually participate in modifying the rules."

But are such findings relevant to digital commons? Usually, the latter are considered a separate category, on the basis of their non-rivalry (or low subtractabilty) [Hess and Ostrom, 2011a]. Unlike common-pool resources, they cannot be depleted or destroyed through overuse. This might be somewhat true of the resource itself, but what about the human labour needed to create and maintain it? We want to point out that time and work capacity are uniquely rivalrous resources. The Sustainers recognize this duality as well ("the sustainability of resources and the sustainability of people"). Therefore, wouldn't a certain level of institutions still be in order, if not to regulate the use of the digital resource, then at least to take care of the human resource?

We are led to believe that a strong sense of community, implying shared values and adherence to the rules in place, is as significant for the sustainability of digital commons, as it is for other types of commons'. When it comes to funding, we have already mentioned that the more engaged users are, the less they should be tempted to free-ride. We also think that, in the case of collective projects, treating every potential user as another commoner can only help with the recruitment and long-term integration of contributors. The orientation of the pandas project (Python data analysis library), as stated in their governance document, seems to support this claim: "we strive to keep the barrier between Contributors and Users as low as possible"; "In general all Project decisions are made through consensus among the Core Team with input from the Community." Evidently, they see value in doing so.

However, we may note that the funding of the project is left to the care of a distinct organization, i.e., NumFOCUS (which, as a side note, Marianne loves). We can also concede that each digital commons has its own specific requirements and culture. For example, formal, centralized types of institutions, which have worked well for environmental commons, will not necessarily be successful with FLOSS commons [Schweik and English, 2007]. Again, rules and systems will be diverse, since they must above all be designed to "[match] local needs and conditions", to quote Hess and Ostrom.

Conclusion

In this article, we chose to target funding as a key to digital commons' sustainability. However, it is obviously not the only issue. Preservation, legitimate use, and diversity should all be core concerns to anyone looking to build and enrich the Commons. For, when we speak of 'the Knowledge Commons', we never mean a particular piece of knowledge, but rather the entire ecosystem which allows as many people as possible to keep creating and sharing knowledge.

References

Bollier, David. 2011. “The Growth of the Commons Paradigm.” In Understanding Knowledge as a Commons: From Theory to Practice, edited by Charlotte Hess and Elinor Ostrom. MIT Press.

Ghosh, Shubha. 2011. “How to Build a Commons: Is Intellectual Property Constrictive, Facilitating, or Irrelevant?” In Understanding Knowledge as a Commons: From Theory to Practice, edited by Charlotte Hess and Elinor Ostrom. MIT Press.

Hess, Charlotte, and Elinor Ostrom. 2011a. “Introduction: An Overview of the Knowledge Commons.” In Understanding Knowledge as a Commons: From Theory to Practice, edited by Charlotte Hess and Elinor Ostrom. MIT Press.

———, eds. 2011b. Understanding Knowledge as a Commons: From Theory to Practice. MIT Press.

Laurent, Philippe. 2012. “Free and Open Source Software Licensing: A Reference for the Reconstruction of ‘Virtual Commons’?” In Conference for the 30th Anniversary of the CRID, 1–19. s.n. http://www.crid.be/pdf/public/7133.pdf.

Maurel, Lionel. 2016. “Les Little Free Libraries, victimes d’une Tragédie des Communs ?” http://www.les-communs-dabord.org/les-little-free-libraries-victimes-dune-tragedie-des-communs/ Accessed on Thu, December 21, 2017.

Nickolls, Ben. 2017. A One Day Conversation for Open Source Software Sustainers. Sustain. GitHub HG (SF). https://sustainoss.org/assets/pdf/SustainOSS-west-2017-report.pdf Accessed on Thu, December 21, 2017.

Ramachandran, Arthi, and Augustin Chaintreau. 2015. “Who Contributes to the Knowledge Sharing Economy?” In Proceedings of the 2015 ACM on Conference on Online Social Networks, 37–48. COSN ’15. New York, NY: ACM. https://doi.org/10.1145/2817946.2817963.

Schweik, Charles M. 2011. “Free/Open-Source Software as a Framework for Establishing Commons in Science.” In Understanding Knowledge as a Commons: From Theory to Practice, edited by Charlotte Hess and Elinor Ostrom. MIT Press.

Schweik, Charles M., and Robert English. 2007. “Tragedy of the FOSS Commons? Investigating the Institutional Designs of Free/Libre and Open Source Software Projects.” First Monday 12 (2). https://doi.org/10.5210/fm.v12i2.1619.

Stallman, Richard. 1983. “Why Programs Should be Shared.” https://www.gnu.org/gnu/why-programs-should-be-shared.html Accessed on Thu, December 21, 2017.

———. 2013. “FLOSS and FOSS.” https://www.gnu.org/philosophy/floss-and-foss.en.html Accessed on Thu, December 21, 2017.

Suber, Peter. 2011. “Creating an Intellectual Commons Through Open Access.” In Understanding Knowledge as a Commons: From Theory to Practice, edited by Charlotte Hess and Elinor Ostrom. MIT Press.

Turk, Matthew. 2016. “The Royal ‘We’ in Scientific Software Development.” https://medium.com/@matthewturk/the-royal-we-in-scientific-software-development-9deea495b3b6 Accessed on Thu, December 21, 2017.


  1. The tragedy of commons describes the overexploitation or free-riding that lead to a shared resource's destruction. 

by Marianne Corvellec and Jeanne Corvellec at January 07, 2018 11:00 PM

Filipe Saraiva

Discussing the future of Cantor

Hello devs! Happy new year!

It is common to use the new year date to start new projects or give new directions for old ones. The last one is the case for Cantor.

Since when I got the maintainer status for Cantor, I was working to improve the community around the software. Because the great plugins systems of Qt, it is easy to write new backends for Cantor, and in fact in last years Cantor reached the number of 11 backends.

If in a hand it is a nice thing because Cantor can run different mathematical engines, in other hand it is very common developers create backends, release them with Cantor upstream, and forget this piece of software after some months. The consequence of this is a lot of unsolved bugs in Bugzilla, unexpected behaviours of some backends, and more.

For instance, R backend is broken from some years right now (thanks Rishabh it was fixed during his GSoC/KDE Edu Sprint 2017 but not released yet). Sage backend breaks for each new release of Sage.

Different backends use different technologies. Scilab and Octave backends use QProcess + Standard Streams; Python 2 uses Python/C API; Python 3, R, and Julia use D-Bus.

In addition to these, remember each programming language used as mathematical engine for Cantor has their respective release schedule and it is very common new versions break the way as backends are implemented.

So, yes, the mainternhip of Cantor is a hell.

In order to remedy it I invited developers to be co-maintainer of these respective backends, but it does not have the effect I was suposed to. I implemented a way to present the versions of programming languages supported in the backend but it does not work well too.

So, my main work in Cantor during these years was try to solve bugs of backends I don’t use and, sometimes, I don’t know how they work, while new features were impossible to be planned and implemented.

If we give a look to Jupyter, the main software for notebook-based mathematical computation, it is possible to see this software supports several programming languages. But, in fact, this support is provide by the community – Jupyter focus effort in Python support only (named the ipython kernel) and in new features for Jupyter itself.

So, I would like to hear the KDE and Cantor community about the future of Cantor. My proposal is split the code of the others backends and put them as third-party plugins, maintained by their respective community. Only the Python 3 backend would be “officially” maintaned and delivered in KDE Applications bundle.

This way I could focus in provide new features and I could to say “well, this bug with X backend must be reported to the X backend community because they are accountable for this piece of software”.

So, what do you think about?

by Filipe Saraiva at January 07, 2018 02:07 PM

January 03, 2018

Jarrod Millman

BIDS is hiring NumPy developers

The Berkeley Institute for Data Science (BIDS) is hiring Open Source Scientific Python Developers to contribute to NumPy.  You can read more about the new positions here.  For more information about the work this grant will support, please see this NumPy lecture by BIDS Computational Fellow Nathaniel Smith.  Interested applicants can find more information in the job posting.

by Jarrod Millman (noreply@blogger.com) at January 03, 2018 11:46 AM

January 02, 2018

January 01, 2018

William Stein

Low latency local CoCalc and SageMath on the Google Pixelbook: playing with Crouton, Gallium OS, Rkt, Docker

I just got CoCalc fully working locally on my Google Pixelbook Chromebook! I want this, since (1) I was inspired by a recent blog post about computer latency, and (2) I'm going to be traveling a lot next week (the JMM in San Diego -- come see me at the Sage booth), and may have times without Internet during which I want to work on CoCalc's development.


I first tried Termux, which is a "Linux in Android" userland that runs on the Pixelbook (via Android), but there were way, way too many problems for CoCalc, which is a very complicated application, so this was out. The only option was to enable ChromeOS dev mode.

I next considered partitioning the hard drive, installing Linux natively (in addition to ChromeOS), and dual booting. However, it seems the canonical option is Gallium OS and it nobody has got that to work with Pixelbook yet (?). In fact, it appears that Gallium OS development made have stopped a year ago (?). Bummer. So I gave up on that approach...

The next option was to try Crouton + Docker, since we have a CoCalc Docker image. Unfortunately, it seems currently impossible to use Docker with the standard ChromeOS kernel.  The next thing I considered was to use Crouton + Rkt, since there are blog posts claiming Rkt can run vanilla Docker containers on Crouton.

I setup Crouton, installed the cli-extra chroot, and easily installed Rkt. I learned how Rkt is different than Docker, and tried a bunch of simple standard Docker containers, which worked. However, when I tried running the (huge) CoCalc Docker container, I hit major performance issues, and things broke down. If I had the 16GB Chromebook and more patience, maybe this would have worked. But with only 8GB RAM, it really wasn't feasible.

The next idea was to just use Crouton Linux directly (so no containers), and fix whatever issues arose. I did this, and it worked well, with CoCalc giving me a very nice local browser-based interface to my Crouton environment. Also, since we've spent so much time optimizing CoCalc to be fast over the web, it feels REALLY fast when used locally. I made some changes to the CoCalc sources and added a directory, to hopefully make this easier if anybody else tries. This is definitely not a 1-click solution.

Finally, for SageMath I first tried the Ubuntu PPA, but realized it is hopelessly out of date. I then downloaded and extracted the Ubuntu 16.04 binary and it worked fine. Of course, I'm also building Sage from source (I'm the founder of SageMath after all), but that takes a long time...


Anyway, Crouton works really, really well on the Pixelbook, especially if you do not need to run Docker containers.

by William Stein (noreply@blogger.com) at January 01, 2018 10:29 PM

December 20, 2017

December 19, 2017

December 18, 2017

Jake Vanderplas

Simulating Chutes & Ladders in Python

[img: Chutes and Ladders animated simulation]

This weekend I found myself in a particularly drawn-out game of Chutes and Ladders with my four-year-old. If you've not had the pleasure of playing it, Chutes and Ladders (also sometimes known as Snakes and Ladders) is a classic kids board game wherein players roll a six-sided die to advance forward through 100 squares, using "ladders" to jump ahead, and avoiding "chutes" that send you backward. It's basically a glorified random walk with visual aids to help you build a narrative. Thrilling. But she's having fun practicing counting, learning to win and lose gracefully, and developing the requisite skills to be a passionate sports fan, so I play along.

On the approximately twenty third game of the morning, as we found ourselves in a near endless cycle of climbing ladders and sliding down chutes, never quite reaching that final square to end the game, I started wondering how much longer the game could last: what is the expected length of a game? How heavy are the tails of the game length distribution? How succinctly could I answer those questions in Python? And then, at some point, it clicked: Chutes and Ladders is memoryless — the effect of a roll depends only on where you are, not where you've been — and so it can be modeled as a Markov process! By the time we (finally) hit square 100, I basically had this blog post written, at least in my head.

When I tweeted about this, people pointed me to a number of similar treatments of Chutes & Ladders, so I'm under no illusion that this idea is original. Think of this as a blog post version of a dad joke: my primary goal is not originality, but self-entertainment, and if anyone else finds it entertaining that's just an added bonus.

by Jake VanderPlas at December 18, 2017 06:00 PM

Continuum Analytics

The Most Popular Anaconda Webinars of 2017

Happy holidays, #AnacondaCREW! Our experts love participating in the Python data science community by sharing their experiences through live, interactive webinars. Below are our most-viewed Anaconda webinars of the year. They’re now available on-demand, so even if you missed them the first time around, you can still watch and learn! Taming the Python Data Visualization …
Read more →

by Janice Zhang at December 18, 2017 02:43 PM

December 14, 2017

Paul Ivanov

SciPy 2018 dates and call for abstracts

I'm helping with next year's SciPy conference, so here are the details:

July 9-15, 2018 | Austin, Texas

Tutorials: July 9-10, 2018
Conference (Talks and Posters): July 11-13, 2018
Sprints: July 14-15, 2018

SciPy 2018, the 17th annual Scientific Computing with Python conference, will be held July 9-15, 2018 in Austin, Texas. The annual SciPy Conference brings together over 700 participants from industry, academia, and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development. The call for abstracts for SciPy 2018 for talks, posters and tutorials is now open. The deadline for submissions is February 9, 2018.

Talks and Posters (July 11-13, 2018)

In addition to the general track, this year will have specialized tracks focused on:

  • Data Visualization
  • Reproducibilty and Software Sustainability

Mini Symposia

  • Astronomy
  • Biology and Bioinformatics
  • Data Science
  • Earth, Ocean and Geo Science
  • Image Processing
  • Language Interoperability
  • Library Science and Digital Humanities
  • Machine Learning
  • Materials Science
  • Political and Social Sciences

There will also be a SciPy Tools Plenary Session each day with 2 to 5 minute updates on tools and libraries.

Tutorials (July 9-10, 2018)

Tutorials should be focused on covering a well-defined topic in a hands-on manner. We are looking for awesome techniques or packages, helping new or advanced Python programmers develop better or faster scientific applications. We encourage submissions to be designed to allow at least 50% of the time for hands-on exercises even if this means the subject matter needs to be limited. Tutorials will be 4 hours in duration. In your tutorial application, you can indicate what prerequisite skills and knowledge will be needed for your tutorial, and the approximate expected level of knowledge of your students (i.e., beginner, intermediate, advanced). Instructors of accepted tutorials will receive a stipend.

Mark Your Calendar for SciPy 2018!

by Paul Ivanov at December 14, 2017 08:00 AM

December 13, 2017

Enthought

Cheat Sheets: Pandas, the Python Data Analysis Library

Download all 8 Pandas Cheat Sheets

Learn more about the Python for Data Analysis and Pandas Mastery Workshop training courses

Pandas (the Python Data Analysis library) provides a powerful and comprehensive toolset for working with data. Fundamentally, Pandas provides a data structure, the DataFrame, that closely matches real world data, such as experimental results, SQL tables, and Excel spreadsheets, that no other mainstream Python package provides. In addition to that, it includes tools for reading and writing diverse files, data cleaning and reshaping, analysis and modeling, and visualization. Using Pandas effectively can give you super powers, regardless of whether you’re working in data science, finance, neuroscience, economics, advertising, web analytics, statistics, social science, or engineering.

However, learning Pandas can be a daunting task because the API is so rich and large. This is why we created a set of cheat sheets built around the data analysis workflow illustrated below. Each cheat sheet focuses on a given task. It shows you the 20% of functions you will be using 80% of the time, accompanied by simple and clear illustrations of the different concepts. Use them to speed up your learning, or as a quick reference to refresh your mind.

Here’s the summary of the content of each cheat sheet:

  1. Reading and Writing Data with Pandas: This cheat sheet presents common usage patterns when reading data from text files with read_table, from Excel documents with read_excel, from databases with read_sql, or when scraping web pages with read_html. It also introduces how to write data to disk as text files, into an HDF5 file, or into a database.
  2. Pandas Data Structures: Series and DataFrames: It presents the two main data structures, the DataFrame, and the Series. It explain how to think about them in terms of common Python data structure and how to create them. It gives guidelines about how to select subsets of rows and columns, with clear explanations of the difference between label-based indexing, with .loc, and position-based indexing, with .iloc.
  3. Plotting with Series and DataFrames: This cheat sheet presents some of the most common kinds of plots together with their arguments. It also explains the relationship between Pandas and matplotlib and how to use them effectively. It highlights the similarities and difference of plotting data stored in Series or DataFrames.
  4. Computation with Series and DataFrames: This one codifies the behavior of DataFrames and Series as following 3 rules: alignment first, element-by-element mathematical operations, and column-based reduction operations. It covers the built-in methods for most common statistical operations, such as mean or sum. It also covers how missing values are handled by Pandas.
  5. Manipulating Dates and Times Using Pandas: The first part of this cheatsheet describes how to create and manipulate time series data, one of Pandas’ most celebrated features. Having a Series or DataFrame with a Datetime index allows for easy time-based indexing and slicing, as well as for powerful resampling and data alignment. The second part covers “vectorized” string operations, which is the ability to apply string transformations on each element of a column, while automatically excluding missing values.
  6. Combining Pandas DataFrames: The sixth cheat sheet presents the tools for combining Series and DataFrames together, with SQL-type joins and concatenation. It then goes on to explain how to clean data with missing values, using different strategies to locate, remove, or replace them.
  7. Split/Apply/Combine with DataFrames: “Group by” operations involve splitting the data based on some criteria, applying a function to each group to aggregate, transform, or filter them and then combining the results. It’s an incredibly powerful and expressive tool. The cheat sheet also highlights the similarity between “group by” operations and window functions, such as resample, rolling and ewm (exponentially weighted functions).
  8. Reshaping Pandas DataFrames and Pivot Tables: The last cheatsheet introduces the concept of “tidy data”, where each observation, or sample, is a row, and each variable is a column. Tidy data is the optimal layout when working with Pandas. It illustrates various tools, such as stack, unstack, melt, and pivot_table, to reshape data into a tidy form or to a “wide” form.

Download all 8 Pandas Cheat Sheets

Data Analysis Workflow

Ready to accelerate your skills with Pandas?

Enthought’s Pandas Mastery Workshop (for experienced Python users) and Python for Data Analysis (for those newer to Python) classes are ideal for those who work heavily with data. Contact us to learn more about onsite corporate or open class sessions.

 

The post Cheat Sheets: Pandas, the Python Data Analysis Library appeared first on Enthought Blog.

by admin at December 13, 2017 10:12 PM

December 12, 2017

Continuum Analytics

Parallel Python with Numba and ParallelAccelerator

With CPU core counts on the rise, Python developers and data scientists often struggle to take advantage of all of the computing power available to them. CPUs with 20 or more cores are now available, and at the extreme end, the Intel® Xeon Phi™ has 68 cores with 4-way Hyper-Threading. (That’s 272 active threads!) To …
Read more →

by Janice Zhang at December 12, 2017 06:28 PM

Anaconda Welcomes Lars Ewe as SVP of Engineering

Anaconda, Inc., provider of the most popular Python data science platform, today announced Lars Ewe as the company’s new senior vice president (SVP) of engineering. With more than 20 years of enterprise engineering experience, Ewe brings a strong foundation in big data, real-time analytics and security. He will lead the Anaconda Enterprise and Anaconda Distribution engineering teams.

by Team Anaconda at December 12, 2017 02:00 PM

December 11, 2017

Jake Vanderplas

Optimization of Scientific Code with Cython: Ising Model

Python is quick and easy to code, but can be slow when doing intensive numerical operations. Translating code to Cython can be helpful, but in most cases requires a bit of trial and error to achieve the optimal result. Cython's tutorials contain a lot of information, but for iterative workflows like optimization with Cython, it's often useful to see it done "live".

For that reason, I decided to record some screencasts showing this iterative optimization process, using an Ising Model, as an example application.

by Jake VanderPlas at December 11, 2017 08:00 PM

Filipe Saraiva

KDE Edu Sprint 2017

Two months ago I attended to KDE Edu Sprint 2017 at Berlin. It was my first KDE sprint (really, I send code to KDE software since 2010 and never went to a sprint!) so I was really excited for the event.

KDE Edu is the an umbrella for specific educational software of KDE. There are a lot of them and it is the main educational software suite in free software world. Despite it, KDE Edu has received little attention in organization side, for instance the previous KDE Edu sprint occurred several years ago, our website has some problems, and more.

Therefore, this sprint was an opportunity not only for developers work in software development, but for works in organization side as well.

In organization work side, we discuss about the rebranding of some software more related to university work than for “education” itself, like Cantor and Labplot. There was a wish to create something like a KDE Research/Science in order to put software like them and others like Kile and KBibTex in a same umbrella. There is a discussion about this theme.

Other topic in this point was the discussions about a new website, more oriented to teach how to use KDE software in educational context than present a set of software. In fact, I think we need to do it and strengthen the “KDE Edu brand” in order to have a specific icon+link in KDE products page.

Follow, the developers in the sprint agreed with the multi operating system policy for KDE Edu. KDE software can be built and distributed to users of several OS, not only Linux. During the sprint some developers worked to bring installers for Windows, Mac OS, porting applications to Android, and creating independent installers for Linux distributions using flatpak.

Besides the discussions in this point, I worked to bring a rule to send e-mail to KDE Edu mailing list for each new Differential Revisions of KDE Edu software in Phabricator. Sorry devs, our mailboxes are full of e-mails because me.

Now in development work side, my focus was work hard on Cantor. First, I made some task triage in our workboard, closing, opening, and putting more information in some tasks. Secondly, I reviewed some works made by Rishabh Gupta, my student during GSoC 2017. He ported the Lua and R backend to QProcess and it will be available soon.

After it I worked to port Python 3 backend to Python/C API. This work is in progress and I expect to finish it to release in 18.04.

Of course, besides this amount of work we have fun with some beers and German food (and some American food and Chinese food and Arab food and Italian food as well)! I was happy because my 31 years birthday was in the first day of the sprint, so thank you KDE for coming to my birthday party full of code and good beers and pork dishes. 🙂

To finish, it is always a pleasure to meet the gearheads like my Spanish friends Albert and Aleix, the only other Mageia user I found personally in my life Timothée, my GSoC student Rishabh, my irmão brasileiro Sandro, and the new friends Sanjiban and David.

Thank you KDE e.V for provide resources to the sprint and thank you Endocode for hosting the sprint.

by Filipe Saraiva at December 11, 2017 03:22 PM

December 10, 2017

Titus Brown

The #CommonsPilot kicks off!!

(Just in case it's not clear, I do not speak for the NIH or for the Data Commons Pilot Phase Consortium in this blog post! These are my own views and perspectives, as always.)

I'm just coming back from the #CommonsPilot kickoff meeting. This was our first face-to-face meeting on the Data Commons Pilot effort, which is a new trans-NIH effort.

The Data Commons Pilot started with the posting of a funding call to assemble a Pilot Phase Consortium using a little-known NIH funding mechanism called an "Other Transactions" agreement. This is a fundamentally different award system from grants, contracts, and cooperative agreements: it lets the NIH interact closely with awardees, adjust funding as needed on very short time scale, and otherwise gives them a level of engagement with the actual work that I've never seen before.

I, along with many others, applied to this funding call (I'll be posting our initial application soon!) and after many trials and tribbleations I ended up being selected to work on training and outreach. Since then I've also become involved in internal coordination, which dovetails nicely with the training/outreach role.

The overall structure of the Data Commons Pilot Phase Consortium is hard to explain and not fully worked out yet, but we have a bunch of things that kind of resemble focus groups, called "Key Capabilities", that correspond to elements of the funding call -- we've put together a draft Web site that lists them all. For example, Key Capability 2 is "GUIDs" - this group of nice people is going to be concerned with identification of "objects" in Data Commonses. Likewise, there's a Scientific Use Cases KC (KC8) that is focused on what researchers and clinicians actually want to do.

(The complete list of awardees is here.)

This kickoff meeting was ...interesting. There were about 100 people (NIH folk, data stewards, OT awardees, cloud providers, and others) at the meeting, and the goal was to dig in to what we actually needed to do during the first 180 days of this effort - aka Pilot Phase I. (Stay tuned on that front.) We managed to put together something that was more "Unconference style" than the typical NIH organizational meeting, and this resulted in what I would call "chaos lite", which was not uniformly enjoyable but also not uniformly miserable. I'm not sure how close we came to actually nailing down what we needed to do, but we are certainly closer to it than we were before!

So... really, what is a Data Commons?

No one really knows, in detail. Let's start there!

(I recalled that Cameron Neylon had written about this, and a quick google search found this post from 2008. (I find it grimly amusing how many of the links in his blog post no longer work...) Some pretty good stuff in there!) I don't know of earlier mentions of the Commons, but a research commons has been being discussed for about a decade.

What is clear from my 2017 vantage point is that a data commons should provide some combination of tools, data, and compute infrastructure, so that people can bring their own tools and bring their own data and combine it with other tools and other data to do data analysis. In the context of a biomedical data commons we have to also be cognizant of legal and ethical issues surrounding access to and use of controlled data, which was a pretty big topic for us (there's a whole Key Capability devoted just to that - see KC6!)

There are, in fact, many biomedical data commons efforts - e.g. the NCI Genomic Data Commons, which shares a number of participants with the #CommonsPilot, and others that I discovered just this week (e.g. the Analysis Commons). So this Data Commons (#CommonsPilot, to be clear) is just one of many. And I think that has interesting implications that I'm only beginning to appreciate.

Something else that has changed since Cameron's 2008 blog post is the power and ubiquity of the cloud platforms. "Cloud" is now an everyday word, and many researchers, academic and industry and nonprofit, are using it every day. So it has become much clearer that cloud is one future of biomedical compute, if not the only one.

(I would like to make it clear that Bitcoin is not part of the #CommonsPilot effort. Just in case anyone was wondering how buzzword compliant we were going to try to be.)

But this still leaves us at a bit of an impasse. OK, so we're talking about tools, data, and compute infrastructure... in the cloud... that leaves a lot of room :).

Things that I haven't mentioned yet, that are explicitly or implicitly part of the Commons Pilot effort as I see it.

  • openness. We must build an open platform to enable a true commons that is accessible to everyone, vice issues of controlled data access. See: Commons.

  • eventual community governance, in some shape or form. (Geoff Bilder, Jennifer Lin, and Cameron Neylon cover this brilliantly in their Principles for Open Scholarly Infrastructure.)

  • multi-tenant. This isn't going to run just on one cloud provider, or one HPC.

  • platform mentality. This is gonna have to be a platform, folks, and we're gonna have to dogfood it. (obligatory link to Yegge rant)

  • larger than any one funding organization. This is necessary for long-term sustainability reasons, but also is an important requirement for a Commons in the first place. There may be disproportionate inputs from certain funders at the beginning, but ultimately I suspect that any Commons will need to be a meeting place for research writ large - which inevitably means not only NIH funded researchers, not just US researchers, but researchers world wide.

I haven't quite wrapped my head around the scope that these various requirements imply, but I think it becomes quite interesting in its implications. More on that as I noodle.

Why are Commonses needed, and what would a successful #CommonsPilot enable?

Perhaps my favorite section of the #CommonsPilot meeting was the brainstorming bit around why we needed a Commons, and what this effort could enable (as part of the larger Commons ecosystem). Here, in no particular order, is what we collectively came up with. (David Siedzik ran the session very capably!)

(As I write up the list below, I'd like to point out that this is really very incomplete. We only did this exercise for about 30 minutes, and many important issues were raised afterwards that weren't captured by this exercise. So this is definitely incomplete and moreover only reflects my memory and notes. Riff on it as you will in comments!)

  • The current scale of data overwhelms naive/simple platforms.
  • The #CommonsPilot must enable access to restricted data in a more uniform way, such that e.g. cross-data set integration becomes more possible.
  • The #CommonsPilot must have a user interface for browsing and exploratory investigation.
  • The #CommonsPilot will enable alignments of approaches across data sets.
  • Integration of tools and data is much easier in a #Commons.
  • Distribution and standardization of tools, data formats, and metadata will enhance robustness of analyses
  • There will be a community of users that will drive extensions to and enhancement of the platform over time.
  • Time to results will decrease as we more and more effectively employ compute across large clouds, and reuse previous results.
  • We expect standardization around formats and approaches, writ large (that is, the #CommonsPilot will contribute significantly to the refinement and deployment of standards and conventions).
  • The #CommonsPilot will expand accessibility of tools, compute, and data to many more scientists.
  • We hope to reduce redundant and repeated analyses where it makes sense.
  • Methods sharing will happen more, if we are successful!
  • Lower costs to data analysis, and lower barriers to doing research as a result.
  • An enhanced ability to share in new ways that we can't fully appreciate yet.
  • A resulting encouragement and development of new types of questions and inquiry.
  • Enhanced sustainability of data as a currency in research.
  • We hope to enhance and extend the life cycle of data.
  • We hope to enable comparison and benchmarking of approaches on a common platform.
  • We hope to help shape policy by demonstrating the value of cloud, and the value of open.
  • More ethical and effective use of more (all!?) data
  • More robust security/auditing of data access and tool use.
  • Enhanced training and documentation around responsible conduct of computational research.

So as you can see it's a pretty unambitious effort and I wouldn't be at all surprised if we were done in a year.

I'd love to explore these issues in comments or in blog posts that other people write about why we're wrong, or incomplete, or short-sighted, or too visionary. Fair game, folks - go for it!

How open are we gonna be about all of this?

That's a good question and I have two answers:

  1. We are hoping to be more open than ever before. As a sign of this, Vivien Bonazzi claims that at least one loud-mouthed open science advocate is involved in the effort. I'll let you know who it is when I find out myself.

  2. Not as open as I'd like to be, and for good reasons. While this effort is partly about building a platform for community, and community engagement will be an intrinsic part of this effort (more on that, sooner or later!), there are contractual issues and NIH requirements that need to be met. Moreover, we need to thread the needle of permitting internal frank discussions while promoting external engagement.

So we'll see!

--titus

by C. Titus Brown at December 10, 2017 11:00 PM

December 06, 2017

Matthew Rocklin

Dask Development Log

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Current development in Dask and Dask-related projects includes the following efforts:

  1. A possible change to our community communication model
  2. A rewrite of the distributed scheduler
  3. Kubernetes and Helm Charts for Dask
  4. Adaptive deployment fixes
  5. Continued support for NumPy and Pandas’ growth
  6. Spectral Clustering in Dask-ML

Community Communication

Dask community communication generally happens in Github issues for bug and feature tracking, the Stack Overflow #dask tag for user questions, and an infrequently used Gitter chat.

Separately, Dask developers who work for Anaconda Inc (there are about five of us part-time) use an internal company chat and a closed weekly video meeting. We’re now trying to migrate away from closed systems when possible.

Details about future directions are in dask/dask #2945. Thoughts and comments on that issue would be welcome.

Scheduler Rewrite

When you start building clusters with 1000 workers the distributed scheduler can become a bottleneck on some workloads. After working with PyPy and Cython development teams we’ve decided to rewrite parts of the scheduler to make it more amenable to acceleration by those technologies. Note that no actual acceleration has occurred yet, just a refactor of internal state.

Previously the distributed scheduler was focused around a large set of Python dictionaries, sets, and lists that indexed into each other heavily. This was done both for low-tech code technology reasons and for performance reasons (Python core data structures are fast). However, compiler technologies like PyPy and Cython can optimize Python object access down to C speeds, so we’re experimenting with switching away from Python data structures to Python objects to see how much this is able to help.

This change will be invisible operationally (the full test suite remains virtually unchanged), but will be a significant change to the scheduler’s internal state. We’re keeping around a compatibility layer, but people who were building their own diagnostics around the internal state should check out with the new changes.

Ongoing work by Antoine Pitrou in dask/distributed #1594

Kubernetes and Helm Charts for Dask

In service of the Pangeo project to enable scalable data analysis of atmospheric and oceanographic data we’ve been improving the tooling around launching Dask on Cloud infrastructure, particularly leveraging Kubernetes.

To that end we’re making some flexible Docker containers and Helm Charts for Dask, and hope to combine them with JupyterHub in the coming weeks.

Work done by myself in the following repositories. Feedback would be very welcome. I am learning on the job with Helm here.

If you use Helm on Kubernetes then you might want to try the following:

helm repo add dask https://dask.github.io/helm-chart
helm update
helm install dask/dask

This installs a full Dask cluster and a Jupyter server. The Docker containers contain entry points that allow their environments to be updated with custom packages easily.

This work extends prior work on the previous package, dask-kubernetes, but is slightly more modular for use alongside other systems.

Adaptive deployment fixes

Adaptive deployments, where a cluster manager scales a Dask cluster up or down based on current workloads recently got a makeover, including a number of bug fixes around odd or infrequent behavior.

Work done by Russ Bubley here:

Keeping up with NumPy and Pandas

NumPy 1.14 is due to release soon. Dask.array had to update how it handled structured dtypes in dask/dask #2694 (Work by Tom Augspurger).

Dask.dataframe is gaining the ability to merge/join simultaneously on columns and indices, following a similar feature released in Pandas 0.22. Work done by Jon Mease in dask/dask #2960

Spectral Clustering in Dask-ML

Dask-ML recently added an approximate and scalable Spectral Clustering algorithm in dask/dask-ml #91 (gallery example).

December 06, 2017 12:00 AM

December 05, 2017

Continuum Analytics

How to Get Ready for the Release of conda 4.4

As the year winds down it’s time to say out with the old and in with the new. Well, conda is no different. What does conda 4.4 have in store for you? Say goodbye to “source activate” in conda. That is so 2017. With conda 4.4 you can snappily “conda activate” and “conda deactivate” your …
Read more →

by Rory Merritt at December 05, 2017 10:06 PM

Jake Vanderplas

Installing Python Packages from a Jupyter Notebook

In software, it's said that all abstractions are leaky, and this is true for the Jupyter notebook as it is for any other software. I most often see this manifest itself with the following issue:

I installed package X and now I can't import it in the notebook. Help!

This issue is a perrennial source of StackOverflow questions (e.g. this, that, here, there, another, this one, that one, and this... etc.).

Fundamentally the problem is usually rooted in the fact that the Jupyter kernels are disconnected from Jupyter's shell; in other words, the installer points to a different Python version than is being used in the notebook. In the simplest contexts this issue does not arise, but when it does, debugging the problem requires knowledge of the intricacies of the operating system, the intricacies of Python package installation, and the intricacies of Jupyter itself. In other words, the Jupyter notebook, like all abstractions, is leaky.

In the wake of several discussions on this topic with colleagues, some online (exhibit A, exhibit B) and some off, I decided to treat this issue in depth here. This post will address a couple things:

  • First, I'll provide a quick, bare-bones answer to the general question, how can I install a Python package so it works with my jupyter notebook, using pip and/or conda?.

  • Second, I'll dive into some of the background of exactly what the Jupyter notebook abstraction is doing, how it interacts with the complexities of the operating system, and how you can think about where the "leaks" are, and thus better understand what's happening when things stop working.

  • Third, I'll talk about some ideas the community might consider to help smooth-over these issues, including some changes that the Jupyter, Pip, and Conda developers might consider to ease the cognitive load on users.

This post will focus on two approaches to installing Python packages: pip and conda. Other package managers exist (including platform-specific tools like yum, apt, homebrew, etc., as well as cross-platform tools like enstaller), but I'm less familiar with them and won't be remarking on them further.

by Jake VanderPlas at December 05, 2017 05:00 PM

Leonardo Uieda

GMT and open-source at #AGU17 and a GMT/Python online demo

Thumbnail image for publication.

The AGU Fall Meeting is happening next week in New Orleans, potentially gathering more than 20,000 geoscientists in a single place. Me and Paul will be there to talk about the next version of the Generic Mapping Tools, my work on GMT/Python, and the role of open-source software in the Geosciences.

There is so much going on at AGU that it can be daunting just to browse the scientific program. I haven't even started and my calendar is already packed. For now, I'll just share the sessions and events in which I'm taking part.

Earth ArXiv meetup

Thursday evening - TBD

The Earth ArXiv logo

The Earth ArXiv is a brand new community developed preprint server for the Earth and Planetary Sciences. Me and some other folks who are involved will get together for dinner/drinks on Thursday to nerd-out offline for a change.

If you are interested in getting involved in Earth ArXiv, join the Loomio group and the ESIP Slack channel and say "Hi". The community is very welcoming and it needs all the help it can get to grow.

We don't know where we'll meet yet but keep posted on Slack and Loomio if you're interested in joining us.

Panel session in the AGU Data Fair

Wednesday 12:30pm - Room 203

I was invited to be a panelist on the Data Capacity Building session of the AGU Data Fair. The fair has other very interesting panels happing throughout the week. They all center around "data": where to get it, what to do with it, how to preserve it, and how to give and receive credit for it.

We'll be discussing what to do with the data once you acquire. From the panel description:

The panel will discuss the challenges the researcher faces and how methods for managing data are currently available or are expected in the future that will help the researcher build value and capacity in the research data lifecycle.

The discussion will be in an Ask-Me-Anything style (AMA) with moderated questions from the audience (on and offline). If you have any questions that you want us to tackle, tweet them using the hashtag #AGUDataCapacity. They'll be added to a list for the moderators.

I'm really looking forward to this panel and getting to meet some new people in the process.

Paul's talk about GMT6

Thursday 4:15pm - Room 228

Paul is giving the talk The Generic Mapping Tools 6: Classic versus Modern Mode at the Challenges and Benefits of Open-Source Software and Open Data session. He'll be showcasing the new changes that are coming to GMT6, including "modern mode" and a new gmt subplot command. These are awesome new features of GMT aimed at making it more accessible to new users. For all the GMT gurus out there: Don't worry, they're also a huge time saver by eliminating many repeated command online options and boilerplate code.

Panel session about open-source software

Thursday 4-6pm - Room 238

I'll also be a panelist on the session Open-Source Software in the Geosciences. The lineup of panelists is amazing and I'm honored to be included among them. It'll be hard to contain the fan-boy in me. I wonder if geophysicists are used to getting asked for autographs.

The discussion will center around the role of open-source software in our science, how it's affected the careers of those who make it, and what we can do to make it a viable career path for new geoscientists.

My contribution is the abstract "Nurturing reliable and robust open-source scientific software".

Many thanks to the chairs and conveners for putting it together. I'll surely have a lot more to say after the panel.

Poster about GMT/Python

Friday morning - Poster Hall D-F

Last but not least, I'll be presenting the poster "A modern Python interface for the Generic Mapping Tools" about my work on GMT/Python. Come see the poster and chat with me and Paul! I'd love to hear what you want to see in this software. I'll also have a laptop and tablets for you to play around with a demo.

My AGU 2017 poster

You can download a PDF of the poster from figshare at doi:10.6084/m9.figshare.5662411.

A lot has happened since my last update after Scipy2017. Much of the infrastructure work to interface with the C API is done but there is still a lot do. Luckily, we just got our first code contributor last week so it looks like I'll have some help!

You can try out the latest features in an online demo Jupyter notebook by visiting agu2017demo.gmtpython.xyz

The notebook is running on the newly re-released mybinder.org service. The Jupyter team did an amazing job!

Come say "Hi"

If you'll be at AGU next week, stop by the poster on Friday or join the panel sessions if you want to chat or have any questions/suggestions. If you won't, there is always Twitter and the Software Underground Slack group.

See you in New Orleans!


The photo of Bourbon Street in the thumbnail is copyright Chris Litherland and licensed CC-BY-SA.


Comments? Leave one below or let me know on Twitter @leouieda.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

Please enable JavaScript to view the comments powered by Disqus.

December 05, 2017 12:00 PM

December 04, 2017

Continuum Analytics

Anaconda Training: A Learning Path for Data Scientists

Here at Anaconda, our mission has always been to make the art of data science accessible to all. We strive to empower people to overcome technical obstacles to data analysis so they can focus on asking better questions of their data and solving actual, real-world problems. With this goal in mind, we’re excited to announce …
Read more →

by Rory Merritt at December 04, 2017 10:00 PM

December 01, 2017

Titus Brown

Four steps in five minutes to deploy a Carpentry lesson for a class of 30

binder is an awesome technology for making GitHub repositories "executable". With binder 2.0, this is now newly stable and feature-full!

Yesterday I gave a two hour class, introducing graduate students to things like Jupyter Notebook and Pandas, via the Data Carpentry Python for Ecologists lesson. Rather than having everyone install their own software on their laptop (hah!), I decided to give them a binder!

For this purpose, I needed a repository that contained the lesson data and told binder how to install pandas and matplotlib.

Since I'm familiar with GitHub and Python requirements.txt files, it took me about 5 minutes. And the class deployment was flawless!

Building the binder repo

  1. Create a github repo (https://github.com/ngs-docs/2017-davis-ggg201a-day1), optionally with a README.md file.

  2. Upload surveys.csv into the github repository (I got this file as per Data Carpentry's Python for Ecology lesson).

  3. Create a requirements.txt file containing:

pandas
numpy
matplotlib

-- this tells binder to install those things when running this repository.

  1. Paste the GitHub URL into the 'URL' entry box at mybinder.org and click 'launch'.

Two optional steps

These steps aren't required but make life nicer for users.

  1. Upload an index.ipynb notebook so that people will be sent to a default notebook rather than being dropped into the Jupyter Console; note, you'll have to put 'index.ipynb' into the 'Path to a notebook file' field at mybinder.org for the redirect to happen.

  2. Grab the 'launch mybinder' Markdown text from the little dropdown menu to the right of 'launch', and paste it into the README.md in your github repo. This lets people click on a cute little 'launch binder' button to launch a binder from the repo.

Try it yourself!

Click on the button below,

Binder

or visit this URL.

-- in either case you should be sent ~instantly to a running Jupyter Notebook with the surveys.csv data already present.

Magic!!

What use is this?

This is an excellent way to do a quick demo in a classroom!

It could serve as a quickfix for people attending a Carpentry workshop who are having trouble installing the software.

(I've used it for both - since you can get a command line terminal as well as Python Notebooks and RStudio environments, it's ideal for short R, Python, and shell workshops.)

The big downside so far is that the environment produced by mybinder.org is temporary and expires after some amount of inactivity, so it's not ideal for workshops with lots of discussion - the repo may go away! No good way to deal with that currently; that's something that a custom JupyterHub deployment would fix, but that is too heavyweight for me at the moment.

(We came up with a lot of additional use cases for binder, here.).

Thoughts & comments welcome, as always!

--titus

by C. Titus Brown at December 01, 2017 11:00 PM

November 30, 2017

November 28, 2017

Matthieu Brucher

Announcement: ATKColoredExpander 2.0.0

I’m happy to announce the update of ATK Colored Expander based on the Audio Toolkit and JUCE. They are available on Windows (AVX compatible processors) and OS X (min. 10.9, SSE4.2) in different formats.

This plugin requires the universal runtime on Windows, which is automatically deployed with Windows update (see tis discussion on the JUCE forum). If you don’t have it installed, please check Microsoft website.

ATK Colored Expander 2.0.0

The supported formats are:

  • VST2 (32bits/64bits on Windows, 32/64bits on OS X)
  • VST3 (32bits/64bits on Windows, 32/64bits on OS X)
  • Audio Unit (32/64bits, OS X)

Direct link for ATKGuitarPreamp.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at November 28, 2017 08:09 AM

November 25, 2017

numfocus

My Favourite Tool: IPython

re-posted with permission from Software Carpentry My favorite tool is … IPython. IPython is a Python interpreter with added features that make it an invaluable tool for interactive coding and data exploration. IPython is most commonly taught via the Jupyter notebook, an interactive web-based tool for evaluating code, but IPython can be used on its own directly in […]

The post My Favourite Tool: IPython appeared first on NumFOCUS.

by NumFOCUS Staff at November 25, 2017 08:08 AM

November 23, 2017

Titus Brown

Why are taxonomic assignments so different for Tara bins? (Black Friday Morning Bioinformatics)

Happy (day after) Thanksgiving!

Now that we can parse custom taxonomies in sourmash and use them for genome classification (tutorial) I thought I'd revisit the Tara ocean genome bins produced by Delmont et al. and Tully et al. (see this blog post for details).

Back when I first looked at the Tully and Delmont bins, my tools for parsing taxonomy were quite poor, and I was limited to using the Genbank taxonomy. This meant that I couldn't deal properly with places where the imputed taxonomies assigned by the authors extended beyond Genbank.

Due to requests from Trina McMahon and Sarah Stevens, this is no longer a constraint! We can now load in completely custom taxonomies and mix and match them as we wish.

How does this change things?

A new sourmash lca command, compare_csv

For the purposes of today's blog post, I added a new command to sourmash lca. compare_csv takes in two taxonomy spreadsheets and compares the classifications, generating a list of compatible and incompatible classifications.

Some quality control and evaluation

(Let's start by asking if our scripts work in the first place!)

When I was working on the sourmash lca stuff I noticed something curious: when I read in the delmont classification spreadsheet and re-classified the delmont genomes, I found more classifications for the genomes than were in the input spreadsheet.

So, for example, when I do:

sourmash lca classify --db delmont-MAGs-k31.lca.json.gz \
    --query delmont-genome-sigs --traverse-directory \
    -o delmont-genome-sigs.classify.csv

and then compare the output CSV with the original table from the Delmont et al. paper,

sourmash lca compare_csv delmont-genome-sigs.classify.csv \
    tara-delmont-SuppTable3.csv

I get the following:

957 total assignments, 24 differ between spreadsheets.
24 are compatible (one lineage is ancestor of another.
0 are incompatible (there is a disagreement in the trees).

What!? Why would we be able to classify new things?

Looking into it, it turns out that these differences are because one input genome's classification informs others, but the way that Delmont et al. did their classifications did not take into account their own genome bins.

For example, TARA_ASW_MAG_00041 is classified as genus Emiliania by sourmash, but is simply Eukaryota in Delmont et al.'s paper. The new classification for 00041 comes from the other genome bin TARA_ASW_MAG_00032 which was firmly classified as Emiliana and shares approximately 1.6% of its k-mers with the 00041.

If this holds up, it provides some nice context for Trina's original request for a quick way to classify new genomes against previously classified bins. Quickly feeding custom classifications into new classifications seems quite useful!

We see the same thing when I reclassify the Tully et al. genome sigs against themselves. If I do:

sourmash lca classify \
    --db tully-MAGs-k31.lca.json.gz \
    --query tully-genome-sigs --traverse-directory \
    -o tully-genome-sigs.classify.csv
sourmash lca compare_csv tully-genome-sigs.classify.csv \
    tara-tully-Table4.csv

then I get:

2009 total assignments, 7 differ between spreadsheets.
7 are compatible (one lineage is ancestor of another.
0 are incompatible (there is a disagreement in the trees).
  • so no incompatibilities, but a few "extensions".

What about incompatibilities?

The above was really just internal validation - can we classify genomes against themselves and get consistent answers? It was unexpectedly interesting but not terribly so.

But what if we take the collections of genome bins from tully and reclassify them based on the delmont classifications? And vice versa?

Reclassifying tully with delmont

Let's give it a try!

First, classify the tully genome signatures with an LCA database built from the delmont data:

sourmash lca classify \
    --db delmont-MAGs-k31.lca.json.gz \
    --query tully-genome-sigs --traverse-directory \
    -o tully-query.delmont-db.sigs.classify.csv

Then, compare:

sourmash lca compare_csv \
    tully-genome-sigs.classify.csv \
    tully-query.delmont-db.sigs.classify.csv \
    --start-column=3

and we get:

987 total assignments, 889 differ between spreadsheets.
296 are compatible (one lineage is ancestor of another.
593 are incompatible (there is a disagreement in the trees).
164 incompatible at rank superkingdom
255 incompatible at rank phylum
107 incompatible at rank class
54 incompatible at rank order
13 incompatible at rank family
0 incompatible at rank genus
0 incompatible at rank species

Ouch: almost two thirds are incompatible, 164 of them at the superkingdom level!

For example, in the tully data set, TOBG_MED-875 is classified as a Euryarchaeota, novelFamily_I, but using the delmont data set, it gets classified as Actinobacteria! Digging a bit deeper, this is based on approximately 290kb of sequence, much of it from TARA_MED_MAG_00029, which is classified as Actinobacteria and shares about 8.6% of its k-mers with TOBG_MED-875. So that's the source of that disagreement.

(Some provisional digging suggests that there's a lot of Actinobacterial proteins in TOBG_MED-875, but this would need to be verified by someone more skilled in protein-based taxonomic analysis than me.)

Reclassifying delmont with tully

What happens in the other direction?

First, classify the delmont signatures with the tully database:

sourmash lca classify \
    --db tully-MAGs-k31.lca.json.gz \
    --query delmont-genome-sigs --traverse-directory \
    -o delmont-query.tully-db.sigs.classify.csv

Then, compare:

sourmash lca compare_csv delmont-genome-sigs.classify.csv \
    delmont-query.tully-db.sigs.classify.csv \
    --start-column=3

And see:

604 total assignments, 537 differ between spreadsheets.
193 are compatible (one lineage is ancestor of another.
344 are incompatible (there is a disagreement in the trees).
95 incompatible at rank superkingdom
151 incompatible at rank phylum
66 incompatible at rank class
25 incompatible at rank order
7 incompatible at rank family
0 incompatible at rank genus
0 incompatible at rank species

As you'd expect, this more or less agrees with the results above - lots of incompatibilities, with fully 1/6th incompatible at the rank of superkingdom (!!).

Why are thing classified so differently!?

First, a big caveat: my code may be completely wrong. If so, well, best to find out now! I've done only the lightest of spot checks and I welcome further investigation. (TBH, I'm actually kind of hoping that Meren, the senior author on the Delmont et al. study, dives into the Tully data sets and does a more robust reclassification using his methods - he has an inspiring history of doing things like that. ;)

But, assuming my code isn't completely wrong...

On first blush, there are three other possibilities. For each classification, the tully classification could be wrong, the delmont classification could be wrong, or both classifications could be wrong. Either way, they're inconsistent!

On second blush, this all strikes me as a bit of a disaster. Were the taxonomic classification methods used by the Delmont and Tully papers really so different!? How do we trust our own classifications, much less anyone else's?

I will fall back on my usual refrain: we need tools that let us detect and resolve such disagreements quickly and reliably. Maybe sourmash can provide the former, but I'm pretty sure k-mers are too specific to do a good job of resolving disagreements above the genus level.

Anyhoo, I'm out of time for today, so I'll just end with some thoughts for What Next.

What next?

Other than untangling disagreements, what other things could we do? Well, we've just added 60,000 genomes from the JGI IMG database to our previous collection of 100,000 genomes from Genbank, so we can do a classification against all available genomes! And, if we're feeling ambitious, we could reclassify all the genomes against themselves. That might be interesting...

Appendix: Building the databases

Install sourmash lca as in the tutorial.

Grab and unpack the genome signatures for the tully and delmont studies:

curl -L https://osf.io/vngdz/download -o delmont-genome-sigs.tar.gz
tar xzf delmont-genome-sigs.tar.gz
curl -L https://osf.io/28r6m/download -o tully-genome-sigs.tar.gz
tar xzf tully-genome-sigs.tar.gz

Grab the classifications, too:

curl -L -O https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv
curl -L -O https://github.com/ctb/2017-sourmash-lca/raw/master/tara-tully-Table4.csv

Then, build the databases:

sourmash lca index -k 31 --scaled=10000 \
    tara-tully-Table4.csv tully-MAGs-k31.lca.json.gz \
    tully-genome-sigs --traverse-directory
sourmash lca index -k 31 --scaled=10000 \
    tara-delmont-SuppTable3.csv delmont-MAGs-k31.lca.json.gz \
    delmont-genome-sigs --traverse-directory

and now all of the commands above should work.

The whole thing takes about 5 minutes on my laptop, and requires less than 1 GB of RAM and < 100 MB of disk space for the data.

by C. Titus Brown at November 23, 2017 11:00 PM

November 21, 2017

Matthieu Brucher

Announcement: ATKColoredCompressor 2.0.0

I’m happy to announce the update of ATK Colored Compressor based on the Audio Toolkit and JUCE. They are available on Windows (AVX compatible processors) and OS X (min. 10.9, SSE4.2) in different formats.

ATK Colored Compressor 2.0.0

The supported formats are:

  • VST2 (32bits/64bits on Windows, 32/64bits on OS X)
  • VST3 (32bits/64bits on Windows, 32/64bits on OS X)
  • Audio Unit (32/64bits, OS X)

Direct link for ATKColoredCompressor .

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at November 21, 2017 08:35 AM

numfocus

My Favourite Tool: Jupyter Notebook

re-posted with permission from Software Carpentry My favourite tool is … the Jupyter Notebook. One of my favourite tools is the Jupyter notebook. I use it for teaching my students scientific computing with Python. Why I like it: Using Jupyter with the plugin RISE, I can create presentations including code cells that I can edit and execute live during the […]

The post My Favourite Tool: Jupyter Notebook appeared first on NumFOCUS.

by NumFOCUS Staff at November 21, 2017 08:01 AM

Matthew Rocklin

Dask Release 0.16.0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.16.0. This is a major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.15.3 release on September 24th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Conda packages are available on both conda-forge and default channels.

Full changelogs are available here:

Some notable changes follow.

Breaking Changes

  • The dask.async module was moved to dask.local for Python 3.7 compatibility. This was previously deprecated and is now fully removed.
  • The distributed scheduler’s diagnostic JSON pages have been removed and replaced by more informative templated HTML.
  • The use of commonly-used private methods _keys and _optimize have been replaced with the Dask collection interface (see below).

Dask collection interface

It is now easier to implement custom collections using the Dask collection interface.

Dask collections (arrays, dataframes, bags, delayed) interact with Dask schedulers (single-machine, distributed) with a few internal methods. We formalized this interface into protocols like .__dask_graph__() and .__dask_keys__() and have published that interface. Any object that implements the methods described in that document will interact with all Dask scheduler features as a first-class Dask object.

class MyDaskCollection(object):
    def __dask_graph__(self):
        ...

    def __dask_keys__(self):
        ...

    def __dask_optimize__(self, ...):
        ...

    ...

This interface has already been implemented within the XArray project for labeled and indexed arrays. Now all XArray classes (DataSet, DataArray, Variable) are fully understood by all Dask schedulers. They are as first-class as dask.arrays or dask.dataframes.

import xarray as xa
from dask.distributed import Client

client = Client()

ds = xa.open_mfdataset('*.nc', ...)

ds = client.persist(ds)  # XArray object integrate seamlessly with Dask schedulers

Work on Dask’s collection interfaces was primarily done by Jim Crist.

Bandwidth and Tornado 5 compatibility

Dask is built on the Tornado library for concurrent network programming. In an effort to improve inter-worker bandwidth on exotic hardware (Infiniband), Dask developers are proposing changes to Tornado’s network infrastructure.

However, in order to use these changes Dask itself needs to run on the next version of Tornado in development, Tornado 5.0.0, which breaks a number of interfaces on which Dask has relied. Dask developers have been resolving these and we encourage other PyData developers to do the same. For example, neither Bokeh nor Jupyter work on Tornado 5.0.0-dev.

Dask inter-worker bandwidth is peaking at around 1.5-2GB/s on a network theoretically capable of 3GB/s. GitHub issue: pangeo #6

Dask worker bandwidth

Network performance and Tornado compatibility are primarily being handled by Antoine Pitrou.

Parquet Compatibility

Dask.dataframe can use either of the two common Parquet libraries in Python, Apache Arrow and Fastparquet. Each has its own strengths and its own base of users who prefer it. We’ve significantly extended Dask’s parquet test suite to cover each library, extending roundtrip compatibility. Notably, you can now both read and write with PyArrow.

df.to_parquet('...', engine='fastparquet')
df = dd.read_parquet('...', engine='pyarrow')

There is still work to be done here. The variety of parquet reader/writers and conventions out there makes completely solving this problem difficult. It’s nice seeing the various projects slowly converge on common functionality.

This work was jointly done by Uwe Korn, Jim Crist, and Martin Durant.

Retrying Tasks

One of the most requested features for the Dask.distributed scheduler is the ability to retry failed tasks. This is particularly useful to people using Dask as a task queue, rather than as a big dataframe or array.

future = client.submit(func, *args, retries=5)

Task retries were primarily built by Antoine Pitrou.

Transactional Work Stealing

The Dask.distributed task scheduler performs load balancing through work stealing. Previously this would sometimes result in the same task running simultaneously in two locations. Now stealing is transactional, meaning that it will avoid accidentally running the same task twice. This behavior is especially important for people using Dask tasks for side effects.

It is still possible for the same task to run twice, but now this only happens in more extreme situations, such as when a worker dies or a TCP connection is severed, neither of which are common on standard hardware.

Transactional work stealing was primarily implemented by Matthew Rocklin.

New Diagnostic Pages

There is a new set of diagnostic web pages available in the Info tab of the dashboard. These pages provide more in-depth information about each worker and task, but are not dynamic in any way. They use Tornado templates rather than Bokeh plots, which means that they are less responsive but are much easier to build. This is an easy and cheap way to expose more scheduler state.

Task page of Dask's scheduler info dashboard

Nested compute calls

Calling .compute() within a task now invokes the same distributed scheduler. This enables writing more complex workloads with less thought to starting worker clients.

import dask
from dask.distributed import Client
client = Client()  # only works for the newer scheduler

@dask.delayed
def f(x):
    ...
    return dask.compute(...)  # can call dask.compute within delayed task

dask.compute([f(i) for ...])

Nested compute calls were primarily developed by Matthew Rocklin and Olivier Grisel.

More aggressive Garbage Collection

The workers now explicitly call gc.collect() at various times when under memory pressure and when releasing data. This helps to avoid some memory leaks, especially when using Pandas dataframes. Doing this carefully proved to require a surprising degree of detail.

Improved garbage collection was primarily implemented and tested by Fabian Keller and Olivier Grisel, with recommendations by Antoine Pitrou.

Dask-ML

A variety of Dask Machine Learning projects are now being assembled under one unified repository, dask-ml. We encourage users and researchers alike to read through that project. We believe there are many useful and interesting approaches contained within.

The work to assemble and curate these algorithms is primarily being handled by Tom Augspurger.

XArray

The XArray project for indexed and labeled arrays is also releasing their major 0.10.0 release this week, which includes many performance improvements, particularly for using Dask on larger datasets.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.3 release on September 24th:

  • Ced4
  • Christopher Prohm
  • fjetter
  • Hai Nguyen Mau
  • Ian Hopkinson
  • James Bourbeau
  • James Munroe
  • Jesse Vogt
  • Jim Crist
  • John Kirkham
  • Keisuke Fujii
  • Matthias Bussonnier
  • Matthew Rocklin
  • mayl
  • Martin Durant
  • Olivier Grisel
  • severo
  • Simon Perkins
  • Stephan Hoyer
  • Thomas A Caswell
  • Tom Augspurger
  • Uwe L. Korn
  • Wei Ji
  • xwang777

The following people contributed to the dask/distributed repository since the 1.19.1 release on September 24nd:

  • Alvaro Ulloa
  • Antoine Pitrou
  • chkoar
  • Fabian Keller
  • Ian Hopkinson
  • Jim Crist
  • Kelvin Yang
  • Krisztián Szűcs
  • Matthew Rocklin
  • Mike DePalatis
  • Olivier Grisel
  • rbubley
  • Tom Augspurger

The following people contributed to the dask/dask-ml repository

  • Evan Welch
  • Matthew Rocklin
  • severo
  • Tom Augspurger
  • Trey Causey

In addition, we are proud to announce that Olivier Grisel has accepted commit rights to the Dask projects. Olivier has been particularly active on the distributed scheduler, and on related projects like Joblib, SKLearn, and Cloudpickle.

November 21, 2017 12:00 AM