November 21, 2017

numfocus

My Favourite Tool: Jupyter Notebook

re-posted with permission from Software Carpentry My favourite tool is … the Jupyter Notebook. One of my favourite tools is the Jupyter notebook. I use it for teaching my students scientific computing with Python. Why I like it: Using Jupyter with the plugin RISE, I can create presentations including code cells that I can edit and execute live during […]

by NumFOCUS Staff at November 21, 2017 02:22 PM

Matthieu Brucher

Announcement: ATKColored Compressor 2.0.0

I’m happy to announce the update to ATK Colored Compressor based on the Audio Toolkit and JUCE. They are available on Windows (AVX compatible processors) and OS X (min. 10.9, SSE4.2) in different formats.

ATK Colored Compressor 2.0.0

The supported formats are:

  • VST2 (32bits/64bits on Windows, 32/64bits on OS X)
  • VST3 (32bits/64bits on Windows, 32/64bits on OS X)
  • Audio Unit (32/64bits, OS X)

Direct link for ATKGuitarPreamp.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at November 21, 2017 08:35 AM

Matthew Rocklin

Dask Release 0.16.0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.16.0. This is a major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.15.3 release on September 24th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Conda packages are available on both conda-forge and default channels.

Full changelogs are available here:

Some notable changes follow.

Breaking Changes

  • The dask.async module was moved to dask.local for Python 3.7 compatibility. This was previously deprecated and is now fully removed.
  • The distributed scheduler’s diagnostic JSON pages have been removed and replaced by more informative templated HTML.
  • The use of commonly-used private methods _keys and _optimize have been replaced with the Dask collection interface (see below).

Dask collection interface

It is now easier to implement custom collections using the Dask collection interface.

Dask collections (arrays, dataframes, bags, delayed) interact with Dask schedulers (single-machine, distributed) with a few internal methods. We formalized this interface into protocols like .__dask_graph__() and .__dask_keys__() and have published that interface. Any object that implements the methods described in that document will interact with all Dask scheduler features as a first-class Dask object.

class MyDaskCollection(object):
    def __dask_graph__(self):
        ...

    def __dask_keys__(self):
        ...

    def __dask_optimize__(self, ...):
        ...

    ...

This interface has already been implemented within the XArray project for labeled and indexed arrays. Now all XArray classes (DataSet, DataArray, Variable) are fully understood by all Dask schedulers. They are as first-class as dask.arrays or dask.dataframes.

import xarray as xa
from dask.distributed import Client

client = Client()

ds = xa.open_mfdataset('*.nc', ...)

ds = client.persist(ds)  # XArray object integrate seamlessly with Dask schedulers

Work on Dask’s collection interfaces was primarily done by Jim Crist.

Bandwidth and Tornado 5 compatibility

Dask is built on the Tornado library for concurrent network programming. In an effort to improve inter-worker bandwidth on exotic hardware (Infiniband), Dask developers are proposing changes to Tornado’s network infrastructure.

However, in order to use these changes Dask itself needs to run on the next version of Tornado in development, Tornado 5.0.0, which breaks a number of interfaces on which Dask has relied. Dask developers have been resolving these and we encourage other PyData developers to do the same. For example, neither Bokeh nor Jupyter work on Tornado 5.0.0-dev.

Dask inter-worker bandwidth is peaking at around 1.5-2GB/s on a network theoretically capable of 3GB/s. GitHub issue: pangeo #6

Dask worker bandwidth

Network performance and Tornado compatibility are primarily being handled by Antoine Pitrou.

Parquet Compatibility

Dask.dataframe can use either of the two common Parquet libraries in Python, Apache Arrow and Fastparquet. Each has its own strengths and its own base of users who prefer it. We’ve significantly extended Dask’s parquet test suite to cover each library, extending roundtrip compatibility. Notably, you can now both read and write with PyArrow.

df.to_parquet('...', engine='fastparquet')
df = dd.read_parquet('...', engine='pyarrow')

There is still work to be done here. The variety of parquet reader/writers and conventions out there makes completely solving this problem difficult. It’s nice seeing the various projects slowly converge on common functionality.

This work was jointly done by Uwe Korn, Jim Crist, and Martin Durant.

Retrying Tasks

One of the most requested features for the Dask.distributed scheduler is the ability to retry failed tasks. This is particularly useful to people using Dask as a task queue, rather than as a big dataframe or array.

future = client.submit(func, *args, retries=5)

Task retries were primarily built by Antoine Pitrou.

Transactional Work Stealing

The Dask.distributed task scheduler performs load balancing through work stealing. Previously this would sometimes result in the same task running simultaneously in two locations. Now stealing is transactional, meaning that it will avoid accidentally running the same task twice. This behavior is especially important for people using Dask tasks for side effects.

It is still possible for the same task to run twice, but now this only happens in more extreme situations, such as when a worker dies or a TCP connection is severed, neither of which are common on standard hardware.

Transactional work stealing was primarily implemented by Matthew Rocklin.

New Diagnostic Pages

There is a new set of diagnostic web pages available in the Info tab of the dashboard. These pages provide more in-depth information about each worker and task, but are not dynamic in any way. They use Tornado templates rather than Bokeh plots, which means that they are less responsive but are much easier to build. This is an easy and cheap way to expose more scheduler state.

Task page of Dask's scheduler info dashboard

Nested compute calls

Calling .compute() within a task now invokes the same distributed scheduler. This enables writing more complex workloads with less thought to starting worker clients.

import dask
from dask.distributed import Client
client = Client()  # only works for the newer scheduler

@dask.delayed
def f(x):
    ...
    return dask.compute(...)  # can call dask.compute within delayed task

dask.compute([f(i) for ...])

Nested compute calls were primarily developed by Matthew Rocklin and Olivier Grisel.

More aggressive Garbage Collection

The workers now explicitly call gc.collect() at various times when under memory pressure and when releasing data. This helps to avoid some memory leaks, especially when using Pandas dataframes. Doing this carefully proved to require a surprising degree of detail.

Improved garbage collection was primarily implemented and tested by Fabian Keller and Olivier Grisel, with recommendations by Antoine Pitrou.

Dask-ML

A variety of Dask Machine Learning projects are now being assembled under one unified repository, dask-ml. We encourage users and researchers alike to read through that project. We believe there are many useful and interesting approaches contained within.

The work to assemble and curate these algorithms is primarily being handled by Tom Augspurger.

XArray

The XArray project for indexed and labeled arrays is also releasing their major 0.10.0 release this week, which includes many performance improvements, particularly for using Dask on larger datasets.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.3 release on September 24th:

  • Ced4
  • Christopher Prohm
  • fjetter
  • Hai Nguyen Mau
  • Ian Hopkinson
  • James Bourbeau
  • James Munroe
  • Jesse Vogt
  • Jim Crist
  • John Kirkham
  • Keisuke Fujii
  • Matthias Bussonnier
  • Matthew Rocklin
  • mayl
  • Martin Durant
  • Olivier Grisel
  • severo
  • Simon Perkins
  • Stephan Hoyer
  • Thomas A Caswell
  • Tom Augspurger
  • Uwe L. Korn
  • Wei Ji
  • xwang777

The following people contributed to the dask/distributed repository since the 1.19.1 release on September 24nd:

  • Alvaro Ulloa
  • Antoine Pitrou
  • chkoar
  • Fabian Keller
  • Ian Hopkinson
  • Jim Crist
  • Kelvin Yang
  • Krisztián Szűcs
  • Matthew Rocklin
  • Mike DePalatis
  • Olivier Grisel
  • rbubley
  • Tom Augspurger

The following people contributed to the dask/dask-ml repository

  • Evan Welch
  • Matthew Rocklin
  • severo
  • Tom Augspurger
  • Trey Causey

In addition, we are proud to announce that Olivier Grisel has accepted commit rights to the Dask projects. Olivier has been particularly active on the distributed scheduler, and on related projects like Joblib, SKLearn, and Cloudpickle.

November 21, 2017 12:00 AM

November 20, 2017

numfocus

Hackseq17: Canada’s Genomics Hackathon for Open Science

This post contributed by Jake Lever, a PhD student in the University of British Columbia’s Bioinformatics program. NumFOCUS was pleased to provide funding support to the hackseq17 hackathon. hackseq17: Canada’s genomics hackathon hackseq17: Canada’s genomics hackathon was held at the University of British Columbia (UBC) in late October. This event brought together a diverse set […]

by NumFOCUS Staff at November 20, 2017 06:00 PM

November 17, 2017

numfocus

Theano and the Future of PyMC

This is a guest post by Christopher Fonnesbeck of PyMC3. — PyMC, now in its third iteration as PyMC3, is a project whose goal is to provide a performant, flexible, and friendly Python interface to Bayesian inference.  The project relies heavily on Theano, a deep learning framework, which has just announced that development will not be […]

by NumFOCUS Staff at November 17, 2017 05:25 PM

November 15, 2017

numfocus

Quantopian commits to fund pandas as a new NumFOCUS Corporate Partner

NumFOCUS welcomes Quantopian as our first Emerging Leader Corporate Partner, a partnership for small but growing companies who are leading by providing fiscal support to our open source projects. — Quantopian Supports Open Source by John Fawcett, CEO and founder of Quantopian       It all started with a single tweet. While scrolling through […]

by NumFOCUS Staff at November 15, 2017 12:03 AM

November 14, 2017

Matthieu Brucher

Audio Toolkit: Performance on the IIR SIMD filters

In release 2.2.0, ATK gained new EQ filters that are vectorized. These filters cannot be used to filter different bands from the same input signal (yet), but they can be used to filter in the same way several channels.

The question is to know if this is really faster than several independent channels, so I’ve set up a test case with a SIMD solution.

The first file has 5 test cases. Four of them are using a vectorized DF1 with different inputs. The fifth one is the TDF2 equivalent of one of these cases. The second file is a fully vectorized DF1 and a TDF2 test cases. It uses a dispatcher to select the best platform according to the CPU it runs on.

So the question is: which one of these is the fastest?

I ran the application on Linux through valgrind, after compiling it with gcc 7 and no specific instruction set. Here are the results:

EQ with no specific instruction set

I’ve only selected the process calls, and it is obvious the timings are dominated by two things:

  • The conversion of the individual signals to the SIMD signal
  • The actual EQ processing

Let’s compare first the non SIMD versions. The DF1 versions spend 13.9 million cycles, when the TDF2 only 3.6, but for only for one channel, so that’s 14.4 million cycles. The DF1 has dedicated SIMD lines that makes it faster that the TDF2 version.

On the SIMD side, things are different. The DF1 version is almost twice as slow as the TDF2 version, and the TDF2 version itself is only slightly slower than the non SIMD version when taking the conversion times into account (there is probably things I need to do to optimize it there!).

When using AVX2 for the non SIMD filters, some get faster:

EQ with AVX2 instruction set

The SIMD filters were not supposed to get faster, their code is still exactly the same. The DF1 non SIMD is now 20% faster and TDF2 stays at the same.

The conclusions are simple: SIMD TDF2 is good but the framework around still need to get better. The non SIMD TDF2 filter will require care so that they get faster. By making this one faster, perhaps I’ll may be able to make the SIMD version also faster!

There are more and more SIMD filters in ATK, let me know what you think of this effort.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at November 14, 2017 08:06 AM

November 09, 2017

Jake Vanderplas

Exploring Line Lengths in Python Packages

This week, Twitter upped their single-tweet character limit from 140 to 280, purportedly based on this interesting analysis of tweet lengths published on Twitter's engineering blog. The gist of the analysis is this: English language tweets display a roughly log-normal distribution of character counts, except near the 140-character limit, at which the distribution spikes:

The analysis takes this as evidence that twitter users often "cram" their longer thoughts into the 140 character limit, and suggest that a 280-character limit would more naturally accommodate the distribution of people's desired tweet lengths.

This immediately brought to mind another character limit that many Python programmers face in their day-to-day lives: the 79-character line limit suggested by Python's PEP8 style guide:

Limit all lines to a maximum of 79 characters.

I began to wonder whether popular Python packages (e.g. NumPy, SciPy, Pandas, Scikit-Learn, Matplotlib, AstroPy) display anything similar to what is seen in the distribution of tweet lengths.

Spoiler alert: they do! And the details of the distribution reveal some insights into the programming habits and stylistic conventions of the communities who write them.

by Jake VanderPlas at November 09, 2017 10:00 PM

November 08, 2017

Titus Brown

How specific are k-mers for taxonomic assignment of microbes, anyway?

I've been on a bit of a k-mers and taxonomy kick lately, as readers may have seen. Now that I can parse "free" taxonomies from sources other than NCBI, I decided the code was well-enough baked to put it into sourmash. So, over the last week I started integrating the lowest common ancestor code and taxonomy parsing code into sourmash; once that pull request is merged, sourmash will have lca index, lca classify, lca summarize, and lca rankinfo commands - see the tutorial for more information on how to run this stuff.

To celebrate the addition of all this code, I thought I'd address the question, "how specific are k-mers?" That is, if we reach into a bucket of metagenome sequence and pick out a k-mer that is identified with a particular species, how certain can we be of that identification (vs it being actually a different species in the same genus, family, order, etc.)

This question comes up routinely in discussions and now I have the information to answer it, at least in the context of all known microbial genomes! (The answer will, of course, be very biased by what's not in our databases.)

After downloading the genbank LCA database, I ran the command:

sourmash lca rankinfo genbank-k31.lca.json.gz

and after a little data munging got this:

superkingdom: 38077 (0.4%)
phylum: 24627 (0.3%)
class: 39306 (0.4%)
order: 66423 (0.7%)
family: 97908 (1.1%)
genus: 1103223 (12.0%)
species: 7818876 (85.1%)

That is, 85% of the 31-mers in genbank are specific to the species level; 12% are specific to the genus level; and the remaining ~3% are family or above.

I would argue this makes genus level assignments pretty reliable, when using k-mers. (It doesn't say much about what you're missing, of course.)

--titus

by C. Titus Brown at November 08, 2017 11:00 PM

November 07, 2017

Continuum Analytics

Utilizing the New Compilers in Anaconda Distribution 5

Part of what made the recent release of Anaconda Distribution 5 so exciting was our switch from OS-provided compiler tools to our own Anaconda toolsets. This change has allowed us to make major steps forward in the capabilities of our compilers, specifically regarding security and performance. In this post, we’ll show you how to use …
Read more →

by Rory Merritt at November 07, 2017 04:45 PM

November 03, 2017

numfocus

SciPy 1.0 — 16 Years in the Making

Congratulations, SciPy! SciPy—a NumFOCUS Affiliated Project—recently crossed a major milestone for any open source project: version 1.0! NumFOCUS extends our hearty congratulations to all of the SciPy contributors and community members who helped get the project to this point. “SciPy the library has been a cornerstone for the scientific Python community. By providing a consistent […]

by NumFOCUS Staff at November 03, 2017 03:59 PM

Matthew Rocklin

Optimizing Data Structure Access in Python

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

Last week at PyCon DE I had the good fortune to meet Stefan Behnel, one of the core developers of Cython. Together we worked to optimize a small benchmark that is representative of Dask’s central task scheduler, a pure-Python application that is primarily data structure bound.

Our benchmark is a toy problem that creates three data structures that index each other with dictionaries, lists, and sets, and then does some simple arithmetic. (You don’t need to understand this benchmark deeply to read this article.)

import random
import time

nA = 100
nB = 100
nC = 100

A = {'A-%d' % i: ['B-%d' % random.randint(0, nB - 1)
                  for i in range(random.randint(0, 5))]
     for i in range(nA)}

B = {'B-%d' % i: {'C-%d' % random.randint(0, nC - 1)
                  for i in range(random.randint(1, 3))}
     for i in range(nB)}

C = {'C-%d' % i: i for i in range(nC)}

data = ['A-%d' % i for i in range(nA)]


def f(A, B, C, data):
    for a_key in data:
        b_keys = A[a_key]
        for b_key in b_keys:
            for c_key in B[b_key]:
                C[c_key] += 1


start = time.time()

for i in range(10000):
    f(A, B, C, data)

end = time.time()

print("Duration: %0.3f seconds" % (end - start))
$ python benchmark.py
Duration: 1.12 seconds

This is an atypical problem Python optimization because it is primarily bound by data structure access (dicts, lists, sets), rather than numerical operations commonly optimized by Cython (nested for loops over floating point arithmetic). Python is already decently fast here, typically within a factor of 2-5x of compiled languages like Java or C++, but still we’d like to improve this when possible.

In this post we combine two different methods to optimize data-structure bound workloads:

  1. Compiling Python code with Cython with no other annotations
  2. Interning strings for more rapid dict lookups

Finally at the end of the post we also run the benchmark under PyPy to compare performance.

Cython

First we compile our Python code with Cython. Normally when using Cython we annotate our variables with types, giving the compiler enough information to avoid using Python altogether. However in our case we don’t have many numeric operations and we’re going to be using Python data structures regardless, so this won’t help much. We compile our original Python code without alteration.

cythonize -i benchmark.py

And run

$ python -c "import benchmark"
Duration: 0.73 seconds

This gives us a decent speedup from 1.1 seconds to 0.73 seconds. This isn’t huge relative to typical Cython speedups (which are often 10-100x) but would be a very welcome change for our scheduler, where we’ve been chasing 5% optimizations for a while now.

Interning Strings

Our second trick is to intern strings. This means that we try to always have only one copy of every string. This improves performance when doing dictionary lookups because of the following:

  1. Python computes the hash of the string only once (strings cache their hash value once computed)
  2. Python checks for object identity (fast) before moving on to value equality (slow)

Or, anecdotally, text is text is faster in Python than text == text. If you ensure that there is only one copy of every string then you only need to do identity comparisons like text is text.

So, if any time we see a string "abc" it is exactly the same object as all other "abc" strings in our program, then string-dict lookups will only require a pointer/integer equality check, rather than having to do a full string comparison.

Adding string interning to our benchmark looks like the following:

inter = {}

def intern(x):
    try:
        return inter[x]
    except KeyError:
        inter[x] = x
        return x

A = {intern('A-%d' % i): [intern('B-%d' % random.randint(0, nB - 1))
                  for i in range(random.randint(0, 5))]
     for i in range(nA)}

B = {intern('B-%d' % i): {intern('C-%d' % random.randint(0, nC - 1))
                  for i in range(random.randint(1, 3))}
     for i in range(nB)}

C = {intern('C-%d' % i): i for i in range(nC)}

data = [intern('A-%d' % i) for i in range(nA)]

# The rest of the benchmark is as before

This brings our duration from 1.1s down to 0.75s. Note that this is without the separate Cython improvements described just above.

Cython + Interning

We can combine both optimizations. This brings us to around 0.45s, a 2-3x improvement over our original time.

cythonize -i benchmark2.py

$ python -c "import benchmark2"
Duration: 0.46 seconds

PyPy

Alternatively, we can just run everything in PyPy.

$ pypy3 benchmark1.py  # original
Duration: 0.25 seconds

$ pypy3 benchmark2.py  # includes interning
Duraiton: 0.20 seconds

So PyPy can be quite a bit faster than Cython on this sort of code (which is not a big surprise). Interning helps a bit, but not quite as much.

This is fairly encouraging. The Dask scheduler can run under PyPy even while Dask clients and workers run under normal CPython (for use with the full PyData stack).

Preliminary Results on Dask Benchmark

We started this experiment with the assumption that our toy benchmark somehow represented the Dask’s scheduler in terms of performance characteristics. This assumption, of course, is false. The Dask scheduler is significantly more complex and it is difficult to build a single toy example to represent its performance.

When we try these tricks on a slightly more complex benchmark that actually uses the Dask scheduler we find the following results:

  • Cython: almost no effect
  • String Interning: almost no effect
  • PyPy: almost no effect

However I have only spent a brief amount of time on this (twenty minutes?) and so I hope that the lack of a performance gain here is due to lack of effort.

If anyone is interested in this I hope that this blogpost contains enough information to get anyone started if they want to investigate further.

November 03, 2017 12:00 AM

November 02, 2017

Continuum Analytics

AnacondaCON 2018: Anaconda Opens Registration and Call for Speakers for Second Annual User Conference

Four-day event brings together data science community to discover trends in data science, connect with thought leaders and learn all things Python AUSTIN, TX—November 2, 2017—Anaconda, Inc., the most popular Python data science platform provider, today announced that registration is now open for AnacondaCON 2018, taking place April 8-11, 2018 in Austin, Texas. The company …
Read more →

by Team Anaconda at November 02, 2017 03:07 PM

November 01, 2017

Enthought

Webinar: Machine Learning Mastery Workshop: An Exclusive Peek “Under the Hood” of Enthought Training

What: A guided walkthrough and live Q&A about Enthought’s new “Machine Learning Mastery Workshop” training course.

Who Should Attend: If predictive modeling and analytics would be valuable in your work, come to the webinar to find out what all the fuss is about and what there is to know. Whether you are looking to get started with machine learning, interested in refining your machine learning skills, or want to transfer your skills from another toolset to Python, come to the webinar to find out if Enthought’s highly interactive, expertly taught Machine Learning Mastery Workshop might be a good fit for accelerating your development!

Register


Why Has Machine Learning Become So Popular?

Artificial Intelligence and Machine Learning are a defining feature of the 21st century and are quickly becoming a key factor in gaining and maintaining competitive advantage in each industry which incorporates them. Why is machine learning so beneficial?  Because it provides a fast and flexible way to build models that can surface signal, find patterns, and predict future behavior.  These powerful models are used for:

  • Forecasting supply chain availability
  • Clustering product defects for QA
  • Anticipating movements in financial markets
  • Predicting chemical tolerances
  • Optimizing the placement of advertisements
  • Managing process engineering
  • Modeling reservoir production
  • and much more.

In response to growing demand for Machine Learning expertise, Enthought has developed an intensive 3-day guided practicum to bring you up to speed quickly on key concepts and skills in this exciting realm. Join us in this webinar for an in-depth overview of Enthought’s Machine Learning Mastery Workshop — a training course designed to accelerate the development of intuition, skill, and confidence in applying machine learning methods to solve real-world problems.

In the webinar we’ll describe how Enthought’s training course combines conceptual knowledge of machine learning models with intensive experience applying them to real-world data to develop skill in applying Python’s machine learning tools, such as the scikit-learn package, to make predictions about complicated phenomena by leveraging the information contained in numerical data, natural language, 2D images, and discrete categories.

The hands-on, interactive course was created ground up by our training experts to enable you to develop transferable skills in Machine Learning that you can apply back at work the next day.

In this webinar, we’ll give you the key information and insight you need to quickly evaluate whether Enthought’s Machine Learning Mastery Workshop course is the right solution for you to build skills in using Python for advanced analytics, including:

  • Who will benefit most from the course, and what pre-requisite knowledge is required
  • What topics the course covers – a guided tour
  • What new knowledge, skills, and capabilities you’ll take away, and how the course design supports those outcomes
  • What the (highly interactive) learning experience is like
  • Why this course is different from other training alternatives (with a preview of actual course materials!)
  • What previous workshop attendees say about our courses

Register


Presenter: Dr. Dillon Niederhut,

Enthought Training Instructor

Ph.D., University of California at Berkeley

 


 

Additional Resources

Upcoming Open Machine Learning Mastery Workshop Sessions:

Austin, TX, Feb. 21-23, 2017
Houston, TX, Apr. 18-20, 2018
Cambridge, UK, May 9-11, 2018

Upcoming Open Python for Data Science Sessions:

New York City, NY, Dec. 4-8, 2018
London, UK, Feb. 19-23, 2018
Washington, DC, Apr. 23-27, 2018
San Jose, CA, May 14-18, 2018

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat Sheets

Enthought's Machine Learning with Python Cheat Sheets

Additional Webinars in the Training Series:

Python for MATLAB Users: What You Need to Know

Python for Scientists and Engineers: A Tour of Enthought’s Professional Technical Training Course

Python for Data Science: A Tour of Enthought’s Professional Technical Training Course

Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

The post Webinar: Machine Learning Mastery Workshop: An Exclusive Peek “Under the Hood” of Enthought Training appeared first on Enthought Blog.

by admin at November 01, 2017 07:01 PM

October 31, 2017

numfocus

NumFOCUS welcomes Jim Weiss, Events Coordinator

NumFOCUS is pleased to announce Jim Weiss has been hired as our new Events Coordinator, bringing over seven years of event management experience. Prior to joining NumFOCUS, Jim worked in Washington, D.C., coordinating congressional hearings for the U.S. House of Representatives and managing events for the U.S. Air Force. Jim hails from Pennsylvania and graduated […]

by NumFOCUS Staff at October 31, 2017 09:08 PM

October 30, 2017

Titus Brown

For want of $2.41... some background on reimbursements.

A week ago, I submitted a reimbursement request for about $2200 - I'd just come back from a pretty extensive east coast trip, visiting five different institutions (including the FBI!) on various grant-funded projects. The reimbursement ended up going on five different grants, and involved a fair amount of explanatory paperwork (e.g. I'd scheduled a meeting for project B because I had to be near there for grant A; how do you allocate reimbursements for the plane flight back? Angels on the heads of pins stuff.)

On Monday, that reimbursement got kicked back to me because I'd misidentified a $2.41 purchase of milk at a convenience store as "miscellaneous supplies" rather than "food". This occasioned a snarky tweet on my part about how annoying the reimbursement process is.

As these things do, my tweet got interpreted by some people as I intended - "ugh, paperwork is annoying and frustrating" - and by some others as me being an entitled git who thought $2.41 was worth reimbursing. While I would characterize some of the responses on the latter point as not very thoughtful, I think there are some interesting points to be discussed in response. So here are my thoughts!


Reimbursements are frustrating for most academics I know, and (as always) the burden falls more on junior people than on senior people. At least two people from my lab (i.e. junior to me) responded to my tweet with their own complaints about their reimbursements being kicked back, for similar reasons. So this issue of "blind bureaucracy", as someone put it, affects others - mostly people junior to me.

If I'd known the $2.41 charge was going to be a problem, I'd probably have just eaten (hah!) it. But you never know what bit of paperwork you'll fail to fill out properly - it could have just as easily been the $400 flight back to California. Regardless, the effect would be the same: until I fill the paperwork out right for the $2.41, I don't get any of my $2200 back.

Why even bother putting the $2.41 on the reimbursement? Honestly, I put all my receipts in my shoulder bag and then mechanically scan them in and upload them into the system. This is because anything I don't have a receipt for is essentially invisible and I usually end up forgoing the reimbursement process for it. I also cannot reimburse late fees or interest charges due to slow filing (or slow approval) of expenses. So on top of the stuff that I can't reimburse (like cookies and apples for the lab) I've probably lost about $1000 this year alone, and conservatively estimate my overall burn for the last 10 years as between $5000-10,000. Even small things add up! So I have A Process to minimize this loss.

In this case, all of this was on my personal card, so I didn't need to put the $2.41 into the system. But since I'm behind on other reimbursements from the summer institute I ran, my travel card is not valid at the moment. If I'd been using my travel card, the $2.41 could have gone on there and I would have had to put in the receipt because every transaction is entered into the system and demands a receipt. That's why I upload all the receipts reflexively as part of My Process :).

It turns out I can bear $2200 for a month (although this month was actually about $4000, if you include the Binder workshop), since I'm (a) reasonably well paid and (b) partnered with someone who is reasonably well paid and ( c) neither of us have any lingering student loans. This is not necessarily true of people in my lab, which is one reason why I have another $1,200 sitting on my personal card: UC Davis can't pay for some things in advance, and some things can't be put on travel cards, so even people in my lab who have travel cards are occasionally asked to sit on thousands of dollars of credit card charges. In two recent situations I simply paid for it in advance myself and am waiting for the people in question to get reimbursed so they can pay me back.

I know many faculty who do variants on this - they outrate pay for things, or have an in-lab slush fund, or otherwise help out with money for their students and postdocs.

Of course, many other faculty - especially the junior faculty, and/or the faculty with young kids, and/or the faculty with non-working spouses, and/or the multitude of reasons why people don't have any spare cash - cannot do this, and often there's no recourse for these people or their labs, as far as I can tell. Every institution I've been at has had different rules, all of them somewhat arbitrary and bureaucratic, for who gets travel or purchasing cards, what gets reimbursed when, and whether or not travel advances are allowed. It's often super hard to know what rules are there and how to exploit them in the service of your lab - an underappreciated aspect of professordom, I think.

One person brought up the massive unfairness inherent in the situation. Why am I even getting anything reimbursed when there are starving grad students to be fed?? Well, each of the five funding sources I'm using for reimbursement for this (and my students and postdocs) is money that I received in the service of accomplishing a specific goal. Much of that required this travel (e.g. I did the osfclient grant closeout on this trip, and that had to be done in person). I don't think it's reasonable to ask me to pay for that travel myself - or, at least, if the grant came with that expectation, I'd be a lot more hesitant about taking on the work. And it's certainly not ok for the grad students to pay for their own travel. So the work that the funding agencies wanted done wouldn't get done. And we wouldn't have the money in the first place if there wasn't a way to pay for the things that needed doing. So I'm definitely not taking money away from graduate students who might otherwise get reimbursed were it not for my Scrooge-like ways.

Why talk about this, anyway?

Travel and conference costs have been cited on my Twitter feed a few times in just the past few months as one of the big challenges for diversity in research. If you don't have much money, being asked to spot $1000 (or lots more) for plane flights and hotel rooms is essentially impossible. Even students who aren't massively in debt balk at this! And since many underrepresented minorities don't have lots of money to start with, this burden falls disproportionately on them. That's actually one of the problematic aspects of structural inequality, as I understand it - the small stuff that falls on the privileged hits the underprivileged with considerably more weight.

Interestingly, the message "look, just cover the small stuff yourself if you're a professor" is arguably the wrong one to send here, because it assumes a lot about the situation of the professor. (I know professors who are far more cash poor than their grad students.) I think a far better message is to encourage institutions to pay for more stuff up front.

I've also noticed that many people less fortunate than me (or simply more careful than me) are hesitant to spend money on good things because of the extra burden of reimbursement. For example: one of the unpublicized success stories of our workshops is that frequently we provide free food and coffee as part of the workshop, which really makes for a positive learning atmosphere. But for long-running or big workshops, this can really add up - this summer I was carrying a $7,000 tab for a few weeks until I got the big reimbursements through the system. Most faculty would simply not do this, I think, because of the time and cost involved (several have expressed amazement at my willingness to do this). But it's something that really helps our workshops be successful, thus it's worth it to me.

So overall I believe the academy is poorer - in diversity, and in activities - because there is this expectation of reimbursement for activities, rather than up-front payment.

I have to say that our bureaucracy doesn't really care about any of this. The rules are put in place and enforced by people who care about financial compliance, and not getting sued by the federal or state government for bad accounting. It takes champions - generally faculty champions - to advocate for any kind of human-centric change, and frankly driving change within large bureaucratic institutions is not often very effective or at least requires massive and focused effort. So I'm not currently charging the castle trying to make the reimbursement rules at UC Davis more humane; I've got other battles to fight that are more important to me (and that I think are more likely to serve my aspirations in outreach and diversity). The most I feel I can do is act within my sphere of influence to help those I'm directly responsible for. So that's what I do.

(Not to mention, the cost involved in spending so many person-hours on reviewing every jot and tittle of reimbursement requests must be significant. There's gotta be a better way!)


Awareness of the larger context and impact of a painful reimbursement process is important, though, and hopefully this is least mildly informative :).

And the situation could definitely use some fixing.

I'd be really interested in hearing what academic institutions do well in this space. I know that some institutions are pretty flexible about travel advances, and others give travel cards to anyone who wants them. Some institutions pay for all big travel costs in advance (airfare, hotel rooms). I've heard of a few that even offer debit cards for job interviewees. I'd be interested in advocating for this kind of thing at UC Davis, but I'm not sure what actually works well for people and institutions. Pointers and tips welcome!

--titus

by C. Titus Brown at October 30, 2017 11:00 PM

Continuum Analytics

Getting Started with GPU Computing in Anaconda

Anaconda Distribution makes it easy to get started with GPU computing with several GPU-enabled packages that can be installed directly from our package repository. In this blog post, we’ll give you some pointers on where to get started with GPUs in Anaconda Distribution.

by Sheyna Webster at October 30, 2017 04:56 PM

October 29, 2017

Titus Brown

The 2017 binder workshop!

tl;dr? We ran a workshop on binder. It was fun!

workshop attendee photo

What is binder?

Imagine... that you are visiting the data repository for a preprint you are reviewing, and with the click of a button you are brought to a fully configured RStudio Server containing that data.

Imagine... you are running a workshop, and you want to introduce everyone in the workshop to a machine-learning approach. You give them all the same URL, and within seconds everyone in the room is looking at their own live environment, copied from your blueprint but individually modifiable and exportable.

Imagine... your lab has a collection of standard data analysis protocols in Jupyter Notebooks on your GitHub site, and anyone in your lab can, with a single click, bring them to life and run them on a new data set.

Binder is a concept and technology that makes all of the above, and more, tantalizingly close to everyday realization! The techie version is this: currently,

  • upon Web request, binder grabs a GitHub repository, inspects it, and builds a custom Docker image based on a variety of configuration detection;

  • then, binder spins up a Docker container and redirects the Web browser to that repo;

  • at some point, binder detects lack of activity and shuts down the container.

All of this is (currently) done without authentication or payment of any time, which makes it a truly zero configuration/single-click experience for the user.

Just as important, the binder infrastructure is meant to be widely distributed, reusable, and hackable open source tech that supports multiple deployments and customization!

The workshop!

In 2016, I wrote a proposal to fund a workshop on binder to the Sloan Foundation and it was funded!! We finally ran the workshop last week, with the following organizing committee:

Why a workshop?

Many people, including myself, see massive potential in binder, but it is still young. The workshop was intended to explore possible technical directions for binder's evolution, build community around the binder ecosystem, and explore issues of sustainability.

One particular item that came up early on in the workshop was that there are many possible integration points for binder into current data and compute infrastructure providers. That's great! But, in the long term, we also need to plan for the current set of endeavors failing or evolving, so we should be building a community around the core binder concepts and developing de facto standards and practice. This will allow us to evolve with endeavors as well as finding new partners.

So that's why we ran a workshop!

Who came to the workshop?

The workshop attendees were a collection of scientists, techies, librarians, and data people. For this first workshop I did my best to reach out to people from a variety of communities - researchers from a variety of disciplines, librarians, trainers, data scientists, programmers, HPC admins, and research infrastructure specialists. In this, we somewhat succeeded! We didn't advertise very widely, partly just because of a last minute time crunch, and also because too many people would have been a problem for the space we had.

As we figure out more of a framework and sales pitch for binder, I expect the set of possible attendees to expand. Still, for hackfest-like workshops, I'm a big fan of small diverse groups of people in a friendly environment.

What is the current state of binder?

The original mybinder.org Web site was created and supported by the Freeman Lab, but maintenance on the site suffered when Jeremy Freeman moved to the Chan-Zuckerberg Initiative and became even busier than before.

The Jupyter folk picked up the binder concept and reimplemented the Web site with somewhat enhanced functionality, building the new BinderHub software in Python around JupyterHub and splitting the repository-to-docker code out into repo2docker. This is now running on a day-to-day basis on a beta site.

A rough breakdown, and links to documentation, follow:

JupyterHub - JupyterHub manages multiple instances of the single-user Jupyter notebook server. JupyterHub can be used to serve notebooks to a class of students, a corporate data science group, or a scientific research group.

Zero-to-JupyterHub - Zero to JupyterHub with Kubernetes is a tutorial to help install and manage JupyterHub.

BinderHub - BinderHub builds "binders" containing data+code from GitHub repos and then serves the binders in a custom computing environment. beta.mybinder.org is a public BinderHub.

repo2docker - repo2docker builds, runs, and pushes Docker images from source code repositories.

Highlights of the binder workshop!

What did we do? We ran things as an unconference, and had a lot of discussions and brainstorming around use cases and the like, with some super cool results. The notes from those are linked below!

A few highlights of the meeting deserve, well, highlighting --

  • Amazingly, we got to the point where binder ran an RStudio Server instance, started from a Jupyter console!! Some tweets of this made the rounds, but it may take a few more weeks for this to make it into production. (This was based on Ryan Lovett's earlier work, which was then hacked on by Carl Boettiger, Yuvi Panda and Aaron Culich at the workshop. I have it on authority that Adelaide Rhodes asking lots of questions by way of encouragement ;).

  • Everyone who attended the workshop got to the point where we had our own BinderHub instance on Google!! (We used these JupyterHub and BinderHub instructions). w00000t! (Session notes)

  • Yuvi Panda gave us a rundown on the data8 / "Foundations of Data Science" course at UC Berkeley, which uses JupyterHub to host several thousand users, with up to 700 concurrent sessions!

We came up with lots of use cases - see ~duplicate set of notes, here.

Other stuff we did at the workshop

(All the notes are on GitHub, here)

Here is a fairly comprehensive list of the other activities at the workshop --

Issues that we only barely touched on:

  • "I have a read only large dataset I want to provide access to for untrusted users, who can do whatever they want but in a safe way." What are good practices for this situation? How do we provide good access without downloading the whole thing?

  • It would be nice to initiate and control (?) Common Workflow Language workflows from binder - see nice Twitter conversation with Michael Crusoe.

  • How do we do continuous integration on notebooks??

  • We need some sort of introspection and badging framework for how reproducible a notebook is likely to be - what are best practices here? Is it "just" a matter of specifying software versions etc and bundling data, or ...??

Far reaching issues and questions --

  • it's likely that the future of binder involves many people running many different binderhub instances. What kind of clever things can we do with federation? Would it be possible for people to run a binder backend "close" to their data and then allow other binderhubs to connect to that, for example?

  • Many issues of publishing workflows, provenance, legality - notes

  • It would be super cool if realtime collaboration was supported by JupyterHub or BinderHub... it's coming, I hear. Soon, one hopes!

Topics we left almost completely untouched:

What's next?

I'm hoping to find money or time to run at least two more hackfests or conference -- perhaps we can run one in Europe, too.

It would be good to run something with a focus on developing training materials (and/or exemplary notebooks) - see Use Cases, above.

I'm hoping to find support to do some demo integrations with scholarly infrastructure, as in in the Imagine... section, above.

If (if) we ran a conference, I could see having some of the following sessions: - A hackfest building notebooks - A panel on deployment - keynote on the roadmap for binder and JupyterHub - Some sort of community fest

If you're interested in any of this, please indicate your interest in future workshops!!

Where to get started with binder

There are lots of example repositories, here:

github.com/binder-examples

you can click "Launch Binder" in any of the READMEs to see examples!


There is a gitter chat channel that is pretty active and good for support: see gitter.im/jupyterhub/binder


And, finally, there is a google groups forum, binderhub-dev

Some other links worth mentioning:

  • nbflow - one-button reproducible workflows with Jupyter Notebook and Scons (see video).
  • papermill - parameterize, execute, and analyze notebooks.
  • dataflow - a kernel to support Python dataflows in the Jupyter Notebook environment.

  • an example of how to use the new postBuild functionality to install jupyter notebook extensions.

aaaand some notes from singularity:

One way to convert docker images to singularity images, using docker2singularity

docker run -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/image:/output \
    --privileged -t --rm singularityware/docker2singularity ubuntu:14.04  

Another way to simply run docker containers in singularity:

singularity exec docker://my/container <runcommand>

The End

I have no particular conclusion other than we'll have to do this again!

--titus

by C. Titus Brown at October 29, 2017 11:00 PM

October 28, 2017

Titus Brown

Classifying genome bins using a custom reference database - maybe this time it'll work?

The story so far:

Trina would like to taxonomically classify new genomes based on a custom classification of old genomes.

Titus wrote some trial code to do this.

Sarah (in Trina's lab) tried the code and realized that it was limited by the NCBI taxonomy, which defeated the whole point of the thing.

Titus retired from the field to consider his options.

My new attempt

We have a new script, classify-free-tax.py.

This script does the following:

  • parses a spreadsheet containing custom taxonomic lineages (i.e. this format);
  • builds a custom taxonomic classification based on the spreadsheet but rooted in NCBI taxonomies;
  • creates a k-mer classifier that combines the k-mers from the custom classification with all of Genbank;
  • applies that k-mer classifier to a set of query genomes.

It's a long, ugly script, but it seems to work OK on my test subsets. Let's give it a try!

Running classify-free-tax.py

Let's reclassify 100 randomly chosen genomes from the Delmont et al., 2017 study, as per this blog post, but using our new code.

Here we're interested in validation, so we're going to use the 100 randomly chosen genomes to classify the 100 randomly chosen genomes and compare the results.

There are two new commands. The first indexes the custom genomes so we can use them in the next classification step:

2017-sourmash-revindex/extract-hashvals-by-sample.py delmont delmont/*.sig

The next command uses the indexed custom genomes, plus the Genbank LCA database, to classify all of the genomes under delmont/:

PYTHONPATH=$PYTHONPATH:2017-sourmash-revindex/ \
    ../classify-free-tax.py tara-delmont-SuppTable3.csv delmont \
    delmont/*.sig -o reclassify.csv

Validating the results

To see how well things worked, we can now run a comparison of the input spreadsheet vs the reclassified spreadsheet using cmp-csv.py:

../cmp-csv.py tara-delmont-SuppTable3.csv reclassify.csv

The output is below. The first lineage line is from the input spreadsheet, and the second lineage line is the re-classification using our software.

tl;dr? Out of 100 genome bins from Delmont et al., we get five disagreements, and it looks like the reclassification is fine.

TARA_ANW_MAG_00051
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Rickettsiales', 'Pelagibacteraceae', '', '']
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Pelagibacterales', 'Pelagibacteraceae', '', '']
--
TARA_IOS_MAG_00076
         ['Eukaryota', '', '', '', '', '', '']
         ['Eukaryota', 'Haptophyta', 'Prymnesiophyceae', 'Isochrysidales', 'Noelaerhabdaceae', 'Emiliania', '']
--
TARA_PSE_MAG_00092
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Rickettsiales', 'Pelagibacteraceae', '', '']
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Pelagibacterales', 'Pelagibacteraceae', '', '']
--
TARA_PSW_MAG_00133
         ['Eukaryota', '', '', '', '', '', '']
         ['Eukaryota', 'Haptophyta', 'Prymnesiophyceae', 'Isochrysidales', 'Noelaerhabdaceae', 'Emiliania', '']
--
TARA_RED_MAG_00001
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Sphingomonadales', 'Erythrobacteraceae', 'Citromicrobium', '']
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Sphingomonadales', 'Sphingomonadaceae', 'Citromicrobium', '']
--

In three of the five, there is an internal disagreement in the taxonomies, e.g. Pelagibacteraceae has been rerooted under Pelagibacterales in our spreadsheet. This is probably due to changes in the NCBI taxonomy between when Delmont et al. did their classification and when I downloaded the NCBI taxonomy a few months ago.

In the other two of the five, Delmont et al. could only classify the bins down to Eukaryota, but our k-mer classifier is telling us that they belong to genus Emiliana. On inspection (hint: pass the --debug flag to the classify script), approximately 470,000 of the k-mers in TARA_IOS_MAG_00076 belong to genus Emiliana, and 1.41 million of the k-mers in TARA_PSW_MAG_00133 belong to genus Emiliana, so I'm going to provisionally argue that our re-classification is correct.

What next?

The code is ugly and needs refactoring, tests, etc.

It'd be fun to repeat the taxonomic examination of the Tara ocean bins once I have it working more nicely. I should be able to do a much better (more thorough) job of comparison than before!

More broadly, this is a useful extension of the least-common-ancestor code that I posted a few weeks back and I need to revisit all the old code in light of this. I think I'll be retooling the LCA code substantially to allow the use of custom taxonomies/genome classifications.

I really want to dig into the 8,000 genomes that were posted a little while back (from this paper), and I should make sure my taxonomy code works with the Genome Taxonomy Database.

In the meantime, I'd love to get feedback on the above code if anyone is interested in trying it out!

Appendix: full set of commands

First, repeat (most of) the steps in this blog post to install and compute a bunch of things.

# create a virtualenv and install sourmash
python3.5 -m venv ~/py3
. ~/py3/bin/activate
pip install -U pip
pip install -U Cython
pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer

pip install -U https://github.com/dib-lab/sourmash/archive/master.zip

# grab the 2017-sourmash-lca repository
git clone https://github.com/ctb/2017-sourmash-lca

# download subsample data
cd 2017-sourmash-lca/
curl -L https://osf.io/73yfz/download?version=1 -o subsample.tar.gz
tar xzf subsample.tar.gz
cd subsample

# compute signatures for delmont
cd delmont
sourmash compute -k 21,31,51 --scaled 10000 --name-from-first *.fa.gz

# fix names to match identifiers in spreadsheet (below)
python ../delmont-fix-sig-names.py *.sig

for i in *.sig
do
    mv $i.fixed $i
done
cd ../

# download NCBI LCA database
curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz
mkdir -p db
cd db/
tar xzf ../genbank-lca-2017.08.26.tar.gz
cd ../

# grab delmont spreadsheet
curl -O -L https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv

If you've already done the above, you'll need to also clone the 2017-sourmash-revindex repository:

git clone https://github.com/ctb/2017-sourmash-revindex.git

and now you're ready to run the commands above.

by C. Titus Brown at October 28, 2017 10:00 PM

October 25, 2017

Continuum Analytics

Announcing the Release of Anaconda Distribution 5.0

We’re thrilled to announce the release of Anaconda Distribution 5.0! With over 4.5 million active users, Anaconda Distribution is the world’s most popular and trusted distribution for data science. It allows you to easily install 1,000+ Python and R data science packages and manage your packages, dependencies, and environments—all with the single click of a …
Read more →

by Team Anaconda at October 25, 2017 08:06 PM

October 21, 2017

Titus Brown

Grokking "custom" taxonomies.

(Some more Saturday/Sunday Morning Bioinformatics...)

A while back I posted some hacky code to classify genome bins using a custom reference database. Much to my delight, Sarah Stevens (a grad student in Trina McMahon's lab) actually tried using it! And of course it didn't do what she wanted.

The problem is that all of my least-common-ancestor scripts rely utterly and completely on the taxonomic IDs from NCBI, but anyone who wants to do classification of new genome bins against their own old genome bins doesn't have NCBI taxonomic IDs. (This should have been obvious to me but I was avoiding thinking about it, because I have a whole nice pile of Python code that has to change. Ugh.)

So basically my code ends up doing a classification with custom genomes that is only useful so far as the custom genomes match NCBI's taxonomy. NOT SO USEFUL.

I'd already noticed that my code for getting taxonomic IDs from the Tara binned genomes barfed on lots of assignments (see script links after "used the posted lineages to assign NCBI taxonomic IDs" in this blog post). This was because the Tully and Delmont genome bins contained a lot of novel species. But I didn't know what to do about that so I punted.

Unfortunately I can punt no more. So I'm working on fixing that, somehow. Today's blog post is about the fallout from some initial work.


Statement of Problem, revisited

We want to classify NEW genome bins using Genbank as well as a custom collection of OLD genome bins that contains both NCBI taxonomy (where available) and custom taxonomic elements.

My trial solution

The approach I took was to build custom extensions to the NCBI taxonomy by rooting the custom taxonomies to NCBI taxids.

Basically, given a custom spreadsheet like this that contains unique identifiers and a taxonomic lineage, I walk through each row, and for each row, find the least common ancestor that shares a name and rank with NCBI. Then the taxonomy for each row consists of triples (rank, name, taxid) where the taxid below the least common ancestor is set to None.

This yields assignments like this, one for each row.

[('superkingdom', 'Bacteria', 2),
 ('phylum', 'Proteobacteria', 1224),
 ('class', 'Alphaproteobacteria', 28211),
 ('order', 'unclassifiedAlphaproteobacteria', None),
 ('family', 'SAR116cluster', None)]

So far so good!

You can see the code here if you are interested.

Problems and challenges with my trial solution

So I ran this on the Delmont and Tully spreadsheets, and by and large I could compute NCBI-rooted custom taxonomies for most of the rows. (You can try it for yourself if you like; install stuff, then run

python parse-free-tax.py db/genbank/{names,nodes}.dmp.gz tara-delmont-SuppTable3.csv > tara-delmont-ncbi-disagree.txt
python parse-free-tax.py db/genbank/{names,nodes}.dmp.gz tara-tully-Table4.csv > tara-tully-ncbi-disagree.txt

in the 2017-sourmash-lca directory.)

But! There were a few places where there were apparent disagreements between what was in NCBI and what was in the spreadsheeet.

Other than typos and parsing errors, these disagreements came in two flavors.

First, there were things that looked like minor alterations in spelling - e.g., below, the custom taxonomy file had "Flavobacteria", while NCBI spells it "Flavobacteriia".

confusing lineage #15
    CSV:  Bacteria, Bacteroidetes, Flavobacteria, Flavobacteriales, Flavobacteriaceae, Xanthomarina
    NCBI: Bacteria, Bacteroidetes, Flavobacteriia, Flavobacteriales, Flavobacteriaceae, Xanthomarina
(2 rows in spreadsheet)

Second, there were just plain ol' disagreements, e.g. below the custom taxonomy file says that genus "Haliea" belongs under order "Alteromonadales_3" while NCBI claims that it belongs under order "Cellvibrionales". This could be a situation where the researchers reused the genus name "Haliea" under a new order, OR (and I think this is more likely) perhaps the researchers are correcting the NCBI taxonomy.

confusing lineage #12
    CSV:  Bacteria, Proteobacteria, Gammaproteobacteria, Alteromonadales_3, Alteromonadaceae, Haliea
    NCBI: Bacteria, Proteobacteria, Gammaproteobacteria, Cellvibrionales, Halieaceae, Haliea
(2 rows in spreadsheet)

See tara-tully-ncbi-disagree.txt and tara-delmont-ncbi-disagree.txt for the full list.

I have no conclusion.

I'm not sure what to do at this point. For now, I'll plan to override the NCBI taxonomy with the researcher-specific taxonomy, but these disagreements seems like useful diagnostic output. It would be interesting to know what the root cause of this is, though; do these names come from using CheckM to assign taxonomy, as in Delmont et al's workflow? And if so, can I resolve a lot of this by just using that taxonomy? :) Inquiring minds must know!!

Regardless, I'm enthusiastic about the bit where my internal error checking in the script catches these issues. (And hopefully I'll never have to write that code again. It's very unpleasant.)

Onwards! Upwards!

--titus

by C. Titus Brown at October 21, 2017 10:00 PM

Bruno Pinho

A simple guide to calculate the Hoek-Brown Failure Criteria in Python

Rock masses usually reduces the strength of the intact rock due to the presence of discontinuities. Engineers and geologists must take this effect into account to predict failure in slope faces and tunnel excavations. In this post, I will take you through the process of calculating the Hoek-Brown Failure Criteria in Python.

by Bruno Ruas de Pinho at October 21, 2017 02:30 AM

October 19, 2017

Continuum Analytics

InfoWorld: 5 essential Python tools for data science—now improved

If you want to master, or even just use, data analysis, Python is the place to do it. Python is easy to learn, it has vast and deep support, and most every data science library and machine learning framework out there has a Python interface.

by Sheyna Webster at October 19, 2017 04:30 PM

October 17, 2017

Matthieu Brucher

Announcement: Audio TK 2.2.0

ATK is updated to 2.2.0 with the major introduction of vectorized filters. This means that some filters (EQ for now) can use vectorization for maximum performance. More filters will be introduced later as well as the Python support. Vector lanes of size 4 and 8 are supported as well as instruction sets from SSE2 to AVX512.

This is also the first major release that officially supports the JUCE framework. This means that ATK can be added as modules (directly source code without requiring any binaries) in the Projucer. The caveat is that SIMD filters are not available in this configuration due to the requirement for CMake support to build the SIMD filters.

Download link: ATK 2.2.0

Changelog:
2.2.0
* Introduced SIMD filters with libsimdpp
* Refactored EQ filters to work with SIMD filters
* Added module files for JUCE Projucer
2.1.1
* Added a Gain Max Compressor filter with wrappers
* Added a dry run call on BaseFilter to setup maximum sizes on a pipeline
* Added a IIR TDF2 (Transposed Direct Form 2) filter implementation (no Python wrappers for now)
* Fixed max gain reduction in the expanders to use 20 log10 instead of 10 log10 (as it is applied to the amplitude and not power)
* Fix a bug in OutCircularPointerFilter with offset handling
* Fix a bug in RIAA inverse filters

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at October 17, 2017 07:21 AM

October 16, 2017

Matthew Rocklin

Streaming Dataframes

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

This post is about experimental software. This is not ready for public use. All code examples and API in this post are subject to change without warning.

Summary

This post describes a prototype project to handle continuous data sources of tabular data using Pandas and Streamz.

Introduction

Some data never stops. It arrives continuously in a constant, never-ending stream. This happens in financial time series, web server logs, scientific instruments, IoT telemetry, and more. Algorithms to handle this data are slightly different from what you find in libraries like NumPy and Pandas, which assume that they know all of the data up-front. It’s still possible to use NumPy and Pandas, but you need to combine them with some cleverness and keep enough intermediate data around to compute marginal updates when new data comes in.

Example: Streaming Mean

For example, imagine that we have a continuous stream of CSV files arriving and we want to print out the mean of our data over time. Whenever a new CSV file arrives we need to recompute the mean of the entire dataset. If we’re clever we keep around enough state so that we can compute this mean without looking back over the rest of our historical data. We can accomplish this by keeping running totals and running counts as follows:

total = 0
count = 0

for filename in filenames:  # filenames is an infinite iterator
    df = pd.read_csv(filename)
    total = total + df.sum()
    count = count + df.count()
    mean = total / count
    print(mean)

Now as we add new files to our filenames iterator our code prints out new means that are updated over time. We don’t have a single mean result, we have continuous stream of mean results that are each valid for the data up to that point. Our output data is an infinite stream, just like our input data.

When our computations are linear and straightforward like this a for loop suffices. However when our computations have several streams branching out or converging, possibly with rate limiting or buffering between them, this for-loop approach can grow complex and difficult to manage.

Streamz

A few months ago I pushed a small library called streamz, which handled control flow for pipelines, including linear map operations, operations that accumulated state, branching, joining, as well as back pressure, flow control, feedback, and so on. Streamz was designed to handle all of the movement of data and signaling of computation at the right time. This library was quietly used by a couple of groups and now feels fairly clean and useful.

Streamz was designed to handle the control flow of such a system, but did nothing to help you with streaming algorithms. Over the past week I’ve been building a dataframe module on top of streamz to help with common streaming tabular data situations. This module uses Pandas and implements a subset of the Pandas API, so hopefully it will be easy to use for programmers with existing Python knowledge.

Example: Streaming Mean

Our example above could be written as follows with streamz

source = Stream.filenames('path/to/dir/*.csv')  # stream of filenames
sdf = (source.map(pd.read_csv)                  # stream of Pandas dataframes
             .to_dataframe(example=...))        # logical streaming dataframe

sdf.mean().stream.sink(print)                   # printed stream of mean values

This example is no more clear than the for-loop version. On its own this is probably a worse solution than what we had before, just because it involves new technology. However it starts to become useful in two situations:

  1. You want to do more complex streaming algorithms

    sdf = sdf[sdf.name == 'Alice']
    sdf.x.groupby(sdf.y).mean().sink(print)
    
    # or
    
    sdf.x.rolling('300ms').mean()
    

    It would require more cleverness to build these algorithms with a for loop as above.

  2. You want to do multiple operations, deal with flow control, etc..

    sdf.mean().sink(print)
    sdf.x.sum().rate_limit(0.500).sink(write_to_database)
    ...
    

    Consistently branching off computations, routing data correctly, and handling time can all be challenging to accomplish consistently.

Jupyter Integration and Streaming Outputs

During development we’ve found it very useful to have live updating outputs in Jupyter.

Usually when we evaluate code in Jupyter we have static inputs and static outputs:

However now both our inputs and our outputs are live:

We accomplish this using a combination of ipywidgets and Bokeh plots both of which provide nice hooks to change previous Jupyter outputs and work well with the Tornado IOLoop (streamz, Bokeh, Jupyter, and Dask all use Tornado for concurrency). We’re able to build nicely responsive feedback whenever things change.

In the following example we build our CSV to dataframe pipeline that updates whenever new files appear in a directory. Whenever we drag files to the data directory on the left we see that all of our outputs update on the right.

What is supported?

This project is very young and could use some help. There are plenty of holes in the API. That being said, the following works well:

Elementwise operations:

sdf['z'] = sdf.x + sdf.y
sdf = sdf[sdf.z > 2]

Simple reductions:

sdf.sum()
sdf.x.mean()

Groupby reductions:

sdf.groupby(sdf.x).y.mean()

Rolling reductions by number of rows or time window

sdf.rolling(20).x.mean()
sdf.rolling('100ms').x.quantile(0.9)

Real time plotting with Bokeh (one of my favorite features)

sdf.plot()

What’s missing?

  1. Parallel computing: The core streamz library has an optional Dask backend for parallel computing. I haven’t yet made any attempt to attach this to the dataframe implementation.
  2. Data ingestion from common streaming sources like Kafka. We’re in the process now of building asynchronous-aware wrappers around Kafka Python client libraries, so this is likely to come soon.
  3. Out-of-order data access: soon after parallel data ingestion (like reading from multiple Kafka partitions at once) we’ll need to figure out how to handle out-of-order data access. This is doable, but will take some effort. This is where more mature libraries like Flink are quite strong.
  4. Performance: Some of the operations above (particularly rolling operations) do involve non-trivial copying, especially with larger windows. We’re relying heavily on the Pandas library which wasn’t designed with rapidly changing data in mind. Hopefully future iterations of Pandas (Arrow/libpandas/Pandas 2.0?) will make this more efficient.
  5. Filled out API: Many common operations (like variance) haven’t yet been implemented. Some of this is due to laziness and some is due to wanting to find the right algorithm.
  6. Robust plotting: Currently this works well for numeric data with a timeseries index but not so well for other data.

But most importantly this needs use by people with real problems to help us understand what here is valuable and what is unpleasant.

Help would be welcome with any of this.

You can install this from github

pip install git+https://github.com/mrocklin/streamz.git

Documentation and code are here:

Current work

Current and upcoming work is focused on data ingestion from Kafka and parallelizing with Dask.

October 16, 2017 12:00 AM

October 15, 2017

Titus Brown

A brief introduction to osfclient, a command line client for the Open Science Framework

Over the last few months, Tim Head has been pushing forward the osfclient project, an effort to build a simple and friendly command-line interface to the Open Science Framework's file storage. This project was funded by a gift to my lab through the Center for Open Science (COS) to the tune of about $20k, given by an anonymous donor.

The original project was actually to write an OSF integration for Galaxy, but that project was first delayed by my move to UC Davis and then suffered from Michael Crusoe's move to work on the Common Workflow Language. After talking with the COS folk, we decided to repurpose the money to something that addresses a need in my lab - using the Open Science Framework to share files.

Our (Tim's) integration effort resulted in osfclient, a combination Python API and command-line program. The project is still in its early stages, but a few people have found it useful - in addition to increasing usage within my lab, @zkamvar has used it to transfer "tens of thousands of files", and @danudwary found it "just worked" for grabbing some big files. And new detailed use cases are emerging regularly.

Most exciting of all, we've had contributions from a number of other people already, and I'm looking forward to this project growing to meet the needs of the open science community!

Taking a step back: why OSF, and why a command-line client?

I talked a bit about "why OSF?" in a previous blog post, but the short version is that it's a globally accessible place to store files for science, and it works well for that! It fits a niche that we haven't found any other solutions for - free storage for medium size genomics files - and we're actively exploring its use in about a dozen different projects.

Our underlying motivations for building a command-line client for OSF were several:

  • we often need to retrieve full folder/directory hierarchies of files for research and training purposes;

  • frequently, we want to retrieve those file hierarchies on remote (cloud or HPC) systems;

  • we're often grabbing files that are larger than GitHub supports;

  • sometimes these files are from private projects that we cannot (or don't want to) publicize;

Here, the Open Science Framework was already an 80% solution (supporting folder hierarchies, large file storage, and a robust permissions system), but it didn't have a command-line client - we were reduced to using curl or wget on individual files, or (in theory) writing our own REST queries.

Enter osfclient!

Using osfclient, a quickstart

(See "Troubleshooting osfclient installs" at the bottom if you run into any troubles running these commands!)

In a Python 3 environment, do:

pip install osfclient

and then execute:

osf -p fuqsk clone

This will go to the osfclient test project on http://osf.io, and download all the files that are part of that project -- if you execute:

find fuqsk

you should see:

fuqsk
fuqsk/figshare
fuqsk/figshare/this is a test text file
fuqsk/figshare/this is a test text file/hello.txt
fuqsk/googledrive
fuqsk/googledrive/google test file.gdoc
fuqsk/googledrive/googledrive-hello.txt
fuqsk/osfstorage
fuqsk/osfstorage/hello.txt
fuqsk/osfstorage/test-subfolder
fuqsk/osfstorage/test-subfolder/hello-from-subfolder.txt

which showcases a particularly nice feature of the OSF that I'll talk about below.

A basic overview of what osfclient did

If you go to the project URL, http://osf.io/fuqsk, you will see a file storage hierarchy that looks like so:

OSF folders screenshot

What osfclient is doing is grabbing all of the different storage files and downloading them to your local machine. Et voila!

What's with the 'figshare' and 'googledrive' stuff? Introducing add-ons/integrations.

In the above, you'll notice that there are these subdirectories named figshare and googledrive. What are those?

The Open Science Framework can act as an umbrella integration for a variety of external storage services - see the docs. They support Amazon S3, Dropbox, Google Drive, Figshare, and a bunch of others.

In the above project, I linked in my Google Drive and Figshare accounts to OSF, and connected specific remote folders/projects into the OSF project (this one from Google Drive, and this one from figshare). This allows me (and others with permissions on the project) to access and manage those files from within a single Web UI on the OSF.

osfclient understands some of these integrations (and it's pretty trivial to add a new one to the client, at least), and it does the most obvious thing possible with them when you do a osfclient clone: it grabs the files and downloads them! (It should also be able to push to those remote storages, but I haven't tested that today.)

Interestingly, this appears to be a good simple way to layer OSF's project hierarchy and permission system on top of more complex and/or less flexible and/or non-command-line-friendly systems. For example, Luiz Irber recently uploaded a very large file to google drive via rclone and it showed up in his OSF project just fine.

This reasonably flexible imposition of an overall namespace on a disparate collection of storages is pretty nice, and could be a real benefit for large, complex projects.

Other things you can do with osfclient

osfclient also has file listing and file upload functionality, along with some configurability in terms of providing a default project and permissions within specific directories. The osfclient User Guide has some brief instructions along these lines.

osfclient also contains a Python API for OSF, and you can see a bit more about that here, in Tim Head and Erin Braswell's webinar materials.

What's next?

There are a few inconveniences about the OSF that could usefully be worked around, and a lot of features to be added in osfclient. In no particular order, here are a few of the big ones that require significant refactoring or design decisions or even new REST API functionality on the OSF side --

  • we want to make osf behave a bit more like git - see the issue. This would make it easier to teach and use, we think. In particular we want to avoid having to specify the project name every time.
  • speaking of project names, I don't think the project UIDs on the OSF (fuqsk above) are particular intuitive or type-able, and it would be great to have a command line way of discovering the project UID for your project of interest.
  • I'd also like to add project creation and maybe removal via the command line, as well as project registration - more on that later.
  • the file storage hierarchy above, with osfstorage/ and figshare/ as top level directories, isn't wonderful for command line folk - there are seemingly needless hierarchies in there. I'm not sure how to deal with this but there are a couple of possible solutions, including adding a per-project 'remapping' configuration that would move the files around.

Concluding thoughts

The OSF offers a simple, free, Web friendly, and convenient way to privately and publicly store collections of files under 5 GB in size on a Web site. osfclient provides a simple and reasonably functional way to download files from and upload files to the OSF via the command line. Give it a try!

Appendix: Troubleshooting osfclient installs

  1. If you can't run pip install on your system, you may need to either run the command as root, OR establish a virtual environment -- something like

python -m virtualenv -p python3.5 osftest . osftest/bin/activate pip install osfclient

will create a virtualenv, activate it, and install osfclient. (If you run into problems)

  1. If you get a requests.exceptions.SSLError, you may be on a Mac and using an old version of openssl. You can try pip install -U pyopenssl. If that doesn't work, please add a comment to this issue.

  2. Note that a conda install for osfclient exists, and you should be able to do conda install -c conda-forge osfclient.

by C. Titus Brown at October 15, 2017 10:00 PM

October 10, 2017

Matthew Rocklin

Notes on Kafka in Python

Summary

I recently investigated the state of Python libraries for Kafka. This blogpost contains my findings.

Both PyKafka and confluent-kafka have mature implementations and are maintained by invested companies. Confluent-kafka is generally faster while PyKafka is arguably better designed and documented for Python usability.

Conda packages are now available for both. I hope to extend one or both to support asynchronous workloads with Tornado.

Disclaimer: I am not an expert in this space. I have no strong affiliation with any of these projects. This is a report based on my experience of the past few weeks. I don’t encourage anyone to draw conclusions from this work. I encourage people to investigate on their own.

Introduction

Apache Kafka is a common data system for streaming architectures. It manages rolling buffers of byte messages and provides a scalable mechanism to publish or subscribe to those buffers in real time. While Kafka was originally designed within the JVM space the fact that it only manages bytes makes it easy to access from native code systems like C/C++ and Python.

Python Options

Today there are three independent Kafka implementations in Python, two of which are optionally backed by a C implementation, librdkafka, for speed:

  • kafka-python: The first on the scene, a Pure Python Kafka client with robust documentation and an API that is fairly faithful to the original Java API. This implementation has the most stars on GitHub, the most active development team (by number of committers) but also lacks a connection to the fast C library. I’ll admit that I didn’t spend enough time on this project to judge it well because of this.

  • PyKafka: The second implementation chronologically. This library is maintained by Parse.ly a web analytics company that heavily uses both streaming systems and Python. PyKafka’s API is more creative and designed to follow common Python idioms rather than the Java API. PyKafka has both a pure Python implementation and connections to the low-level librdkafka C library for increased performance.

  • Confluent-kafka: Is the final implementation chronologically. It is maintained by Confluent, the primary for-profit company that supports and maintains Kafka. This library is the fastest, but also the least accessible from a Python perspective. This implementation is written in CPython extensions, and the documentation is minimal. However, if you are coming from the Java API then this is entirely consistent with that experience, so that documentation probably suffices.

Performance

Confluent-kafka message-consumption bandwidths are around 50% higher and message-production bandwidths are around 3x higher than PyKafka, both of which are significantly higher than kafka-python. I’m taking these numbers from this blogpost which gives benchmarks comparing the three libraries. The primary numeric results follow below:

Note: It’s worth noting that this blogpost was moving smallish 100 byte messages around. I would hope that Kafka would perform better (closer to network bandwidths) when messages are of a decent size.

Producer Throughput

time_in_seconds MBs/s Msgs/s
confluent_kafka_producer 5.4 17 183000
pykafka_producer_rdkafka 16 6.1 64000
pykafka_producer 57 1.7 17000
python_kafka_producer 68 1.4 15000

Consumer Throughput

time_in_seconds MBs/s Msgs/s
confluent_kafka_consumer 3.8 25 261000
pykafka_consumer_rdkafka 6.1 17 164000
pykafka_consumer 29 3.2 34000
python_kafka_consumer 26 3.6 38000

Note: I discovered this article on parsely/pykafka #559, which has good conversation about the three libraries.

I profiled PyKafka in these cases and it doesn’t appear that these code paths have yet been optimized. I expect that modest effort could close that gap considerably. This difference seems to be more from lack of interest than any hard design constraint.

It’s not clear how critical these speeds are. According to the PyKafka maintainers at Parse.ly they haven’t actually turned on the librdkafka optimizations in their internal pipelines, and are instead using the slow Pure Python implementation, which is apparently more than fast enough for common use. Getting messages out of Kafka just isn’t their bottleneck. It may be that these 250,000 messages/sec limits are not significant in most applications. I suspect that this matters more in bulk analysis workloads than in online applications.

Pythonic vs Java APIs

It took me a few times to get confluent-kafka to work. It wasn’t clear what information I needed to pass to the constructor to connect to Kafka and when I gave the wrong information I received no message that I had done anything incorrectly. Docstrings and documentation were both minimal. In contrast, PyKafka’s API and error messages quickly led me to correct behavior and I was up and running within a minute.

However, I persisted with confluent-kafka, found the right Java documentation, and eventually did get things up and running. Once this happened everything fell into place and I was able to easily build applications with Confluent-kafka that were both simple and fast.

Development experience

I would like to add asynchronous support to one or both of these libraries so that they can read or write data in a non-blocking fashion and play nicely with other asynchronous systems like Tornado or Asyncio. I started investigating this with both libraries on GitHub.

Developers

Both libraries have a maintainer who is somewhat responsive and whose time is funded by the parent company. Both maintainers seem active on a day-to-day basis and handle contributions from external developers.

Both libraries are fully active with a common pattern of a single main dev merging work from a number of less active developers. Distributions of commits over the last six months look similar:

confluent-kafka-python$ git shortlog -ns --since "six months ago"
38  Magnus Edenhill
5  Christos Trochalakis
4  Ewen Cheslack-Postava
1  Simon Wahlgren

pykafka$ git shortlog -ns --since "six months ago"
52  Emmett Butler
23  Emmett J. Butler
20  Marc-Antoine Parent
18  Tanay Soni
5  messense
1  Erik Stephens
1  Jeff Widman
1  Prateek Shrivastava
1  aleatha
1  zpcui

Codebase

In regards to the codebases I found that PyKafka was easier to hack on for a few reasons:

  1. Most of PyKafka is written in Python rather than C extensions, and so it is more accessible to a broader development base. I find that Python C extensions are not pleasant to work with, even if you are comfortable with C.
  2. PyKafka appears to be much more extensively tested. PyKafka actually spins up a local Kafka instance to do comprehensive integration tests while Confluent-kafka seems to only test API without actually running against a real Kakfa instance.
  3. For what it’s worth, PyKafka maintainers responded quickly to an issue on Tornado. Confluent-kafka maintainers still have not responded to a comment on an existing Tornado issue, even though that comment had signfiicnatly more content (a working prototype).

To be clear, no maintainer has any responsibility to answer my questions on github. They are likely busy with other things that are of more relevance to their particular mandate.

Conda packages

I’ve pushed/updated recipes for both packages on conda-forge. You can install them as follows:

conda install -c conda-forge pykafka                 # Linux, Mac, Windows
conda install -c conda-forge python-confluent-kafka  # Linux, Mac

In both cases this these are built against the fast librdkafka C library (except on Windows) and install that library as well.

Future plans

I’ve recently started work on streaming systems and pipelines for Dask, so I’ll probably continue to investigate this space. I’m still torn between the two implementations. There are strong reasons to use either of them.

Culturally I am drawn to Parse.ly’s PyKafka library. They’re clearly Python developers writing for Python users. However the costs of using a non-Pythonic system here just aren’t that large (Kafka’s API is small), and Confluent’s interests are more aligned with investing in Kafka long term than are Parse.ly’s.

October 10, 2017 12:00 AM

October 09, 2017

Continuum Analytics

Strata Data Conference Grows Up

The Strata conference will always hold a place in my heart, as it’s one of the events that inspired Travis and I to found Anaconda. We listened to open source-driven talks about data lakes and low-cost storage and knew there would be a demand for tools to help organizations and data scientists derive value from these mountains of information.

by Team Anaconda at October 09, 2017 02:31 PM

October 05, 2017

Continuum Analytics

Database Trends & Applications: Machine Learning and Data Science are Top Trends at Strata Data

Data professionals and vendors converged at Strata Data in New York to trade tips and tricks for handling big data. Top of mind for most was the impact of machine learning and how it’s continuing to evolve as the “next big thing.”

by Sheyna Webster at October 05, 2017 04:29 PM

October 04, 2017

October 03, 2017

Continuum Analytics

Seven Things You Might Not Know About Numba

In this post, I want to dive deeper and demonstrate several aspects of using Numba on the GPU that are often overlooked. I’ll quickly breeze through a number of topics, but I’ll provide links throughout for additional reading.

by Team Anaconda at October 03, 2017 06:00 PM

Database Trends & Applications: Anaconda Partners with Microsoft to Provide Data Science Python Programs

Anaconda, Inc., a Python data science platform provider, is partnering with Microsoft to embed Anaconda into Azure Machine Learning, Visual Studio and SQL Server to deliver data insights in real time.

by Sheyna Webster at October 03, 2017 04:28 PM

October 02, 2017

Continuum Analytics

Intellyx: Anaconda: Delivering a Python Data Science Platform for the Enterprise

If you’re going down the data science road, there’s a pretty good chance that you’re using Python and Anaconda as part of your toolset. That’s because Anaconda is the most popular open source Python distribution for data science and is both downloaded directly and included in numerous data science platforms.

by Sheyna Webster at October 02, 2017 04:27 PM

App Developer Magazine: Python-powered machine learning with Anaconda and MS partnership

Anaconda, Inc. has announced it is partnering with Microsoft to embed Anaconda into Azure Machine Learning, Visual Studio and SQL Server to deliver data insights in real time.

by Sheyna Webster at October 02, 2017 04:24 PM

September 29, 2017

Continuum Analytics

ZDNet: Strata NYC 2017 to Hadoop: Go jump in a data lake

http://www.zdnet.com/article/strata-nyc-2017-to-hadoop-go-jump-in-a-data-lake/

by Sheyna Webster at September 29, 2017 04:37 PM

September 28, 2017

September 26, 2017

Continuum Analytics

Anaconda and Microsoft Partner to Deliver Python-Powered Machine Learning

Strata Data Conference, NEW YORK––September 26, 2017––Anaconda, Inc., the most popular Python data science platform provider, today announced it is partnering with Microsoft to embed Anaconda into Azure Machine Learning, Visual Studio and SQL Server to deliver data insights in real time.

by Sheyna Webster at September 26, 2017 01:00 PM

Matthieu Brucher

Book Review: Mastering The Challenges Of Leading Change: Inspire the People and Succeed Where Others Fail

I like change. More precisely, I like improving things. An as some of the people in my entourage would say, I can be a bull in a china shop. So this book sounded interesting.

Discussion

The book is split in four parts that are explained in the introduction: priorities, politics, people and perseverance.
At the end of each chapter, there is a small questionnaire with actions to do to improve your ability to change.

Let’s try with priorities. Obviously, this is all about how you want to start the change. If you stay put, you are not going forward. So the first chapter is about looking for people who will help you to change. The second chapter deals with setting up a team ready for leading the change, and the last one of this part is about the type of changes we are aiming for.

Second part tackles politics, or more precisely what happens behind the change. Communication is crucial, for the fourth chapter, but I think communication is also getting the message that lies behind a situation. It’s also about being ready to listen to others solutions than yours to improve a situation. The bad part is there are also people who just want to mess with you. I do remember some cases where it happened, and I wish I had in lace all the tools from the first two parts, it would probably have solved my problems with Machiavellis, as the author calls them!

Then we move to handling people. We go back to communication, and the importance of getting direct communication, and not indirect one, then how to handle a group and move in the same direction. I liked also the last part, when you don’t do something just for your job, but also for having new friends.

Last part is about perseverance. There are going to be issues, fires to put out, and you need a proper team. Then perseverance is also about keeping in movement and be ready for the next change. And to do so, you also need to cultivate people who may lead the next change!

Conclusion

Great book. Even if part of it is just sanity, it’s also about noticing that some of the things you may do for your friends, you need to do them when leading change. For me, this proves (all aspects of) communication (are/)is everything.

So now, let’s apply this book to my professional life!

by Matt at September 26, 2017 07:31 AM

September 25, 2017

Enthought

Enthought at the 2017 Society of Exploration Geophysicists (SEG) Conference

2017 will be Enthought’s 11th year at the SEG (Society of Exploration Geophysicists) Annual Meeting, and we couldn’t be more excited to be at the leading edge of the digital transformation in oil & gas being driven by the capabilities provided by machine learning and artificial intelligence.

Now in its 87th year, the Annual SEG (Society of Exploration Geophysicists) Meeting will be held in Houston, Texas on September 24-27, 2017 at the George R. Brown Convention Center. The SEG Annual Meeting will be the first large conference to take place in Houston since Hurricane Harvey and its devastating floods, and we’re so pleased to be a small part of getting Houston back “open for business.”

Pre-Event Kickoff: The Machine Learning Geophysics Hackathon

We had such a great experience at the EAGE Subsurface Hackathon in Paris in June that when we heard our friends at Agile Geoscience were planning a machine learning in geophysics hackathon for the US, we had to join! Brendon Hall, Enthought’s Energy Solutions Group Director will be there as a participant and coach and Enthought CEO Eric Jones will be on the judging panel.

Come Meet Us on the SEG Expo Floor & Learn About Our AI-Enabled Solutions for Oil & Gas

Presentations in Enthought Booth #318 (just to the left from the main entrance before the main aisle):

  • Monday, Sept 25, 12-12:45 PM: Lessons Learned From the Front Line: Moving AI From Research to Application
  • Tues, Sept 26, 1-1:45 PM: Canopy Geoscience: Building Innovative, AI-Enabled Geoscience Applications
  • Wed, Sept 27, 12-12:45 PM: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core

Hart Energy’s E&P Magazine Features Canopy Geoscience

Canopy Geoscience, Enthought’s cross-domain AI platform for oil & gas, was featured in the September 2017 edition of E&P magazine. See the coverage in the online SEG Technology Showcase, in the September print edition, or in the online E&P Flipbook.


Enthought's Canopy Geoscience featured in E&P's September 2017 edition

The post Enthought at the 2017 Society of Exploration Geophysicists (SEG) Conference appeared first on Enthought Blog.

by admin at September 25, 2017 05:14 PM

September 24, 2017

Matthew Rocklin

Dask Release 0.15.3

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.3. This release contains stability enhancements and bug fixes. This blogpost outlines notable changes since the 0.15.2 release on August 30th.

You can conda install Dask:

conda install -c conda-forge dask

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on conda-forge channels. They will be on defaults in a few days.

Full changelogs are available here:

Some notable changes follow.

Masked Arrays

Dask.array now supports masked arrays similar to NumPy.

In [1]: import dask.array as da

In [2]: x = da.arange(10, chunks=(5,))

In [3]: mask = x % 2 == 0

In [4]: m = da.ma.masked_array(x, mask)

In [5]: m
Out[5]: dask.array<masked_array, shape=(10,), dtype=int64, chunksize=(5,)>

In [6]: m.compute()
Out[6]:
masked_array(data = [-- 1 -- 3 -- 5 -- 7 -- 9],
             mask = [ True False  True False  True False  True False  True False],
       fill_value = 999999)

This work was primarily done by Jim Crist and partially funded by the UK Met office in support of the Iris project.

Constants in atop

Dask.array experts will be familiar with the atop function, which powers a non-trivial amount of dask.array and is commonly used by people building custom algorithms. This function now supports constants when the index given is None.

atop(func, 'ijk', x, 'ik', y, 'kj', CONSTANT, None)

Memory management for workers

Dask workers spill excess data to disk when they reach 60% of their alloted memory limit. Previously we only measured memory use by adding up the memory use of every piece of data produce by the worker. This could fail under a few situations

  1. Our per-data estiamtes were faulty
  2. User code consumed a large amount of memory without our tracking it

To compensate we now also periodically check the memory use of the worker using system utilities with the psutil module. We dump data to disk if the process rises about 70% use, stop running new tasks if it rises above 80%, and restart the worker if it rises above 95% (assuming that the worker has a nanny process).

Breaking Change: Previously the --memory-limit keyword to the dask-worker process specified the 60% “start pushing to disk” limit. So if you had 100GB of RAM then you previously might have started a dask-worker as follows:

dask-worker ... --memory-limit 60e9  # before specify 60% target

And the worker would start pushing to disk once it had 60GB of data in memory. However, now we are changing this meaning to be the full amount of memory given to the process.

dask-worker ... --memory-limit 100e9A  # now specify 100% target

Of course, you don’t have to sepcify this limit (many don’t). It will be chosen for you automatically. If you’ve never cared about this then you shouldn’t start caring now.

More about memory management here: http://distributed.readthedocs.io/en/latest/worker.html?highlight=memory-limit#memory-management

Statistical Profiling

Workers now poll their worker threads every 10ms and keep a running count of which functions are being used. This information is available on the diagnostic dashboard as a new “Profile” page. It provides information that is orthogonal, and generally more detailed than the typical task-stream plot.

These plots are available on each worker, and an aggregated view is available on the scheduler. The timeseries on the bottom allows you to select time windows of your computation to restrict the parallel profile.

More information about diagnosing performance available here: http://distributed.readthedocs.io/en/latest/diagnosing-performance.html

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.2 release on August 30th

  • Adonis
  • Christopher Prohm
  • Danilo Horta
  • jakirkham
  • Jim Crist
  • Jon Mease
  • jschendel
  • Keisuke Fujii
  • Martin Durant
  • Matthew Rocklin
  • Tom Augspurger
  • Will Warner

The following people contributed to the dask/distributed repository since the 1.18.3 release on September 2nd:

  • Casey Law
  • Edrian Irizarry
  • Matthew Rocklin
  • rbubley
  • Tom Augspurger
  • ywangd

September 24, 2017 12:00 AM

September 23, 2017

Bruno Pinho

Integrating & Exploring 3: Download and Process DEMs in Python

This tutorial shows how to automate downloading and processing DEM files. It shows a one-liner code to download SRTM (30 or 90 m) data and how to use rasterio to reproject the downloaded data into a desired CRS, spatial resolution or bounds.

by Bruno Ruas de Pinho at September 23, 2017 01:00 AM

September 22, 2017

numfocus

Open Journals joins NumFOCUS Sponsored Projects

NumFOCUS is pleased to welcome the Open Journals as a fiscally sponsored project. Open Journals is a collection of open source, open access journals. The team behind Open Journals believes that code review and high-quality, reusable software are a critical–but often overlooked–part of academic research. The primary goal of Open Journals is to provide venues […]

by NumFOCUS Staff at September 22, 2017 02:00 PM

William Stein

DataDog's pricing: don't make the same mistake I made

(I wrote a 1-year followup post here.)

I stupidly made a mistake recently by choosing to use DataDog for monitoring the infrastructure for my startup (SageMathCloud).

I got bit by their pricing UI design that looks similar to many other sites, but is different in a way that caused me to spend far more money than I expected.

I'm writing this post so that you won't make the same mistake I did.  As a product, DataDog is of course a lot of hard work to create, and they can try to charge whatever they want. However, my problem is that what they are going to charge was confusing and misleading to me.

I wanted to see some nice web-based data about my new autoscaled Kubernetes cluster, so I looked around at options. DataDog looked like a new and awesomely-priced service for seeing live logging. And when I looked (not carefully enough) at the pricing, it looked like only $15/month to monitor a bunch of machines. I'm naive about the cost of cloud monitoring -- I've been using Stackdriver on Google cloud platform for years, which is completely free (for now, though that will change), and I've also used self hosted open solutions, and some quite nice solutions I've written myself. So my expectations were way out of whack.

Ever busy, I signed up for the "$15/month plan":


One of the people on my team spent a little time and installed datadog on all the VM's in our cluster, and also made DataDog automatically start running on any nodes in our Kubernetes cluster. That's a lot of machines.

Today I got the first monthly bill, which is for the month that just happened. The cost was $639.19 USD charged to my credit card. I was really confused for a while, wondering if I had bought a year subscription.



After a while I realized that the cost is per host! When I looked at the pricing page the first time, I had just saw in big letters "$15", and "$18 month-to-month" and "up to 500 hosts". I completely missed the "Per Host" line, because I was so naive that I didn't think the price could possibly be that high.

I tried immediately to delete my credit card and cancel my plan, but the "Remove Card" button is greyed out, and it says you can "modify your subscription by contacting us at success@datadoghq.com":



So I wrote to success@datadoghq.com:

Dear Datadog,

Everybody on my team was completely mislead by your
horrible pricing description.

Please cancel the subscription for wstein immediately
and remove my credit card from your system.

This is the first time I've wasted this much money
by being misled by a website in my life.

I'm also very unhappy that I can't delete my credit
card or cancel my subscription via your website. It's
like one more stripe API call to remove the credit card
(I know -- I implemented this same feature for my site).


And they responded:

Thanks for reaching out. If you'd like to cancel your
Datadog subscription, you're able to do so by going into
the platform under 'Plan and Usage' and choose the option
downgrade to 'Lite', that will insure your credit card
will not be charged in the future. Please be sure to
reduce your host count down to the (5) allowed under
the 'Lite' plan - those are the maximum allowed for
the free plan.

Also, please note you'll be charged for the hosts
monitored through this month. Please take a look at
our billing FAQ.


They were right -- I was able to uninstall the daemons, downgrade to Lite, remove my card, etc. all through the website without manual intervention.

When people have been confused with billing for my site, I have apologized, immediately refunded their money, and opened a ticket to make the UI clearer.  DataDog didn't do any of that.

I wish DataDog would at least clearly state that when you use their service you are potentially on the hook for an arbitrarily large charge for any month. Yes, if they had made that clear, they wouldn't have had me as a customer, so they are not incentivized to do so.

A fool and their money are soon parted. I hope this post reduces the chances you'll be a fool like me.  If you chose to use DataDog, and their monitoring tools are very impressive, I hope you'll be aware of the cost.


ADDED:

On Hacker News somebody asked: "How could their pricing page be clearer? It says per host in fairly large letters underneath it. I'm asking because I will be designing a similar page soon (that's also billed per host) and I'd like to avoid the same mistakes."  My answer:

[EDIT: This pricing page by the top poster in this thread is way better than I suggest below -- https://www.serverdensity.com/pricing/]

1. VERY clearly state that when you sign up for the service, then you are on the hook for up to $18*500 = $9000 + tax in charges for any month. Even Google compute engine (and Amazon) don't create such a trap, and have a clear explicit quota increase process.
2. Instead of "HUGE $15" newline "(small light) per host", put "HUGE $18 per host" all on the same line. It would easily fit. I don't even know how the $15/host datadog discount could ever really work, given that the number of hosts might constantly change and there is no prepayment.
3. Inform users clearly in the UI at any time how much they are going to owe for that month (so far), rather than surprising them at the end. Again, Google Cloud Platform has a very clear running total in their billing section, and any time you create a new VM it gives the exact amount that VM will cost per month.
4. If one works with a team, 3 is especially important. The reason that I had monitors on 50+ machines is that another person working on the project, who never looked at pricing or anything, just thought -- he I'll just set this up everywhere. He had no idea there was a per-machine fee.

by William Stein (noreply@blogger.com) at September 22, 2017 01:03 PM

DataDog: Don't make the same mistake I did -- a followup and thoughts about very unhappy customers

This is a followup to my previous blog post about DataDog billing.

TL;DR:
- I don't recommend DataDog,
- dealing with unhappy customers is hard,
- monitoring for data science nerds?

Hacker News Comments

DataDog at Google Cloud Summit

I was recently at the Seattle Google Cloud Summit and DataDog was well represented, with the biggest booth and top vendor billing during the keynote. Clearly they are doing something right. I had a past unpleasant experience with them, and I had just been auditing my records and discovered that last year DataDog had actually charged me a lot more than I thought, so was kind of annoyed. Nonetheless, they kept coming up and talking to me, server monitoring is life-and-death important to me, and their actual software is very impressive in some ways.

Nick Parisi started talking with me about DataDog. He sincerely wanted to know about my past experience with monitoring and the DataDog product, which he was clearly very enthuisiastic about. So I told him a bit, and he encouraged me to tell him more, explaining that he did not work at DataDog last year, and that he would like to know what happened. So he gave me his email, and he was very genuinely concerned and helpful, so I sent him an email with a link to my post, etc. I didn't receive a response, so a week later I asked why, and received a followup email from Jay Robichau, who is DataDog's Director of Sales.

Conference Call with DataDog

Jay setup a conference call with me today at 10am (September 22, 2017). Before the call, I sent him a summary of my blog post, and also requested a refund, especially for the suprise bill they sent me nearly 6 weeks after my post.

During the call, Jay explained that he was "protecting" Nick from me, and that I would mostly talk with Michelle Danis who is in charge of customer success. My expectation for the call is that we would find some common ground, and that they would at least appreciate the chance to make things right and talk with an unhappy customer. I was also curious about how a successful startup company addresses the concerns of an unhappy customer (me).

I expected the conversation to be difficult but go well, with me writing a post singing the praises of the charming DataDog sales and customer success people. A few weeks ago CoCalc.com (my employer) had a very unhappy customer who got (rightfully) angry over a miscommunication, told us he would no longer use our product, and would definitely not recommend it to anybody else. I wrote to him wanting to at least continue the discussion and help, but he was completely gone. I would do absolutely anything I could to ensure he is a satisfied, if only he would give me the chance. Also, there was a recent blog post from somebody unhappy with using CoCalc/Sage for graphics, and I reached out to them as best I could to at least clarify things...

In any case, here's what DataDog charged us as a result of us running their daemon on a few dozen containers in our Kubernetes cluster (a contractor who is not a native English speaker actually setup these monitors for us):

07/22/2016  449215JWJH87S8N4  DATADOG 866-329-4466 NY  $639.19
08/29/2016 2449215L2JH87V8WZ DATADOG 866-329-4466 NY $927.22

I was shocked by the 07/22 bill which spured my post, and discovered the 8/29 one only later. We canceled our subscription on July 22 (cancelling was difficult in itself).

Michelle started the conference call by explaining that the 08/29 bill was for charges incurred before 07/22, and that their billing system has over a month lag (and it does even today, unlike Google Cloud Platform, say). Then Michelle explained at length many of the changes that DataDog has made to their software to address the issues I (and others?) have pointed out with their pricing description and billing. She was very interested in whether I would be a DataDog customer in the future, and when I said no, she explained that they would not refund my money since the bill was not a mistake.

I asked if they now provide a periodic summary of the upcoming bill, as Google cloud platform (say) does. She said that today they don't, though they are working on it. They do now provide a summary of usage so far in an admin page.

Finally, I explained in no uncertain terms that I felt misled by their pricing. I expected that they would understand, and pointed out that they had just described to me many ways in which they were addressing this very problem. Very surprisingly, Michelle's response was that she absolutely would not agree that there was any problem with their pricing description a year ago, and they definitely would not refund my money. She kept bringing up the terms of service. I agreed that I didn't think legally they were in the wrong, given what she had explained, just that -- as they had just pointed out -- their pricing and billing was unclear in various ways. They would not agree at all.

I can't recommend doing business with DataDog. I had very much hoped to write the opposite in this updated post. Unfortunately, their pricing and terms are still confusing today compared to competitors, and they are unforgiving of mistakes. This dog bites.    

(Disclaimer: I took notes during the call, but most of the above is from memory, and I probably misheard or misunderstood something. I invite comments from DataDog to set the record straight.)

Also, for what it is worth, I definitely do recommend Google Cloud Platform.  They put in the effort to do many things right regarding clear billing.

How do Startups Deal with Unhappy Customers?

I am very curious about how other startups deal with unhappy customers. At CoCalc we have had very few "major incidents" yet... but I want to be as prepared as possible. At the Google Cloud Summit, I went to some amazing "war storries by SRE's" session in which they talked about situations they had been in years ago in which their decisions meant the difference between whether they would have a company or job tomorrow or not. These guys clearly have amazing instincts for when a problem was do-or-die serious and when it wasn't. And their deep preparation "in depth" was WHY they were on that stage, and a big reason why older companies like Google are still around. Having a strategy for addressing very angry customers is surely just as important.


Google SRE's: these guys are serious.

I mentioned my DataDog story to a long-time Google employee there (16 years!) and he said he had recently been involved in a similar situation with Google's Stackdriver monitoring, where the bill to a customer was $85K in a month just for Stackdriver. I asked what Google did, and he said they refunded the money, then worked with the customer to better use their tools.

There is of course no way to please all of the people all of the time. However, I genuinely feel that I was ripped off and misled by DataDog, but I have the impression that Jay and Michelle honestly view me as some jerk trying to rip them off for $1500.   And they probably hate me for telling you about my experiences.

So far, with CoCalc we charge customers in advance for any service we provide, so less people are surprised by a bill.  Sometimes there are problems with recurring subscriptions when a person is charged for the upcoming month of a subscription, and don't want to continue (e.g., because their course is over), we always fully refund the charge. What does your company do? Why? I do worry that our billing model means that we miss out on potential revenue.

We all know what successful huge consumer companies like Amazon and Wal-Mart do.

Monitoring for Data Science Nerds?

I wonder if there is interest in a service like DataDog, but targeted at Data Science Nerds, built on CoCalc, which provides hosted collaborative Jupyter notebooks with Pandas, R, etc., pre-installed. This talk at PrometheusCon 2017 mostly discussed the friction that people face moving data from Prometheus to analyze using data science tools (e.g., R, Python, Jupyter). CoCalc provides a collaborative data science environment, so if we were to smooth over those points of friction, perhaps it could be useful to certain people. And much more efficient...

by William Stein (noreply@blogger.com) at September 22, 2017 12:32 PM

September 21, 2017

Continuum Analytics

Back to School: Data Science & AI Open New Doors for Students

School is back in session and now is the time students are thinking about their future. When considering options, science-oriented students are likely thinking about what is arguably today’s hottest technology: artificial intelligence (AI).

by Sheyna Webster at September 21, 2017 06:40 PM

Anaconda to Present at Strata Data Conference, New York

Anaconda, the most popular Python data science platform provider, today announced that several company experts will present two sessions and one tutorial at The Strata Data Conference on September 26 and 27 at the Javits Center in New York City.

by Sheyna Webster at September 21, 2017 02:11 PM

Matthew Rocklin

Fast GeoSpatial Analysis in Python

This work is supported by Anaconda Inc., the Data Driven Discovery Initiative from the Moore Foundation, and NASA SBIR NNX16CG43P

This work is a collaboration with Joris Van den Bossche. This blogpost builds on Joris’s EuroSciPy talk (slides) on the same topic. You can also see Joris’ blogpost on this same topic.

TL;DR:

Python’s Geospatial stack is slow. We accelerate the GeoPandas library with Cython and Dask. Cython provides 10-100x speedups. Dask gives an additional 3-4x on a multi-core laptop. Everything is still rough, please come help.

We start by reproducing a blogpost published last June, but with 30x speedups. Then we talk about how we achieved the speedup with Cython and Dask.

All code in this post is experimental. It should not be relied upon.

Experiment

In June Ravi Shekhar published a blogpost Geospatial Operations at Scale with Dask and GeoPandas in which he counted the number of rides originating from each of the official taxi zones of New York City. He read, processed, and plotted 120 million rides, performing an expensive point-in-polygon test for each ride, and produced a figure much like the following:

This took about three hours on his laptop. He used Dask and a bit of custom code to parallelize Geopandas across all of his cores. Using this combination he got close to the speed of PostGIS, but from Python.

Today, using an accelerated GeoPandas and a new dask-geopandas library, we can do the above computation in around eight minutes (half of which is reading CSV files) and so can produce a number of other interesting images with faster interaction times.

A full notebook producing these plots is available below:

The rest of this article talks about GeoPandas, Cython, and speeding up geospatial data analysis.

Background in Geospatial Data

The Shapely User Manual begins with the following passage on the utility of geospatial analysis to our society.

Deterministic spatial analysis is an important component of computational approaches to problems in agriculture, ecology, epidemiology, sociology, and many other fields. What is the surveyed perimeter/area ratio of these patches of animal habitat? Which properties in this town intersect with the 50-year flood contour from this new flooding model? What are the extents of findspots for ancient ceramic wares with maker’s marks “A” and “B”, and where do the extents overlap? What’s the path from home to office that best skirts identified zones of location based spam? These are just a few of the possible questions addressable using non-statistical spatial analysis, and more specifically, computational geometry.

Shapely is part of Python’s GeoSpatial stack which is currently composed of the following libraries:

  1. Shapely: Manages shapes like points, linestrings, and polygons. Wraps the GEOS C++ library
  2. Fiona: Handles data ingestion. Wraps the GDAL library
  3. Rasterio: Handles raster data like satelite imagery
  4. GeoPandas: Extends Pandas with a column of shapely geometries to intuitively query tables of geospatially annotated data.

These libraries provide intuitive Python wrappers around the OSGeo C/C++ libraries (GEOS, GDAL, …) which power virtually every open source geospatial library, like PostGIS, QGIS, etc.. They provide the same functionality, but are typically much slower due to how they use Python. This is acceptable for small datasets, but becomes an issue as we transition to larger and larger datasets.

In this post we focus on GeoPandas, a geospatial extension of Pandas which manages tabular data that is annotated with geometry information like points, paths, and polygons.

GeoPandas Example

GeoPandas makes it easy to load, manipulate, and plot geospatial data. For example, we can download the NYC taxi zones, load and plot them in a single line of code.

geopandas.read_file('taxi_zones.shp')
         .to_crs({'init' :'epsg:4326'})
         .plot(column='borough', categorical=True)

Cities are now doing a wonderful job publishing data into the open. This provides transparency and an opportunity for civic involvement to help analyze, understand, and improve our communities. Here are a few fun geospatially-aware datasets to make you interested:

  1. Chicago Crimes from 2001 to present (one week ago)
  2. Paris Velib (bikeshare) in real time
  3. Bike lanes in New Orleans
  4. New Orleans Police Department incidents involving the use of force

Performance

Unfortunately GeoPandas is slow. This limits interactive exploration on larger datasets. For example the Chicago crimes data (the first dataset above) has seven million entries and is several gigabytes in memory. Analyzing a dataset of this size interactively with GeoPandas is not feasible today.

This slowdown is because GeoPandas wraps each geometry (like a point, line, or polygon) with a Shapely object and stores all of those objects in an object-dtype column. When we compute a GeoPandas operation on all of our shapes we just iterate over these shapes in Python. As an example, here is how one might implement a distance method in GeoPandas today.

def distance(self, other):
    result = [geom.distance(other)
              for geom in self.geometry]
    return pd.Series(result)

Unfortunately this just iterates over elements in the series, each of which is an individual Shapely object. This is inefficient for two reasons:

  1. Iterating through Python objects is slow relative to iterating through those same objects in C.
  2. Shapely Python objects consume more memory than the GEOS Geometry objects that they wrap.

This results in slow performance.

Cythonizing GeoPandas

Fortunately, we’ve rewritten GeoPandas with Cython to directly loop over the underlying GEOS pointers. This provides a 10-100x speedup depending on the operation. So instead of using a Pandas object-dtype column that holds shapely objects we instead store a NumPy array of direct pointers to the GEOS objects.

Before

After

As an example, our function for distance now looks like the following Cython implementation (some liberties taken for brevity):

cpdef distance(self, other):
    cdef int n = self.size
    cdef GEOSGeometry *left_geom
    cdef GEOSGeometry *right_geom = other.__geom__  # a geometry pointer
    geometries = self._geometry_array

    with nogil:
        for idx in xrange(n):
            left_geom = <GEOSGeometry *> geometries[idx]
            if left_geom != NULL:
                distance = GEOSDistance_r(left_geom, some_point.__geom)
            else:
                distance = NaN

For fast operations we see speedups of 100x. For slower operations we’re closer to 10x. Now these operations run at full C speed.

In his EuroSciPy talk Joris compares the performance of GeoPandas (both before and after Cython) with PostGIS, the standard geospatial plugin for the popular PostgreSQL database (original notebook with the comparison). I’m stealing some plots from his talk below:

Cythonized GeoPandas and PostGIS run at almost exactly the same speed. This is because they use the same underlying C library, GEOS. These algorithms are not particularly complex, so it is not surprising that everyone implements them in exactly the same way.

This is great. The Python GIS stack now has a full-speed library that operates as fast as any other open GIS system is likely to manage.

Problems

However, this is still a work in progress, and there is still plenty of work to do.

First, we need for Pandas to track our arrays of GEOS pointers differently from how it tracks a normal integer array. This is both for usability reasons, like we want to render them differently and don’t want users to be able to perform numeric operations like sum and mean on these arrays, and also for stability reasons, because we need to track these pointers and release their allocated GEOSGeometry objects from memory at the appropriate times. Currently, this goal is pursued by creating a new block type, the GeometryBlock (‘blocks’ are the internal building blocks of pandas that hold the data of the different columns). This will require some changes to Pandas itself to enable custom block types (see this issue on the pandas issue tracker).

Second, data ingestion is still quite slow. This relies not on GEOS, but on GDAL/OGR, which is handled in Python today by Fiona. Fiona is more optimized for consistency and usability rather than raw speed. Previously when GeoPandas was slow this made sense because no one was operating on particularly large datasets. However now we observe that data loading is often several times more expensive than all of our manipulations so this will probably need some effort in the future.

Third, there are some algorithms within GeoPandas that we haven’t yet Cythonized. This includes both particular features like overlay and dissolve operations as well as small components like GeoJSON output.

Finally as with any rewrite on a codebase that is not exhaustively tested (we’re trying to improve testing as we do this) there are probably several bugs that we won’t detect until some patient and forgiving user runs into them first.

Still though, all linear geospatial operations work well and are thoroughly tested. Also spatial joins (a backbone of many geospatial operations) are up and running at full speed. If you work in a non-production environment then Cythonized GeoPandas may be worth your time to investigate.

You can track future progress on this effort at geopandas/geopandas #473 which includes installation instructions.

Parallelize with Dask

Cythonizing gives us speedups in the 10x-100x range. We use a single core as effectively as is possible with the GEOS library. Now we move on to using multiple cores in parallel. This gives us an extra 3-4x on a standard 4 core laptop. We can also scale to clusters, though I’ll leave that for a future blogpost.

To parallelize we need to split apart our dataset into multiple chunks. We can do this naively by placing the first million rows in one chunk, the second million rows in another chunk, etc. or we can partition our data spatially, for example by placing all of the data for one region of our dataset in one chunk and all of the data for another region in another chunk, and so on. Both approaches are implemented in a rudimentary dask-geopandas library available on GitHub.

So just as dask-array organizes many NumPy arrays along a grid and dask-dataframe organizes many Pandas dataframes along a linear index

the dask-geopandas library organizes many GeoPandas dataframes into spatial regions. In the example below we might partition data in the city of New York into its different boroughs. Data for each borough would be handled separately by a different thread or, in a distributed situation, might live on a different machine.

This gives us two advantages:

  1. Even without geospatial partitioning, we can use many cores (or many machines) to accelerate simple operations.
  2. For spatially aware operations, like spatial joins or subselections we can engage only those parts of the parallel dataframe that we know are relevant for various parts of the computation.

However this is also expensive and not always necessary. In our initial exercise with the NYC Taxi data we didn’t do this, and will still got significant speedups just from normal multicore operation.

Exercise

And so to produce the images we did at the top of this post we used a combination of dask.dataframe to load in CSV files, dask-geopandas to perform the spatial join, and then dask.dataframe and normal pandas to perform the actual computations. Our code looked something like the following:

import dask.dataframe as dd
import dask_geopandas as dg

df = dd.read_csv('yellow_tripdata_2015-*.csv')
gf = dg.set_geometry(df, geometry=df[['pickup_longitude', 'pickup_latitude']],
                     crs={'init' :'epsg:4326'})
gf = dg.sjoin(gf, zones[['zone', 'borough', 'geometry']])
full = gf[['zone', 'payment_type', 'tip_amount', 'fare_amount']]

full.to_parquet('nyc-zones.parquet')  # compute and cache result on disk
full = dd.read_parquet('nyc-zones.parquet')

And then we can do typical groupbys and joins on the more typical pandas-like data now properly annotated with zones.

result = full.passenger_count.groupby(full.zone).count().compute()
result.name = 'count'
joined = pd.merge(result.to_frame(), zones,
                  left_index=True, right_on='zone')
joined = geopandas.GeoDataFrame(joined)  # convert back for plotting

We’ve replaced most of Ravi’s custom analysis with a few lines of new standard code. This maxes our or CPU when doing spatial joins. Everything here releases the GIL well and the entire computation operates in under a couple gigabytes of RAM.

Problems

The dask-geopandas project is currently a prototype. It will easily break for non-trivial applications (and indeed many trivial ones). It was designed to see how hard it would be to implement some of the trickier operations like spatial joins, repartitioning, and overlays. This is why, for example, it supports a fully distributed spatial join, but lacks simple operations like indexing. There are other longer-term issues as well.

Serialization costs are manageable, but decently high. We currently use the standard “well known binary” WKB format common in other geospatial applications but have found it to be fairly slow, which bogs down inter-process parallelism.

Similarly distributed and spatially partitioned data stores don’t seem to be common (or at least I haven’t run across them yet).

It’s not clear how dask-geopandas dataframes and normal dask dataframes should interact. It would be very convenient to reuse all of the algorithms in dask.dataframe, but the index structures of the two libraries is very different. This may require some clever software engineering on the part of the Dask developers.

Still though, these seem surmountable and generally this process has been easy so far. I suspect that we can build an intuitive and performant parallel GIS analytics system with modest effort.

The notebook for the example at the start of the blogpost shows using dask-geopandas with good results.

Conclusion

With established technologies in the PyData space like Cython and Dask we’ve been able to accelerate and scale GeoPandas operations above and beyond industry standards. However this work is still experimental and not ready for production use. This work is a bit of a side project for both Joris and Matthew and they would welcome effort from other experienced open source developers. We believe that this project can have a large social impact and are enthusiastic about pursuing it in the future. We hope that you share our enthusiasm.

September 21, 2017 12:00 AM

September 20, 2017

Titus Brown

Using the Open Science Framework to share files, for Science.

Sharing files in the 100 MB-5 GB range has been a perennial problem for my lab - we want to post them someplace central, perhaps publicly or perhaps privately, and we have several use cases that complicate things:

  • sometimes we need to share files with collaborators who are not so command-line savvy, so an easy-to-use Web interface is important;
  • other times we want to provide them as part of documentation or protocols (e.g. see my recent blog post), and provide simple download links that can be used with curl or wget;
  • when we're doing benchmarking, it's nice to be able to save the input and output files somewhere, but here we want to be able to use the command line to save (and sometimes to retrieve) the files to/from servers;
  • occasionally we need to share a collection of medium-sized files with dozens or hundreds of people for a workshop or a course, and it's nice to have a high performance always-up site for that purpose;
  • if we publish something, we'd like a location that we can "freeze" into an archive;

and of course it'd be nice to have a single system that does all of this!

Over the past decade we've used a variety of systems (figshare, AWS S3, Dropbox, mega.nz, Zenodo) and ignored a number of other systems (Box, anyone?), but none of them have really met our needs. GitHub is the closest, but GitHub itself is too restrictive with file sizes, and Git LFS was frustratingly bad the one time we engaged closely with it.

But now we have a solution that has been working out pretty well!

Enter the Open Science Framework, osf.io, which is a ...well, it's complicated, but it's a site that brings together some project management tools with user accounts and data storage.

I first encountered OSF in 2015, when we visited the Center for Open Science as part of the Open Source/Open Science meeting. COS does many things, including running various reproducibility projects, being engaged with research study preregistration and hosting a number of preprint sites. osf.io underpins many of these efforts and is an open source platform that is under continued development. They are serious about their software engineering, in it for the long run, and to the best of my understanding simply want to enable #openscience.

The osf.io site addresses essentially all of our use cases above:

  • it supports both public and private projects;
  • it has a simple and reasonably robust permissions model;
  • it is quite usable via the Web interface;
  • with only a little bit of trouble you can get download URLs for individual files;
  • we (by which I mean mostly Tim Head) have a ~beta version of a command line utility for interacting with the OSF storage system, called osfclient, which actually just works;
  • it's free! and will store individual files up to 5 GB, with no project limit;
  • they support "registering" projects which freezes your project at a particular point in time and gives you an archival location for that version (think GitHub + Zenodo style DOI generation);
  • they have a sustainability fund that guarantees that even if COS itself shuts down, the osf.io site will continue serving files for an extended period of time (last time I checked, it was 50 years);
  • you can even do things like host UCSC genome tracks on osf.io (and when you can't, because of bugs, they are really responsive and fix things quickly!)

and, coolest of all, they have "community advocates" like Natalie Meyers who are actively looking for use cases to support, which is kind of how we got involved with them.

So... give it a try!

In the past six months, we've used osf.io for --

and more.

There's a bunch more things that osf.io can in theory do that I haven't explored, but, at least for now, it's meeting our file redistribution and project file sharing needs quite handily.

Where next? I'd suggest trying it out; it only takes about 5 minutes to figure out most of the basic stuff. You can ask them questions on Twitter at @OSFramework and there are FAQs and Guides ready to hand.

--titus

p.s. I've had extended conversations with them about whether they really want the genomics community descending upon them. They said "sure!" So, you know, ...go for it. :)

by C. Titus Brown at September 20, 2017 10:00 PM

Jarrod Millman

NetworkX 2.0 released

I am happy to announce the release of NetworkX 2.0!  NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

This release supports Python 2.7 and 3.4-3.6 and is the result of over two years of work with 1212 commits and 193 merges by 86 contributors.  We have made major changes to the methods in the Multi/Di/Graph classes.  There is a  migration guide for people moving from 1.X to 2.0.

For more information, please visit our website and our gallery of examples.  Please send comments and questions to the networkx-discuss mailing.

by Jarrod Millman (noreply@blogger.com) at September 20, 2017 01:43 PM

September 19, 2017

Continuum Analytics

How to Troubleshoot Python Software in Anaconda Distribution

Below is a question that was recently asked on StackOverflow and I decided it would be helpful to publish an answer explaining the various ways in which to troubleshoot a problem you may be having in Anaconda.

by Sheyna Webster at September 19, 2017 02:00 PM

Gaël Varoquaux

Beyond computational reproducibility, let us aim for reusability

Note

Scientific progress calls for reproducing results. Due to limited resources, this is difficult even in computational sciences. Yet, reproducibility is only a means to an end. It is not enough by itself to enable new scientific results. Rather, new discoveries must build on reuse and modification of the state of the art. As time goes, this state of the art must be consolidated in software libraries, just as scientific knowledge as been consolidated on bookshelves of brick-and-mortar libraries.

I am reposting an essay that I wrote on reproducible science and software libraries. The full discussion is in IEEE CIS TC Cognitive and Developmental Systems, but I’ve been told that it is hard to find.

Science is based on the ability to falsify claims. Thus, reproduction or replication of published results is central to the progress of science. Researchers failing to reproduce a result will raise questions: Are these investigators not skilled enough? Did they misunderstand the original scientific endeavor? Or is the scientific claim unfounded? For this reason, the quality of the methods description in a research paper is crucial. Beyond papers, computers —central to science in our digital era— bring the hope of automating reproduction. Indeed, computers excel at doing the same thing several times.

However, there are many challenges to computational reproducibility. To begin with, computers enable reproducibility only if all steps of a scientific study are automated. In this sense, interactive environments —productivity-boosters for many— are detrimental unless they enable easy recording and replay of the actions performed. Similarly, as a computational-science study progresses, it is crucial to keep track of changes to the corresponding data and scripts. With a software-engineering perspective, version control is the solution. It should be in the curriculum of today’s scientists. But it does not suffice. Automating a computational study is difficult. This is because it comes with a large maintenance burden: operations change rapidly, straining limited resources —processing power and storage. Saving intermediate results helps. As does devising light experiments that are easier to automate. These are crucial to the progress of science, as laboratory classes or thought experiments in physics. A software engineer would relate them to unit tests, elementary operations checked repeatedly to ensure the quality of a program.

Archiving computers in thermally-regulated nuclear-proof vaults?

Once a study is automated and published, ensuring reproducibility should be easy; just a matter of archiving the computer used, preferably in a thermally-regulated nuclear-proof vault. Maybe, dear reader, the scientist in you frowns at this solution. Indeed, studies should also be reproduced by new investigators. Hardware and software variations then get in the way. Portability, ie achieving identical results across platforms, is well-known by the software industry as being a difficult problem. It faces great hurdles due to incompatibilities in compilers, libraries, or operating systems. Beyond these issues, portability also faces numerical and statistical stability issues in scientific computing. Hiding instability problems with heavy restrictions on the environment is like rearranging deck chairs on the Titanic. While enough freezing will recover reproducibility, unstable operations cast doubt upon scientific conclusions they might lead to. Computational reproducibility is more than a software engineering challenge; it must build upon solid numerical and statistical methods.

Reproducibility is not enough. It is only a means to an end, scientific progress. Setting in stone a numerical pipeline that produces a figure is of little use to scientific thinking if it is a black box. Researchers need to understand the corresponding set of operations to relate them to modeling assumptions. New scientific discoveries will arise from varying those assumptions, or applying the methodology to new questions or new data. Future studies build upon past studies, standing on the shoulders of giants, as Isaac Newton famously wrote. In this process, published results need to be modified and adapted, not only reproduced. Enabling reuse is an important goal.

Libraries as reusable computational experiments

To a software architect, a reusable computational experiment may sound like a library. Software libraries are not only a good analogy, but also an essential tool. The demanding process of designing a good library involves isolating elementary steps, ensuring their quality, and documenting them. It is akin to the editorial work needed to assemble a textbook from the research literature.

Science should value libraries made of code, and not only bookshelves. But they are expensive to develop, and even more so to maintain. Where to set the cursor? It is clear that in physics not every experimental setup can be stored for later reuse. Costs are less tangible with computational science; but they should not be underestimated. In addition, the race to publish creates legions of studies. As an example, Google scholar lists 28000 publications concerning compressive sensing in 2015. Arguably many are incremental and research could do with less publications. Yet the very nature of research is to explore new ideas, not all of which are to stay.

Identifying and consolidating major results for reuse

Computational research will best create scientific progress by identifying and consolidating the major results. It is a difficult but important task. These studies should be made reusable. Limited resources imply that the remainder will suffer from “code rot”, with results becoming harder and harder to reproduce as their software environment becomes obsolete. Libraries, curated and maintained, are the building blocks that can enable progress.


If you want to cite this essay in an academic publication, please cite the version in IEEE CIS TC Cognitive and Developmental Systems (volume 32, number 2, 2016).

by Gaël Varoquaux at September 19, 2017 10:10 AM

Matthieu Brucher

Book review: Audio Effects: Theory, Implementation and Application

I love reading books on signal processing, especially on audio signal processing. From books on using effects to a so-called reference, I still enjoy reading them, even if they are frustrating. The only one that was is DAFX: Digital Audio Effects, but I haven’t made a review of it!

This book came after reading a book on JUCE and starting using it myself. As it was using JUCE for its plugins, I thought it would be a good book.

Discussion

The book starts with a great introduction on signal processing. It actually gives all the theory for linear time invariant filters. Once this is set, I’m not even sure you need more DSP theory!

The second chapters dials back and tackles delays, but with the theoretical approach. It would have actually been better IMHO to give the theory bit by bit, as we don’t need to know anything about the z-transform for any delays. I still think that Zölzer approach on chorus (low-frequency random noise instead of LFO) is the best…

After that, there are two chapters on general LTI effects. First filter design, with all the equations to switch between different types of filters, and then their application to EQ, was-wah and others. It was interesting, but lots of talks and the code quality is… bad. Really, you don’t call modulo in the DSP loop.

The next chapter is yet another mystery: amplitude modulation. The concepts are so basic that it should have been before delays… Not even talking about it.

OK, now the chapter on dynamic processing did introduce compression properly, but the authors forgot that compressors and expanders are more or less the same, and they confused compressor of type I (compressing above threshold) and expander of type II (expanding below threshold) with compressors and expanders. The only thing you can’t do is a compressor of type II, but expanders of type I do exist.

The chapter on overdrive is “cute”. I mean, waveshaping and oversampling, that’s great, but a proper book should tackle analog modeling as well in this kind of chapter…

The following chapter almost tackles vocoders. I mean almost because while starting talking about the theory and short-term FFT, it completely fails at the theory of FFT. It’s quite astonishing when comparing with the first chapter, but seriously, what the authors present as processing assumes a periodic signal. And even then, depending on the FFT window, the result is different. Why didn’t they warn about the fact that this kind of effect is not causal? And after, there are people designing effect directly in the frequency space and completely mess it up. Seriously!

After that chapter, it was all downhill. Spatial audio effects (pan, more or less, with mention of HRTF), the Doppler effect, which is a rehash of the delay chapter with some more information, reverberation (which should have been after delays), a chapter on audio production (the signal path in a condole or a DAW) and then a short chapter on JUCE (so short, that I’m not really sure it matters). I suppose that with my tone, you understand that the second half of the book was a big disappointment.

Conclusion

So there is still no better DSP book than DAFX, and definitely no proper book on DSP programming… Avoid this book, too expensive for what it brings to the table.

by Matt at September 19, 2017 07:05 AM

September 18, 2017

Titus Brown

Classifying genome bins using a custom reference database, part II

Note: this blog post uses a pile of alpha code, which is not systematically tested yet. It will eventually be introduced into sourmash, probably as sourmash lca. I'll put a link to a new tutorial here when that happens.

See updated discussion here - this code probably doesn't really do what you want ;).


This is Part II of a blog post on some tools to taxonomically classify a set of genome assemblies against a private reference database. Part I is here.

The commands below should be a complete walkthrough of the process.

We'll be using data sets from two different binning approaches applied to the Tara ocean data set (Sunagawa et al., 2015). The two approaches were posted to bioRxiv as Tully et al., 2017 and Delmont et al., 2017. Below, I will refer to them as 'Tully' and 'Delmont'.

Installing sourmash

The below commands use a bunch of commands from the sourmash Python API, so you'll need to have sourmash installed. You can do that by following the beginning of the sourmash tutorial, or, if you are comfortable installing Python packages, you can set up your own virtualenv and go from there using the following commands (which do not require root privileges):

python3.5 -m venv ~/py3
. ~/py3/bin/activate
pip install -U pip
pip install -U Cython
pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer

pip install https://github.com/dib-lab/sourmash/archive/master.zip

Grabbing the 2017-sourmash-lca repository

Most of the scripts below are in the 2017-sourmash-lca github repository. So, after changing to a workspace directory, clone the repo:

git clone https://github.com/ctb/2017-sourmash-lca

Getting the data

I've selected 100 genomes at random from both tully and delmont (although all the commands below should work with the full data sets, too). Below, we'll grab the data.

cd 2017-sourmash-lca/
curl -L https://osf.io/73yfz/download?version=1 -o subsample.tar.gz
tar xzf subsample.tar.gz
cd subsample

After you get and unpack the data, the command

ls -FC

should show you delmont/ and tully/ directories with 100 genomes in each.

Computing sourmash signatures for your "reference" genomes

Let's use the 'delmont' genomes as the reference genomes, against which you want to classify the 'tully' genomes; for that, we'll have to build a least-common-ancestor database for all the genomes in delmont/.

First, we need to compute sourmash signatures for the delmont genomes.

cd delmont
sourmash compute -k 21,31,51 --scaled 10000 --name-from-first *.fa.gz

This will take about 2 minutes.

(Note that sourmash compute will autodetect gzipped and bzipped files, as well as straight FASTA or FASTQ - you don't need to name the files in any special way.)

You should now have .sig files for every .fa.gz file in your delmont/ directory:

ls -1 *.fa.gz | wc -l
ls -1 *.sig | wc -l

should both yield 100.

These signature files contain the k-mers to which we will link taxonomic information.

Connecting genome names to NCBI taxonomic IDs

Next, we need to connect each signature to an NCBI taxonomic ID; the next step (Building a least-common-ancestor database) requires a CSV file with the following information:

genomename,taxid,lineage

here, genome_name needs to be the first word (space-separated) of the first contig in each of the genome files above, and taxid needs to be the integer taxonomic ID from NCBI -- i.e. the number at the end of NCBI taxonomy database URLs. (It doesn't need to be to a specific species or strain, if the species or strain hasn't been entered into the database yet.)

If you're like most bioinformaticians, you've probably developed your own custom way of doing taxonomy, so getting it into proper shape for this step might involve a bit of scripting.

For the Delmont et al. data, the predicted taxonomy of each genome bin is provided in Supplemental Table 3 of Delmont et al., which I saved as a CSV file here. However, we face another problem here, which is that the contigs are named things like TARA_ANE_MAG_00011_00000000001, but that's not what's in that spreadsheet. So to connect the spreadsheet to the signatures, we need to "fix" each signature name.

(Where did these names come from? The argument --name-from-first to sourmash compute tells sourmash to take the name of the signature from the first sequence in the FASTA file. Hey, it's better than setting it manually for 100 individual files!)

"Fixing" the names of the signatures

The following Python script will do the trick; it walks through every signature file provided on the command line, and updates the name fields to have the format that's in the spreadsheet.

#! /usr/bin/env python
import sourmash_lib, sourmash_lib.signature
import sys

for filename in sys.argv[1:]:
    print('...', filename)
    outsigs = []
    for sig in sourmash_lib.signature.load_signatures(filename):
        name = sig.d['name'].split('_')[:4]
        sig.d['name'] = "_".join(name)
        outsigs.append(sig)

    with open(filename + '.fixed', 'wt') as fp:
        sourmash_lib.signature.save_signatures(outsigs, fp)

This script is available as delmont-fix-sig-names.py in the directory above the delmont directory, and to run it, execute:

python ../delmont-fix-sig-names.py *.sig

This will produce a bunch of .sig.fixed files, and you can rename them to .sig by doing the following:

for i in *.sig
do
    mv $i.fixed $i
done

OK! Now we've got the names in our signature files fixed up properly. Next, let's correlate them with the taxonomic IDs published in the paper!

Running a custom script to get taxids into the spreadsheet.

Next, we need to run the script to produce the CSV file that maps signature names to taxonomic IDs. This is checked into the 2017-sourmash-lca repository as script get_taxid_tara_delmont.py. This script is a hacky mess that searches the downloaded NCBI Files for the names in the spreadsheet, and does some disambiguation if there are duplicate names.

For this, you'll need the names.dmp and nodes.dmp files from the NCBI taxonomy dump database, so download it, unpack the two files into a subdirectory, and then gzip them so they take up a bit less space.

cd ../
mkdir db/
cd db/
curl -O -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xzf taxdump.tar.gz names.dmp nodes.dmp
gzip names.dmp nodes.dmp
rm taxdump.tar.gz
cd ../

You'll also need the taxonomy spreadsheet provided with Delmont et al.; let's put it in the subsample/ directory above delmont/, which should be our current working directory:

curl -O -L https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv

Now, run the script to convert the taxonomy in the Delmont et al. paper into NCBI taxids:

python ../get_taxid_tara_delmont.py db/names.dmp.gz db/nodes.dmp.gz \
    tara-delmont-SuppTable3.csv tara_delmont_taxids.csv

It will take a few minutes to run, but it should produce a file named tara_delmont_taxids.csv that should look preeeeeeetty similar to the one in github. This is the file you'll need for the next step.

Building a least-common-ancestor database

Now we get to put all this information together!

To build the least-common-ancestor (LCA) database, we'll need to specify: a k-mer size (here 21), a scaled value (here 10,000 - more on what this does in another blog post), the output name of the LCA database (here delmont.k21.lca), the taxonomic spreadsheet we just created (tara_delmont_axids.csv), the NCBI nodes dump table which we downloaded above (db/nodes.dmp.gz), and the signatures -- here, all in the subdirectory delmont, and found by specifying --traverse-directory (so the extract script will go find all the signatures under delmont/.)

Finally, we'll specify an LCA metadata database delmont.lca.json that ties together all of the files and k-mer size/scaled values for later use.

To run the extract script that produces the LCA database, then, we have this ungainly command:

../extract.py -k 21 --scaled 10000 \
    --traverse-directory \
    db/delmont.k21.lca tara_delmont_taxids.csv \
    db/nodes.dmp.gz delmont \
    --lca-json db/delmont.lca.json

If that worked without complaint, you should see some output text talking about how it "found root 2431 times".

Run extract.py two more times, once for each k-mer value we used in building the signatures above with sourmash compute (k=31 and k=51):

../extract.py -k 31 --scaled 10000 \
    --traverse-directory \
    db/delmont.k31.lca tara_delmont_taxids.csv \
    db/nodes.dmp.gz delmont \
    --lca-json db/delmont.lca.json

../extract.py -k 51 --scaled 10000 \
    --traverse-directory \
    db/delmont.k51.lca tara_delmont_taxids.csv \
    db/nodes.dmp.gz delmont \
    --lca-json db/delmont.lca.json

At the end of this, you should have a directory db/ containing three *.lca files, along with names.dmp.gz and nodes.dmp.gz and two cache files. You will also have a delmont.lca.json file. Collectively, the contents of db/ are self-contained and can be packaged up and distributed.

Classifying a collection of genome bins with your newly constructed reference database.

The original goal of this two-part blog post was to be able to classify new genome bins against existing private collections of genomes, not just Genbank. So let's do that!

Let's suppose your new collection of genome bins is in the directory tully/. First, compute signatures for all of them:

cd tully
sourmash compute -k 21,31,51 --scaled 10000 --name-from-first *.fna.gz
cd ../

and now... classify!

To this, we will use the classify-genome-sigs.py script that is part of the github.com/ctb/2017-tara-binning repository. Rather than grabbing that entire repository, let's just download the one script --

curl -L -O https://github.com/ctb/2017-tara-binning/raw/master/classify-genome-sigs.py
chmod +x classify-genome-sigs.py
mv classify-genome-sigs.py ../

and now run it:

../classify-genome-sigs.py --lca db/delmont.lca.json -k 21 tully

This uses our just-created LCA database, delmont.lca.json, with a k-mer size of 21, to classify all of the signatures in the tully/ subdirectory.

You should see the following output:

loading LCA database from db/delmont.lca.json
loading taxonomic nodes from: db/nodes.dmp.gz
loading taxonomic names from: db/names.dmp.gz
loading k-mer DB from: delmont.k21.lca
loading all signatures in directory: tully
saving to cache: tully.pickle
...loaded 100 signatures at k=21

classified sourmash signatures in directory: 'tully'
LCA databases used: delmont.lca.json

total signatures found: 100
no classification information: 95

could classify 5
of those, 0 disagree at some rank.

number classified unambiguously, by lowest classification rank:
        class: 3
        family: 1
        genus: 1

disagreements by rank:

classification output as CSV, here: tully.taxonomy.csv

which indicates that only 5 out of 100 of the signatures could classified -- more detail on those 5 can be seen by looking in tully.taxonomy.csv:

grep found tully.taxonomy.csv

which should yield

TOBG_SP-33.fna.gz,2742,found,genus,Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter
TOBG_EAC-53.fna.gz,246874,found,family,Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Cryomorphaceae
TOBG_MED-644.fna.gz,28211,found,class,Bacteria;Proteobacteria;Alphaproteobacteria
TOBG_CPC-18.fna.gz,1236,found,class,Bacteria;Proteobacteria;Gammaproteobacteria
TOBG_SP-120.fna.gz,1236,found,class,Bacteria;Proteobacteria;Gammaproteobacteria

(You can look at the first part of this blog post to see how to interpret the 3rd column.)

Using the genbank LCA, too.

We've already prepared the genbank LCA database (which takes about 2 hours total), and you can download it and use it, along with the delmont data, to classify the tully sequences; all you need to do is specify multiple lca.json databases on the command line.

To do this, first download and unpack the genbank file:

curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz
cd db/
tar xzf ../genbank-lca-2017.08.26.tar.gz
cd ../

and then re-run classify-genome-sigs.py:

../classify-genome-sigs.py \
    --lca db/delmont.lca.json db/genbank.lca.json \
    -k 21 tully

-- now you should see that you can classify 56 genomes, 51 of them unambiguously.

Using other k-mer sizes than 21

Both the genbank LCA database and the one we calculated from Delmont et al. were built for three k-mer sizes: 21, 31, and 51. So you can set the -k parameter to any of those, e.g. k=51:

../classify-genome-sigs.py \
    --lca db/delmont.lca.json db/genbank.lca.json \
    -k 51 tully -o tully-k51.taxonomy.csv

Bonus: plotting your new classification using metacodeR

Let's generate a plot. I'm going to use metacodeR but Taylor Reiter from our lab has some other options here.

Note: you'll need to install metacoder in R for the below to work.

First, generate a classification:

../classify-genome-sigs.py \
    --lca db/delmont.lca.json db/genbank.lca.json \
    -k 21 tully

Grab two scripts, plot-metacoder.R and load_classify_csv.R:

curl -O -L https://github.com/ctb/2017-tara-binning/raw/master/load_classify_csv.R
curl -O -L https://github.com/ctb/2017-tara-binning/raw/master/plot-metacoder.R

and now

Rscript plot-metacoder.R tully.taxonomy.csv tully.pdf

should produce a plot entitled 'tully.pdf', which looks like the below:

A pretty metacodeR picture, again

Appendix: a recap

To build an LCA database for a private collection of genomes:

  1. Compute sourmash signatures for your genomes.
  2. Link those signatures to their taxonomic IDs in a CSV file.
  3. Run extract.py to build the database.

To classify another set of genomes with an LCA database:

  1. Compute sourmash signatures for your genomes.
  2. Run classify-genome-sigs.py on your new signatures, using one or more LCA databases.

And that's all, folks!

by C. Titus Brown at September 18, 2017 10:00 PM

numfocus

The Econ-ARK joins NumFOCUS Sponsored Projects

​NumFOCUS is pleased to announce the addition of the Econ-ARK to our fiscally sponsored projects. As a complement to the thriving QuantEcon project (also a NumFOCUS sponsored project), the Econ-ARK is creating an open-source resource containing the tools needed to understand how diversity across economic agents (in preferences, circumstances, knowledge, etc) leads to richer and […]

by NumFOCUS Staff at September 18, 2017 05:06 PM

Matthew Rocklin

Dask on HPC - Initial Work

This work is supported by Anaconda Inc. and the NSF EarthCube program.

We recently announced a collaboration between the National Center for Atmospheric Research (NCAR), Columbia University, and Anaconda Inc to accelerate the analysis of atmospheric and oceanographic data on high performance computers (HPC) with XArray and Dask. The full text of the proposed work is available here. We are very grateful to the NSF EarthCube program for funding this work, which feels particularly relevant today in the wake (and continued threat) of the major storms Harvey, Irma, and Jose.

This is a collaboration of academic scientists (Columbia), infrastructure stewards (NCAR), and software developers (Anaconda and Columbia and NCAR) to scale current workflows with XArray and Jupyter onto big-iron HPC systems and peta-scale datasets. In the first week after the grant closed a few of us focused on the quickest path to get science groups up and running with XArray, Dask, and Jupyter on these HPC systems. This blogpost details what we achieved and some of the new challenges that we’ve found in that first week. We hope to follow this blogpost with many more to come in the future. Today we cover the following topics:

  1. Deploying Dask with MPI
  2. Interactive deployments on a batch job scheduler, in this case PBS
  3. The virtues of JupyterLab in a remote system
  4. Network performance and 3GB/s infiniband
  5. Modernizing XArray’s interactions with Dask’s distributed scheduler

A video walkthrough deploying Dask on XArray on an HPC system is available on YouTube and instructions for atmospheric scientists with access to the Cheyenne Supercomputer is available here.

Now lets start with technical issues:

Deploying Dask with MPI

HPC systems use job schedulers like SGE, SLURM, PBS, LSF, and others. Dask has been deployed on all of these systems before either by academic groups or financial companies. However every time we do this it’s a little different and generally tailored to a particular cluster.

We wanted to make something more general. This started out as a GitHub issue on PBS scripts that tried to make a simple common template that people could copy-and-modify. Unfortunately, there were significant challenges with this. HPC systems and their job schedulers seem to focus and easily support only two common use cases:

  1. Embarrassingly parallel “run this script 1000 times” jobs. This is too simple for what we have to do.
  2. MPI jobs. This seemed like overkill, but is the approach that we ended up taking.

Deploying dask is somewhere between these two. It falls into the master-slave pattern (or perhaps more appropriately coordinator-workers). We ended up building an MPI4Py program that launches Dask. MPI is well supported, and more importantly consistently supported, by all HPC job schedulers so depending on MPI provides a level of stability across machines. Now dask.distributed ships with a new dask-mpi executable:

mpirun --np 4 dask-mpi

To be clear, Dask isn’t using MPI for inter-process communication. It’s still using TCP. We’re just using MPI to launch a scheduler and several workers and hook them all together. In pseudocode the dask-mpi executable looks something like this:

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    start_dask_scheduler()
else:
    start_dask_worker()

Socially this is useful because every cluster management team knows how to support MPI, so anyone with access to such a cluster has someone they can ask for help. We’ve successfully translated the question “How do I start Dask?” to the question “How do I run this MPI program?” which is a question that the technical staff at supercomputer facilities are generally much better equipped to handle.

Working Interactively on a Batch Scheduler

Our collaboration is focused on interactive analysis of big datasets. This means that people expect to open up Jupyter notebooks, connect to clusters of many machines, and compute on those machines while they sit at their computer.

Unfortunately most job schedulers were designed for batch scheduling. They will try to run your job quickly, but don’t mind waiting for a few hours for a nice set of machines on the super computer to open up. As you ask for more time and more machines, waiting times can increase drastically. For most MPI jobs this is fine because people aren’t expecting to get a result right away and they’re certainly not interacting with the program, but in our case we really do want some results right away, even if they’re only part of what we asked for.

Handling this problem long term will require both technical work and policy decisions. In the short term we take advantage of two facts:

  1. Many small jobs can start more quickly than a few large ones. These take advantage of holes in the schedule that are too small to be used by larger jobs.
  2. Dask doesn’t need to be started all at once. Workers can come and go.

And so I find that if I ask for several single machine jobs I can easily cobble together a sizable cluster that starts very quickly. In practice this looks like the following:

$ qsub start-dask.sh      # only ask for one machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine

Our main job has a wall time of about an hour. The workers have shorter wall times. They can come and go as needed throughout the computation as our computational needs change.

Jupyter Lab and Web Frontends

Our scientific collaborators enjoy building Jupyter notebooks of their work. This allows them to manage their code, scientific thoughts, and visual outputs all at once and for them serves as an artifact that they can share with their scientific teams and collaborators. To help them with this we start a Jupyter server on the same machine in their allocation that is running the Dask scheduler. We then provide them with SSH-tunneling lines that they can copy-and-paste to get access to the Jupyter server from their personal computer.

We’ve been using the new Jupyter Lab rather than the classic notebook. This is especially convenient for us because it provides much of the interactive experience that they lost by not working on their local machine. They get a file browser, terminals, easy visualization of textfiles and so on without having to repeatedly SSH into the HPC system. We get all of this functionality on a single connection and with an intuitive Jupyter interface.

For now we give them a script to set all of this up. It starts Jupyter Lab using Dask and then prints out the SSH-tunneling line.

from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')

import socket
host = client.run_on_scheduler(socket.gethostname)

def start_jlab(dask_scheduler):
    import subprocess
    proc = subprocess.Popen(['jupyter', 'lab', '--ip', host, '--no-browser'])
    dask_scheduler.jlab_proc = proc

client.run_on_scheduler(start_jlab)

print("ssh -N -L 8787:%s:8787 -L 8888:%s:8888 -L 8789:%s:8789 cheyenne.ucar.edu" % (host, host, host))

Long term we would like to switch to an entirely point-and-click interface (perhaps something like JupyterHub) but this will requires additional thinking about deploying distributed resources along with the Jupyter server instance.

Network Performance on Infiniband

The intended computations move several terabytes across the cluster. On this cluster Dask gets about 1GB/s simultaneous read/write network bandwidth per machine using the high-speed Infiniband network. For any commodity or cloud-based system this is very fast (about 10x faster than what I observe on Amazon). However for a super-computer this is only about 30% of what’s possible (see hardware specs).

I suspect that this is due to byte-handling in Tornado, the networking library that Dask uses under the hood. The following image shows the diagnostic dashboard for one worker after a communication-heavy workload. We see 1GB/s for both read and write. We also see 100% CPU usage.

Network performance is a big question for HPC users looking at Dask. If we can get near MPI bandwidth then that may help to reduce concerns for this performance-oriented community.

How do I use Infiniband network with Dask?

XArray and Dask.distributed

XArray was the first major project to use Dask internally. This early integration was critical to prove out Dask’s internals with user feedback. However it also means that some parts of XArray were designed well before some of the newer parts of Dask, notably the asynchronous distributed scheduling features.

XArray can still use Dask on a distributed cluster, but only with the subset of features that are also available with the single machine scheduler. This means that persisting data in distributed RAM, parallel debugging, publishing shared datasets, and so on all require significantly more work today with XArray than they should.

To address this we plan to update XArray to follow a newly proposed Dask interface. This is complex enough to handle all Dask scheduling features, but light weight enough not to actually require any dependence on the Dask library itself. (Work by Jim Crist.)

We will also eventually need to look at reducing overhead for inspecting several NetCDF files, but we haven’t yet run into this, so I plan to wait.

Future Work

We think we’re at a decent point for scientific users to start playing with the system. We have a Getting Started with Dask on Cheyenne wiki page that our first set of guinea pig users have successfully run through without much trouble. We’ve also identified a number of issues that the software developers can work on while the scientific teams spin up.

  1. Zero copy Tornado writes to improve network bandwidth
  2. Enable Dask.distributed features in XArray by formalizing dask’s expected interface
  3. Dynamic deployments on batch job schedulers

We would love to engage other collaborators throughout this process. If you or your group work on related problems we would love to hear from you. This grant isn’t just about serving the scientific needs of researchers at Columbia and NCAR, but about building long-term systems that can benefit the entire atmospheric and oceanographic community. Please engage on the Pangeo GitHub issue tracker.

September 18, 2017 12:00 AM

September 15, 2017

numfocus

2017 GSoC Students Complete Work on NumFOCUS Projects

The summer has come to an end and so have the internships of our GSoC students. Ten students have finished their projects successfully and now continue to contribute to our open source community! FEniCS: Ivan Yashchuk has implemented quadrilateral and hexahedral meshes for finite element methods. Michal Habera has added support for the XDMF format.   […]

by NumFOCUS Staff at September 15, 2017 09:12 PM

September 14, 2017

Enthought

Webinar: Python for MATLAB Users: What You Need To Know

What:  A guided walkthrough and Q&A about how to migrate from MATLAB® to Python with Enthought Lead Instructor, Dr. Alexandre Chabot-Leclerc.

Who Should Watch: MATLAB® users who are considering migrating to Python, either partially or completely.

View the Webinar


Python has a lot of momentum. Many high profile projects use it and more are migrating to it all the time. Why? One reason is that Python is free, but more importantly, it is because Python has a thriving ecosystem of packages that allow developers to work faster and more efficiently. They can go from prototyping to production to scale on hardware ranging from a Raspberry Pi (or maybe micro controller) to a cluster, all using the same language. A large part of Python’s growth is driven by its excellent support for work in the fields of science, engineering, machine learning, and data science.

You and your organization might be thinking about migrating from MATLAB to Python to get access to the ecosystem and increase your productivity, but you might also have some outstanding questions and concerns, such as: How do I get started? Will any of my knowledge transfer? How different are Python and MATLAB? How long will it take me to become proficient? Is it too big a of a shift? Can I transition gradually or do I have to do it all at once? These are all excellent questions.

We know people put a lot of thought into the tools they select and that changing platforms is a big deal. We created this webinar to help you make the right choice.

In this webinar, we’ll give you the key information and insight you need to quickly evaluate whether Python is the right choice for you, your team, and your organization, including:

  • How to get started
  • What you need in order to replicate the MATLAB experience
  • Important conceptual differences between MATLAB and Python
  • Important similarities between MATLAB and Python: What MATLAB knowledge will transfer
  • Strategies for converting existing MATLAB code to Python
  • How to accelerate your transition

View the Webinar


Presenter: Dr. Alexandre Chabot-Leclerc, Enthought Lead Instructor

Ph.D, Electrical Engineering, Technical University of Denmark

 


Python for Scientists & Engineers Training: The Quick Start Approach to Turbocharging Your Work

If you are tired of running repeatable processes manually and want to (semi-) automate them to increase your throughput and decrease pilot error, or you want to spend less time debugging code and more time writing clean code in the first place, or you are simply tired of using a multitude of tools and languages for different parts of a task and want to replace them with one comprehensive language, then Enthought’s Python for Scientists and Engineers is definitely for you!

This class has been particularly appealing to people who have been using other tools like MATLAB or even Excel for their computational work and want to start applying their skills using the Python toolset.  And it’s no wonder — Python has been identified as the most popular coding language for five years in a row for good reason.

One reason for its broad popularity is its efficiency and ease-of-use. Many people consider Python more fun to work in than other languages (and we agree!). Another reason for its popularity among scientists, engineers, and analysts in particular is Python’s support for rapid application development and extensive (and growing) open source library of powerful tools for preparing, visualizing, analyzing, and modeling data as well as simulation.

Python is also an extraordinarily comprehensive toolset – it supports everything from interactive analysis to automation to software engineering to web app development within a single language and plays very well with other languages like C/C++ or FORTRAN so you can continue leveraging your existing code libraries written in those other languages.

Many organizations are moving to Python so they can consolidate all of their technical work streams under a single comprehensive toolset. In the first part of this class we’ll give you the fundamentals you need to switch from another language to Python and then we cover the core tools that will enable you to do in Python what you were doing with other tools, only faster and better!

Additional Resources

Upcoming Open Python for Scientists & Engineers Sessions:

Washington, DC, Sept 25-29
Los Alamos, NM, Oct 2-6, 2017
Cambridge, UK, Oct 16-20, 2017
San Diego, CA, Oct 30-Nov 3, 2017
Albuquerque, NM, Nov 13-17, 2017
Los Alamos, NM, Dec 4-8, 2017
Austin, TX, Dec 11-15, 2017

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Learn More

Download Enthought’s MATLAB to Python White Paper

Additional Webinars in the Training Series:

Python for Scientists & Engineers: A Tour of Enthought’s Professional Technical Training Course

Python for Data Science: A Tour of Enthought’s Professional Technical Training Course

Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat SheetsEnthought's Machine Learning with Python Cheat Sheets

The post Webinar: Python for MATLAB Users: What You Need To Know appeared first on Enthought Blog.

by admin at September 14, 2017 06:00 PM

September 12, 2017

Bruno Pinho

Integrating & Exploring 2: Predicting Spatial Data with Machine Learning

Using Machine Learning (ML) algorithms to predict Airbourne Geophysics. A simple, but powerful solution using Rotation Gradients to add complexity to the model with a Automated Machine Learning (AutoML) implementation.

by Bruno Ruas de Pinho at September 12, 2017 01:00 AM