October 21, 2017

Titus Brown

Grokking "custom" taxonomies.

(Some more Saturday/Sunday Morning Bioinformatics...)

A while back I posted some hacky code to classify genome bins using a custom reference database. Much to my delight, Sarah Stevens (a grad student in Trina McMahon's lab) actually tried using it! And of course it didn't do what she wanted.

The problem is that all of my least-common-ancestor scripts rely utterly and completely on the taxonomic IDs from NCBI, but anyone who wants to do classification of new genome bins against their own old genome bins doesn't have NCBI taxonomic IDs. (This should have been obvious to me but I was avoiding thinking about it, because I have a whole nice pile of Python code that has to change. Ugh.)

So basically my code ends up doing a classification with custom genomes that is only useful so far as the custom genomes match NCBI's taxonomy. NOT SO USEFUL.

I'd already noticed that my code for getting taxonomic IDs from the Tara binned genomes barfed on lots of assignments (see script links after "used the posted lineages to assign NCBI taxonomic IDs" in this blog post). This was because the Tully and Delmont genome bins contained a lot of novel species. But I didn't know what to do about that so I punted.

Unfortunately I can punt no more. So I'm working on fixing that, somehow. Today's blog post is about the fallout from some initial work.


Statement of Problem, revisited

We want to classify NEW genome bins using Genbank as well as a custom collection of OLD genome bins that contains both NCBI taxonomy (where available) and custom taxonomic elements.

My trial solution

The approach I took was to build custom extensions to the NCBI taxonomy by rooting the custom taxonomies to NCBI taxids.

Basically, given a custom spreadsheet like this that contains unique identifiers and a taxonomic lineage, I walk through each row, and for each row, find the least common ancestor that shares a name and rank with NCBI. Then the taxonomy for each row consists of triples (rank, name, taxid) where the taxid below the least common ancestor is set to None.

This yields assignments like this, one for each row.

[('superkingdom', 'Bacteria', 2),
 ('phylum', 'Proteobacteria', 1224),
 ('class', 'Alphaproteobacteria', 28211),
 ('order', 'unclassifiedAlphaproteobacteria', None),
 ('family', 'SAR116cluster', None)]

So far so good!

You can see the code here if you are interested.

Problems and challenges with my trial solution

So I ran this on the Delmont and Tully spreadsheets, and by and large I could compute NCBI-rooted custom taxonomies for most of the rows. (You can try it for yourself if you like; install stuff, then run

python parse-free-tax.py db/genbank/{names,nodes}.dmp.gz tara-delmont-SuppTable3.csv > tara-delmont-ncbi-disagree.txt
python parse-free-tax.py db/genbank/{names,nodes}.dmp.gz tara-tully-Table4.csv > tara-tully-ncbi-disagree.txt

in the 2017-sourmash-lca directory.)

But! There were a few places where there were apparent disagreements between what was in NCBI and what was in the spreadsheeet.

Other than typos and parsing errors, these disagreements came in two flavors.

First, there were things that looked like minor alterations in spelling - e.g., below, the custom taxonomy file had "Flavobacteria", while NCBI spells it "Flavobacteriia".

confusing lineage #15
    CSV:  Bacteria, Bacteroidetes, Flavobacteria, Flavobacteriales, Flavobacteriaceae, Xanthomarina
    NCBI: Bacteria, Bacteroidetes, Flavobacteriia, Flavobacteriales, Flavobacteriaceae, Xanthomarina
(2 rows in spreadsheet)

Second, there were just plain ol' disagreements, e.g. below the custom taxonomy file says that genus "Haliea" belongs under order "Alteromonadales_3" while NCBI claims that it belongs under order "Cellvibrionales". This could be a situation where the researchers reused the genus name "Haliea" under a new order, OR (and I think this is more likely) perhaps the researchers are correcting the NCBI taxonomy.

confusing lineage #12
    CSV:  Bacteria, Proteobacteria, Gammaproteobacteria, Alteromonadales_3, Alteromonadaceae, Haliea
    NCBI: Bacteria, Proteobacteria, Gammaproteobacteria, Cellvibrionales, Halieaceae, Haliea
(2 rows in spreadsheet)

See tara-tully-ncbi-disagree.txt and tara-delmont-ncbi-disagree.txt for the full list.

I have no conclusion.

I'm not sure what to do at this point. For now, I'll plan to override the NCBI taxonomy with the researcher-specific taxonomy, but these disagreements seems like useful diagnostic output. It would be interesting to know what the root cause of this is, though; do these names come from using CheckM to assign taxonomy, as in Delmont et al's workflow? And if so, can I resolve a lot of this by just using that taxonomy? :) Inquiring minds must know!!

Regardless, I'm enthusiastic about the bit where my internal error checking in the script catches these issues. (And hopefully I'll never have to write that code again. It's very unpleasant.)

Onwards! Upwards!

--titus

by C. Titus Brown at October 21, 2017 10:00 PM

Bruno Pinho

A simple guide to calculate the Hoek-Brown Failure Criteria in Python

Rock masses usually reduces the strength of the intact rock due to the presence of discontinuities. Engineers and geologists must take this effect into account to predict failure in slope faces and tunnel excavations. In this post, I will take you through the process of calculating the Hoek-Brown Failure Criteria in Python.

by Bruno Ruas de Pinho at October 21, 2017 01:01 AM

October 17, 2017

Matthieu Brucher

Announcement: Audio TK 2.2.0

ATK is updated to 2.2.0 with the major introduction of vectorized filters. This means that some filters (EQ for now) can use vectorization for maximum performance. More filters will be introduced later as well as the Python support. Vector lanes of size 4 and 8 are supported as well as instruction sets from SSE2 to AVX512.

This is also the first major release that officially supports the JUCE framework. This means that ATK can be added as modules (directly source code without requiring any binaries) in the Projucer. The caveat is that SIMD filters are not available in this configuration due to the requirement for CMake support to build the SIMD filters.

Download link: ATK 2.2.0

Changelog:
2.2.0
* Introduced SIMD filters with libsimdpp
* Refactored EQ filters to work with SIMD filters
* Added module files for JUCE Projucer
2.1.1
* Added a Gain Max Compressor filter with wrappers
* Added a dry run call on BaseFilter to setup maximum sizes on a pipeline
* Added a IIR TDF2 (Transposed Direct Form 2) filter implementation (no Python wrappers for now)
* Fixed max gain reduction in the expanders to use 20 log10 instead of 10 log10 (as it is applied to the amplitude and not power)
* Fix a bug in OutCircularPointerFilter with offset handling
* Fix a bug in RIAA inverse filters

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at October 17, 2017 07:21 AM

October 16, 2017

Matthew Rocklin

Streaming Dataframes

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

This post is about experimental software. This is not ready for public use. All code examples and API in this post are subject to change without warning.

Summary

This post describes a prototype project to handle continuous data sources of tabular data using Pandas and Streamz.

Introduction

Some data never stops. It arrives continuously in a constant, never-ending stream. This happens in financial time series, web server logs, scientific instruments, IoT telemetry, and more. Algorithms to handle this data are slightly different from what you find in libraries like NumPy and Pandas, which assume that they know all of the data up-front. It’s still possible to use NumPy and Pandas, but you need to combine them with some cleverness and keep enough intermediate data around to compute marginal updates when new data comes in.

Example: Streaming Mean

For example, imagine that we have a continuous stream of CSV files arriving and we want to print out the mean of our data over time. Whenever a new CSV file arrives we need to recompute the mean of the entire dataset. If we’re clever we keep around enough state so that we can compute this mean without looking back over the rest of our historical data. We can accomplish this by keeping running totals and running counts as follows:

total = 0
count = 0

for filename in filenames:  # filenames is an infinite iterator
    df = pd.read_csv(filename)
    total = total + df.sum()
    count = count + df.count()
    mean = total / count
    print(mean)

Now as we add new files to our filenames iterator our code prints out new means that are updated over time. We don’t have a single mean result, we have continuous stream of mean results that are each valid for the data up to that point. Our output data is an infinite stream, just like our input data.

When our computations are linear and straightforward like this a for loop suffices. However when our computations have several streams branching out or converging, possibly with rate limiting or buffering between them, this for-loop approach can grow complex and difficult to manage.

Streamz

A few months ago I pushed a small library called streamz, which handled control flow for pipelines, including linear map operations, operations that accumulated state, branching, joining, as well as back pressure, flow control, feedback, and so on. Streamz was designed to handle all of the movement of data and signaling of computation at the right time. This library was quietly used by a couple of groups and now feels fairly clean and useful.

Streamz was designed to handle the control flow of such a system, but did nothing to help you with streaming algorithms. Over the past week I’ve been building a dataframe module on top of streamz to help with common streaming tabular data situations. This module uses Pandas and implements a subset of the Pandas API, so hopefully it will be easy to use for programmers with existing Python knowledge.

Example: Streaming Mean

Our example above could be written as follows with streamz

source = Stream.filenames('path/to/dir/*.csv')  # stream of filenames
sdf = (source.map(pd.read_csv)                  # stream of Pandas dataframes
             .to_dataframe(example=...))        # logical streaming dataframe

sdf.mean().stream.sink(print)                   # printed stream of mean values

This example is no more clear than the for-loop version. On its own this is probably a worse solution than what we had before, just because it involves new technology. However it starts to become useful in two situations:

  1. You want to do more complex streaming algorithms

    sdf = sdf[sdf.name == 'Alice']
    sdf.x.groupby(sdf.y).mean().sink(print)
    
    # or
    
    sdf.x.rolling('300ms').mean()
    

    It would require more cleverness to build these algorithms with a for loop as above.

  2. You want to do multiple operations, deal with flow control, etc..

    sdf.mean().sink(print)
    sdf.x.sum().rate_limit(0.500).sink(write_to_database)
    ...
    

    Consistently branching off computations, routing data correctly, and handling time can all be challenging to accomplish consistently.

Jupyter Integration and Streaming Outputs

During development we’ve found it very useful to have live updating outputs in Jupyter.

Usually when we evaluate code in Jupyter we have static inputs and static outputs:

However now both our inputs and our outputs are live:

We accomplish this using a combination of ipywidgets and Bokeh plots both of which provide nice hooks to change previous Jupyter outputs and work well with the Tornado IOLoop (streamz, Bokeh, Jupyter, and Dask all use Tornado for concurrency). We’re able to build nicely responsive feedback whenever things change.

In the following example we build our CSV to dataframe pipeline that updates whenever new files appear in a directory. Whenever we drag files to the data directory on the left we see that all of our outputs update on the right.

What is supported?

This project is very young and could use some help. There are plenty of holes in the API. That being said, the following works well:

Elementwise operations:

sdf['z'] = sdf.x + sdf.y
sdf = sdf[sdf.z > 2]

Simple reductions:

sdf.sum()
sdf.x.mean()

Groupby reductions:

sdf.groupby(sdf.x).y.mean()

Rolling reductions by number of rows or time window

sdf.rolling(20).x.mean()
sdf.rolling('100ms').x.quantile(0.9)

Real time plotting with Bokeh (one of my favorite features)

sdf.plot()

What’s missing?

  1. Parallel computing: The core streamz library has an optional Dask backend for parallel computing. I haven’t yet made any attempt to attach this to the dataframe implementation.
  2. Data ingestion from common streaming sources like Kafka. We’re in the process now of building asynchronous-aware wrappers around Kafka Python client libraries, so this is likely to come soon.
  3. Out-of-order data access: soon after parallel data ingestion (like reading from multiple Kafka partitions at once) we’ll need to figure out how to handle out-of-order data access. This is doable, but will take some effort. This is where more mature libraries like Flink are quite strong.
  4. Performance: Some of the operations above (particularly rolling operations) do involve non-trivial copying, especially with larger windows. We’re relying heavily on the Pandas library which wasn’t designed with rapidly changing data in mind. Hopefully future iterations of Pandas (Arrow/libpandas/Pandas 2.0?) will make this more efficient.
  5. Filled out API: Many common operations (like variance) haven’t yet been implemented. Some of this is due to laziness and some is due to wanting to find the right algorithm.
  6. Robust plotting: Currently this works well for numeric data with a timeseries index but not so well for other data.

But most importantly this needs use by people with real problems to help us understand what here is valuable and what is unpleasant.

Help would be welcome with any of this.

You can install this from github

pip install git+https://github.com/mrocklin/streamz.git

Documentation and code are here:

Current work

Current and upcoming work is focused on data ingestion from Kafka and parallelizing with Dask.

October 16, 2017 12:00 AM

October 15, 2017

Titus Brown

A brief introduction to osfclient, a command line client for the Open Science Framework

Over the last few months, Tim Head has been pushing forward the osfclient project, an effort to build a simple and friendly command-line interface to the Open Science Framework's file storage. This project was funded by a gift to my lab through the Center for Open Science (COS) to the tune of about $20k, given by an anonymous donor.

The original project was actually to write an OSF integration for Galaxy, but that project was first delayed by my move to UC Davis and then suffered from Michael Crusoe's move to work on the Common Workflow Language. After talking with the COS folk, we decided to repurpose the money to something that addresses a need in my lab - using the Open Science Framework to share files.

Our (Tim's) integration effort resulted in osfclient, a combination Python API and command-line program. The project is still in its early stages, but a few people have found it useful - in addition to increasing usage within my lab, @zkamvar has used it to transfer "tens of thousands of files", and @danudwary found it "just worked" for grabbing some big files. And new detailed use cases are emerging regularly.

Most exciting of all, we've had contributions from a number of other people already, and I'm looking forward to this project growing to meet the needs of the open science community!

Taking a step back: why OSF, and why a command-line client?

I talked a bit about "why OSF?" in a previous blog post, but the short version is that it's a globally accessible place to store files for science, and it works well for that! It fits a niche that we haven't found any other solutions for - free storage for medium size genomics files - and we're actively exploring its use in about a dozen different projects.

Our underlying motivations for building a command-line client for OSF were several:

  • we often need to retrieve full folder/directory hierarchies of files for research and training purposes;

  • frequently, we want to retrieve those file hierarchies on remote (cloud or HPC) systems;

  • we're often grabbing files that are larger than GitHub supports;

  • sometimes these files are from private projects that we cannot (or don't want to) publicize;

Here, the Open Science Framework was already an 80% solution (supporting folder hierarchies, large file storage, and a robust permissions system), but it didn't have a command-line client - we were reduced to using curl or wget on individual files, or (in theory) writing our own REST queries.

Enter osfclient!

Using osfclient, a quickstart

(See "Troubleshooting osfclient installs" at the bottom if you run into any troubles running these commands!)

In a Python 3 environment, do:

pip install osfclient

and then execute:

osf -p fuqsk clone

This will go to the osfclient test project on http://osf.io, and download all the files that are part of that project -- if you execute:

find fuqsk

you should see:

fuqsk
fuqsk/figshare
fuqsk/figshare/this is a test text file
fuqsk/figshare/this is a test text file/hello.txt
fuqsk/googledrive
fuqsk/googledrive/google test file.gdoc
fuqsk/googledrive/googledrive-hello.txt
fuqsk/osfstorage
fuqsk/osfstorage/hello.txt
fuqsk/osfstorage/test-subfolder
fuqsk/osfstorage/test-subfolder/hello-from-subfolder.txt

which showcases a particularly nice feature of the OSF that I'll talk about below.

A basic overview of what osfclient did

If you go to the project URL, http://osf.io/fuqsk, you will see a file storage hierarchy that looks like so:

OSF folders screenshot

What osfclient is doing is grabbing all of the different storage files and downloading them to your local machine. Et voila!

What's with the 'figshare' and 'googledrive' stuff? Introducing add-ons/integrations.

In the above, you'll notice that there are these subdirectories named figshare and googledrive. What are those?

The Open Science Framework can act as an umbrella integration for a variety of external storage services - see the docs. They support Amazon S3, Dropbox, Google Drive, Figshare, and a bunch of others.

In the above project, I linked in my Google Drive and Figshare accounts to OSF, and connected specific remote folders/projects into the OSF project (this one from Google Drive, and this one from figshare). This allows me (and others with permissions on the project) to access and manage those files from within a single Web UI on the OSF.

osfclient understands some of these integrations (and it's pretty trivial to add a new one to the client, at least), and it does the most obvious thing possible with them when you do a osfclient clone: it grabs the files and downloads them! (It should also be able to push to those remote storages, but I haven't tested that today.)

Interestingly, this appears to be a good simple way to layer OSF's project hierarchy and permission system on top of more complex and/or less flexible and/or non-command-line-friendly systems. For example, Luiz Irber recently uploaded a very large file to google drive via rclone and it showed up in his OSF project just fine.

This reasonably flexible imposition of an overall namespace on a disparate collection of storages is pretty nice, and could be a real benefit for large, complex projects.

Other things you can do with osfclient

osfclient also has file listing and file upload functionality, along with some configurability in terms of providing a default project and permissions within specific directories. The osfclient User Guide has some brief instructions along these lines.

osfclient also contains a Python API for OSF, and you can see a bit more about that here, in Tim Head and Erin Braswell's webinar materials.

What's next?

There are a few inconveniences about the OSF that could usefully be worked around, and a lot of features to be added in osfclient. In no particular order, here are a few of the big ones that require significant refactoring or design decisions or even new REST API functionality on the OSF side --

  • we want to make osf behave a bit more like git - see the issue. This would make it easier to teach and use, we think. In particular we want to avoid having to specify the project name every time.
  • speaking of project names, I don't think the project UIDs on the OSF (fuqsk above) are particular intuitive or type-able, and it would be great to have a command line way of discovering the project UID for your project of interest.
  • I'd also like to add project creation and maybe removal via the command line, as well as project registration - more on that later.
  • the file storage hierarchy above, with osfstorage/ and figshare/ as top level directories, isn't wonderful for command line folk - there are seemingly needless hierarchies in there. I'm not sure how to deal with this but there are a couple of possible solutions, including adding a per-project 'remapping' configuration that would move the files around.

Concluding thoughts

The OSF offers a simple, free, Web friendly, and convenient way to privately and publicly store collections of files under 5 GB in size on a Web site. osfclient provides a simple and reasonably functional way to download files from and upload files to the OSF via the command line. Give it a try!

Appendix: Troubleshooting osfclient installs

  1. If you can't run pip install on your system, you may need to either run the command as root, OR establish a virtual environment -- something like

python -m virtualenv -p python3.5 osftest . osftest/bin/activate pip install osfclient

will create a virtualenv, activate it, and install osfclient. (If you run into problems)

  1. If you get a requests.exceptions.SSLError, you may be on a Mac and using an old version of openssl. You can try pip install -U pyopenssl. If that doesn't work, please add a comment to this issue.

  2. Note that a conda install for osfclient exists, and you should be able to do conda install -c conda-forge osfclient.

by C. Titus Brown at October 15, 2017 10:00 PM

October 10, 2017

Matthew Rocklin

Notes on Kafka in Python

Summary

I recently investigated the state of Python libraries for Kafka. This blogpost contains my findings.

Both PyKafka and confluent-kafka have mature implementations and are maintained by invested companies. Confluent-kafka is generally faster while PyKafka is arguably better designed and documented for Python usability.

Conda packages are now available for both. I hope to extend one or both to support asynchronous workloads with Tornado.

Disclaimer: I am not an expert in this space. I have no strong affiliation with any of these projects. This is a report based on my experience of the past few weeks. I don’t encourage anyone to draw conclusions from this work. I encourage people to investigate on their own.

Introduction

Apache Kafka is a common data system for streaming architectures. It manages rolling buffers of byte messages and provides a scalable mechanism to publish or subscribe to those buffers in real time. While Kafka was originally designed within the JVM space the fact that it only manages bytes makes it easy to access from native code systems like C/C++ and Python.

Python Options

Today there are three independent Kafka implementations in Python, two of which are optionally backed by a C implementation, librdkafka, for speed:

  • kafka-python: The first on the scene, a Pure Python Kafka client with robust documentation and an API that is fairly faithful to the original Java API. This implementation has the most stars on GitHub, the most active development team (by number of committers) but also lacks a connection to the fast C library. I’ll admit that I didn’t spend enough time on this project to judge it well because of this.

  • PyKafka: The second implementation chronologically. This library is maintained by Parse.ly a web analytics company that heavily uses both streaming systems and Python. PyKafka’s API is more creative and designed to follow common Python idioms rather than the Java API. PyKafka has both a pure Python implementation and connections to the low-level librdkafka C library for increased performance.

  • Confluent-kafka: Is the final implementation chronologically. It is maintained by Confluent, the primary for-profit company that supports and maintains Kafka. This library is the fastest, but also the least accessible from a Python perspective. This implementation is written in CPython extensions, and the documentation is minimal. However, if you are coming from the Java API then this is entirely consistent with that experience, so that documentation probably suffices.

Performance

Confluent-kafka message-consumption bandwidths are around 50% higher and message-production bandwidths are around 3x higher than PyKafka, both of which are significantly higher than kafka-python. I’m taking these numbers from this blogpost which gives benchmarks comparing the three libraries. The primary numeric results follow below:

Note: It’s worth noting that this blogpost was moving smallish 100 byte messages around. I would hope that Kafka would perform better (closer to network bandwidths) when messages are of a decent size.

Producer Throughput

time_in_seconds MBs/s Msgs/s
confluent_kafka_producer 5.4 17 183000
pykafka_producer_rdkafka 16 6.1 64000
pykafka_producer 57 1.7 17000
python_kafka_producer 68 1.4 15000

Consumer Throughput

time_in_seconds MBs/s Msgs/s
confluent_kafka_consumer 3.8 25 261000
pykafka_consumer_rdkafka 6.1 17 164000
pykafka_consumer 29 3.2 34000
python_kafka_consumer 26 3.6 38000

Note: I discovered this article on parsely/pykafka #559, which has good conversation about the three libraries.

I profiled PyKafka in these cases and it doesn’t appear that these code paths have yet been optimized. I expect that modest effort could close that gap considerably. This difference seems to be more from lack of interest than any hard design constraint.

It’s not clear how critical these speeds are. According to the PyKafka maintainers at Parse.ly they haven’t actually turned on the librdkafka optimizations in their internal pipelines, and are instead using the slow Pure Python implementation, which is apparently more than fast enough for common use. Getting messages out of Kafka just isn’t their bottleneck. It may be that these 250,000 messages/sec limits are not significant in most applications. I suspect that this matters more in bulk analysis workloads than in online applications.

Pythonic vs Java APIs

It took me a few times to get confluent-kafka to work. It wasn’t clear what information I needed to pass to the constructor to connect to Kafka and when I gave the wrong information I received no message that I had done anything incorrectly. Docstrings and documentation were both minimal. In contrast, PyKafka’s API and error messages quickly led me to correct behavior and I was up and running within a minute.

However, I persisted with confluent-kafka, found the right Java documentation, and eventually did get things up and running. Once this happened everything fell into place and I was able to easily build applications with Confluent-kafka that were both simple and fast.

Development experience

I would like to add asynchronous support to one or both of these libraries so that they can read or write data in a non-blocking fashion and play nicely with other asynchronous systems like Tornado or Asyncio. I started investigating this with both libraries on GitHub.

Developers

Both libraries have a maintainer who is somewhat responsive and whose time is funded by the parent company. Both maintainers seem active on a day-to-day basis and handle contributions from external developers.

Both libraries are fully active with a common pattern of a single main dev merging work from a number of less active developers. Distributions of commits over the last six months look similar:

confluent-kafka-python$ git shortlog -ns --since "six months ago"
38  Magnus Edenhill
5  Christos Trochalakis
4  Ewen Cheslack-Postava
1  Simon Wahlgren

pykafka$ git shortlog -ns --since "six months ago"
52  Emmett Butler
23  Emmett J. Butler
20  Marc-Antoine Parent
18  Tanay Soni
5  messense
1  Erik Stephens
1  Jeff Widman
1  Prateek Shrivastava
1  aleatha
1  zpcui

Codebase

In regards to the codebases I found that PyKafka was easier to hack on for a few reasons:

  1. Most of PyKafka is written in Python rather than C extensions, and so it is more accessible to a broader development base. I find that Python C extensions are not pleasant to work with, even if you are comfortable with C.
  2. PyKafka appears to be much more extensively tested. PyKafka actually spins up a local Kafka instance to do comprehensive integration tests while Confluent-kafka seems to only test API without actually running against a real Kakfa instance.
  3. For what it’s worth, PyKafka maintainers responded quickly to an issue on Tornado. Confluent-kafka maintainers still have not responded to a comment on an existing Tornado issue, even though that comment had signfiicnatly more content (a working prototype).

To be clear, no maintainer has any responsibility to answer my questions on github. They are likely busy with other things that are of more relevance to their particular mandate.

Conda packages

I’ve pushed/updated recipes for both packages on conda-forge. You can install them as follows:

conda install -c conda-forge pykafka                 # Linux, Mac, Windows
conda install -c conda-forge python-confluent-kafka  # Linux, Mac

In both cases this these are built against the fast librdkafka C library (except on Windows) and install that library as well.

Future plans

I’ve recently started work on streaming systems and pipelines for Dask, so I’ll probably continue to investigate this space. I’m still torn between the two implementations. There are strong reasons to use either of them.

Culturally I am drawn to Parse.ly’s PyKafka library. They’re clearly Python developers writing for Python users. However the costs of using a non-Pythonic system here just aren’t that large (Kafka’s API is small), and Confluent’s interests are more aligned with investing in Kafka long term than are Parse.ly’s.

October 10, 2017 12:00 AM

October 09, 2017

Continuum Analytics

Strata Data Conference Grows Up

The Strata conference will always hold a place in my heart, as it’s one of the events that inspired Travis and I to found Anaconda. We listened to open source-driven talks about data lakes and low-cost storage and knew there would be a demand for tools to help organizations and data scientists derive value from these mountains of information.

by Team Anaconda at October 09, 2017 02:31 PM

October 05, 2017

October 04, 2017

October 03, 2017

Continuum Analytics

Seven Things You Might Not Know About Numba

In this post, I want to dive deeper and demonstrate several aspects of using Numba on the GPU that are often overlooked. I’ll quickly breeze through a number of topics, but I’ll provide links throughout for additional reading.

by Team Anaconda at October 03, 2017 06:00 PM

September 26, 2017

Continuum Analytics

Anaconda and Microsoft Partner to Deliver Python-Powered Machine Learning

Strata Data Conference, NEW YORK––September 26, 2017––Anaconda, Inc., the most popular Python data science platform provider, today announced it is partnering with Microsoft to embed Anaconda into Azure Machine Learning, Visual Studio and SQL Server to deliver data insights in real time.

by Sheyna Webster at September 26, 2017 01:00 PM

Matthieu Brucher

Book Review: Mastering The Challenges Of Leading Change: Inspire the People and Succeed Where Others Fail

I like change. More precisely, I like improving things. An as some of the people in my entourage would say, I can be a bull in a china shop. So this book sounded interesting.

Discussion

The book is split in four parts that are explained in the introduction: priorities, politics, people and perseverance.
At the end of each chapter, there is a small questionnaire with actions to do to improve your ability to change.

Let’s try with priorities. Obviously, this is all about how you want to start the change. If you stay put, you are not going forward. So the first chapter is about looking for people who will help you to change. The second chapter deals with setting up a team ready for leading the change, and the last one of this part is about the type of changes we are aiming for.

Second part tackles politics, or more precisely what happens behind the change. Communication is crucial, for the fourth chapter, but I think communication is also getting the message that lies behind a situation. It’s also about being ready to listen to others solutions than yours to improve a situation. The bad part is there are also people who just want to mess with you. I do remember some cases where it happened, and I wish I had in lace all the tools from the first two parts, it would probably have solved my problems with Machiavellis, as the author calls them!

Then we move to handling people. We go back to communication, and the importance of getting direct communication, and not indirect one, then how to handle a group and move in the same direction. I liked also the last part, when you don’t do something just for your job, but also for having new friends.

Last part is about perseverance. There are going to be issues, fires to put out, and you need a proper team. Then perseverance is also about keeping in movement and be ready for the next change. And to do so, you also need to cultivate people who may lead the next change!

Conclusion

Great book. Even if part of it is just sanity, it’s also about noticing that some of the things you may do for your friends, you need to do them when leading change. For me, this proves (all aspects of) communication (are/)is everything.

So now, let’s apply this book to my professional life!

by Matt at September 26, 2017 07:31 AM

September 25, 2017

Enthought

Enthought at the 2017 Society of Exploration Geophysicists (SEG) Conference

2017 will be Enthought’s 11th year at the SEG (Society of Exploration Geophysicists) Annual Meeting, and we couldn’t be more excited to be at the leading edge of the digital transformation in oil & gas being driven by the capabilities provided by machine learning and artificial intelligence.

Now in its 87th year, the Annual SEG (Society of Exploration Geophysicists) Meeting will be held in Houston, Texas on September 24-27, 2017 at the George R. Brown Convention Center. The SEG Annual Meeting will be the first large conference to take place in Houston since Hurricane Harvey and its devastating floods, and we’re so pleased to be a small part of getting Houston back “open for business.”

Pre-Event Kickoff: The Machine Learning Geophysics Hackathon

We had such a great experience at the EAGE Subsurface Hackathon in Paris in June that when we heard our friends at Agile Geoscience were planning a machine learning in geophysics hackathon for the US, we had to join! Brendon Hall, Enthought’s Energy Solutions Group Director will be there as a participant and coach and Enthought CEO Eric Jones will be on the judging panel.

Come Meet Us on the SEG Expo Floor & Learn About Our AI-Enabled Solutions for Oil & Gas

Presentations in Enthought Booth #318 (just to the left from the main entrance before the main aisle):

  • Monday, Sept 25, 12-12:45 PM: Lessons Learned From the Front Line: Moving AI From Research to Application
  • Tues, Sept 26, 1-1:45 PM: Canopy Geoscience: Building Innovative, AI-Enabled Geoscience Applications
  • Wed, Sept 27, 12-12:45 PM: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core

Hart Energy’s E&P Magazine Features Canopy Geoscience

Canopy Geoscience, Enthought’s cross-domain AI platform for oil & gas, was featured in the September 2017 edition of E&P magazine. See the coverage in the online SEG Technology Showcase, in the September print edition, or in the online E&P Flipbook.


Enthought's Canopy Geoscience featured in E&P's September 2017 edition

The post Enthought at the 2017 Society of Exploration Geophysicists (SEG) Conference appeared first on Enthought Blog.

by admin at September 25, 2017 05:14 PM

September 24, 2017

Matthew Rocklin

Dask Release 0.15.3

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.3. This release contains stability enhancements and bug fixes. This blogpost outlines notable changes since the 0.15.2 release on August 30th.

You can conda install Dask:

conda install -c conda-forge dask

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on conda-forge channels. They will be on defaults in a few days.

Full changelogs are available here:

Some notable changes follow.

Masked Arrays

Dask.array now supports masked arrays similar to NumPy.

In [1]: import dask.array as da

In [2]: x = da.arange(10, chunks=(5,))

In [3]: mask = x % 2 == 0

In [4]: m = da.ma.masked_array(x, mask)

In [5]: m
Out[5]: dask.array<masked_array, shape=(10,), dtype=int64, chunksize=(5,)>

In [6]: m.compute()
Out[6]:
masked_array(data = [-- 1 -- 3 -- 5 -- 7 -- 9],
             mask = [ True False  True False  True False  True False  True False],
       fill_value = 999999)

This work was primarily done by Jim Crist and partially funded by the UK Met office in support of the Iris project.

Constants in atop

Dask.array experts will be familiar with the atop function, which powers a non-trivial amount of dask.array and is commonly used by people building custom algorithms. This function now supports constants when the index given is None.

atop(func, 'ijk', x, 'ik', y, 'kj', CONSTANT, None)

Memory management for workers

Dask workers spill excess data to disk when they reach 60% of their alloted memory limit. Previously we only measured memory use by adding up the memory use of every piece of data produce by the worker. This could fail under a few situations

  1. Our per-data estiamtes were faulty
  2. User code consumed a large amount of memory without our tracking it

To compensate we now also periodically check the memory use of the worker using system utilities with the psutil module. We dump data to disk if the process rises about 70% use, stop running new tasks if it rises above 80%, and restart the worker if it rises above 95% (assuming that the worker has a nanny process).

Breaking Change: Previously the --memory-limit keyword to the dask-worker process specified the 60% “start pushing to disk” limit. So if you had 100GB of RAM then you previously might have started a dask-worker as follows:

dask-worker ... --memory-limit 60e9  # before specify 60% target

And the worker would start pushing to disk once it had 60GB of data in memory. However, now we are changing this meaning to be the full amount of memory given to the process.

dask-worker ... --memory-limit 100e9A  # now specify 100% target

Of course, you don’t have to sepcify this limit (many don’t). It will be chosen for you automatically. If you’ve never cared about this then you shouldn’t start caring now.

More about memory management here: http://distributed.readthedocs.io/en/latest/worker.html?highlight=memory-limit#memory-management

Statistical Profiling

Workers now poll their worker threads every 10ms and keep a running count of which functions are being used. This information is available on the diagnostic dashboard as a new “Profile” page. It provides information that is orthogonal, and generally more detailed than the typical task-stream plot.

These plots are available on each worker, and an aggregated view is available on the scheduler. The timeseries on the bottom allows you to select time windows of your computation to restrict the parallel profile.

More information about diagnosing performance available here: http://distributed.readthedocs.io/en/latest/diagnosing-performance.html

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.2 release on August 30th

  • Adonis
  • Christopher Prohm
  • Danilo Horta
  • jakirkham
  • Jim Crist
  • Jon Mease
  • jschendel
  • Keisuke Fujii
  • Martin Durant
  • Matthew Rocklin
  • Tom Augspurger
  • Will Warner

The following people contributed to the dask/distributed repository since the 1.18.3 release on September 2nd:

  • Casey Law
  • Edrian Irizarry
  • Matthew Rocklin
  • rbubley
  • Tom Augspurger
  • ywangd

September 24, 2017 12:00 AM

September 23, 2017

Bruno Pinho

Integrating & Exploring 3: Download and Process DEMs in Python

This tutorial shows how to automate downloading and processing DEM files. It shows a one-liner code to download SRTM (30 or 90 m) data and how to use rasterio to reproject the downloaded data into a desired CRS, spatial resolution or bounds.

by Bruno Ruas de Pinho at September 23, 2017 01:00 AM

September 22, 2017

numfocus

Open Journals joins NumFOCUS Sponsored Projects

NumFOCUS is pleased to welcome the Open Journals as a fiscally sponsored project. Open Journals is a collection of open source, open access journals. The team behind Open Journals believes that code review and high-quality, reusable software are a critical–but often overlooked–part of academic research. The primary goal of Open Journals is to provide venues […]

by NumFOCUS Staff at September 22, 2017 02:00 PM

William Stein

DataDog's pricing: don't make the same mistake I made

(I wrote a 1-year followup post here.)

I stupidly made a mistake recently by choosing to use DataDog for monitoring the infrastructure for my startup (SageMathCloud).

I got bit by their pricing UI design that looks similar to many other sites, but is different in a way that caused me to spend far more money than I expected.

I'm writing this post so that you won't make the same mistake I did.  As a product, DataDog is of course a lot of hard work to create, and they can try to charge whatever they want. However, my problem is that what they are going to charge was confusing and misleading to me.

I wanted to see some nice web-based data about my new autoscaled Kubernetes cluster, so I looked around at options. DataDog looked like a new and awesomely-priced service for seeing live logging. And when I looked (not carefully enough) at the pricing, it looked like only $15/month to monitor a bunch of machines. I'm naive about the cost of cloud monitoring -- I've been using Stackdriver on Google cloud platform for years, which is completely free (for now, though that will change), and I've also used self hosted open solutions, and some quite nice solutions I've written myself. So my expectations were way out of whack.

Ever busy, I signed up for the "$15/month plan":


One of the people on my team spent a little time and installed datadog on all the VM's in our cluster, and also made DataDog automatically start running on any nodes in our Kubernetes cluster. That's a lot of machines.

Today I got the first monthly bill, which is for the month that just happened. The cost was $639.19 USD charged to my credit card. I was really confused for a while, wondering if I had bought a year subscription.



After a while I realized that the cost is per host! When I looked at the pricing page the first time, I had just saw in big letters "$15", and "$18 month-to-month" and "up to 500 hosts". I completely missed the "Per Host" line, because I was so naive that I didn't think the price could possibly be that high.

I tried immediately to delete my credit card and cancel my plan, but the "Remove Card" button is greyed out, and it says you can "modify your subscription by contacting us at success@datadoghq.com":



So I wrote to success@datadoghq.com:

Dear Datadog,

Everybody on my team was completely mislead by your
horrible pricing description.

Please cancel the subscription for wstein immediately
and remove my credit card from your system.

This is the first time I've wasted this much money
by being misled by a website in my life.

I'm also very unhappy that I can't delete my credit
card or cancel my subscription via your website. It's
like one more stripe API call to remove the credit card
(I know -- I implemented this same feature for my site).


And they responded:

Thanks for reaching out. If you'd like to cancel your
Datadog subscription, you're able to do so by going into
the platform under 'Plan and Usage' and choose the option
downgrade to 'Lite', that will insure your credit card
will not be charged in the future. Please be sure to
reduce your host count down to the (5) allowed under
the 'Lite' plan - those are the maximum allowed for
the free plan.

Also, please note you'll be charged for the hosts
monitored through this month. Please take a look at
our billing FAQ.


They were right -- I was able to uninstall the daemons, downgrade to Lite, remove my card, etc. all through the website without manual intervention.

When people have been confused with billing for my site, I have apologized, immediately refunded their money, and opened a ticket to make the UI clearer.  DataDog didn't do any of that.

I wish DataDog would at least clearly state that when you use their service you are potentially on the hook for an arbitrarily large charge for any month. Yes, if they had made that clear, they wouldn't have had me as a customer, so they are not incentivized to do so.

A fool and their money are soon parted. I hope this post reduces the chances you'll be a fool like me.  If you chose to use DataDog, and their monitoring tools are very impressive, I hope you'll be aware of the cost.


ADDED:

On Hacker News somebody asked: "How could their pricing page be clearer? It says per host in fairly large letters underneath it. I'm asking because I will be designing a similar page soon (that's also billed per host) and I'd like to avoid the same mistakes."  My answer:

[EDIT: This pricing page by the top poster in this thread is way better than I suggest below -- https://www.serverdensity.com/pricing/]

1. VERY clearly state that when you sign up for the service, then you are on the hook for up to $18*500 = $9000 + tax in charges for any month. Even Google compute engine (and Amazon) don't create such a trap, and have a clear explicit quota increase process.
2. Instead of "HUGE $15" newline "(small light) per host", put "HUGE $18 per host" all on the same line. It would easily fit. I don't even know how the $15/host datadog discount could ever really work, given that the number of hosts might constantly change and there is no prepayment.
3. Inform users clearly in the UI at any time how much they are going to owe for that month (so far), rather than surprising them at the end. Again, Google Cloud Platform has a very clear running total in their billing section, and any time you create a new VM it gives the exact amount that VM will cost per month.
4. If one works with a team, 3 is especially important. The reason that I had monitors on 50+ machines is that another person working on the project, who never looked at pricing or anything, just thought -- he I'll just set this up everywhere. He had no idea there was a per-machine fee.

by William Stein (noreply@blogger.com) at September 22, 2017 01:03 PM

DataDog: Don't make the same mistake I did -- a followup and thoughts about very unhappy customers

This is a followup to my previous blog post about DataDog billing.

TL;DR:
- I don't recommend DataDog,
- dealing with unhappy customers is hard,
- monitoring for data science nerds?

Hacker News Comments

DataDog at Google Cloud Summit

I was recently at the Seattle Google Cloud Summit and DataDog was well represented, with the biggest booth and top vendor billing during the keynote. Clearly they are doing something right. I had a past unpleasant experience with them, and I had just been auditing my records and discovered that last year DataDog had actually charged me a lot more than I thought, so was kind of annoyed. Nonetheless, they kept coming up and talking to me, server monitoring is life-and-death important to me, and their actual software is very impressive in some ways.

Nick Parisi started talking with me about DataDog. He sincerely wanted to know about my past experience with monitoring and the DataDog product, which he was clearly very enthuisiastic about. So I told him a bit, and he encouraged me to tell him more, explaining that he did not work at DataDog last year, and that he would like to know what happened. So he gave me his email, and he was very genuinely concerned and helpful, so I sent him an email with a link to my post, etc. I didn't receive a response, so a week later I asked why, and received a followup email from Jay Robichau, who is DataDog's Director of Sales.

Conference Call with DataDog

Jay setup a conference call with me today at 10am (September 22, 2017). Before the call, I sent him a summary of my blog post, and also requested a refund, especially for the suprise bill they sent me nearly 6 weeks after my post.

During the call, Jay explained that he was "protecting" Nick from me, and that I would mostly talk with Michelle Danis who is in charge of customer success. My expectation for the call is that we would find some common ground, and that they would at least appreciate the chance to make things right and talk with an unhappy customer. I was also curious about how a successful startup company addresses the concerns of an unhappy customer (me).

I expected the conversation to be difficult but go well, with me writing a post singing the praises of the charming DataDog sales and customer success people. A few weeks ago CoCalc.com (my employer) had a very unhappy customer who got (rightfully) angry over a miscommunication, told us he would no longer use our product, and would definitely not recommend it to anybody else. I wrote to him wanting to at least continue the discussion and help, but he was completely gone. I would do absolutely anything I could to ensure he is a satisfied, if only he would give me the chance. Also, there was a recent blog post from somebody unhappy with using CoCalc/Sage for graphics, and I reached out to them as best I could to at least clarify things...

In any case, here's what DataDog charged us as a result of us running their daemon on a few dozen containers in our Kubernetes cluster (a contractor who is not a native English speaker actually setup these monitors for us):

07/22/2016  449215JWJH87S8N4  DATADOG 866-329-4466 NY  $639.19
08/29/2016 2449215L2JH87V8WZ DATADOG 866-329-4466 NY $927.22

I was shocked by the 07/22 bill which spured my post, and discovered the 8/29 one only later. We canceled our subscription on July 22 (cancelling was difficult in itself).

Michelle started the conference call by explaining that the 08/29 bill was for charges incurred before 07/22, and that their billing system has over a month lag (and it does even today, unlike Google Cloud Platform, say). Then Michelle explained at length many of the changes that DataDog has made to their software to address the issues I (and others?) have pointed out with their pricing description and billing. She was very interested in whether I would be a DataDog customer in the future, and when I said no, she explained that they would not refund my money since the bill was not a mistake.

I asked if they now provide a periodic summary of the upcoming bill, as Google cloud platform (say) does. She said that today they don't, though they are working on it. They do now provide a summary of usage so far in an admin page.

Finally, I explained in no uncertain terms that I felt misled by their pricing. I expected that they would understand, and pointed out that they had just described to me many ways in which they were addressing this very problem. Very surprisingly, Michelle's response was that she absolutely would not agree that there was any problem with their pricing description a year ago, and they definitely would not refund my money. She kept bringing up the terms of service. I agreed that I didn't think legally they were in the wrong, given what she had explained, just that -- as they had just pointed out -- their pricing and billing was unclear in various ways. They would not agree at all.

I can't recommend doing business with DataDog. I had very much hoped to write the opposite in this updated post. Unfortunately, their pricing and terms are still confusing today compared to competitors, and they are unforgiving of mistakes. This dog bites.    

(Disclaimer: I took notes during the call, but most of the above is from memory, and I probably misheard or misunderstood something. I invite comments from DataDog to set the record straight.)

Also, for what it is worth, I definitely do recommend Google Cloud Platform.  They put in the effort to do many things right regarding clear billing.

How do Startups Deal with Unhappy Customers?

I am very curious about how other startups deal with unhappy customers. At CoCalc we have had very few "major incidents" yet... but I want to be as prepared as possible. At the Google Cloud Summit, I went to some amazing "war storries by SRE's" session in which they talked about situations they had been in years ago in which their decisions meant the difference between whether they would have a company or job tomorrow or not. These guys clearly have amazing instincts for when a problem was do-or-die serious and when it wasn't. And their deep preparation "in depth" was WHY they were on that stage, and a big reason why older companies like Google are still around. Having a strategy for addressing very angry customers is surely just as important.


Google SRE's: these guys are serious.

I mentioned my DataDog story to a long-time Google employee there (16 years!) and he said he had recently been involved in a similar situation with Google's Stackdriver monitoring, where the bill to a customer was $85K in a month just for Stackdriver. I asked what Google did, and he said they refunded the money, then worked with the customer to better use their tools.

There is of course no way to please all of the people all of the time. However, I genuinely feel that I was ripped off and misled by DataDog, but I have the impression that Jay and Michelle honestly view me as some jerk trying to rip them off for $1500.   And they probably hate me for telling you about my experiences.

So far, with CoCalc we charge customers in advance for any service we provide, so less people are surprised by a bill.  Sometimes there are problems with recurring subscriptions when a person is charged for the upcoming month of a subscription, and don't want to continue (e.g., because their course is over), we always fully refund the charge. What does your company do? Why? I do worry that our billing model means that we miss out on potential revenue.

We all know what successful huge consumer companies like Amazon and Wal-Mart do.

Monitoring for Data Science Nerds?

I wonder if there is interest in a service like DataDog, but targeted at Data Science Nerds, built on CoCalc, which provides hosted collaborative Jupyter notebooks with Pandas, R, etc., pre-installed. This talk at PrometheusCon 2017 mostly discussed the friction that people face moving data from Prometheus to analyze using data science tools (e.g., R, Python, Jupyter). CoCalc provides a collaborative data science environment, so if we were to smooth over those points of friction, perhaps it could be useful to certain people. And much more efficient...

by William Stein (noreply@blogger.com) at September 22, 2017 12:32 PM

September 21, 2017

Continuum Analytics

Back to School: Data Science & AI Open New Doors for Students

School is back in session and now is the time students are thinking about their future. When considering options, science-oriented students are likely thinking about what is arguably today’s hottest technology: artificial intelligence (AI).

by Sheyna Webster at September 21, 2017 06:40 PM

Anaconda to Present at Strata Data Conference, New York

Anaconda, the most popular Python data science platform provider, today announced that several company experts will present two sessions and one tutorial at The Strata Data Conference on September 26 and 27 at the Javits Center in New York City.

by Sheyna Webster at September 21, 2017 02:11 PM

Matthew Rocklin

Fast GeoSpatial Analysis in Python

This work is supported by Anaconda Inc., the Data Driven Discovery Initiative from the Moore Foundation, and NASA SBIR NNX16CG43P

This work is a collaboration with Joris Van den Bossche. This blogpost builds on Joris’s EuroSciPy talk (slides) on the same topic. You can also see Joris’ blogpost on this same topic.

TL;DR:

Python’s Geospatial stack is slow. We accelerate the GeoPandas library with Cython and Dask. Cython provides 10-100x speedups. Dask gives an additional 3-4x on a multi-core laptop. Everything is still rough, please come help.

We start by reproducing a blogpost published last June, but with 30x speedups. Then we talk about how we achieved the speedup with Cython and Dask.

All code in this post is experimental. It should not be relied upon.

Experiment

In June Ravi Shekhar published a blogpost Geospatial Operations at Scale with Dask and GeoPandas in which he counted the number of rides originating from each of the official taxi zones of New York City. He read, processed, and plotted 120 million rides, performing an expensive point-in-polygon test for each ride, and produced a figure much like the following:

This took about three hours on his laptop. He used Dask and a bit of custom code to parallelize Geopandas across all of his cores. Using this combination he got close to the speed of PostGIS, but from Python.

Today, using an accelerated GeoPandas and a new dask-geopandas library, we can do the above computation in around eight minutes (half of which is reading CSV files) and so can produce a number of other interesting images with faster interaction times.

A full notebook producing these plots is available below:

The rest of this article talks about GeoPandas, Cython, and speeding up geospatial data analysis.

Background in Geospatial Data

The Shapely User Manual begins with the following passage on the utility of geospatial analysis to our society.

Deterministic spatial analysis is an important component of computational approaches to problems in agriculture, ecology, epidemiology, sociology, and many other fields. What is the surveyed perimeter/area ratio of these patches of animal habitat? Which properties in this town intersect with the 50-year flood contour from this new flooding model? What are the extents of findspots for ancient ceramic wares with maker’s marks “A” and “B”, and where do the extents overlap? What’s the path from home to office that best skirts identified zones of location based spam? These are just a few of the possible questions addressable using non-statistical spatial analysis, and more specifically, computational geometry.

Shapely is part of Python’s GeoSpatial stack which is currently composed of the following libraries:

  1. Shapely: Manages shapes like points, linestrings, and polygons. Wraps the GEOS C++ library
  2. Fiona: Handles data ingestion. Wraps the GDAL library
  3. Rasterio: Handles raster data like satelite imagery
  4. GeoPandas: Extends Pandas with a column of shapely geometries to intuitively query tables of geospatially annotated data.

These libraries provide intuitive Python wrappers around the OSGeo C/C++ libraries (GEOS, GDAL, …) which power virtually every open source geospatial library, like PostGIS, QGIS, etc.. They provide the same functionality, but are typically much slower due to how they use Python. This is acceptable for small datasets, but becomes an issue as we transition to larger and larger datasets.

In this post we focus on GeoPandas, a geospatial extension of Pandas which manages tabular data that is annotated with geometry information like points, paths, and polygons.

GeoPandas Example

GeoPandas makes it easy to load, manipulate, and plot geospatial data. For example, we can download the NYC taxi zones, load and plot them in a single line of code.

geopandas.read_file('taxi_zones.shp')
         .to_crs({'init' :'epsg:4326'})
         .plot(column='borough', categorical=True)

Cities are now doing a wonderful job publishing data into the open. This provides transparency and an opportunity for civic involvement to help analyze, understand, and improve our communities. Here are a few fun geospatially-aware datasets to make you interested:

  1. Chicago Crimes from 2001 to present (one week ago)
  2. Paris Velib (bikeshare) in real time
  3. Bike lanes in New Orleans
  4. New Orleans Police Department incidents involving the use of force

Performance

Unfortunately GeoPandas is slow. This limits interactive exploration on larger datasets. For example the Chicago crimes data (the first dataset above) has seven million entries and is several gigabytes in memory. Analyzing a dataset of this size interactively with GeoPandas is not feasible today.

This slowdown is because GeoPandas wraps each geometry (like a point, line, or polygon) with a Shapely object and stores all of those objects in an object-dtype column. When we compute a GeoPandas operation on all of our shapes we just iterate over these shapes in Python. As an example, here is how one might implement a distance method in GeoPandas today.

def distance(self, other):
    result = [geom.distance(other)
              for geom in self.geometry]
    return pd.Series(result)

Unfortunately this just iterates over elements in the series, each of which is an individual Shapely object. This is inefficient for two reasons:

  1. Iterating through Python objects is slow relative to iterating through those same objects in C.
  2. Shapely Python objects consume more memory than the GEOS Geometry objects that they wrap.

This results in slow performance.

Cythonizing GeoPandas

Fortunately, we’ve rewritten GeoPandas with Cython to directly loop over the underlying GEOS pointers. This provides a 10-100x speedup depending on the operation. So instead of using a Pandas object-dtype column that holds shapely objects we instead store a NumPy array of direct pointers to the GEOS objects.

Before

After

As an example, our function for distance now looks like the following Cython implementation (some liberties taken for brevity):

cpdef distance(self, other):
    cdef int n = self.size
    cdef GEOSGeometry *left_geom
    cdef GEOSGeometry *right_geom = other.__geom__  # a geometry pointer
    geometries = self._geometry_array

    with nogil:
        for idx in xrange(n):
            left_geom = <GEOSGeometry *> geometries[idx]
            if left_geom != NULL:
                distance = GEOSDistance_r(left_geom, some_point.__geom)
            else:
                distance = NaN

For fast operations we see speedups of 100x. For slower operations we’re closer to 10x. Now these operations run at full C speed.

In his EuroSciPy talk Joris compares the performance of GeoPandas (both before and after Cython) with PostGIS, the standard geospatial plugin for the popular PostgreSQL database (original notebook with the comparison). I’m stealing some plots from his talk below:

Cythonized GeoPandas and PostGIS run at almost exactly the same speed. This is because they use the same underlying C library, GEOS. These algorithms are not particularly complex, so it is not surprising that everyone implements them in exactly the same way.

This is great. The Python GIS stack now has a full-speed library that operates as fast as any other open GIS system is likely to manage.

Problems

However, this is still a work in progress, and there is still plenty of work to do.

First, we need for Pandas to track our arrays of GEOS pointers differently from how it tracks a normal integer array. This is both for usability reasons, like we want to render them differently and don’t want users to be able to perform numeric operations like sum and mean on these arrays, and also for stability reasons, because we need to track these pointers and release their allocated GEOSGeometry objects from memory at the appropriate times. Currently, this goal is pursued by creating a new block type, the GeometryBlock (‘blocks’ are the internal building blocks of pandas that hold the data of the different columns). This will require some changes to Pandas itself to enable custom block types (see this issue on the pandas issue tracker).

Second, data ingestion is still quite slow. This relies not on GEOS, but on GDAL/OGR, which is handled in Python today by Fiona. Fiona is more optimized for consistency and usability rather than raw speed. Previously when GeoPandas was slow this made sense because no one was operating on particularly large datasets. However now we observe that data loading is often several times more expensive than all of our manipulations so this will probably need some effort in the future.

Third, there are some algorithms within GeoPandas that we haven’t yet Cythonized. This includes both particular features like overlay and dissolve operations as well as small components like GeoJSON output.

Finally as with any rewrite on a codebase that is not exhaustively tested (we’re trying to improve testing as we do this) there are probably several bugs that we won’t detect until some patient and forgiving user runs into them first.

Still though, all linear geospatial operations work well and are thoroughly tested. Also spatial joins (a backbone of many geospatial operations) are up and running at full speed. If you work in a non-production environment then Cythonized GeoPandas may be worth your time to investigate.

You can track future progress on this effort at geopandas/geopandas #473 which includes installation instructions.

Parallelize with Dask

Cythonizing gives us speedups in the 10x-100x range. We use a single core as effectively as is possible with the GEOS library. Now we move on to using multiple cores in parallel. This gives us an extra 3-4x on a standard 4 core laptop. We can also scale to clusters, though I’ll leave that for a future blogpost.

To parallelize we need to split apart our dataset into multiple chunks. We can do this naively by placing the first million rows in one chunk, the second million rows in another chunk, etc. or we can partition our data spatially, for example by placing all of the data for one region of our dataset in one chunk and all of the data for another region in another chunk, and so on. Both approaches are implemented in a rudimentary dask-geopandas library available on GitHub.

So just as dask-array organizes many NumPy arrays along a grid and dask-dataframe organizes many Pandas dataframes along a linear index

the dask-geopandas library organizes many GeoPandas dataframes into spatial regions. In the example below we might partition data in the city of New York into its different boroughs. Data for each borough would be handled separately by a different thread or, in a distributed situation, might live on a different machine.

This gives us two advantages:

  1. Even without geospatial partitioning, we can use many cores (or many machines) to accelerate simple operations.
  2. For spatially aware operations, like spatial joins or subselections we can engage only those parts of the parallel dataframe that we know are relevant for various parts of the computation.

However this is also expensive and not always necessary. In our initial exercise with the NYC Taxi data we didn’t do this, and will still got significant speedups just from normal multicore operation.

Exercise

And so to produce the images we did at the top of this post we used a combination of dask.dataframe to load in CSV files, dask-geopandas to perform the spatial join, and then dask.dataframe and normal pandas to perform the actual computations. Our code looked something like the following:

import dask.dataframe as dd
import dask_geopandas as dg

df = dd.read_csv('yellow_tripdata_2015-*.csv')
gf = dg.set_geometry(df, geometry=df[['pickup_longitude', 'pickup_latitude']],
                     crs={'init' :'epsg:4326'})
gf = dg.sjoin(gf, zones[['zone', 'borough', 'geometry']])
full = gf[['zone', 'payment_type', 'tip_amount', 'fare_amount']]

full.to_parquet('nyc-zones.parquet')  # compute and cache result on disk
full = dd.read_parquet('nyc-zones.parquet')

And then we can do typical groupbys and joins on the more typical pandas-like data now properly annotated with zones.

result = full.passenger_count.groupby(full.zone).count().compute()
result.name = 'count'
joined = pd.merge(result.to_frame(), zones,
                  left_index=True, right_on='zone')
joined = geopandas.GeoDataFrame(joined)  # convert back for plotting

We’ve replaced most of Ravi’s custom analysis with a few lines of new standard code. This maxes our or CPU when doing spatial joins. Everything here releases the GIL well and the entire computation operates in under a couple gigabytes of RAM.

Problems

The dask-geopandas project is currently a prototype. It will easily break for non-trivial applications (and indeed many trivial ones). It was designed to see how hard it would be to implement some of the trickier operations like spatial joins, repartitioning, and overlays. This is why, for example, it supports a fully distributed spatial join, but lacks simple operations like indexing. There are other longer-term issues as well.

Serialization costs are manageable, but decently high. We currently use the standard “well known binary” WKB format common in other geospatial applications but have found it to be fairly slow, which bogs down inter-process parallelism.

Similarly distributed and spatially partitioned data stores don’t seem to be common (or at least I haven’t run across them yet).

It’s not clear how dask-geopandas dataframes and normal dask dataframes should interact. It would be very convenient to reuse all of the algorithms in dask.dataframe, but the index structures of the two libraries is very different. This may require some clever software engineering on the part of the Dask developers.

Still though, these seem surmountable and generally this process has been easy so far. I suspect that we can build an intuitive and performant parallel GIS analytics system with modest effort.

The notebook for the example at the start of the blogpost shows using dask-geopandas with good results.

Conclusion

With established technologies in the PyData space like Cython and Dask we’ve been able to accelerate and scale GeoPandas operations above and beyond industry standards. However this work is still experimental and not ready for production use. This work is a bit of a side project for both Joris and Matthew and they would welcome effort from other experienced open source developers. We believe that this project can have a large social impact and are enthusiastic about pursuing it in the future. We hope that you share our enthusiasm.

September 21, 2017 12:00 AM

September 20, 2017

Titus Brown

Using the Open Science Framework to share files, for Science.

Sharing files in the 100 MB-5 GB range has been a perennial problem for my lab - we want to post them someplace central, perhaps publicly or perhaps privately, and we have several use cases that complicate things:

  • sometimes we need to share files with collaborators who are not so command-line savvy, so an easy-to-use Web interface is important;
  • other times we want to provide them as part of documentation or protocols (e.g. see my recent blog post), and provide simple download links that can be used with curl or wget;
  • when we're doing benchmarking, it's nice to be able to save the input and output files somewhere, but here we want to be able to use the command line to save (and sometimes to retrieve) the files to/from servers;
  • occasionally we need to share a collection of medium-sized files with dozens or hundreds of people for a workshop or a course, and it's nice to have a high performance always-up site for that purpose;
  • if we publish something, we'd like a location that we can "freeze" into an archive;

and of course it'd be nice to have a single system that does all of this!

Over the past decade we've used a variety of systems (figshare, AWS S3, Dropbox, mega.nz, Zenodo) and ignored a number of other systems (Box, anyone?), but none of them have really met our needs. GitHub is the closest, but GitHub itself is too restrictive with file sizes, and Git LFS was frustratingly bad the one time we engaged closely with it.

But now we have a solution that has been working out pretty well!

Enter the Open Science Framework, osf.io, which is a ...well, it's complicated, but it's a site that brings together some project management tools with user accounts and data storage.

I first encountered OSF in 2015, when we visited the Center for Open Science as part of the Open Source/Open Science meeting. COS does many things, including running various reproducibility projects, being engaged with research study preregistration and hosting a number of preprint sites. osf.io underpins many of these efforts and is an open source platform that is under continued development. They are serious about their software engineering, in it for the long run, and to the best of my understanding simply want to enable #openscience.

The osf.io site addresses essentially all of our use cases above:

  • it supports both public and private projects;
  • it has a simple and reasonably robust permissions model;
  • it is quite usable via the Web interface;
  • with only a little bit of trouble you can get download URLs for individual files;
  • we (by which I mean mostly Tim Head) have a ~beta version of a command line utility for interacting with the OSF storage system, called osfclient, which actually just works;
  • it's free! and will store individual files up to 5 GB, with no project limit;
  • they support "registering" projects which freezes your project at a particular point in time and gives you an archival location for that version (think GitHub + Zenodo style DOI generation);
  • they have a sustainability fund that guarantees that even if COS itself shuts down, the osf.io site will continue serving files for an extended period of time (last time I checked, it was 50 years);
  • you can even do things like host UCSC genome tracks on osf.io (and when you can't, because of bugs, they are really responsive and fix things quickly!)

and, coolest of all, they have "community advocates" like Natalie Meyers who are actively looking for use cases to support, which is kind of how we got involved with them.

So... give it a try!

In the past six months, we've used osf.io for --

and more.

There's a bunch more things that osf.io can in theory do that I haven't explored, but, at least for now, it's meeting our file redistribution and project file sharing needs quite handily.

Where next? I'd suggest trying it out; it only takes about 5 minutes to figure out most of the basic stuff. You can ask them questions on Twitter at @OSFramework and there are FAQs and Guides ready to hand.

--titus

p.s. I've had extended conversations with them about whether they really want the genomics community descending upon them. They said "sure!" So, you know, ...go for it. :)

by C. Titus Brown at September 20, 2017 10:00 PM

Jarrod Millman

NetworkX 2.0 released

I am happy to announce the release of NetworkX 2.0!  NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

This release supports Python 2.7 and 3.4-3.6 and is the result of over two years of work with 1212 commits and 193 merges by 86 contributors.  We have made major changes to the methods in the Multi/Di/Graph classes.  There is a  migration guide for people moving from 1.X to 2.0.

For more information, please visit our website and our gallery of examples.  Please send comments and questions to the networkx-discuss mailing.

by Jarrod Millman (noreply@blogger.com) at September 20, 2017 01:43 PM

September 19, 2017

Continuum Analytics

How to Troubleshoot Python Software in Anaconda Distribution

Below is a question that was recently asked on StackOverflow and I decided it would be helpful to publish an answer explaining the various ways in which to troubleshoot a problem you may be having in Anaconda.

by Sheyna Webster at September 19, 2017 02:00 PM

Gaël Varoquaux

Beyond computational reproducibility, let us aim for reusability

Note

Scientific progress calls for reproducing results. Due to limited resources, this is difficult even in computational sciences. Yet, reproducibility is only a means to an end. It is not enough by itself to enable new scientific results. Rather, new discoveries must build on reuse and modification of the state of the art. As time goes, this state of the art must be consolidated in software libraries, just as scientific knowledge as been consolidated on bookshelves of brick-and-mortar libraries.

I am reposting an essay that I wrote on reproducible science and software libraries. The full discussion is in IEEE CIS TC Cognitive and Developmental Systems, but I’ve been told that it is hard to find.

Science is based on the ability to falsify claims. Thus, reproduction or replication of published results is central to the progress of science. Researchers failing to reproduce a result will raise questions: Are these investigators not skilled enough? Did they misunderstand the original scientific endeavor? Or is the scientific claim unfounded? For this reason, the quality of the methods description in a research paper is crucial. Beyond papers, computers —central to science in our digital era— bring the hope of automating reproduction. Indeed, computers excel at doing the same thing several times.

However, there are many challenges to computational reproducibility. To begin with, computers enable reproducibility only if all steps of a scientific study are automated. In this sense, interactive environments —productivity-boosters for many— are detrimental unless they enable easy recording and replay of the actions performed. Similarly, as a computational-science study progresses, it is crucial to keep track of changes to the corresponding data and scripts. With a software-engineering perspective, version control is the solution. It should be in the curriculum of today’s scientists. But it does not suffice. Automating a computational study is difficult. This is because it comes with a large maintenance burden: operations change rapidly, straining limited resources —processing power and storage. Saving intermediate results helps. As does devising light experiments that are easier to automate. These are crucial to the progress of science, as laboratory classes or thought experiments in physics. A software engineer would relate them to unit tests, elementary operations checked repeatedly to ensure the quality of a program.

Archiving computers in thermally-regulated nuclear-proof vaults?

Once a study is automated and published, ensuring reproducibility should be easy; just a matter of archiving the computer used, preferably in a thermally-regulated nuclear-proof vault. Maybe, dear reader, the scientist in you frowns at this solution. Indeed, studies should also be reproduced by new investigators. Hardware and software variations then get in the way. Portability, ie achieving identical results across platforms, is well-known by the software industry as being a difficult problem. It faces great hurdles due to incompatibilities in compilers, libraries, or operating systems. Beyond these issues, portability also faces numerical and statistical stability issues in scientific computing. Hiding instability problems with heavy restrictions on the environment is like rearranging deck chairs on the Titanic. While enough freezing will recover reproducibility, unstable operations cast doubt upon scientific conclusions they might lead to. Computational reproducibility is more than a software engineering challenge; it must build upon solid numerical and statistical methods.

Reproducibility is not enough. It is only a means to an end, scientific progress. Setting in stone a numerical pipeline that produces a figure is of little use to scientific thinking if it is a black box. Researchers need to understand the corresponding set of operations to relate them to modeling assumptions. New scientific discoveries will arise from varying those assumptions, or applying the methodology to new questions or new data. Future studies build upon past studies, standing on the shoulders of giants, as Isaac Newton famously wrote. In this process, published results need to be modified and adapted, not only reproduced. Enabling reuse is an important goal.

Libraries as reusable computational experiments

To a software architect, a reusable computational experiment may sound like a library. Software libraries are not only a good analogy, but also an essential tool. The demanding process of designing a good library involves isolating elementary steps, ensuring their quality, and documenting them. It is akin to the editorial work needed to assemble a textbook from the research literature.

Science should value libraries made of code, and not only bookshelves. But they are expensive to develop, and even more so to maintain. Where to set the cursor? It is clear that in physics not every experimental setup can be stored for later reuse. Costs are less tangible with computational science; but they should not be underestimated. In addition, the race to publish creates legions of studies. As an example, Google scholar lists 28000 publications concerning compressive sensing in 2015. Arguably many are incremental and research could do with less publications. Yet the very nature of research is to explore new ideas, not all of which are to stay.

Identifying and consolidating major results for reuse

Computational research will best create scientific progress by identifying and consolidating the major results. It is a difficult but important task. These studies should be made reusable. Limited resources imply that the remainder will suffer from “code rot”, with results becoming harder and harder to reproduce as their software environment becomes obsolete. Libraries, curated and maintained, are the building blocks that can enable progress.


If you want to cite this essay in an academic publication, please cite the version in IEEE CIS TC Cognitive and Developmental Systems (volume 32, number 2, 2016).

by Gaël Varoquaux at September 19, 2017 10:10 AM

Matthieu Brucher

Book review: Audio Effects: Theory, Implementation and Application

I love reading books on signal processing, especially on audio signal processing. From books on using effects to a so-called reference, I still enjoy reading them, even if they are frustrating. The only one that was is DAFX: Digital Audio Effects, but I haven’t made a review of it!

This book came after reading a book on JUCE and starting using it myself. As it was using JUCE for its plugins, I thought it would be a good book.

Discussion

The book starts with a great introduction on signal processing. It actually gives all the theory for linear time invariant filters. Once this is set, I’m not even sure you need more DSP theory!

The second chapters dials back and tackles delays, but with the theoretical approach. It would have actually been better IMHO to give the theory bit by bit, as we don’t need to know anything about the z-transform for any delays. I still think that Zölzer approach on chorus (low-frequency random noise instead of LFO) is the best…

After that, there are two chapters on general LTI effects. First filter design, with all the equations to switch between different types of filters, and then their application to EQ, was-wah and others. It was interesting, but lots of talks and the code quality is… bad. Really, you don’t call modulo in the DSP loop.

The next chapter is yet another mystery: amplitude modulation. The concepts are so basic that it should have been before delays… Not even talking about it.

OK, now the chapter on dynamic processing did introduce compression properly, but the authors forgot that compressors and expanders are more or less the same, and they confused compressor of type I (compressing above threshold) and expander of type II (expanding below threshold) with compressors and expanders. The only thing you can’t do is a compressor of type II, but expanders of type I do exist.

The chapter on overdrive is “cute”. I mean, waveshaping and oversampling, that’s great, but a proper book should tackle analog modeling as well in this kind of chapter…

The following chapter almost tackles vocoders. I mean almost because while starting talking about the theory and short-term FFT, it completely fails at the theory of FFT. It’s quite astonishing when comparing with the first chapter, but seriously, what the authors present as processing assumes a periodic signal. And even then, depending on the FFT window, the result is different. Why didn’t they warn about the fact that this kind of effect is not causal? And after, there are people designing effect directly in the frequency space and completely mess it up. Seriously!

After that chapter, it was all downhill. Spatial audio effects (pan, more or less, with mention of HRTF), the Doppler effect, which is a rehash of the delay chapter with some more information, reverberation (which should have been after delays), a chapter on audio production (the signal path in a condole or a DAW) and then a short chapter on JUCE (so short, that I’m not really sure it matters). I suppose that with my tone, you understand that the second half of the book was a big disappointment.

Conclusion

So there is still no better DSP book than DAFX, and definitely no proper book on DSP programming… Avoid this book, too expensive for what it brings to the table.

by Matt at September 19, 2017 07:05 AM

September 18, 2017

Titus Brown

Classifying genome bins using a custom reference database, part II

Note: this blog post uses a pile of alpha code, which is not systematically tested yet. It will eventually be introduced into sourmash, probably as sourmash lca. I'll put a link to a new tutorial here when that happens.

See updated discussion here - this code probably doesn't really do what you want ;).


This is Part II of a blog post on some tools to taxonomically classify a set of genome assemblies against a private reference database. Part I is here.

The commands below should be a complete walkthrough of the process.

We'll be using data sets from two different binning approaches applied to the Tara ocean data set (Sunagawa et al., 2015). The two approaches were posted to bioRxiv as Tully et al., 2017 and Delmont et al., 2017. Below, I will refer to them as 'Tully' and 'Delmont'.

Installing sourmash

The below commands use a bunch of commands from the sourmash Python API, so you'll need to have sourmash installed. You can do that by following the beginning of the sourmash tutorial, or, if you are comfortable installing Python packages, you can set up your own virtualenv and go from there using the following commands (which do not require root privileges):

python3.5 -m venv ~/py3
. ~/py3/bin/activate
pip install -U pip
pip install -U Cython
pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer

pip install https://github.com/dib-lab/sourmash/archive/master.zip

Grabbing the 2017-sourmash-lca repository

Most of the scripts below are in the 2017-sourmash-lca github repository. So, after changing to a workspace directory, clone the repo:

git clone https://github.com/ctb/2017-sourmash-lca

Getting the data

I've selected 100 genomes at random from both tully and delmont (although all the commands below should work with the full data sets, too). Below, we'll grab the data.

cd 2017-sourmash-lca/
curl -L https://osf.io/73yfz/download?version=1 -o subsample.tar.gz
tar xzf subsample.tar.gz
cd subsample

After you get and unpack the data, the command

ls -FC

should show you delmont/ and tully/ directories with 100 genomes in each.

Computing sourmash signatures for your "reference" genomes

Let's use the 'delmont' genomes as the reference genomes, against which you want to classify the 'tully' genomes; for that, we'll have to build a least-common-ancestor database for all the genomes in delmont/.

First, we need to compute sourmash signatures for the delmont genomes.

cd delmont
sourmash compute -k 21,31,51 --scaled 10000 --name-from-first *.fa.gz

This will take about 2 minutes.

(Note that sourmash compute will autodetect gzipped and bzipped files, as well as straight FASTA or FASTQ - you don't need to name the files in any special way.)

You should now have .sig files for every .fa.gz file in your delmont/ directory:

ls -1 *.fa.gz | wc -l
ls -1 *.sig | wc -l

should both yield 100.

These signature files contain the k-mers to which we will link taxonomic information.

Connecting genome names to NCBI taxonomic IDs

Next, we need to connect each signature to an NCBI taxonomic ID; the next step (Building a least-common-ancestor database) requires a CSV file with the following information:

genomename,taxid,lineage

here, genome_name needs to be the first word (space-separated) of the first contig in each of the genome files above, and taxid needs to be the integer taxonomic ID from NCBI -- i.e. the number at the end of NCBI taxonomy database URLs. (It doesn't need to be to a specific species or strain, if the species or strain hasn't been entered into the database yet.)

If you're like most bioinformaticians, you've probably developed your own custom way of doing taxonomy, so getting it into proper shape for this step might involve a bit of scripting.

For the Delmont et al. data, the predicted taxonomy of each genome bin is provided in Supplemental Table 3 of Delmont et al., which I saved as a CSV file here. However, we face another problem here, which is that the contigs are named things like TARA_ANE_MAG_00011_00000000001, but that's not what's in that spreadsheet. So to connect the spreadsheet to the signatures, we need to "fix" each signature name.

(Where did these names come from? The argument --name-from-first to sourmash compute tells sourmash to take the name of the signature from the first sequence in the FASTA file. Hey, it's better than setting it manually for 100 individual files!)

"Fixing" the names of the signatures

The following Python script will do the trick; it walks through every signature file provided on the command line, and updates the name fields to have the format that's in the spreadsheet.

#! /usr/bin/env python
import sourmash_lib, sourmash_lib.signature
import sys

for filename in sys.argv[1:]:
    print('...', filename)
    outsigs = []
    for sig in sourmash_lib.signature.load_signatures(filename):
        name = sig.d['name'].split('_')[:4]
        sig.d['name'] = "_".join(name)
        outsigs.append(sig)

    with open(filename + '.fixed', 'wt') as fp:
        sourmash_lib.signature.save_signatures(outsigs, fp)

This script is available as delmont-fix-sig-names.py in the directory above the delmont directory, and to run it, execute:

python ../delmont-fix-sig-names.py *.sig

This will produce a bunch of .sig.fixed files, and you can rename them to .sig by doing the following:

for i in *.sig
do
    mv $i.fixed $i
done

OK! Now we've got the names in our signature files fixed up properly. Next, let's correlate them with the taxonomic IDs published in the paper!

Running a custom script to get taxids into the spreadsheet.

Next, we need to run the script to produce the CSV file that maps signature names to taxonomic IDs. This is checked into the 2017-sourmash-lca repository as script get_taxid_tara_delmont.py. This script is a hacky mess that searches the downloaded NCBI Files for the names in the spreadsheet, and does some disambiguation if there are duplicate names.

For this, you'll need the names.dmp and nodes.dmp files from the NCBI taxonomy dump database, so download it, unpack the two files into a subdirectory, and then gzip them so they take up a bit less space.

cd ../
mkdir db/
cd db/
curl -O -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xzf taxdump.tar.gz names.dmp nodes.dmp
gzip names.dmp nodes.dmp
rm taxdump.tar.gz
cd ../

You'll also need the taxonomy spreadsheet provided with Delmont et al.; let's put it in the subsample/ directory above delmont/, which should be our current working directory:

curl -O -L https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv

Now, run the script to convert the taxonomy in the Delmont et al. paper into NCBI taxids:

python ../get_taxid_tara_delmont.py db/names.dmp.gz db/nodes.dmp.gz \
    tara-delmont-SuppTable3.csv tara_delmont_taxids.csv

It will take a few minutes to run, but it should produce a file named tara_delmont_taxids.csv that should look preeeeeeetty similar to the one in github. This is the file you'll need for the next step.

Building a least-common-ancestor database

Now we get to put all this information together!

To build the least-common-ancestor (LCA) database, we'll need to specify: a k-mer size (here 21), a scaled value (here 10,000 - more on what this does in another blog post), the output name of the LCA database (here delmont.k21.lca), the taxonomic spreadsheet we just created (tara_delmont_axids.csv), the NCBI nodes dump table which we downloaded above (db/nodes.dmp.gz), and the signatures -- here, all in the subdirectory delmont, and found by specifying --traverse-directory (so the extract script will go find all the signatures under delmont/.)

Finally, we'll specify an LCA metadata database delmont.lca.json that ties together all of the files and k-mer size/scaled values for later use.

To run the extract script that produces the LCA database, then, we have this ungainly command:

../extract.py -k 21 --scaled 10000 \
    --traverse-directory \
    db/delmont.k21.lca tara_delmont_taxids.csv \
    db/nodes.dmp.gz delmont \
    --lca-json db/delmont.lca.json

If that worked without complaint, you should see some output text talking about how it "found root 2431 times".

Run extract.py two more times, once for each k-mer value we used in building the signatures above with sourmash compute (k=31 and k=51):

../extract.py -k 31 --scaled 10000 \
    --traverse-directory \
    db/delmont.k31.lca tara_delmont_taxids.csv \
    db/nodes.dmp.gz delmont \
    --lca-json db/delmont.lca.json

../extract.py -k 51 --scaled 10000 \
    --traverse-directory \
    db/delmont.k51.lca tara_delmont_taxids.csv \
    db/nodes.dmp.gz delmont \
    --lca-json db/delmont.lca.json

At the end of this, you should have a directory db/ containing three *.lca files, along with names.dmp.gz and nodes.dmp.gz and two cache files. You will also have a delmont.lca.json file. Collectively, the contents of db/ are self-contained and can be packaged up and distributed.

Classifying a collection of genome bins with your newly constructed reference database.

The original goal of this two-part blog post was to be able to classify new genome bins against existing private collections of genomes, not just Genbank. So let's do that!

Let's suppose your new collection of genome bins is in the directory tully/. First, compute signatures for all of them:

cd tully
sourmash compute -k 21,31,51 --scaled 10000 --name-from-first *.fna.gz
cd ../

and now... classify!

To this, we will use the classify-genome-sigs.py script that is part of the github.com/ctb/2017-tara-binning repository. Rather than grabbing that entire repository, let's just download the one script --

curl -L -O https://github.com/ctb/2017-tara-binning/raw/master/classify-genome-sigs.py
chmod +x classify-genome-sigs.py
mv classify-genome-sigs.py ../

and now run it:

../classify-genome-sigs.py --lca db/delmont.lca.json -k 21 tully

This uses our just-created LCA database, delmont.lca.json, with a k-mer size of 21, to classify all of the signatures in the tully/ subdirectory.

You should see the following output:

loading LCA database from db/delmont.lca.json
loading taxonomic nodes from: db/nodes.dmp.gz
loading taxonomic names from: db/names.dmp.gz
loading k-mer DB from: delmont.k21.lca
loading all signatures in directory: tully
saving to cache: tully.pickle
...loaded 100 signatures at k=21

classified sourmash signatures in directory: 'tully'
LCA databases used: delmont.lca.json

total signatures found: 100
no classification information: 95

could classify 5
of those, 0 disagree at some rank.

number classified unambiguously, by lowest classification rank:
        class: 3
        family: 1
        genus: 1

disagreements by rank:

classification output as CSV, here: tully.taxonomy.csv

which indicates that only 5 out of 100 of the signatures could classified -- more detail on those 5 can be seen by looking in tully.taxonomy.csv:

grep found tully.taxonomy.csv

which should yield

TOBG_SP-33.fna.gz,2742,found,genus,Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter
TOBG_EAC-53.fna.gz,246874,found,family,Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Cryomorphaceae
TOBG_MED-644.fna.gz,28211,found,class,Bacteria;Proteobacteria;Alphaproteobacteria
TOBG_CPC-18.fna.gz,1236,found,class,Bacteria;Proteobacteria;Gammaproteobacteria
TOBG_SP-120.fna.gz,1236,found,class,Bacteria;Proteobacteria;Gammaproteobacteria

(You can look at the first part of this blog post to see how to interpret the 3rd column.)

Using the genbank LCA, too.

We've already prepared the genbank LCA database (which takes about 2 hours total), and you can download it and use it, along with the delmont data, to classify the tully sequences; all you need to do is specify multiple lca.json databases on the command line.

To do this, first download and unpack the genbank file:

curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz
cd db/
tar xzf ../genbank-lca-2017.08.26.tar.gz
cd ../

and then re-run classify-genome-sigs.py:

../classify-genome-sigs.py \
    --lca db/delmont.lca.json db/genbank.lca.json \
    -k 21 tully

-- now you should see that you can classify 56 genomes, 51 of them unambiguously.

Using other k-mer sizes than 21

Both the genbank LCA database and the one we calculated from Delmont et al. were built for three k-mer sizes: 21, 31, and 51. So you can set the -k parameter to any of those, e.g. k=51:

../classify-genome-sigs.py \
    --lca db/delmont.lca.json db/genbank.lca.json \
    -k 51 tully -o tully-k51.taxonomy.csv

Bonus: plotting your new classification using metacodeR

Let's generate a plot. I'm going to use metacodeR but Taylor Reiter from our lab has some other options here.

Note: you'll need to install metacoder in R for the below to work.

First, generate a classification:

../classify-genome-sigs.py \
    --lca db/delmont.lca.json db/genbank.lca.json \
    -k 21 tully

Grab two scripts, plot-metacoder.R and load_classify_csv.R:

curl -O -L https://github.com/ctb/2017-tara-binning/raw/master/load_classify_csv.R
curl -O -L https://github.com/ctb/2017-tara-binning/raw/master/plot-metacoder.R

and now

Rscript plot-metacoder.R tully.taxonomy.csv tully.pdf

should produce a plot entitled 'tully.pdf', which looks like the below:

A pretty metacodeR picture, again

Appendix: a recap

To build an LCA database for a private collection of genomes:

  1. Compute sourmash signatures for your genomes.
  2. Link those signatures to their taxonomic IDs in a CSV file.
  3. Run extract.py to build the database.

To classify another set of genomes with an LCA database:

  1. Compute sourmash signatures for your genomes.
  2. Run classify-genome-sigs.py on your new signatures, using one or more LCA databases.

And that's all, folks!

by C. Titus Brown at September 18, 2017 10:00 PM

numfocus

The Econ-ARK joins NumFOCUS Sponsored Projects

​NumFOCUS is pleased to announce the addition of the Econ-ARK to our fiscally sponsored projects. As a complement to the thriving QuantEcon project (also a NumFOCUS sponsored project), the Econ-ARK is creating an open-source resource containing the tools needed to understand how diversity across economic agents (in preferences, circumstances, knowledge, etc) leads to richer and […]

by NumFOCUS Staff at September 18, 2017 05:06 PM

Matthew Rocklin

Dask on HPC - Initial Work

This work is supported by Anaconda Inc. and the NSF EarthCube program.

We recently announced a collaboration between the National Center for Atmospheric Research (NCAR), Columbia University, and Anaconda Inc to accelerate the analysis of atmospheric and oceanographic data on high performance computers (HPC) with XArray and Dask. The full text of the proposed work is available here. We are very grateful to the NSF EarthCube program for funding this work, which feels particularly relevant today in the wake (and continued threat) of the major storms Harvey, Irma, and Jose.

This is a collaboration of academic scientists (Columbia), infrastructure stewards (NCAR), and software developers (Anaconda and Columbia and NCAR) to scale current workflows with XArray and Jupyter onto big-iron HPC systems and peta-scale datasets. In the first week after the grant closed a few of us focused on the quickest path to get science groups up and running with XArray, Dask, and Jupyter on these HPC systems. This blogpost details what we achieved and some of the new challenges that we’ve found in that first week. We hope to follow this blogpost with many more to come in the future. Today we cover the following topics:

  1. Deploying Dask with MPI
  2. Interactive deployments on a batch job scheduler, in this case PBS
  3. The virtues of JupyterLab in a remote system
  4. Network performance and 3GB/s infiniband
  5. Modernizing XArray’s interactions with Dask’s distributed scheduler

A video walkthrough deploying Dask on XArray on an HPC system is available on YouTube and instructions for atmospheric scientists with access to the Cheyenne Supercomputer is available here.

Now lets start with technical issues:

Deploying Dask with MPI

HPC systems use job schedulers like SGE, SLURM, PBS, LSF, and others. Dask has been deployed on all of these systems before either by academic groups or financial companies. However every time we do this it’s a little different and generally tailored to a particular cluster.

We wanted to make something more general. This started out as a GitHub issue on PBS scripts that tried to make a simple common template that people could copy-and-modify. Unfortunately, there were significant challenges with this. HPC systems and their job schedulers seem to focus and easily support only two common use cases:

  1. Embarrassingly parallel “run this script 1000 times” jobs. This is too simple for what we have to do.
  2. MPI jobs. This seemed like overkill, but is the approach that we ended up taking.

Deploying dask is somewhere between these two. It falls into the master-slave pattern (or perhaps more appropriately coordinator-workers). We ended up building an MPI4Py program that launches Dask. MPI is well supported, and more importantly consistently supported, by all HPC job schedulers so depending on MPI provides a level of stability across machines. Now dask.distributed ships with a new dask-mpi executable:

mpirun --np 4 dask-mpi

To be clear, Dask isn’t using MPI for inter-process communication. It’s still using TCP. We’re just using MPI to launch a scheduler and several workers and hook them all together. In pseudocode the dask-mpi executable looks something like this:

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    start_dask_scheduler()
else:
    start_dask_worker()

Socially this is useful because every cluster management team knows how to support MPI, so anyone with access to such a cluster has someone they can ask for help. We’ve successfully translated the question “How do I start Dask?” to the question “How do I run this MPI program?” which is a question that the technical staff at supercomputer facilities are generally much better equipped to handle.

Working Interactively on a Batch Scheduler

Our collaboration is focused on interactive analysis of big datasets. This means that people expect to open up Jupyter notebooks, connect to clusters of many machines, and compute on those machines while they sit at their computer.

Unfortunately most job schedulers were designed for batch scheduling. They will try to run your job quickly, but don’t mind waiting for a few hours for a nice set of machines on the super computer to open up. As you ask for more time and more machines, waiting times can increase drastically. For most MPI jobs this is fine because people aren’t expecting to get a result right away and they’re certainly not interacting with the program, but in our case we really do want some results right away, even if they’re only part of what we asked for.

Handling this problem long term will require both technical work and policy decisions. In the short term we take advantage of two facts:

  1. Many small jobs can start more quickly than a few large ones. These take advantage of holes in the schedule that are too small to be used by larger jobs.
  2. Dask doesn’t need to be started all at once. Workers can come and go.

And so I find that if I ask for several single machine jobs I can easily cobble together a sizable cluster that starts very quickly. In practice this looks like the following:

$ qsub start-dask.sh      # only ask for one machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine

Our main job has a wall time of about an hour. The workers have shorter wall times. They can come and go as needed throughout the computation as our computational needs change.

Jupyter Lab and Web Frontends

Our scientific collaborators enjoy building Jupyter notebooks of their work. This allows them to manage their code, scientific thoughts, and visual outputs all at once and for them serves as an artifact that they can share with their scientific teams and collaborators. To help them with this we start a Jupyter server on the same machine in their allocation that is running the Dask scheduler. We then provide them with SSH-tunneling lines that they can copy-and-paste to get access to the Jupyter server from their personal computer.

We’ve been using the new Jupyter Lab rather than the classic notebook. This is especially convenient for us because it provides much of the interactive experience that they lost by not working on their local machine. They get a file browser, terminals, easy visualization of textfiles and so on without having to repeatedly SSH into the HPC system. We get all of this functionality on a single connection and with an intuitive Jupyter interface.

For now we give them a script to set all of this up. It starts Jupyter Lab using Dask and then prints out the SSH-tunneling line.

from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')

import socket
host = client.run_on_scheduler(socket.gethostname)

def start_jlab(dask_scheduler):
    import subprocess
    proc = subprocess.Popen(['jupyter', 'lab', '--ip', host, '--no-browser'])
    dask_scheduler.jlab_proc = proc

client.run_on_scheduler(start_jlab)

print("ssh -N -L 8787:%s:8787 -L 8888:%s:8888 -L 8789:%s:8789 cheyenne.ucar.edu" % (host, host, host))

Long term we would like to switch to an entirely point-and-click interface (perhaps something like JupyterHub) but this will requires additional thinking about deploying distributed resources along with the Jupyter server instance.

Network Performance on Infiniband

The intended computations move several terabytes across the cluster. On this cluster Dask gets about 1GB/s simultaneous read/write network bandwidth per machine using the high-speed Infiniband network. For any commodity or cloud-based system this is very fast (about 10x faster than what I observe on Amazon). However for a super-computer this is only about 30% of what’s possible (see hardware specs).

I suspect that this is due to byte-handling in Tornado, the networking library that Dask uses under the hood. The following image shows the diagnostic dashboard for one worker after a communication-heavy workload. We see 1GB/s for both read and write. We also see 100% CPU usage.

Network performance is a big question for HPC users looking at Dask. If we can get near MPI bandwidth then that may help to reduce concerns for this performance-oriented community.

How do I use Infiniband network with Dask?

XArray and Dask.distributed

XArray was the first major project to use Dask internally. This early integration was critical to prove out Dask’s internals with user feedback. However it also means that some parts of XArray were designed well before some of the newer parts of Dask, notably the asynchronous distributed scheduling features.

XArray can still use Dask on a distributed cluster, but only with the subset of features that are also available with the single machine scheduler. This means that persisting data in distributed RAM, parallel debugging, publishing shared datasets, and so on all require significantly more work today with XArray than they should.

To address this we plan to update XArray to follow a newly proposed Dask interface. This is complex enough to handle all Dask scheduling features, but light weight enough not to actually require any dependence on the Dask library itself. (Work by Jim Crist.)

We will also eventually need to look at reducing overhead for inspecting several NetCDF files, but we haven’t yet run into this, so I plan to wait.

Future Work

We think we’re at a decent point for scientific users to start playing with the system. We have a Getting Started with Dask on Cheyenne wiki page that our first set of guinea pig users have successfully run through without much trouble. We’ve also identified a number of issues that the software developers can work on while the scientific teams spin up.

  1. Zero copy Tornado writes to improve network bandwidth
  2. Enable Dask.distributed features in XArray by formalizing dask’s expected interface
  3. Dynamic deployments on batch job schedulers

We would love to engage other collaborators throughout this process. If you or your group work on related problems we would love to hear from you. This grant isn’t just about serving the scientific needs of researchers at Columbia and NCAR, but about building long-term systems that can benefit the entire atmospheric and oceanographic community. Please engage on the Pangeo GitHub issue tracker.

September 18, 2017 12:00 AM

September 15, 2017

numfocus

2017 GSoC Students Complete Work on NumFOCUS Projects

The summer has come to an end and so have the internships of our GSoC students. Ten students have finished their projects successfully and now continue to contribute to our open source community! FEniCS: Ivan Yashchuk has implemented quadrilateral and hexahedral meshes for finite element methods. Michal Habera has added support for the XDMF format.   […]

by NumFOCUS Staff at September 15, 2017 09:12 PM

September 14, 2017

Enthought

Webinar: Python for MATLAB Users: What You Need To Know

What:  A guided walkthrough and Q&A about how to migrate from MATLAB® to Python with Enthought Lead Instructor, Dr. Alexandre Chabot-Leclerc.

Who Should Watch: MATLAB® users who are considering migrating to Python, either partially or completely.

View the Webinar


Python has a lot of momentum. Many high profile projects use it and more are migrating to it all the time. Why? One reason is that Python is free, but more importantly, it is because Python has a thriving ecosystem of packages that allow developers to work faster and more efficiently. They can go from prototyping to production to scale on hardware ranging from a Raspberry Pi (or maybe micro controller) to a cluster, all using the same language. A large part of Python’s growth is driven by its excellent support for work in the fields of science, engineering, machine learning, and data science.

You and your organization might be thinking about migrating from MATLAB to Python to get access to the ecosystem and increase your productivity, but you might also have some outstanding questions and concerns, such as: How do I get started? Will any of my knowledge transfer? How different are Python and MATLAB? How long will it take me to become proficient? Is it too big a of a shift? Can I transition gradually or do I have to do it all at once? These are all excellent questions.

We know people put a lot of thought into the tools they select and that changing platforms is a big deal. We created this webinar to help you make the right choice.

In this webinar, we’ll give you the key information and insight you need to quickly evaluate whether Python is the right choice for you, your team, and your organization, including:

  • How to get started
  • What you need in order to replicate the MATLAB experience
  • Important conceptual differences between MATLAB and Python
  • Important similarities between MATLAB and Python: What MATLAB knowledge will transfer
  • Strategies for converting existing MATLAB code to Python
  • How to accelerate your transition

View the Webinar


Presenter: Dr. Alexandre Chabot-Leclerc, Enthought Lead Instructor

Ph.D, Electrical Engineering, Technical University of Denmark

 


Python for Scientists & Engineers Training: The Quick Start Approach to Turbocharging Your Work

If you are tired of running repeatable processes manually and want to (semi-) automate them to increase your throughput and decrease pilot error, or you want to spend less time debugging code and more time writing clean code in the first place, or you are simply tired of using a multitude of tools and languages for different parts of a task and want to replace them with one comprehensive language, then Enthought’s Python for Scientists and Engineers is definitely for you!

This class has been particularly appealing to people who have been using other tools like MATLAB or even Excel for their computational work and want to start applying their skills using the Python toolset.  And it’s no wonder — Python has been identified as the most popular coding language for five years in a row for good reason.

One reason for its broad popularity is its efficiency and ease-of-use. Many people consider Python more fun to work in than other languages (and we agree!). Another reason for its popularity among scientists, engineers, and analysts in particular is Python’s support for rapid application development and extensive (and growing) open source library of powerful tools for preparing, visualizing, analyzing, and modeling data as well as simulation.

Python is also an extraordinarily comprehensive toolset – it supports everything from interactive analysis to automation to software engineering to web app development within a single language and plays very well with other languages like C/C++ or FORTRAN so you can continue leveraging your existing code libraries written in those other languages.

Many organizations are moving to Python so they can consolidate all of their technical work streams under a single comprehensive toolset. In the first part of this class we’ll give you the fundamentals you need to switch from another language to Python and then we cover the core tools that will enable you to do in Python what you were doing with other tools, only faster and better!

Additional Resources

Upcoming Open Python for Scientists & Engineers Sessions:

Washington, DC, Sept 25-29
Los Alamos, NM, Oct 2-6, 2017
Cambridge, UK, Oct 16-20, 2017
San Diego, CA, Oct 30-Nov 3, 2017
Albuquerque, NM, Nov 13-17, 2017
Los Alamos, NM, Dec 4-8, 2017
Austin, TX, Dec 11-15, 2017

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Learn More

Download Enthought’s MATLAB to Python White Paper

Additional Webinars in the Training Series:

Python for Scientists & Engineers: A Tour of Enthought’s Professional Technical Training Course

Python for Data Science: A Tour of Enthought’s Professional Technical Training Course

Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat SheetsEnthought's Machine Learning with Python Cheat Sheets

The post Webinar: Python for MATLAB Users: What You Need To Know appeared first on Enthought Blog.

by admin at September 14, 2017 06:00 PM

September 12, 2017

Bruno Pinho

Integrating & Exploring 2: Predicting Spatial Data with Machine Learning

Using Machine Learning (ML) algorithms to predict Airbourne Geophysics. A simple, but powerful solution using Rotation Gradients to add complexity to the model with a Automated Machine Learning (AutoML) implementation.

by Bruno Ruas de Pinho at September 12, 2017 01:00 AM

September 08, 2017

Titus Brown

Classifying genome bins using a custom reference database, part I (Saturday Morning Bioinformatics)

Last week, Trina McMahon sent up a Twitter flare, asking for a tool that would taxonomically classify genome bins using a custom reference database. Based on having watched her excellent JGI talk, my understanding was that Trina was interested in taking single-cell and metagenome assembled genomes (SAGs and MAGs, in the parlance of the day - I'll just call them "bins" below) and identifying those using private databases. One particular use case of interest was using in-house databases of SAGs and MAGs that had already been classified.

I don't know how connected Trina's comment was to my previous blogging, but last Monday I put together a short blog post on taxonomic exploration of two sets of genome bins extracted from the Tara ocean data. That blog post focused rather narrowly on identifying genome bins that might be "confused" in the sense that they contained k-mers with disagreeing taxonomic assignments (see my blog post on k-mers and taxonomy for info here). Using the code in that blog post, it was in theory quite easily to do taxonomic assignments of unassigned genome bins.

Separately, Tom Delmont (the lead author on one of the genome binning studies I'd used for the blog post) implied that they had not compared their genome bins to GenBank genomes. That was basically the same problem that Trina wanted solved.

So, I put together a script to do something simple but hopefully useful, and (bonus!!) I added some pretty viz using metacodeR. The end result of applying taxonomic classification to a directory (with plotting) looks something like this, for the Tully et al. study:

A pretty metacodeR picture


When reading the stuff below, please bear in mind that I am doing nucleotide-level classification, which is only going to work at the ~species level. If you are interested in an approach that is more sensitive for more distant evolutionary relationships, see Meren's blog on their classification approach using CheckM.

All of the below was run on my laptop in under ~2 GB of RAM and less than 10 minutes of compute.


Starting at the end

If you want to use the ~100,000 Genbank genomes to classify your genome bins, and you already have a directory of genome signatures computed with sourmash, then you can run the classify-genome-sigs.py script like so:

# download genbank DB
curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz

# unpack LCA db
mkdir db
cd db
tar xzf ../genbank-lca-2017.08.26.tar.gz
cd ../

# Run classifier on the whole set of genomes
./classify-genome-sigs.py tully-genome-sigs --lca db/genbank.lca.json

# Plot!
Rscript plot-metacoder.R tully-genome-sigs.taxonomy.csv tully.metacoder.pdf

and you should get a much sparser version of the figure above.

The reason you're getting only a sparse set of classifications is that you're using genbank, and most of the MAGs in the Tully and Delmont studies are unknown. What you really want to use is a custom database - in this case, we'll use custom LCA databases built from the delmont and tully data sets, but you can use any LCA databases you've built.

To classify using custom databases, you've got to download and unpack the databases, and then add them to the end of the command above:

# grab the Tully et al. classifications
curl -L https://osf.io/3pc7j/download -o tully-lca-2017.09.04.tar.gz 
curl -L https://osf.io/bw2dj/download -o delmont-lca-2017.09.04.tar.gz

# unpack 'em
tar xzf tully-lca-2017.09.04.tar.gz
tar xzf delmont-lca-2017.09.04.tar.gz

# Run classifier on the whole set of genomes with all three LCA dbs
./classify-genome-sigs.py tully-genome-sigs \
    --lca db/genbank.lca.json db/tully.lca.json db/delmont.lca.json

# Plot!
Rscript plot-metacoder.R tully-genome-sigs.taxonomy.csv tully.metacoder.pdf

After a bunch of diagnostic output, you should see some summary output:

classified sourmash signatures in directory: 'tully-genome-sigs'
LCA databases used: db/genbank.lca.json, db/tully.lca.json, db/delmont.lca.json

total signatures found: 2631
no classification information: 624

could classify 2007
of those, 73 disagree at some rank.

number classified unambiguously, by lowest classification rank:
        superkingdom: 35
        phylum: 285
        class: 527
        family: 656
        genus: 339
        species: 91

disagreements by rank:
        superkingdom: 1
        phylum: 11
        order: 4
        class: 8
        family: 10
        genus: 10
        species: 29

classification output as CSV, here: tully-genome-sigs.taxonomy.csv

This gives you an output file tully-genome-sigs.taxonomy.csv containing the various assignments, and you should also have a tully.metacoder.pdf that looks like the image above.

Interpreting the CSV file

The CSV file has five columns: name, taxid, status, rank_info, and lineage. name is the name in the signature file; taxid and lineage are the NCBI taxonomic ID assigned to that name, and the taxonomic lineage of that ID.

The status and rank_info require a bit more explanation.

There are three possible status values at present: nomatch, found, and disagree.

nomatch is hopefully obvious - no (or insufficient) taxonomic information was found for k-mers in that signature, so you get a taxid of 0, and an empty rank and lineage:

TARA_IOS_MAG_00005,0,nomatch,,

found is hopefully also pretty obvious: it says that a simple straightforward lineage was found across all of the databases, and gives you that lineage down to the most detailed rank found:

TARA_PSW_MAG_00129,28108,found,species,Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Alteromonas;Alteromonas macleodii

disagree is the least clear. Here, the assigned rank is the rank immediately above where there is a taxonomic disagreement, and the taxid & lineage refer to the name at that rank (the least-common-ancestor at which an assignment can be made).

For example, look at this line in the CSV file:

TARA_ASW_MAG_00029,1224,disagree,phylum,Bacteria;Proteobacteria

TARA_ASW_MAG_00029 has k-mers that are shared between different orders: 'Pseudomonadales' and 'Rhodobacterales'. Therefore, the classifier status is disagree, and the classified taxid is at rank phylum - just above order.

Plotting with metacodeR

I like pretty pictures as much as the next person, but I rarely spend the time to make them - I'm just not good at plotting!

However, when I was last visiting Oregon State (to give a talk at the Oregon State Microbiome Initiative) I met Zachary Foster, a student of Nik Grunwald, and saw their metacodeR poster. metacodeR is an R package for making pretty taxonomic figures, and since I was deep in the throes of sourmash taxonomic foo I asked him for some help, which he promptly gave.

Fast forward to today, and I thought, "hey, Zach said that I should be able to use metacodeR with this kind of output - why don't I give it a try?"

At this point I should confess that I am not a very good R coder, so this filled me with trepidation. But leveling up is what Saturday Morning Bioinformatics is for! So I opened up a new window for all the R syntax google searches I was going to be done, started up RStudio, installed metacodeR, and ... after about an hour of struggling with R, got a script working!

You can see the script in all its glory here, or I've pasted it below - mainly because I need some help understanding the line that says names(csvnames) <- csvnames, which is Magic that is needed for extract_taxonomy but that I do not understand the need for.

# load the output of 'classify-genome-sigs.py'
library('metacoder')
library('dplyr')

load_classify_csv <- function(filename) {
  csvdata = read.csv(filename, header=TRUE)

  # eliminate empty lineages
  csvdata <- filter(csvdata, lineage != "")

  # create a mashup of name and lineage to satisfy extract_taxonomy
  csvnames <- paste(csvdata$name, csvdata$lineage, sep='\t')

  # heck if I know why I have to do this :), but set list names to list values
  names(csvnames) <- csvnames

  taxdata <- extract_taxonomy(csvnames, regex = "^(.*)\\\t(.*)",
                              key = c(id = "obs_info", "class"), class_sep=';')

  taxdata
}

Suggestions welcome on cleaning this up :)

Part 2 is yet to come

OK, I'm officially out of time for the day, so I'll stop here. In part 2 , I'll discuss how to create your own directory of genome signatures, show you how to build your own LCA databases, and give a fully repeatable workflow for everything. (If you're super enthusiastic to try the stuff above, you can probably ad lib it from this blog post).

In the meantime, comments welcome!

--titus

by C. Titus Brown at September 08, 2017 10:00 PM

Continuum Analytics

Securely Connecting Anaconda Enterprise to a Remote Spark Cluster

With the release of Anaconda Enterprise 5, we are introducing a more robust method of connecting JupyterLab, the interactive data science notebook environment, to an Apache Spark cluster.

by Sheyna Webster at September 08, 2017 01:00 PM

September 07, 2017

numfocus

NumFOCUS is Now Hiring: Events Coordinator

NumFOCUS seeks applicants for the position of Events Coordinator. Join our small but mighty team and help support data science communities all over the world! This is a junior-level position with the opportunity to make a big impact — you’ll be working on community management of our global open source scientific computing community! Why Work […]

by NumFOCUS Staff at September 07, 2017 03:56 PM

September 06, 2017

Pierre de Buyl

Testing the compact Hilbert index

In 2015, I worked to implement the so-called "compact" Hilbert index. You can read about it here. After some feedback on the code, I decided that I wanted to investigate "how good" the compact Hilbert actually is. In this post, I compare my code to Chris Hamilton's original code (using a program by Peter Colberg) and verify the basic promise of the compact Hilbert index (keep the ordering of the full Hilbert index).

What is already out there

I am aware of the following softwares for computing the compact Hilbert index:

  1. Chris Hamilton published libhilbert. His academic webpage has been removed but I had a copy of the library. I provide a copy of his "0.2.1" version on my GitHub account. The code is LGPL v2.1 so it can be redistributed.
  2. Peter Colberg has an implementation in his simulation software nano-dimer. This is what motivated me to use the compact Hilbert index in the first place. Peter Colberg has a test against Chris Hamilton's implementation. This code is MIT and implemented in OpenCL C.
  3. I have implemented the code in Python for my own understanding here and in Fortran for my simulation code RMPCDMD.

Are jumps "allowed" in the compact Hilbert curve?

By construction of the orginal Hilbert curve, consecutive points along the curve are at a distance of 1 from each other.

One question that came several times (I can still count on one hand though) is about the existence of jumps in the curve. The only promise in the compact Hilbert index paper is that it preserves the ordering of the Hilbert curve HR08. By graphical observations, some non-cubic curves can be seen as "repeat" of the cubic curve. But this simple situation does not hold for arbitrary shapes.

For "M = [2 2 3]", I show below an example where the dimension of first traversal of the curve is 0. One can see that around the middle of the curve, there is a jump in the z direction that is larger than one. This is reflected in the step size of the lower-right panel.

A compact Hilbert curve

In the next illustration, I have only changed the parameter vd to 1 and now there is no jump anymore.

A compact Hilbert curve

Actual testing

To perform the testing, I compiled the library by Chris Hamilton and used Peter Colberg's hilbert test program that will dump the coordinates of a hilbert curve in a HDF5 file.

Then, I dumped my code for the Hilbert curve in a Python file. The code relies on knowing the spatial dimension (in the code it is N) first and was not super suited to write a proper library. The dimensions of the curve are given as a set of integers ("2 2 3" in the example above) and there is the option --vd to specify the dimension of first traversal.

A plain cubic Hilbert curve is generated and the relative ordering of the compact indices with respect to the original one is performed (with success, ouf) at every run of the program. So this settles the fulfillment of the ordering.

When the option --compare is given, the program will load the data generated by the hilbert program and compare it against the Python data. So far, all combinations of dimensions did work.

The test program is called test_compact_hilbert.py and is available in the repository that I set up specifically for these tests. To display a compact hilbert curve, simply ask, for instance

python3 test_compact_hilbert.py 1 2 3 --plot --vd 2

To perform preset tests with the code, simply issue the command

make HDF5_HOME=/path/to/hdf5lib

(The variable HDF5_HOME is only necessary on the first invocation when the program hilbert is built.)

Summary

To gain confidence in my code, I have

  1. Made a backup of the original code by the author of the algorithm. That was just in time, as its homepage disappeared.
  2. Made a specific Python test program to test my algorithm against the reference one.
  3. Checked numerically the ordering of the curve.
  4. Adapted my program to also allow an easy graphical inspection of the compact Hilbert curve and pass as an argument the direction of traversal.

Bibliography

  • HR08 Chris H. Hamilton and Andrew Rau-Chaplin. Compact Hilbert indices: Space-filling curves for domains with unequal side lengths, Information Processing Letters 105, 155-163 (2008).

by Pierre de Buyl at September 06, 2017 12:00 PM

September 03, 2017

Titus Brown

Taxonomic examinations of genome bins from Tara Oceans (Honorary Sunday Morning Bioinformatics)

This is a follow-on to Comparing genome sets extracted from metagenomes.

After my first foray into comparing the genome bins extracted from the Tara Oceans data by Tully et al. (2017) and Delmont et al. (2017), I got curious about the taxonomic assignments of the bins and decided to investigate those on this fine Labor Day morning.

In this, I could take advantage of a simple reimplementation of the Kraken least-common-ancestor taxonomic classification algorithm described here that works on MinHash signatures - more on that in another, later blog post.

Briefly, I did the following -

  • used the posted lineages to assign NCBI taxonomic IDs to many of the tully samples and the delmont samples using really ugly scripts.
  • found the least-common-ancestor LCA for each hash value across all the MinHash signatures in each data set and stored them in an LCA database;
  • wrote a custom script that loaded in all of the genbank, tully, and delmont LCA assignments together with the NCBI taxonomy and classified each of the tully and delmont genome bins, looking for disagreements.

This last step is the most interesting one for the purposes of this morning's blog post, so let's go into it in a bit more detail.

Building the taxonomy databases for delmont and tully

I used an approach lifted from Kraken to build taxonomy classifiers for delmont and tully.

Kraken first links all k-mers in each genome to a list of taxonomic IDs, based on the taxonomic assignment for that genome. Then, Kraken looks for the least-common-ancestor in the taxonomic tree for each k-mer.

I did this for each k-mer in each MinHash signature from the full genbank data set (100,000 genomes), as well as the ~3000 tully and delmont data sets. (You have to use our super special opensauce if you want to do this on your own MinHash signatures.)

These databases are available for download, below.

Computing taxonomic assignments for each genome bin

Now I wanted to get an estimate of taxonomic IDs for each genome bin.

So, I wrote a script that loaded in all three LCA databases (genbank, tully, delmont) and, for each genome bin, extracted all of the taxonomic IDs for every hash value in the MinHash signature. And, for each taxonomic ID, it computed the LCA across all three databases, obviating the need to do build a single gigantic database of genbank+tully+delmont.

This can all be boiled down to "this script computed the most specific taxonomic ID for a randomly chosen subset of k-mers in each bin."

Finding disagreements in taxonomy in each bin

Next, I wanted to figure out where (if anywhere) there were disagreements in taxonomy for each bin.

So, in the script above, I took all taxonomic IDs for which there were five or more counts (i.e. five or more hash values with that specific assignment - randomly chosen but IMO conservative), and then found the lowest taxonomic rank at which there was a disagreement across all thresholded taxonomic IDs.

And then I summarized all of this :)

Some output!

The full output can be seen for delmont and for tully.

For each genome bin, we either get no output or a line like this:

TOBG_MED-638 has multiple LCA at rank 'phylum': Proteobacteria, Bacteroidetes

and then we get summary information like so:

for tully-genome-sigs found 2631 signatures;
out of 2631 that could be classified, 73 disagree at some rank.
    superkingdom: 1
    phylum: 11
    order: 4
    class: 8
    family: 10
    genus: 10
    species: 29

and

for delmont-genome-sigs found 957 signatures;
out of 957 that could be classified, 4 disagree at some rank.
    class: 3
    species: 1

(Note, for the 50 E. coli signatures in the sourmash tutorial, we get no disagreements. Good!)

Checking the results

I took a closer look at a few of the disagreements.

For TOBG_ARS-21 from tully, which was classified as both Archaea and Bacteria, the hash value taxids seem to be ~evenly split between:

cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Flavobacteriia;Flavobacteriales
cellular organisms;Archaea

Interestingly, TOBG_ARS-21 has no assignment in the associated spreadsheet, as far as I can tell. Instead, the Flavobacterium assignment comes from the delmont data set, where a few overlapping genome bins (TARA_ANW_MAG_00073 and TARA_ANW_MAG_00082 ) are assigned to Flavobacterium. And the archaeal assignment comes from the tully data set, but not from TOBG_ARS-21 - rather, there is a ~10% overlap with another tully genome bin, TOBG_RS-366, which is assigned to Archaea (with no more refined assignment).

Verdict: genuinely confusing. My script works!


For TOBG_IN-33, we had a disagreement at phylum: Chlorophyta and Euryarchaeota. Here the hash value taxids are split between Euryarchaeota and genus Ostreococcus.

cellular organisms;Archaea;Euryarchaeota
cellular organisms;Eukaryota;Viridiplantae;Chlorophyta;prasinophytes;Mamiellophyceae;Mamiellales;Bathycoccaceae;Ostreococcus

Again, TOBG_IN-33 has no direct taxonomic assignment in tully. The Euryarchaeota assignment comes from a match to TARA_IOS_MAG_00045 in the delmont data set, as well as a match to delmont genomes TOBG_EAC-675 and TOBG_RS-356 (both Euryarchaeota). So where does the assignment to Eukaryota come from?

If you dig a bit deeper, it turns out that there is ~5.8 - 7.6% similarity between TOBG_IN-33 (from tully) and two delmont genome bins, TARA_PSW_MAG_00136 and TARA_ANE_MAG_00093, both of which are assigned to Eukaryota;Chlorophyta;Mamiellophyceae;Mamiellales;Bathycoccaceae;Ostreococcus. Maybe TOBG_IN-33 is a chimeric bin?

Anyway, the verdict: genuinely confusing. My script works (again :)!

Concluding thoughts

As with most bioinformatics analyses, this gives us some directions to pursue, rather than (m)any hard answers :).

But, with the caveat that this is an analysis that did in just about 3 hours this morning and so the code is probably broken in a myriad of ways:

The vast majority of both the tully and delmont genomes are taxonomically homogeneous under this analysis. In many cases this may be due to limited taxonomic assignment, and it also definitely reflects a lack of information about these genomes!

It seems like the delmont bins have been curated more than the tully bins, with the end effect of removing taxonomically confusing information. Whether this is good or bad will have to await a broader and deeper investigation of the genome bins - right now there are clearly some confounding issues within and between the data sets.

Appendix: Running everything yourself

(This should take about 10-15 minutes to run on a laptop with ~10 GB of disk space and ~8 GB of RAM. This includes the downloads.)

Along with a sourmash install, you'll need two github repos: https://github.com/ctb/2017-tara-binning and https://github.com/ctb/2017-sourmash-lca.

git clone https://github.com/ctb/2017-sourmash-lca
git clone https://github.com/ctb/2017-tara-binning

Then set PYTHONPATH and change into the 2017-tara-binning directory:

export PYTHONPATH=$(pwd)/2017-sourmash-lca:$PYTHONPATH
cd 2017-tara-binning

Download the LCA databases:

curl -L https://osf.io/3pc7j/download -o tully-lca-2017.09.04.tar.gz 
curl -L https://osf.io/bw2dj/download -o delmont-lca-2017.09.04.tar.gz

curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz

Unpack them:

tar xzf delmont-lca-2017.09.04.tar.gz
tar xzf tully-lca-2017.09.04.tar.gz
cd db
tar xzf ../genbank-lca-2017.08.26.tar.gz
cd ../

Now, grab the signature collections and unpack them, too:

curl -L https://osf.io/bh2jr/download -o delmont-genome-sigs.tar.gz
curl -L https://osf.io/28r6m/download -o tully-genome-sigs.tar.gz
tar xzf delmont-genome-sigs.tar.gz 
tar xzf tully-genome-sigs.tar.gz 

And we are ready!

Now, run:

./examine-taxonomy.py delmont-genome-sigs -k 31 > delmont.taxinfo.txt
./examine-taxonomy.py tully-genome-sigs -k 31 > tully.taxinfo.txt

and you should see the same results as in the github links above.

by C. Titus Brown at September 03, 2017 10:00 PM

September 02, 2017

Titus Brown

Comparing genome sets extracted from metagenomes (Sunday Morning Bioinformatics).

(Because I'm a nerd who loves my research, I frequently do short-form bioinformatics analyses on weekends. I may start posting more of them. But an important note is that these will necessarily be incomplete and speculative because I put a hard limit on the time spent doing them - e.g. the below excursion took me about 2.5 hours total, starting from having the signatures calculated. Most of the 2 hours was before the kids woke up :)


I’ve been watching the metagenome-assembled genomics community for a while with some envy. The lure of extracting single genomes from the messy complex mixture of stuff in a metagenome is seductive! But the environment with which I have the most experience, agricultural soil, doesn’t lend itself to this process (because coverage is almost always too low, and complexity is too high), so I've been relegated to the sidelines ().

I've also been a bit wary of binning. First, almost all binning approaches operate post-assembly, and I generally think all metagenome assemblies are wrong (although of course some are useful!) Second, many binning approaches involve examining the clusters and applying cutoffs chosen by eyeball, which lends a potential degree of variability to things that I'm concerned by. And third, we just posted a paper suggesting that strain variation (which is almost certainly present to massive degree in any sufficiently deep sequencing data) really gets in the way of metagenome assembly. This suggests that most metagenome assemblies are going to be somewhat incomplete.

But I'm always happy to take advantage of other people's work and so I've been on the lookout for good binned data sets to play with.

Conveniently, two different groups recently applied their binning approaches to the Tara ocean data set (Sunagawa et al., 2015), and I thought I'd take the opportunity to compare 'em. While a comparison won't necessarily say anything about the completeness of the binning, it might help characterize variability in the approaches.

The two approaches were posted to bioRxiv as Tully et al., 2017 and Delmont et al., 2017. Below, I will refer to them as 'Tully' and 'Delmont'.

Tara Oceans: the bacterial fraction

(These stats are mostly taken from Tully et al., 2017)

Tara Oceans generated metagenomic sequencing data from 234 samples, size-filtered to enrich for "prokaryotic" (bacterial + archaeal) organisms. This ended up being a ~terabase data set containing 102 billion paired-end reads.

The Tully data

Tully et al. produced 2631 genomes, and estimated that 1491 were more than 70% complete and 603 were more than 90% complete. (Wow!)

The process used was (roughly) as follows --

  1. Assemble each of 234 samples with MEGAHIT, yielding 562m contigs.
  2. After length filtering the contigs at 2kb, use CD-HIT to eliminate contigs that are more than 99% similar.
  3. Co-assemble contigs from each of the 61 stations (geographical locations) with MINIMUS2, yielding 7.2m contigs.
  4. Apply BinSanity to cluster these contigs into genome bins, and use CheckM to evaluate completeness.

The Delmont data

Delmont et al. produced 957 genomes from 93 samples (presumably a subset of the 234 samples above?), using this very well documented approach - briefly,

  1. Co-assemble samples from 12 geographical regions with MEGAHIT.
  2. Bin results with CONCOCT.
  3. Manually refine CONCOCT results using a'n'vio'.
  4. Extract bins with > 70% completion or 2 Mbp of contigs into 1077 genome bins.
  5. Collapse redundant genomes from across the different regions into 957 bins based on average nucleotide identity (> 98% ANI).

Some initial thoughts

Quite different approaches were taken here: in particular, Tully et al. assembled each of 234 samples individually, then co-assembled the resulting assembly by each of 61 stations; while Delmont et al. divided 93 samples into 12 based on geographical location, and co-assembled those 12.

Both studies used MEGAHIT to assemble the primary contigs. It would be interesting to compare these initial assemblies directly, as this would place bounds on the similarity of the eventual binning outcomes.

Getting to the data

I downloaded both full sets of binned genomes, and then ran the following sourmash command to compute sourmash signatures from the data:

sourmash compute -k 21,31,51 \
                         --scaled 2000 \
                         --track-abundance

This took about an hour for the Tully collection, and less for the Delmont collection.

You can download the data sets from the Open Science Framework if you'd like to follow along --

curl -L https://osf.io/vngdz/download -o delmont-genome-sigs.tar.gz
curl -L https://osf.io/28r6m/download -o tully-genome-sigs.tar.gz

The first file is about 30 MB, the second file is about 90 MB.

I then unpacked the two collections on my laptop,

tar xzf delmont-genome-sigs.tar.gz 
tar xzf tully-genome-sigs.tar.gz 

and built search trees (SBTs) for each of them at k=31 --

sourmash index -k 31 delmont31 --traverse-directory delmont-genome-sigs
sourmash index -k 31 tully31 --traverse-directory tully-genome-sigs

Comparing binned genomes directly

I wrote a custom script that computed matches between genomes in one directory vs genomes in the other directory.

Here, 'similarity' is Jaccard similarity, 'identity' is more than 80% Jaccard similarity, and 'containment' is more than 95% of a genome's signature included in the other.

% ./compare-two-directories.py delmont-genome-sigs delmont31.sbt.json tully-genome-sigs tully31.sbt.json 

After lots of detailed output, you get the following summary:

thresholds:
min Jaccard similarity for any match: 0.05
to score as identical, similarity must be >= 0.8
to score as contained, containment must be >= 0.95
----
delmont-genome-sigs vs tully-genome-sigs: 957 signatures
identical count: 22
containment count: 66
matches: 591
no match: 366
no contain: 891
----
tully-genome-sigs vs delmont-genome-sigs: 2631 signatures
identical count: 27
containment count: 54
matches: 850
no match: 1781
no contain: 2577

Some initial thoughts here:

  • the tully genomes probably have more "redundancy" in this context because of the way Delmont et al. removed redundancy.
  • the data sets are strikingly different!

Getting overall stats for overlap/disjointness

After all this, I got to wondering about the overall similarity of the genomes. So, I wrote another custom script, compare-two-directories-hashvals.py, to compare the k-mer composition of tully and delmont directly. (For those of you familiar with MinHash, no, you cannot do this normally based on MinHash signatures; but we're using super-special opensauce in sourmash that lets us do this. Trust us.)

% ./compare-two-directories-hashvals.py delmont-genome-sigs tully-genome-sigs 
loading all signatures: delmont-genome-sigs
loading from cache: delmont-genome-sigs.pickle
...loaded 957 signatures at k=31
loading all signatures: tully-genome-sigs
loading from cache: tully-genome-sigs.pickle
...loaded 2631 signatures at k=31
total hashvals: 2837161
both: 17.2%
unique to delmont-genome-sigs: 22.1%
unique to tully-genome-sigs: 60.7%

Here we estimate that the binned genomes, in aggregate, share about 17.2% of the union, while Delmont et al. has about 22.1% of the unique content, and Tully has about 60.7% of the unique content.

This is probably largely due to the extra data that was taken into account by Tully et al. (234 samples in tully vs 93 samples in delmont.)

It is reassuring that some genomes (~20) are identical between the data sets, albeit at a low threshold of 80%; and that some genomes from one are mostly contained in bins from the other (~50-60 in each direction).

Concluding thoughts

It is not surprising that two different binning procedures run on different subsamples produce different results :).

I must admit that it is not immediately obvious to me where I go from here. What stats or distributions should I be looking at? How do people compare the output of different binning procedures when they're working on the same samples? Thoughts welcome!

If you want to play with the data yourself, you should be able to repeat my entire analysis on your own laptop in ~30 minutes, starting from a basic sourmash install and using the commands above. You'll need to clone the repo, of course.

--titus

by C. Titus Brown at September 02, 2017 10:00 PM

Matthieu Brucher

Book review: Fail U.: The False Promise of Higher Education

American universities have some reputation, in all kind of terms, and the amount of student debt is something I also found baffling. So a book on the failure of US universities was obviously of interest to me.

Discussion

The book is divided in 6 different parts, and lots of them resound with my experience with French and British universities. So it is scary.

Let’s start with the beginning. The first part is just a reminder that the situation was already bad several decades ago. The author tends to delve into “I told you so” and “If only they had listened to me”, but it is true that lots of people have said that the situation was deteriorating. And it happens now in France and the United Kingdom as well, and I think the solution they are trying is the one failing in the US, so interesting to see if there are other options, so let’s move on to the other parts.

The second one addresses the need in universities for teaching students. The interesting part is that researchers don’t like teaching. And those who would like to are not encouraged to do so. But it’s so rooted into our culture that even in Big Bang Theory, none of the protagonists actually teaches, and when they are forced to do so, it is the opportunity for jokes. This is an issue. It shouldn’t happen. And then, you here that the really big deal of US universities is the bachelor degree, not the master’s degree (PhD thesis may still be alright). But if the teachers are not the best in their field, one are these degree recognized as valuable? Well, they will soon not be. Managers start noticing that their new hires are not that effective anymore. It shows.

The third part tackles the expenditures of these universities. I didn’t know that they didn’t really balance their books. I did read in Weapons of Math Destruction that the tuitions were rising and that the universities were spending lots of money to get new students. This parts confirms it. And I can’t understand how big universities don’t think about balancing books and debts. At some point, when your debt is getting to big, you have a problem, and for people who are supposed to be knowledgeable, this is just baffling. And let’s be clear, it starts happening here as well. When will people understand that this is not possible?

The fourth part deals with several scandals in the universities. The first one is the failure of research. This is known, the researchers are ranked according their publications, and it doesn’t work because you can publish more or less anything, even if it doesn’t make sense. There are some efforts to change this with public reviews, and I’m not even talking about the research publication scam. And there id also the sexual scandals that keep on coming up all the time. There shouldn’t be any scandal like that, period.

OK, let’s move to a part that just baffled me at first. And then I saw that it is now everywhere in our society. Universities are supposed to be a place where you can exchange, face your opinions. It seems that it’s no longer the case. And it’s true for life after university as well. Can you say to someone who you don’t agree with their position on say homosexual marriage? No, there is no debate. But people have the right to have a different opinion. You may not impose being pro or against on someone else, but you can discuss. But today, if you are against, you are an extremist. Let’s go further. If you say that you don’t think democracy is a system that works, you may be in big trouble. And in some way, that may be why Trump won. With the intolerance of some people, the people who didn’t agree with the majority may have seen an opportunity. And now, we have two groups of intolerant people fighting each other.

Final part, and some opportunities for enhancements. As I’ve said, students are heavily in debt (and with people like Dame Glynis Breakwell, the universities have to raise their tuitions fees, just so that some people full of themselves can be paid far too much compared to what they bring to the table), and it’s going to be a big problem in the future. And it’s not by injecting more money or lowering the bar that the situation will get better. Then there is a chapter on MOOCs which I also talked about before. I don’t agree with the conclusion of this book, MOOCs are targeted towards professionals, we still rely on degrees. Perhaps this will change with the future loss of the bachelor degree worth.

Conclusion

There are many more examples in the book, but the tendency is worrying. We seem to be living in a society that is less and less tolerant by saying that they tolerate everything except intolerant people (and becoming intolerant in the process).

Universities in the world in general are in a bad position, and it gets worse. More people “have” to graduate, and in the end you don’t solve the problem, you diminish the quality of the degree. Not everyone can graduate. The sooner we accept that, the sooner we can cure our universities.

by Matt at September 02, 2017 09:19 PM

August 31, 2017

Titus Brown

Is your bioinformatics analysis package Big Data ready?

(Apologies for the Big Data buzzword, but it actually fits!)

I find it increasingly frustrating to use a broad swath of bioinformatics tools (and given how inherently frustrating the bioinformatics ecosystem is already, that's saying something!) My bugaboo this time is bioinformatics "prepared databases" - but not multiple installs of databases or their size. Rather, I'm frustrated by the cost of building them. Similarly, I'm frustrated at how many analysis pipelines require you to start from the very beginning if you add data to your experiment or sample design.

We are increasingly living in a world of what I call "infinite data", where, for many purposes, getting the data is probably the least of your problems - or, at least, the time and cost of generating sequence data is dwarfed by sample acquisition, data analysis, hypothesis generation/testing, and paper writing.

What this means for databases is that no sooner does someone publish a massive amount of data from a global ocean survey (Tara Oceans, Sunagawa et al. 2015) than does someone else analyze the data in a new and interesting way (Tully et al., 2017) and produce data sets that I, at least, really want to include in my own reference data sets for environmental metagenome data. (See Hug et al., 2016 and Stewart et al., 2017 for even more such data sets.)

Unfortunately, most bioinformatics tools don't let me easily update my databases or analyses with information from new references or new data. I'm going to pick on taxonomic approaches first because they're on my mind but this applies to any number of software packages commonly used in bioinformatics.

Let's take Kraken, an awesome piece of software that founded a whole class of approaches. The Kraken database and toolset is constructed in such a way that I cannot easily do two specific things:

  • add just one reference genome to it;
  • ask Kraken to search two different reference databases and report integrated results;

Similarly, none of the assemblers I commonly use lets me me add just one more sequencing data set into the assembly - I have to recalculate the entire assembly from scratch, after adding the additional data set. Annotation approaches don't let me update the database with new sequences and then run an abbreviated analysis that improves annotations. And etc.

What's the problem?

There are two classes of problems underneath this - one is the pragmatic engineering issue. For example, with Kraken, it is theoretically straightforward, but tricky from an engineering perspective, to support the addition of a single genome into the database; "all" you need to do is update the least-common-ancestor assignments in the database for each k-mer in your new genome. Likewise, the Kraken approach should be able to support search across databases, where you give it two different databases and it combines them in the analysis - but it doesn't seem to. And, last but not least, Kraken could in theory support taking an existing classified set of reads and "updating" their classification from a new database, although in practice there's not much point in doing this because Kraken's already pretty fast.

The second kind of challenge is theoretical. We don't really have the right algorithmic approaches to assembly to support fully online assembly in all its glory, although you could imagine doing things like error trimming in an incremental way (I mean, just imagine...) and saving that for later re-use. But we generally don't save the full information needed to add a single sequence and then quickly output an updated assembly.

And that last statement highlights an important question - for which approaches do we have the theory and the engineering to quickly update an analysis with a new reference database, or a new read data set? I can guess but I don't know.

Note, I'm explicitly not interested in heuristic approaches where you e.g. map the reads to the assembly to see if there's new information - I'm talking about using algorithms and data structures and software where you get identical results from doing analysis(n + m) and analysis(n) + analysis(m).

Even if we do have this theoretical ability, it's not clear that many tools support it, or at least they don't seem to advertise that support. This also seems like an area where additional theoretical development (new data structures! better algorithms!) would be useful.

I'm curious if I'm missing something; have I just not noticed that this is straightforward with many tools? It seems so useful...

Anyway, I'm going to make a conscious effort to support this in our tools (which we sort of already were planning, but now can be more explicit about).

--titus

p.s. This post partly inspired by Meren, who asked for this with sourmash databases - thanks!

by C. Titus Brown at August 31, 2017 10:00 PM

Continuum Analytics

Introducing Anaconda Enterprise 5

Anaconda Enterprise 5 is built for your entire organization and serves the needs of Data Scientists, IT Administrators, Business Analysts & Managers, Developers and Machine Learning Engineers.

by Team Anaconda at August 31, 2017 01:57 PM

Anaconda Enterprise 5 Introduces Secure Collaboration to Amplify the Impact of Enterprise Data Scientists

Anaconda, the Python data science leader, today introduced Anaconda Enterprise 5 software to help organizations respond to customers and stakeholders faster, deliver strategic insight for rapid decision-making and take advantage of cutting edge machine learning.

by Team Anaconda at August 31, 2017 01:00 PM

August 30, 2017

Continuum Analytics

Anaconda: A Coming of Age Story

No one tells our story better than Continuum co-founders, Travis Oliphant and Peter Wang. Read on for their thoughts and feelings on what’s next for Anaconda.

by Sheyna Webster at August 30, 2017 01:00 PM

Titus Brown

A post about k-mers - this time for taxonomy!

I love k-mers. Unlike software engineering or metagenomics, this isn't a love-hate relationship - this is a pure "gosh aren't they wonderful" kind of love and admiration. They "just work" in so many situations that it helps validate the last 10 years of my research life (which, let's be honest, has been pretty k-mer focused!).

DNA k-mers underlie much of our assembly work, and we (along with many others!) have spent a lot of time thinking about how to store k-mer graphs efficiently, discard redundant data, and count them efficiently.

More recently, we've been enthused about using k-mer based similarity measures and computing and searching k-mer-based sketch search databases for all the things.

But I haven't spent too much talking about using k-mers for taxonomy, although that has become an ahem area of interest recently, if you read into our papers a bit.

In this blog post I'm going to fix this by doing a little bit of a literature review and waxing enthusiastic about other people's work. Then in a future blog post I'll talk about how we're building off of this work in fun! and interesting? ways!

A brief introduction to k-mers

(Borrowed from the ANGUS 2017 sourmash tutorial!)

K-mers are a fairly simple concept that turn out to be tremendously powerful.

A "k-mer" is a word of DNA that is k long:

ATTG - a 4-mer
ATGGAC - a 6-mer

Typically we extract k-mers from genomic assemblies or read data sets by running a k-length window across all of the reads and sequences -- e.g. given a sequence of length 16, you could extract 11 k-mers of length six from it like so:

AGGATGAGACAGATAG

becomes the following set of 6-mers:

AGGATG
 GGATGA
  GATGAG
   ATGAGA
    TGAGAC
     GAGACA
      AGACAG
       GACAGA
        ACAGAT
         CAGATA
          AGATAG

k-mers are most useful when they're long, because then they're specific. That is, if you have a 31-mer taken from a human genome, it's pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904. So, you know, a lot.)

The important concept here is that long k-mers are species specific. We'll go into a bit more detail later.

K-mers and assembly graphs

We've already run into k-mers before, as it turns out - when we were doing genome assembly. One of the three major ways that genome assembly works is by taking reads, breaking them into k-mers, and then "walking" from one k-mer to the next to bridge between reads. To see how this works, let's take the 16-base sequence above, and add another overlapping sequence:

AGGATGAGACAGATAG
    TGAGACAGATAGGATTGC

One way to assemble these together is to break them down into k-mers -- e.g the sequences above becomes the following set of 6-mers:

AGGATG
 GGATGA
  GATGAG
   ATGAGA
    TGAGAC
     GAGACA
      AGACAG
       GACAGA
        ACAGAT
         CAGATA
          AGATAG -> off the end of the first sequence
           GATAGG <- beginning of the second sequence
            ATAGGA
             TAGGAT
              AGGATT
               GGATTG
                GATTGC

and if you walk from one 6-mer to the next based on 5-mer overlap, you get the assembled sequence:

AGGATGAGACAGATAGGATTGC

voila, an assembly!

Graphs of many k-mers together are called De Bruijn graphs, and assemblers like MEGAHIT and SOAPdenovo are De Bruijn graph assemblers - they use k-mers underneath.

Why k-mers, though? Why not just work with the full read sequences?

Computers love k-mers because there's no ambiguity in matching them. You either have an exact match, or you don't. And computers love that sort of thing!

Basically, it's really easy for a computer to tell if two reads share a k-mer, and it's pretty easy for a computer to store all the k-mers that it sees in a pile of reads or in a genome.

So! On to some papers!

MetaPalette

About a year and a half ago, I had the pleasure of reviewing the MetaPalette paper by David Koslicki and Daniel Falush. This paper did some really cool things and I'm going to focus on two of the least cool things (the cooler things are deeper and worth exploring, but mathematical enough that I haven't fully understood them).

The first cool thing: the authors show that k-mer similarity between genomes at different k approximates various degrees of taxonomic similarity. Roughly, high similarity at k=21 defines genus-level similarity, k=31 defines species-level, and k=51 defines strain-level. (This is an ad-lib from what is much more precisely and carefully stated within the paper.) This relationship is something that we already sorta knew from Kraken and similar k-mer-based taxonomic efforts, but the MetaPalette paper was the first paper I remember reading where the specifics really stuck to me.

Figure 2 from Koslicki and Falush, 2016

This figure, taken from the MetaPalette paper, shows clearly how overall k-mer similarity between pairs of genomes matches the overall structure of genera.


The second cool thing: the authors show that you can look at what I think of as the "k-mer size response curve" to infer the presence of strain/species variants of known species in a metagenomic data set. Briefly, in a metagenomic data set you would expect to see stronger similarity to a known genome at k=21 than at k=31, and k=31 than at k=51, if a variant of that known species is present. (You'd see more or less even similarity across the k-mer sizes if the exact genome was present.)

k-mer response curve

In this figure (generated by code from our metagenome assembly preprint), you can see that this specific Geobacter genome's inclusion in the metagenome stays the same as we increase the k-mer size, while Burkholderia's decreases. This suggests that we know precisely which Geobacter is present, while only a strain variant of that Burkholderia is present.

I'm not sure whether this idea has been so clearly explicated before MetaPalette - it may not be novel because it sort of falls out of De Bruijn/assembly graph concepts, for example - but again it was really nicely explained in the paper, and the paper did an excellent job of then using these two observations to build a strain-aware taxonomic classification approach for metagenomic data sets.

The two major drawbacks of MetaPalette were that the implementation (though competitive with other implementations) was quite heavyweight and couldn't run on e.g. my laptop; and the math was complicated enough that reimplementing it myself was not straightforward.

MinHash and mash

(This next paper is less immediately relevant to taxonomy, but we'll get there, promise!)

At around the same time as I reviewed the MetaPalette paper, I also reviewed the mash paper by Ondov et al., from Adam Phillipy's lab. (Actually, what really happened was two people from my lab reviewed it and I blessed their review & submitted it, and then when the revisions came through I read the paper more thoroughly & got into the details to the point where I reimplemented it for fun. My favorite kind of review!)

The mash paper showed how a simple technique for fast comparison of sets, called MinHash - published in 1998 by Broder - could be applied to sets of k-mers. The specific uses were things like comparing two genomes, or two metagenomes. The mash authors also linked the Jaccard set similarity measure to the better known (in genomics) measure, the Average Nucleotide Identity between two samples.

The core idea of MinHash is that you subsample any set of k-mers in the same way, by first establishing a common permutations of k-mer space (using a hash function) and then subselecting from that permutation in a systematic way, e.g. by taking the lowest N hashes. This lets you prepare a subsample from a collection of k-mers once, and then forever more use that same subsampled collection of k-mers to compare with other subsampled collections. (In sourmash, we call these subsampled sets of k-mers "signatures".)

What really blew my mind was how fast and lightweight MinHash was, and how broadly applicable it was to various problems we were working on. The mash paper started what is now an 18 month odyssey that has led in some pretty interesting directions. (Read here and here for some of them.)

The one big drawback of mash/MinHash for me stems from its strength: MinHash has limited resolution when comparing two sets of very different sizes, e.g. a genome and a metagenome. The standard MinHash approach assumes that the sets to be compared are of similar sizes, and that you are calculating Jaccard similarity across the entire set; in return for these assumptions, you get a fixed-size sketch with guarantees on memory usage and execution time for both preparation and comparisons. However, these assumptions prevent something I want to do very much, which is to compare complex environmental metagenomes with potential constituent genomes.

A blast from the recent past: Kraken

The third paper I wanted to talk about is Kraken, a metagenome taxonomic classifier by Wood and Salzberg (2014) - see manual, and paper. This is actually a somewhat older paper, and the reason I dug back into the literature is that there has been somewhat of a renaissance in k-mer classification systems as the CS crowd has discovered that storing and searching large collections of k-mers is an interesting challenge. More specifically, I've recently reviewed several papers that implement variations on the Kraken approach (most recently MetaOthello, by Liu et al (2017)). My lab has also been getting more and more office hour requests from people who want to use Kraken and Kaiju for metagenomic classification.

Briefly, Kraken uses k-mers to apply the least-common-ancestor approach to taxonomic classification. The Kraken database is built by linking every known k-mer to a taxonomic ID. In cases where a k-mer belongs to only a single known genome, the assignment is easy: that k-mer gets the taxonomic ID of its parent genome. In the fairly common case that a k-mer belongs to multiple genomes, you assign the taxonomic ID of the least-common-ancestor taxonomic level to that k-mer.

Then, once you have built the database, you classify sequences by looking at the taxonomic IDs of constitutent k-mers. Conveniently this step is blazingly fast!

Figure from Wood and Salzberg, 2014

This figure (from Wood and Salzberg, 2014) shows the process that Kraken uses to do taxonomic identification. It's fast because (once you've computed the k-mer-to-LCA mapping) k-mer looks up are super fast, albeit rather memory intensive.

What you end up with (when using Kraken) is a taxonomic ID for each read, which can then be post-processed with something like Bracken (Lu et al., 2017) to give you an estimated species abundance.

The drawback to Kraken (and many similar approaches) is shared by MetaPalette: the databases are large, and building the databases is time/CPU-intensive. In our experience we're seeing what the manual says: you need 30+ GB RAM to run the software, and many more to build the databases.

(At the end of the day, it may be that many people actually want to use something like Centrifuge, another tool from the inimitable Salzberg group (see Kim et al., 2016). I'm only slowly developing a deeper understanding of the opportunities, limitations, and challenges of the various approaches, and I may blog about that later, but for now let's just say that I have some reasons to prefer the Kraken-style approach.)

More general thoughts

There are a number of challenges that are poorly addressed by current k-mer based classification schemes. One is scalability of classification: I really want to be able to run this stuff on my laptop! Another is scalability of the database build step: I'm OK with running that on bigger hardware than my laptop, but I want to be able to update, recombine, and customize the databases. Here, large RAM requirements are a big problem, and the tooling for database building is rather frustrating as well - more on that below.

I also very much want a library implementation of these things - specifically, a library in Python. Basically, you lose a lot when you communicate between programs through files. (See this blog post for the more general argument.) This would let us intermix k-mer classification with other neat techniques.

More generally, in the current era of "sequence all the things" and the coming era of "ohmigod we have sequenced so many things now what" we are going to be in need of a rich, flexible ecosystem of tools and libraries. This ecosystem will (IMO) need to be:

  • decentralized and locally installable & usable, because many labs will have large internal private data sets that they want to explore;
  • scalable in memory and speed, because the sheer volume of the data is so ...voluminous;
  • customizable and programmable (see above) so that we can try out cool new ideas more easily;
  • making use of databases that can be incrementally (and routinely) updated, so that we can quickly add new reference information without rebuilding the whole database;

and probably some other things I'm not thinking of. The ecosystem aspect here is increasingly important and something I've been increasingly focusing on: approaches that don't work together well are simply not that useful.

Another goal we are going to need to address is classification and characterization of unknowns in metagenomic studies. We are making decent progress in certain areas (metagenome-resolved genomics!!) but there are disturbing hints that we largely acting like drunks looking for their keys under the streetlight. I believe that we remain in need of systematic, scalable, comprehensive approaches for characterizing environmental metagenome data sets.

This means that we will need to be thinking more and more about reference independent analyses. Of the three above papers, only mash is reference independent; MetaPalette and Kraken both rely on reference databases. Of course, those two tools address the flip side of the coin, which is to properly make use of the reference databases we do have for pre-screening and cross-validation.

--titus

by C. Titus Brown at August 30, 2017 09:25 AM

Matthew Rocklin

Dask Release 0.15.2

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.2. This release contains stability enhancements and bug fixes. This blogpost outlines notable changes since the 0.15.0 release on June 11th.

You can conda install Dask:

conda install dask

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on the defaults and conda-forge channels.

Full changelogs are available here:

Some notable changes follow.

New dask-core and dask conda packages

On conda there are now three relevant Dask packages:

  1. dask-core: Package that includes only the core Dask package. This has no dependencies other than the standard library. This is primarily intended for down-stream libraries that depend on certain parts of Dask.
  2. distributed: Dask’s distributed scheduler, depends on Tornado, cloudpickle, and other libraries.
  3. dask: Metapackage that includes dask-core, distributed, and all relevant libraries like NumPy, Pandas, Bokeh, etc.. This is intended for users to install

This organization is designed to both allow downstream libraries to only depend on the parts of Dask that they need while also making the default behavior for users all-inclusive.

Downstream libraries may want to change conda dependencies from dask to dask-core. They will then need to be careful to include the necessary libraries (like numpy or cloudpickle) based on their user community.

Improved Deployment

Due to increased deployment on Docker or other systems with complex networking rules dask-worker processes now include separate --contact-address and --listen-address keywords that can be used to specify addresses that they advertise and addresses on which they listen. This is especially helpful when the perspective of ones network can shift dramatically.

dask-worker scheduler-address:8786 \
            --contact-address 192.168.0.100:9000  # contact me at 192.168.0.100:9000
            --listen-address 172.142.0.100:9000  # I listen on this host

Additionally other services like the HTTP and Bokeh servers now respect the hosts provided by --listen-address or --host keywords and will not be visible outside of the specified network.

Avoid memory, file descriptor, and process leaks

There were a few occasions where Dask would leak resources in complex situations. Many of these have now been cleaned up. We’re grateful to all those who were able to provide very detailed case studies that demonstrated these issues and even more grateful to those who participated in resolving them.

There is undoubtedly more work to do here and we look forward to future collaboration.

Array and DataFrame APIs

As usual, Dask array and dataframe have a new set of functions that fill out their API relative to NumPy and Pandas.

See the full APIs for further reference:

Deprecations

Officially deprecated dask.distributed.Executor, users should use dask.distributed.Client instead. Previously this was set to an alias.

Removed Bag.concat, users should use Bag.flatten instead.

Removed magic tuple unpacking in Bag.map like bag.map(lambda x, y: x + y). Users should unpack manually instead.

Julia

Developers from the Invenia have been building Julia workers and clients that operate with the Dask.distributed scheduler. They have been helpful in raising issues necessary to ensure cross-language support.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.0 release on June 11th

  • Bogdan
  • Elliott Sales de Andrade
  • Bruce Merry
  • Erik Welch
  • Fabian Keller
  • James Bourbeau
  • Jeff Reback
  • Jim Crist
  • John A Kirkham
  • Luke Canavan
  • Mark Dunne
  • Martin Durant
  • Matthew Rocklin
  • Olivier Grisel
  • Søren Fuglede Jørgensen
  • Stephan Hoyer
  • Tom Augspurger
  • Yu Feng

The following people contributed to the dask/distributed repository since the 1.17.1 release on June 14th:

  • Antoine Pitrou
  • Dan Brown
  • Elliott Sales de Andrade
  • Eric Davies
  • Erik Welch
  • Evan Welch
  • John A Kirkham
  • Jim Crist
  • James Bourbeau
  • Jeremiah Lowin
  • Julius Neuffer
  • Martin Durant
  • Matthew Rocklin
  • Paul Anton Letnes
  • Peter Waller
  • Sohaib Iftikhar
  • Tom Augspurger

Additionally we’re happy to announce that John Kirkham (@jakirkham) has accepted commit rights to the Dask organization and become a core contributor. John has been active through the Dask project, and particularly active in Dask.array.

August 30, 2017 12:00 AM

August 29, 2017

Matthieu Brucher

Something I realized on scale temperament

When I started music almost thirty years ago, one of my music teachers told us that there was a difference between a flat and a sharp note. I didn’t really understand as on a trumpet, both would be the same! I forgot about it, until a few years ago, I was introduced ti the concept of temperament.

It started with he fact that the fifth in a scale has a mathematical relationship to the root note, and all other notes were built from there. At least for the 12 notes we have in the occidental world. Then, I read that at some point, this imperfect scale was replaced by another one, where all 12 notes were evenly split. Of course, that means that the fifth is not the fifth anymore…

Then, I watched a French video on it again, with the proper scientific explanation for different scales, and it got me thinking again. That was what my music teacher was talking about. If you build your scale with the harmonics, you end up with a difference between flats and sharps. But even more, I thought that a melody with flats was more melancholic than one with sharps. But this doesn’t make sense for an evenly tempered scale! They all feel the same way, no matter which scale you use. It’s only if you do like Bach and use an imperfect temperament that you can feel a difference.

So then a musical instrument like strings, or voices, can actually work in a pure temperament. For winds, you can work on adjusting notes on the fly, and also have kind of pure temperament. On a piano, you can’t, you need to tune it right (BTW, I still don’t know how piano tuners can tune pianos when you have to tune them “imperfectly”! Their job is just amazing), or it will feel different from with a symphonic orchestra that can use a pure temperament of any scale.

And this is something quite difficult to achieve with a digital instrument. We expect equal temperament. I wonder how Celemony handles these differences when they tune a voice track.

Anyway, maybe the pure scale is imperfect because the 13th note is not exactly the same of the original note, maybe the equal temperament makes everything similar, but they all bring something. We shouldn’t probably forget about these differences and use them in modern music as well.

by Matt at August 29, 2017 07:13 AM

August 28, 2017

Continuum Analytics

Continuum Analytics Officially Becomes Anaconda

Today, Continuum Analytics changed its name to Anaconda. Not surprised? We agree. The name change reinforces our commitment to the Anaconda open source development community, the 4.5 million Anaconda users across the globe and customers of our Anaconda Enterprise Platform. The name Anaconda points to our long-term commitment to supporting the Anaconda open source ecosystem …
Read more →

by Scott Collison at August 28, 2017 02:35 AM

August 24, 2017

Continuum Analytics news

Continuum Analytics Welcomes Mathew Lodge as SVP Products and Marketing

Thursday, August 24, 2017

Former VMware VP Joins Anaconda’s Executive Leadership Team to Help Accelerate Data Science Adoption across the Enterprise

AUSTIN, TEXAS—August 24, 2017—Continuum Analytics, the company behind Anaconda, the leading Python data science platform, today announced Mathew Lodge as the company’s new senior vice president (SVP) of products and marketing. Lodge brings extensive experience to the B2B product space––including software and SaaS––to help the company further expand adoption of Anaconda across the enterprise.  

“With a proven history of leading product and marketing strategies for some of the biggest brands and hottest startups in technology, Mathew brings a unique perspective that will help take Continuum Analytics to the next level and extend our position as the leading Python data science platform,” said Scott Collison, CEO at Continuum Analytics. “Mathew will lead our product and marketing efforts, helping to ensure that Anaconda continues to meet and exceed the requirements of today’s enterprise, empowering organizations across the globe to build data science-driven applications that deliver measurable business impact.”
 
The Python community is estimated at more than 30 million members and, according to the most recent O’Reilly Data Science Survey, among data scientists, 72 percent prefer Python as their main tool. Anaconda has more than 4.5 million users and is the world’s most popular and trusted Python data science platform. 

“Data science is foundational to digital and AI strategies, allowing organizations to deliver new products and services faster and cheaper, and respond more quickly to customers and competitors,” said Lodge. “Python is the open source ecosystem of choice for data scientists, and Anaconda is the gold standard platform for Python data science. I look forward to working with customers and partners to realize the huge potential of Anaconda to deliver actionable, automated insight and intelligence to every organization.”

This summer, Continuum Analytics was included in Gartner’s “Cool Vendors in Data Science and Machine Learning, 2017” report and Gartner’s “Hype Cycle for Data Science and Machine Learning for 2017.”
 
Mathew Lodge comes from Weaveworks, a container and microservices start-up, where he was the chief operating officer. Previously, he was vice president in VMware’s Cloud Services group; notably he was co-founder for what became its vCloud Air IaaS service. Early in his career, Lodge built compilers and distributed systems for projects like the International Space Station, helped connect six countries to the Internet for the first time and managed a $630M product line at Cisco. Lodge holds a Master of Engineering from University of York, where he graduated with honors.
 
About Anaconda Powered by Continuum Analytics
Anaconda is the leading data science platform powered by Python, the fastest growing data science language with more than 30 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io

 

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by swebster at August 24, 2017 03:56 PM

Continuum Analytics

Continuum Analytics Welcomes Mathew Lodge as SVP Products and Marketing

Continuum Analytics, the company behind Anaconda, the leading Python data science platform, today announced Mathew Lodge as the company’s new senior vice president (SVP) of products and marketing.

by Team Anaconda at August 24, 2017 04:47 AM

August 20, 2017

numfocus

How to Compare Photos of the Solar Eclipse using Python & SunPy

The following post was written by Steven Christe of SunPy. A Rare Opportunity to View the Solar Corona A solar eclipse presents a unique opportunity for us to view the solar corona or solar atmosphere. In visible light, the corona is about one million times fainter than the sun, or about as bright as the moon. […]

by NumFOCUS Staff at August 20, 2017 05:39 PM

August 17, 2017

Titus Brown

Thoughts on my Salmon review

I was one of the reviewers of the Salmon paper by Patro et al., 2017, Salmon provides fast and bias-aware quantification of transcript expression, and I posted my review in large part because of Lior Pachter's blog post levying charges of intellectual theft and dishonest against the Salmon authors. More recently, the Salmon authors posted a strong rebuttal that is worth reading.

I posted my review verbatim, without any editing, and with the charge for the review laid out by the journal. Part of my goal was to provide some insight into why I was enthusiastic about Salmon, and why I reviewed it positively for publication.


Looking back on my review, I have a few thoughts - especially in light of Lior's hit piece and Rob et al's response.

First, it is always a pleasure reviewing software papers when you have used the tool for ages. My lab uses Salmon, trains people in running Salmon, and passes around blog posts and papers discussing Salmon results. I think we had been using Salmon for at least a year when I was asked to review the paper. So I was already pretty confident that Salmon performed well in general and I did not need to try to run it, evaluate it, etc.

Second, my review hit on a number of high points about Salmon: generally, that it was an important new tool of interest and that it was a significant improvement to the tried-and-tested Kallisto technique. More specifically, I felt that the GC bias correction was likely to be critical in the future (see esp the isoform switching observation!), and that the open-source license was an important aspect of the Salmon software. To my mind this meant that it warranted publication, a case I made succinctly but tried to make strongly.

Third, my review missed at least one very important aspect, which was the streaming nature of the Salmon implementation. The rebuttal covers this nicely...

Last, my review uses the term "pseudoalignment" and not "quasimapping." The distinction between these two terms is part of the brouhaha; but to my mind the basic idea of finding equivalence classes for reading mapping is pretty similar no matter what you call it, and it was clear (both from the citations to Kallisto and the discussion in the paper) that the core concept had been published as part of the Kallisto work. I don't think it matters what it's called - it's a good idea, I'm glad that there are multiple implementations that agree well with each other, and I think it was adequately, appropriately, and generously cited in the Salmon paper.

So, anyway, there are my thoughts. Yours welcome below!

(Before you comment, please do see the code of conduct for this blog - it's linked below. In particular, I place myself under no obligation to retain posted comments, and I will delete comments that are useless, or dishonest.)

--titus

by C. Titus Brown at August 17, 2017 10:00 PM

Continuum Analytics

Continuum Analytics to Share Insights at JupyterCon 2017

Continuum Analytics, the creator and driving force behind Anaconda, the leading Python data science platform, today announced that the team will present one keynote, three talks and two tutorials at JupyterCon on August 23 and 24 in NYC, NY.

by Team Anaconda at August 17, 2017 04:41 AM

August 16, 2017

Continuum Analytics news

Continuum Analytics to Share Insights at JupyterCon 2017

Thursday, August 17, 2017

Presentation topics include Jupyter and Anaconda in the enterprise; open innovation in a data-centric world; building an Excel-Python bridge; encapsulating data science using Anaconda Project and JupyterLab; deploying Jupyter dashboards for datapoints; JupyterLab

NEW YORK, August 17, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Python data science platform, today announced that the team will present one keynote, three talks and two tutorials at JupyterCon on August 23 and 24 in NYC, NY. The event is designed for the data science and business analyst community and offers in-depth trainings, insightful keynotes, networking events and talks exploring the Project Jupyter platform.

Peter Wang, co-founder and CTO of Continuum Analytics, will present two sessions on August 24. The first is a keynote at 9:15 am, titled “Jupyter & Anaconda: Shaking Up the Enterprise.” Peter will discuss the co-evolution of these two major players in the new open source data science ecosystem and next steps to a sustainable future. The other is a talk, “Fueling Open Innovation in a Data-Centric World,” at 11:55 am, offering Peter’s perspectives on the unique challenges of building a company that is fundamentally centered around sustainable open source innovation.

The second talk features Christine Doig, senior data scientist, product manager, and Fabio Pliger, software engineer, of Continuum Analytics, “Leveraging Jupyter to build an Excel-Python Bridge.” It will take place on August 24 at 11:05 am and Christine and Fabio will share how they created a native Microsoft Excel plug-in that provides a point-and-click interface to Python functions, enabling Excel analysts to use machine learning models, advanced interactive visualizations and distributed compute frameworks without needing to write any code. Christine will also be holding a talk on August 25 at 11:55 am on “Data Science Encapsulation and Deployment with Anaconda Project & JupyterLab.” Christine will share how Anaconda Project and JupyterLab encapsulate data science and how to deploy self-service notebooks, interactive applications, dashboards and machine learning.

James Bednar, senior solutions architect, and Philipp Rudiger, software developer, of Continuum Analytics, will give a tutorial on August 23 at 1:30 pm titled, “Deploying Interactive Jupyter Dashboards for Visualizing Hundreds of Millions of Datapoints.” This tutorial will explore an overall workflow for building interactive dashboards, visualizing billions of data points interactively in a Jupyter notebook, with graphical widgets allowing control over data selection, filtering and display options, all using only a few dozen lines of code.

The second tutorial, “JupyterLab,” will be hosted by Steven Silvester, software engineer at Continuum Analytics and Jason Grout, software developer at Bloomberg, on August 23 at 1:30 pm. They will walk through JupyterLab as a user and as an extension author, exploring its capabilities and offering a demonstration on how to create a simple extension to the environment.

Keynote:
WHO: Peter Wang, co-founder and CTO, Anaconda Powered by Continuum Analytics
WHAT: Jupyter & Anaconda: Shaking Up the Enterprise
WHEN: August 24, 9:15am-9:25am ET
WHERE: Grand Ballroom

Talk #1:
WHO: Peter Wang, co-founder and CTO, Anaconda Powered by Continuum Analytics
WHAT: Fueling Open Innovation in a Data-Centric World
WHEN: August 24, 11:55am–12:35pm ET
WHERE: Regent Parlor

Talk #2:
WHO: 

  • Christine Doig, senior data scientist, product manager, Anaconda Powered by Continuum Analytics
  • Fabio Pliger, software engineer, Anaconda Powered by Continuum Analytics

WHAT: Leveraging Jupyter to Build an Excel-Python Bridge
WHEN: August 24, 11:05am–11:45am ET
WHERE: Murray Hill

Talk #3:
WHO: Christine Doig, senior data scientist, product manager, Anaconda Powered by Continuum Analytics
WHAT: Data Science Encapsulation and Deployment with Anaconda Project & JupyterLab
WHEN: August 25, 11:55am–12:35pm ET
WHERE: Regent Parlor

Tutorial #1:
WHO: 

  • James Bednar, senior solutions architect, Anaconda Powered By Continuum Analytics 
  • Philipp Rudiger, software developer, Anaconda Powered By Continuum Analytics 

WHAT: Deploying Interactive Jupyter Dashboards for Visualizing Hundreds of Millions of Datapoints
WHEN: August 23, 1:30pm–5:00pm ET
WHERE: Concourse E

Tutorial #2:
WHO: 

  • Steven Silvester, software engineer, Anaconda Powered By Continuum Analytics 
  • Jason Grout, software developer of Bloomberg

WHAT: JupyterLab Tutorial
WHEN: August 23, 1:30pm–5:00pm ET
WHERE: Concourse A

###

About Anaconda Powered by Continuum Analytics
Anaconda is the leading data science platform powered by Python, the fastest growing data science language with more than 30 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

 

by swebster at August 16, 2017 03:12 PM

August 15, 2017

Titus Brown

My Salmon review

I was one of the reviewers of the Salmon paper by Patro et al., 2017, Salmon provides fast and bias-aware quantification of transcript expression. I was asked to review the paper on September 14, 2016, and submitted my review (or at least stopped getting reminders :) soon after October 20th.

The review request stated,

... I would be grateful for your views on both the technical merits of this work and its interest to a broad readership.

Please try the tool and report your experience. Let me know if you have issues with access.

Brief Communications are intended to report concise studies of high quality and broad interest but are expected to be less substantial than Articles. Typically, a Brief Communication will describe a significant improvement to a tried-and-tested technique, its adaptation for a new application, or an important new tool of interest to the scientific community.

and I reviewed the paper as requested.

Below is the full text of my review as I submitted it. (Note that I have not been posting my reviews lately because I've been busy, and I don't have a simple automated system for doing it. In this case the review seems like it might be of general interest.)


The authors present salmon, a new RNAseq quantification tool that uses pseudoalignment and is highly performant. The significant factors recommending salmon over other tools include speed, high sensitivity, correction for technical biases in sequencing, an open source license, and a robust implementation that is straightforward to use. The paper makes the argument that salmon is the premier tool for RNAseq quantification.

The paper is well written, if unavoidably dense due to the nonsensical space constraints of this journal. The supporting and supplemental material is well written and seems comprehensive. The comparisons with kallisto and express seem fine to me although I note that the x axes of the various figures are zoomed into a slightly absurd extent; nontheless, the comparisons support the conclusions reached in the paper.

There are some interesting gems buried deep within the paper. For example, I was struck by the observation that apparent isoform switching may be caused by technical artifacts (supp fig 5). While the format and journal chosen by the authors doesn't permit proper exploration of this, I suspect that the technical bias correction is going to be a very important aspect of this software.

I should also say that various versions of the preprint and software have been out for well over a year, and we and others have been using it since very early on. Others have done independent benchmarks that agree with the conclusions reached in the paper.

Requested changes:

  • the authors should provide an archival location for salmon, and include a DOI to the specific version used in this paper. This can be easily accomplished with e.g. zenodo/github integration.
  • as the authors are putting this forward for general use, I would like at least a brief discussion of how they do software testing and validation. In particular, this can answer the important question of how users and developers can know that future versions of salmon will produce results at least as correct as the results in this paper.

Requested additions:

  • optional but recommended: the documentation should be expanded to include a "getting started" tutorial. Right now a naive user would have no idea where to star.

Typos/corrections:

p3, at the same false discovery rate than those => as those

probabilistic is mis-spelled on p5

p7, reference 5 has "page" in it inappropriately.

by C. Titus Brown at August 15, 2017 10:00 PM

Continuum Analytics

Keeping Anaconda Up To Date

Below is a question that gets asked so often that I decided it would be helpful to publish an answer explaining the various ways in which Anaconda can be kept up to date. The question was originally asked on StackOverflow.  I have Anaconda installed on my computer and I’d like to update it. In Navigator I can …
Read more →

by Team Anaconda at August 15, 2017 04:39 AM

Five Organizations Successfully Fueling Innovation with Data Science

Data science innovation requires availability, transparency and interoperability. But what does that mean in practice? At Anaconda, it means providing data scientists with open source tools that facilitate collaboration; moving beyond analytics to intelligence. Open source projects are the foundation of modern data science and are popping up across industries, making it more accessible, more …
Read more →

by Team Anaconda at August 15, 2017 04:28 AM

August 14, 2017

Continuum Analytics news

Five Organizations Successfully Fueling Innovation with Data Science

Tuesday, August 15, 2017
Christine Doig
Sr. Data Scientist, Product Manager

Data science innovation requires availability, transparency and interoperability. But what does that mean in practice? At Anaconda, it means providing data scientists with open source tools that facilitate collaboration; moving beyond analytics to intelligence. Open source projects are the foundation of modern data science and are popping up across industries, making it more accessible, more interactive and more effective. So, who’s leading the open source charge in the data science community? Here are five organizations to keep your eye on:

1. TaxBrain. TaxBrain is a platform that enables policy makers and the public to simulate and study the effects of tax policy reforms using open source economic models. Using the open source platform, anyone can plug elements of the administration’s proposed tax policy to get an idea of how it would perform in the real world.

 

2. Recursion Pharmaceuticals. Recursion is a pharmaceutical company dedicated to finding the remedies for rare genetic diseases. Its drug discovery assay is built on an open source software platform, combining biological science with machine learning techniques to visualize cell data and test drugs efficiently. This approach shortens research and development process, reducing time to market for remedies to these rare genetic diseases. Their goal is to treat 100 diseases by 2026 using this method.

3. The U.S. Government. Under the previous administration, the U.S. government launched Data.gov, an open data initiative that offers more than 197K datasets for public use. This database exists, in part, thanks to the former U.S. chief data scientist, DJ Patil. He helped drive the government’s data science projects forward, at the city, state and federal levels. Recently, concerns have been raised over the the Data.gov portal, as certain information has started to disappear. Data scientists are keeping a sharp eye on the portal to ensure that these resources are updated and preserved for future innovative projects.

4. Comcast. Telecom and broadcast giant, Comcast, run their projects on open source platforms to drive data science innovation in the industry. 

For example, earlier this month, Comcast’s advertising branch announced they were creating a Blockchain Insights Platform to make the planning, targeting, execution and measurement of video ads more efficient. This data-driven, secure approach would be a game changer for the advertising industry, which eagerly awaits its launch in 2018.

5. DARPA. The Defense Advanced Research Projects Agency (DARPA) is behind the Memex project, a program dedicated to fighting human trafficking, which is a top mission for the defense department. DARPA estimates that in two years, traffickers spent $250 million posting the temporary advertisements that fuel the human trafficking trade. Using an open source platform, Memex is able to index and cross reference interactive and social media, text, images and video across the web. This allows them to find the patterns in web data that indicate human trafficking. Memex’s data science approach is already credited in generating at least 20 active cases and nine open indictments. 

These are just some of the examples of open source-fueled data science turning industries on their head, bringing important data to the public and generally making the world a better place. What will be the next open source project to put data science in the headlines? Let us know what you think in the comments below!

by swebster at August 14, 2017 06:12 PM

Continuum Analytics

Anaconda Rides its Way into Gartner’s Hype Cycle

If you’re an Anaconda user and/or frequent reader of our blog, then you know how passionate we are about empowering our community (and future community!) with all the resources needed to bring data science to life. And while it will never stop us from tooting the data science horn, it’s always nice to know we …
Read more →

by Team Anaconda at August 14, 2017 04:22 AM

August 11, 2017

numfocus

How rOpenSci uses Code Review to Promote Reproducible Science

This post was co-authored by the rOpenSci Editorial Board: Noam Ross, Scott Chamberlain, Karthik Ram, and Maëlle Salmon. — At rOpenSci, we create and curate software to help scientists with the data life cycle. These tools access, download, manage, and archive scientific data in open, reproducible ways. Early on, we realized this could only be […]

by NumFOCUS Staff at August 11, 2017 05:30 PM

August 08, 2017

Continuum Analytics news

Anaconda Rides its Way into Gartner’s Hype Cycle

Monday, August 14, 2017
Scott Collison
Chief Executive Officer

If you’re an Anaconda user and/or frequent reader of our blog, then you know how passionate we are about empowering our community (and future community!) with all the resources needed to bring data science to life. And while it will never stop us from tooting the data science horn, it’s always nice to know we are in good company. Continuum Analytics was recently included in Gartner’s Hype Cycle for Data Science and Machine Learning for 2017, Peter Krensky and Jim Hare, July 28, 2017.

Per the Gartner report, “The hype around data science and machine learning has increased from already high levels in the past year. Data and analytics leaders should use this Hype Cycle to understand technologies generating excitement and inflated expectations, as well as significant movements in adoption and maturity.”

We couldn’t agree more. 

This report comes on the heels of our inclusion in Gartner’s Cool Vendors in Data Science and Machine Learning, 2017 report Peter Krensky, Alexander Linden, Svetlana Sicular, and Cindi Howson, May 11, 2017. 

One important point is that this was the first time Machine Learning was included in Gartner’s Hype Cycle for Emerging Technologies, which we believe clearly validates the growing importance of data science across the enterprise. While this isn’t news to us, it is important that influencers are acknowledging the shift to a more data-driven enterprise and moving data science from an ‘emerging’ category to ‘established.’ 

We witness this shift everyday as the Anaconda community swells to 4.5M active users with more new members joining daily. Our excitement grows along with every new user because we know the impact data science can have on so many areas––from enabling new and cutting edge innovations to solving some of the world’s biggest challenges and uncovering answers to questions that haven’t even been asked yet.

We’re grateful to our incredible community for bringing our vision to life and truly embracing the superpowers we offer for people who change the world. 

Gartner Disclaimer
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

by swebster at August 08, 2017 05:37 PM