January 18, 2017

Titus Brown

Computational postdoc opening at UC Davis!

We are currently soliciting applications for computational postdoctoral fellows to undertake exciting projects in computational biology/bioinformatics jointly supervised by Dr. Titus Brown (http://ivory.idyll.org/lab/) and Dr. Fereydoun Hormozdiari (http://www.hormozdiarilab.org/) at UC Davis.

UC Davis is a world class research institution with a strong genomics faculty. In addition to being part of Dr. Brown and Dr. Hormazdiari's labs, the postdoc will be able to participate in Genome Center activities. Potential collaborators include Megan Dennis, Alex Norde, and Paul and Randi Hagerman. UC Davis is close to the Bay Area and there will be opportunities to connect and collaborate with researchers at Berkeley, Stanford, and UCSF as well.

Davis, CA is an excellent place to live with good food, great schools, nice weather, non-Bay-Area housing prices, and a bike-friendly culture.

---

The successful candidate will undertake computational method and tool development for better understanding the contribution of genetic variation (especially structural variation) on changing the genome structure. In collaboration with the members of both labs, the postdoctoral candidate will also be building models for predicting the changes in gene expression based on variants (especially CNV) and performing a comparative study of genome structures in multiple tissues/samples using HiC data.

This opportunity requires developing novel computational algorithms and machine learning methods to solve emerging biological problems. The technical expertise needed include strong computational background to develop novel combinatorial, machine learning (ML) or statistical inference algorithms, with strong programming capabilities and a general understanding of the concepts in genomics and genetics.

Candidates are guaranteed funding for two years and will be strongly encouraged to apply for external funding in the second year of their postdoc to make a successful transition to independent investigator.

Some of the projects to work on include but are not limited to:

• Computational methods to discover and predict the structural variations (SV) which will result in significant modification of genome structure. It is been shown recently that structural variation which results in modification of TAD (Topologically Associating Domains) can result in genetic diseases. As part of this project we are trying to develop methods which would predict which SVs will result in such a significant modification and potentially build a method for ranking/scoring SVs based on their pathogenicity in disease such as autism and cancer.
• Study the effect of SV/CNVs which result in significant changes of genome structure during (great ape) evolution and associated with changes in gene expression for each of these species as a result of such variants.
• Develop computational tools for finding conserved and significantly differentiated TADs in two more samples (from different cell types or species) using HiC data, with application to data from different tissues and/or species.

The start date for this position is flexible, although we hope the successful candidate can start before Sep 1, 2017.

Suggested candidate background:

• Ph.D. in computer science, computational biology or related fields
• Excellent programming skills in at least one language (C/C++, Java or Python)
• Strong written/oral presentation skills
• Enthusiasm for genomics-related problems
• Knowledge of next-generation sequencing technologies and HiC data is a plus.

Interested candidates should send their CV and a research statement to Fereydoun Hormozdiari (email: fhormozd[at]ucdavis.edu) and Titus Brown (email: ctbrown[at]ucdavis.edu).

We will begin review of applications on Feb 1, 2017.

Matthew Rocklin

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2017-01-01 and 2016-01-17. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

1. Stability enhancements for the distributed scheduler and micro-release
2. NASA Grant writing
4. Dataframe categorical flexibility (work in progress)
5. Communication refactor (work in progress)

Stability enhancements and micro-release

We’ve released dask.distributed version 1.15.1, which includes important bugfixes after the recent 1.15.0 release. There were a number of small issues that coordinated to remove tasks erroneously. This was generally OK because the Dask scheduler was able to heal the missing pieces (using the same machinery that makes Dask resilience) and so we didn’t notice the flaw until the system was deployed in some of the more serious Dask deployments in the wild. PR dask/distributed #804 contains a full writeup in case anyone is interested. The writeup ends with the following line:

This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.

This also adds other fixes, like a compatibility issue with the new Bokeh 0.12.4 release and others.

NASA Grant Writing

I’ve been writing a proposal to NASA to help fund distributed Dask+XArray work for atmospheric and oceanographic science at the 100TB scale. Many thanks to our scientific collaborators who are offering support here.

The Dask-EC2 project deploys Anaconda, a Dask cluster, and Jupyter notebooks on Amazon’s Elastic Compute Cloud (EC2) with a small command line interface:

pip install dask-ec2 --upgrade
--keypair /path/to/ssh-key \
--type m4.2xlarge
--count 8


This project can be either very useful for people just getting started and for Dask developers when we run benchmarks, or it can be horribly broken if AWS or Dask interfaces change and we don’t keep this project maintained. Thanks to a great effort from Ben Zaitlen dask-ec2 is again in the very useful state, where I’m hoping it will stay for some time.

If you’ve always wanted to try Dask on a real cluster and if you already have AWS credentials then this is probably the easiest way.

This already seems to be paying dividends. There have been a few unrelated pull requests from new developers this week.

Dataframe Categorical Flexibility

Categoricals can significantly improve performance on text-based data. Currently Dask’s dataframes support categoricals, but they expect to know all of the categories up-front. This is easy if this set is small, like the ["Healthy", "Sick"] categories that might arise in medical research, but requires a full dataset read if the categories are not known ahead of time, like the names of all of the patients.

Jim Crist is changing this so that Dask can operates on categorical columns with unknown categories at dask/dask #1877. The constituent pandas dataframes all have possibly different categories that are merged as necessary. This distinction may seem small, but it limits performance in a surprising number of real-world use cases.

Communication Refactor

Since the recent worker refactor and optimizations it has become clear that inter-worker communication has become a dominant bottleneck in some intensive applications. Antoine Pitrou is currently refactoring Dask’s network communication layer, making room for more communication options in the future. This is an ambitious project. I for one am very happy to have someone like Antoine looking into this.

January 17, 2017

Titus Brown

Categorizing 400,000 microbial genome shotgun data sets from the SRA

A few months ago I was at the Woods Hole MBL Microbial Diversity course, and I ran across Mihai Pop, who was teaching at the STAMPS Microbial Ecology course. Mihai is an old friend who shares my interest in microbial genomes and assembly and other such stuff, and during our conversation he pointed out that there were many unassembled microbial genomes sitting in the Sequence Read Archive.

The NCBI Sequence Read Archive is one of the primary archives for biological sequencing data, but it generally holds only the raw sequencing data; assemblies and analysis products go elsewhere. It's also largely unsearchable by sequence: you can search an individual data set with BLAST, I think, but you can't search multiple data sets (because each data set is large, and the search functionality to handle it doesn't really exist). There have been some attempts to make it searchable, including most notably Brad Solomon and Carl Kingsford's Sequence Bloom Tree paper (also on biorxiv, and see my review), but it's still not straightforward.

Back to Mihai - Mihai told me that there were several hundred thousand microbial WGS samples in the SRA for which assemblies were not readily available. That got me kind of interested, and -- combined with my interest in indexing all the microbial genomes for MinHash searching -- led to... well, read on!

How do you dance lightly across the surface of 400,000 data sets?

tl;dr? To avoid downloading all the things, we're sipping from the beginning of each SRA data set only.

The main problem we faced in looking at the SRA is that whole genome shotgun data sets are individually rather large (typically at least 500 MB to 1 GB), and we have no special access to the SRA, so we were looking at a 200-400 PB download. Luiz Irber found that NCBI seems to throttle downloads to about 100 Mbps, so we calculated that grabbing the 400k samples would significantly extend his PhD.

But, not only is the data volume quite large, the samples themselves are mildly problematic: they're not assembled or error trimmed, so we had to develop a way to error trim them in order to minimize spurious k-mer presence.

We tackled these problems in several ways:

• Luiz implemented a distributed system to grab SRA samples and compute MinHash sketch signatures on them with sourmash; he then ran this 50x across Rackspace, Google Compute Engine, and the MSU High Performance Compute cluster (see blog post);

To quote, "Just to keep track: we are posting Celery tasks from a Rackspace server to Amazon SQS, running workers inside Docker managed by Kubernetes on GCP, putting results on Amazon S3 and finally reading the results on Rackspace and then posting it to IPFS."

This meant we were no longer dependent on a single node, or even on a single compute solution. w00t!

• We needed a way to quickly and efficiently error trim the WGS samples. In MinHash land, this means walking through reads and finding "true" k-mers based on their abundance in the read data set.

Thanks to khmer, we already have ways of doing this on a low-memory streaming basis, so we started with that (using trim-low-abund.py).

• Because whole-genome shotgun data is generally pretty high coverage, we guessed that we could get away with computing signatures on only a small subset of the data. After all, if you have 100x coverage sample, and you only need 5x coverage to build a MinHash signature, then you only need to look at 5% of the data!

The fastq-dump program has a streaming output mode, and both khmer and sourmash support streaming I/O, so we could do all this computing progressively. The question was, how do we know when to stop?

Our first attempt was to grab the first million reads from each sample, and then abundance-trim them, and MinHash them. Luiz calculated that (with about 50 workers going over the holiday break) this would take about 3 weeks to run on the 400,000 samples.

Fortunately, due to a bug in my categorization code, we thought that this approach wasn't working. I say "fortunately" because in attempting to fix the wrong problem, we came across a much better solution :).

For mark 2 of streaming, some basic experimentation suggested that we could get a decent match when searching a sample against known microbial genomes with only about 20% of the genome. For E. coli, this is about 1m bases, which is about 1m k-mers.

So I whipped together a program called syrah that reads FASTA/FASTQ sequences and outputs high-abundance regions of the sequences until it has seen 1m k-mers. Then it exits, terminating the stream.

This is nice and simple to use with fastq-dump and sourmash --

fastq-dump -A {sra_id} -Z | syrah | \
sourmash compute -k 21 --dna - -o {output} --name {sra_id}


and when Luiz tested it out we found that it was 3-4x faster than our previous approach, because it tended to terminate much earlier in the stream and hence downloaded less data. (See the final command here.)

At this point we were down to an estimated 5 days for computing about 400,000 sourmash signatures on the microbial genomes section of the SRA. That was fast enough even for grad students in a hurry :).

Categorizing 400,000 sourmash signatures... quickly!

tl;dr? We sped up the sourmash Sequence Bloom Tree search functionality, like, a lot.

Now we had the signatures! Done, right? We just categorize 'em all! How long can that take!?

Well, no. It turns out when operating at this scale even the small things take too much time!

We knew from browsing the SRA metadata that most of the samples were likely to be strain variants of human pathogens, which are very well represented in the microbial RefSeq. Conveniently, we already had prepared those for search. So my initial approach to looking at the signatures was to compare them to the 52,000 microbial RefSeq genomes, and screen out those that could be identified at k=21 as something known. This would leave us with the cool and interesting unknown/unidentifiable SRA samples.

I implemented a new sourmash subcommand, categorize, that took in a list (or a directory) full of sourmash signatures and searched them individually against a Sequence Bloom Tree of signatures. The output was a CSV file of categorized signatures, with each entry containing the best match to a given signature against the entire SBT.

The command looks like this:

sourmash categorize --csv categories.csv \
-k 21 --dna --traverse-directory syrah microbes.sbt.json


and the default threshold is 8%, which is just above random background.

This worked great! It took about 1-3 seconds per genome. For 400,000 signatures that would take... 14 days. Sigh. Even if we parallelized that it was annoyingly slow.

So I dug into the source code and found that the problem was our YAML signature format, which was slow as a dog. When searching the SBT, each leaf node was stored in YAML and loading this was consuming something like 80% of the time.

My first solution was to cache all the signatures, which worked great but consumed about a GB of RAM. Now we could search each signature in about half a second.

In the meantime, Laurent Gautier had discovered the same problem in his work and he came along and reimplemented signature storage in JSON, which was 10-20x faster and was a way better permanent solution. So now we have JSON as the default sourmash signature format, huzzah!

At this point I could categorize about 200,000 signatures in 1 day on an AWS m4.xlarge, when running 8 categorize tasks in parallel (on a single machine). That was fast enough for me.

It's worth noting that we explicitly opted for separating the signature creation from the categorization, because (a) the signatures themselves are valuable, and (b) we were sure the signature generation code was reasonably bug free but we didn't know how much iteration we would have to do on the categorization code. If you're interested in calculating and categorizing signatures directly from streaming FASTQ, see sourmash watch. But Buyer Beware ;).

Results! What are the results?!

For 361,077 SRA samples, we cannot identify 8707 against the 52,000 RefSeq microbial genomes. That's about 2.4%.

Most of the 340,000+ samples are human pathogens. I can do a breakdown later, but it's all E. coli, staph, tuberculosis, etc.

From the 8707 unidentified, I randomly chose and downloaded 34 entire samples. I ran them all through the MEGAHIT assembler, and 27 of them assembled (the rest looked like PacBio, which MEGAHIT doesn't assemble). Of the 27, 20 could not be identified against the RefSeq genomes. This suggests that about 60% of the 8707 samples (5200 or so) are samples that are (a) Illumina sequence, (b) assemble-able, and (c) not identifiable.

You can get the CSV of categorized samples here (it's about 5 MB, .csv.gz).

What next?

Well, there are a few directions --

• we have about 350,000 SRA samples identified based on sequence content now. We should cross-check that against the SRA metadata to see where the metadata is wrong or incomplete.
• we could do bulk strain analyses of a variety of human pathogens at this point, if we wanted.
• we can pursue the uncategorized/uncategorizable samples too, of course! There are a few strategies we can try here but I think the best strategy boils down to assembling them, annotating them, and then using protein-based comparisons to identify nearest known microbes. I'm thinking of trying phylosift. (See Twitter conversation 1 and Twitter conversation 2.)
• we should cross-compare uncategorized samples!

At this point I'm not 100% sure what we'll do next - we have some other fish to fry in the sourmash project first, I think - but we'll see. Suggestions welcome!

A few points based partly on reactions to the Twitter conversations (1) and (2) about what to do --

• mash/MinHash comparisons aren't going to give us anything interesting, most likely; that's what's leading to our list of uncategorizables, after all.
• I'm skeptical that nucleotide level comparisons of any kind (except perhaps of SSU/16s genes) will get us anywhere.
• functional analysis seems secondary to figuring out what branch of bacteria they are, but maybe I'm just guilty of name-ism here. Regardless, if we were to do any functional analysis for e.g. metabolism, I'd want to do it on all of 'em, not just the identified ones.

Backing up -- why would you want to do any of this?

No, I'm not into doing this just for the sake of doing it ;). Here's some of my (our) motivations:

• It would be nice to make the entire SRA content searchable. This is particularly important for non-model genomic/transcriptomic/metagenomic folk who are looking for resources.
• I think a bunch of the tooling we're building around sourmash is going to be broadly useful for lots of people who are sequencing lots of microbes.
• Being able to scale sourmash to hundreds of thousands (and millions and eventually billions) of samples is going to be, like, super useful.
• More generally, this is infrastructure to support data-intensive biology, and I think this is important. Conveniently the Moore Foundation has funded me to develop stuff like this.
• I'm hoping I can tempt the grey (access restricted, etc.) databases into indexing their (meta)genomes and transcriptomes and making the signatures available for search. See e.g. "MinHash signatures as ways to find samples, and collaborators?".

Also, I'm starting to talk to some databases about getting local access to do this to their data. If you are at, or know of, a public database that would like to cooperate with this kind of activity, let's chat -- titus@idyll.org.

--titus

Continuum Analytics news

Announcing General Availability of conda 4.3

Wednesday, January 18, 2017
Kale Franz
Continuum Analytics

We're excited to announce that conda 4.3 has been released for general availability. The 4.3 release series has several new features and substantial improvements. Below is a summary.

To get the latest, just run conda update conda.

New Features

• Unlink and Link Packages in a Single Transaction: In the past, conda hasn't always been safe and defensive with its disk-mutating actions. It has gleefully clobbered existing files; mid-operation failures left environments completely broken. In some of the most severe examples, conda can appear to "uninstall itself." With this release, the unlinking and linking of packages for an executed command is done in a single transaction. If a failure occurs for any reason while conda is mutating files on disk, the environment will be returned its previous state. While we've implemented some pre-transaction checks (verifying package integrity for example), it's impossible to anticipate every failure mechanism. In some circumstances, OS file permissions cannot be fully known until an operation is attempted and fails, and conda itself is not without bugs. Moving forward, unforeseeable failures won't be catastrophic.

• Progressive Fetch and Extract Transactions: Like package unlinking and linking, the download and extract phases of package handling have also been given transaction-like behavior. The distinction is that the rollback on error is limited to a single package. Rather than rolling back the download and extract operation for all packages, the single-package rollback prevents the need for having to re-download every package if an error is encountered.

• Generic- and Python-Type Noarch/Universal Packages: Along with conda-build 2.1, a noarch/universal type for Python packages is officially supported. These are much like universal Python wheels. Files in a Python noarch package are linked into a prefix just like any other conda package, with the following additional features:

1. conda maps the site-packages directory to the correct location for the Python version in the environment,
2. conda maps the python-scripts directory to either $PREFIX/bin or $PREFIX/Scripts depending on platform,
3. conda creates the Python entry points specified in the conda-build recipe, and
4. conda compiles pyc files at install time when prefix write permissions are guaranteed.

Python noarch packages must be "fully universal." They cannot have OS- or Python version-specific dependencies. They cannot have OS- or Python version-specific "scripts" files. If these features are needed, traditional conda packages must be used.

• Multi-User Package Caches: While the on-disk package cache structure has been preserved, the core logic implementing package cache handling has had a complete overhaul. Writable and read-only package caches are fully supported.

• Python API Module: An oft requested feature is the ability to use conda as a Python library, obviating the need to "shell out" to another Python process. Conda 4.3 includes a conda.cli.python_api module that facilitates this use case. While we maintain the user-facing command-line interface, conda commands can be executed in-process. There is also a conda.exports module to facilitate longer-term usage of conda as a library across conda releases. However, conda's Python code is considered internal and private, subject to change at any time across releases. At the moment, conda will not install itself into environments other than its original install environment.

• Remove All Locks: Locking has never been fully effective in conda, and it often created a false sense of security. In this release, multi-user package cache support has been implemented for improved safety by hard-linking packages in read-only caches to the user's primary user package cache. Still, users are cautioned that undefined behavior can result when conda is running in multiple process and operating on the same package caches and/or environments.

Deprecations/Breaking Changes

• Conda now has the ability to refuse to clobber existing files that are not within the unlink instructions of the transaction. This behavior is configurable via the path_conflict configuration option, which has three possible values: clobber, warn, and prevent. In 4.3, the default value is clobber. This preserves existing behaviors, and it gives package maintainers time to correct current incompatibilities within their package ecosystem. In 4.4, the default will switch to warn, which means these operations continue to clobber, but the warning messages are displayed. In 4.5, the default value will switch to prevent. As we tighten up the path_conflict constraint, a new command line flag --clobber will loosen it back up on an ad hoc basis. Using --clobber overrides the setting for path_conflict to effectively be clobber for that operation.

• Conda signed packages have been removed in 4.3. Vulnerabilities existed, and an illusion of security is worse than not having the feature at all. We will be incorporating The Update Framework (TUF) into conda in a future feature release.

• Conda 4.4 will drop support for older versions of conda-build.

Other Notable Improvements

• A new "trace" log level is added, with output that is extremely verbose. To enable it, use -v -v -v or -vvv as a command-line flag, set a verbose: 3 configuration parameter, or set a CONDA_VERBOSE=3 environment variable.

• The r channel is now part of the default channels.

• Package resolution/solver hints have been improved with better messaging.

Matthew Rocklin

Distributed NumPy on a Cluster with Dask Arrays

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

This page includes embedded large profiles. It may look better on the actual site TODO: link to live site (rather than through syndicated pages like planet.python) and it may take a while to load on non-broadband connections (total size is around 20MB)

Summary

We analyze a stack of images in parallel with NumPy arrays distributed across a cluster of machines on Amazon’s EC2 with Dask array. This is a model application shared among many image analysis groups ranging from satellite imagery to bio-medical applications. We go through a series of common operations:

1. Inspect a sample of images locally with Scikit Image
2. Construct a distributed Dask.array around all of our images
3. Process and re-center images with Numba
4. Transpose data to get a time-series for every pixel, compute FFTs

This last step is quite fun. Even if you skim through the rest of this article I recommend checking out the last section.

Inspect Dataset

I asked a colleague at the US National Institutes for Health (NIH) for a biggish imaging dataset. He came back with the following message:

*Electron microscopy may be generating the biggest ndarray datasets in the field - terabytes regularly. Neuroscience needs EM to see connections between neurons, because the critical features of neural synapses (connections) are below the diffraction limit of light microscopes. This type of research has been called “connectomics”. Many groups are looking at machine vision approaches to follow small neuron parts from one slice to the next. *

This data is from drosophila: http://emdata.janelia.org/. Here is an example 2d slice of the data http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000.

import skimage.io
import matplotlib.pyplot as plt

skimage.io.imshow(sample)


The last number in the URL is an index into a large stack of about 10000 images. We can change that number to get different slices through our 3D dataset.

samples = [skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % i)
for i in [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]]

fig, axarr = plt.subplots(1, 9, sharex=True, sharey=True, figsize=(24, 2.5))
for i, sample in enumerate(samples):
axarr[i].imshow(sample, cmap='gray')


We see that our field of interest wanders across the frame over time and drops off in the beginning and at the end.

Create a Distributed Array

Even though our data is spread across many files, we still want to think of it as a single logical 3D array. We know how to get any particular 2D slice of that array using Scikit-image. Now we’re going to use Dask.array to stitch all of those Scikit-image calls into a single distributed array.

import dask.array as da

urls = ['http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % i
for i in range(10000)]  # A list of our URLs

lazy_values = [imread(url) for url in urls]     # Lazily evaluate imread on each url

arrays = [da.from_delayed(lazy_value,           # Construct a small Dask array
dtype=sample.dtype,   # for every lazy value
shape=sample.shape)
for lazy_value in lazy_values]

stack = da.stack(arrays, axis=0)                # Stack all small Dask arrays into one

>>> stack
dask.array<shape=(10000, 2000, 2000), dtype=uint8, chunksize=(1, 2000, 2000)>

>>> stack = stack.rechunk((20, 2000, 2000))     # combine chunks to reduce overhead
>>> stack
dask.array<shape=(10000, 2000, 2000), dtype=uint8, chunksize=(20, 2000, 2000)>


So here we’ve constructed a lazy Dask.array from 10 000 delayed calls to skimage.io.imread. We haven’t done any actual work yet, we’ve just constructed a parallel array that knows how to get any particular slice of data by downloading the right image if necessary. This gives us a full NumPy-like abstraction on top of all of these remote images. For example we can now download a particular image just by slicing our Dask array.

>>> stack[5000, :, :].compute()
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

>>> stack[5000, :, :].mean().compute()
11.49902425


However we probably don’t want to operate too much further without connecting to a cluster. That way we can just download all of the images once into distributed RAM and start doing some real computations. I happen to have ten m4.2xlarges on Amazon’s EC2 (8 cores, 30GB RAM each) running Dask workers. So we’ll connect to those.

from dask.distributed import Client, progress

>>> client


I’ve replaced the actual address of my scheduler (something like 54.183.180.153 with scheduler-address. Let’s go ahead and bring in all of our images, persisting the array into concrete data in memory.

stack = client.persist(stack)


This starts downloads of our 10 000 images across our 10 workers. When this completes we have 10 000 NumPy arrays spread around on our cluster, coordinated by our single logical Dask array. This takes a while, about five minutes. We’re mostly network bound here (Janelia’s servers are not co-located with our compute nodes). Here is a parallel profile of the computation as an interactive Bokeh plot.

There will be a few of these profile plots throughout the blogpost, so you might want to familiarize yoursel with them now. Every horizontal rectangle in this plot corresponds to a single Python function running somewhere in our cluster over time. Because we called skimage.io.imread 10 000 times there are 10 000 purple rectangles. Their position along the y-axis denotes which of the 80 cores in our cluster that they ran on and their position along the x-axis denotes their start and stop times. You can hover over each rectangle (function) for more information on what kind of task it was, how long it took, etc.. In the image below, purple rectangles are skimage.io.imread calls and red rectangles are data transfer between workers in our cluster. Click the magnifying glass icons in the upper right of the image to enable zooming tools.

Now that we have persisted our Dask array in memory our data is based on hundreds of concrete in-memory NumPy arrays across the cluster, rather than based on hundreds of lazy scikit-image calls. Now we can do all sorts of fun distributed array computations more quickly.

For example we can easily see our field of interest move across the frame by averaging across time:

skimage.io.imshow(stack.mean(axis=0).compute())


Or we can see when the field of interest is actually present within the frame by averaging across x and y

plt.plot(stack.mean(axis=[1, 2]).compute())


By looking at the profile plots for each case we can see that averaging over time involves much more inter-node communication, which can be quite expensive in this case.

Recenter Images with Numba

In order to remove the spatial offset across time we’re going to compute a centroid for each slice and then crop the image around that center. I looked up centroids in the Scikit-Image docs and came across a function that did way more than what I was looking for, so I just quickly coded up a solution in Pure Python and then JIT-ed it with Numba (which makes this run at C-speeds).

from numba import jit

@jit(nogil=True)
def centroid(im):
n, m = im.shape
total_x = 0
total_y = 0
total = 0
for i in range(n):
for j in range(m):
total += im[i, j]
total_x += i * im[i, j]
total_y += j * im[i, j]

if total > 0:
total_x /= total
total_y /= total

>>> centroid(sample)  # this takes around 9ms
(748.7325324581344, 802.4893005160851)

def recenter(im):
x, y = centroid(im.squeeze())
x, y = int(x), int(y)
if x < 500:
x = 500
if y < 500:
y = 500
if x > 1500:
x = 1500
if y > 1500:
y = 1500

return im[..., x-500:x+500, y-500:y+500]

plt.figure(figsize=(8, 8))
skimage.io.imshow(recenter(sample))


Now we map this function across our distributed array.

import numpy as np
def recenter_block(block):
""" Recenter a short stack of images """
return np.stack([recenter(block[i]) for i in range(block.shape[0])])

recentered = stack.map_blocks(recenter,
chunks=(20, 1000, 1000),  # chunk size changes
dtype=a.dtype)
recentered = client.persist(recentered)


This profile provides a good opportunity to talk about a scheduling failure; things went a bit wrong here. Towards the beginning we quickly recenter several images (Numba is fast), taking around 300-400ms for each block of twenty images. However as some workers finish all of their allotted tasks, the scheduler erroneously starts to load balance, moving images from busy workers to idle workers. Unfortunately the network at this time appeared to be much slower than expected and so the move + compute elsewhere strategy ended up being much slower than just letting the busy workers finish their work. The scheduler keeps track of expected compute times and transfer times precisely to avoid mistakes like this one. These sorts of issues are rare, but do occur on occasion.

We check our work by averaging our re-centered images across time and displaying that to the screen. We see that our images are better centered with each other as expected.

skimage.io.imshow(recentered.mean(axis=0))


This shows how easy it is to create fast in-memory code with Numba and then scale it out with Dask.array. The two projects complement each other nicely, giving us near-optimal performance with intuitive code across a cluster.

Rechunk to Time Series by Pixel

We’re now going to rearrange our data from being partitioned by time slice, to being partitioned by pixel. This will allow us to run computations like Fast Fourier Transforms (FFTs) on each time series efficiently. Switching the chunk pattern back and forth like this is generally a very difficult operation for distributed arrays because every slice of the array contributes to every time-series. We have N-squared communication.

This analysis may not be appropriate for this data (we won’t learn any useful science from doing this), but it represents a very frequently asked question, so I wanted to include it.

Currently our Dask array has chunkshape (20, 1000, 1000), meaning that our data is collected into 500 NumPy arrays across the cluster, each of size (20, 1000, 1000).

>>> recentered
dask.array<shape=(10000, 1000, 1000), dtype=uint8, chunksize=(20, 1000, 1000)>


But we want to change this shape so that the chunks cover the entire first axis. We want all data for any particular pixel to be in the same NumPy array, not spread across hundreds of different NumPy arrays. We could solve this by rechunking so that each pixel is its own block like the following:

>>> rechunked = recentered.rechunk((10000, 1, 1))


However this would result in one million chunks (there are one million pixels) which will result in a bit of scheduling overhead. Instead we’ll collect our time-series into 10 x 10 groups of one hundred pixels. This will help us to reduce overhead.

>>> # rechunked = recentered.rechunk((10000, 1, 1))  # Too many chunks
>>> rechunked = recentered.rechunk((10000, 10, 10))  # Use larger chunks


Now we compute the FFT of each pixel, take the absolute value and square to get the power spectrum. Finally to conserve space we’ll down-grade the dtype to float32 (our original data is only 8-bit anyway).

x = da.fft.fft(rechunked, axis=0)
power = abs(x ** 2).astype('float32')

power = client.persist(power, optimize_graph=False)


This is a fun profile to inspect; it includes both the rechunking and the subsequent FFTs. We’ve included a real-time trace during execution, the full profile, as well as some diagnostics plots from a single worker. These plots total up to around 20MB. I sincerely apologize to those without broadband access.

Here is a real time plot of the computation finishing over time:

And here is a single interactive plot of the entire computation after it completes. Zoom with the tools in the upper right. Hover over rectangles to get more information. Remember that red is communication.

Screenshots of the diagnostic dashboard of a single worker during this computation.

This computation starts with a lot of communication while we rechunk and realign our data (recent optimizations here by Antoine Pitrou in dask #417). Then we transition into doing thousands of small FFTs and other arithmetic operations. All of the plots above show a nice transition from heavy communication to heavy processing with some overlap each way (once some complex blocks are available we get to start overlapping communication and computation). Inter-worker communication was around 100-300 MB/s (typical for Amazon’s EC2) and CPU load remained high. We’re using our hardware.

Finally we can inspect the results. We see that the power spectrum is very boring in the corner, and has typical activity towards the center of the image.

plt.semilogy(1 + power[:, 0, 0].compute())


plt.semilogy(1 + power[:, 500, 500].compute())


Final Thoughts

This blogpost showed a non-trivial image processing workflow, emphasizing the following points:

1. Construct a Dask array from lazy SKImage calls.
2. Use NumPy syntax with Dask.array to aggregate distributed data across a cluster.
3. Build a centroid function with Numba. Use Numba and Dask together to clean up an image stack.
4. Rechunk to facilitate time-series operations. Perform FFTs.

Hopefully this example has components that look similar to what you want to do with your data on your hardware. We would love to see more applications like this out there in the wild.

What we could have done better

As always with all computationally focused blogposts we’ll include a section on what went wrong and what we could have done better with more time.

1. Communication is too expensive: Interworker communications that should be taking 200ms are taking up to 10 or 20 seconds. We need to take a closer look at our communications pipeline (which normally performs just fine on other computations) to see if something is acting up. Disucssion here dask/distributed #776 and early work here dask/distributed #810.
2. Faulty Load balancing: We discovered a case where our load-balancing heuristics misbehaved, incorrectly moving data between workers when it would have been better to let everything alone. This is likely due to the oddly low bandwidth issues observed above.
3. Loading from disk blocks network I/O: While doing this we discovered an issue where loading large amounts of data from disk can block workers from responding to network requests (dask/distributed #774)
4. Larger datasets: It would be fun to try this on a much larger dataset to see how the solutions here scale.

January 12, 2017

Enthought

Webinar: An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

Enthought’s Pandas Mastery Workshop is designed to accelerate the development of skill and confidence with Python’s Pandas data analysis package — in just three days, you’ll look like an old pro! This course was created ground up by our training experts based on insights from the science of human learning, as well as what we’ve learned from over a decade of extensive practical experience of teaching thousands of scientists, engineers, and analysts to use Python effectively in their everyday work.

In this webinar, we’ll give you the key information and insight you need to evaluate whether the Pandas Mastery Workshop is the right solution to advance your data analysis skills in Python, including:

• Who will benefit most from the course
• A guided tour through the course topics
• What skills you’ll take away from the course, how the instructional design supports that
• What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
• What previous workshop attendees say about the course

Date and Registration Info:
January 26, 2017, 11-11:45 AM CT
Register (if you can’t attend, register and we’ll be happy to send you a recording of the session)

Register

Presenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT

Why Focus on Pandas:

Python has been identified as the most popular coding language for five years in a row. One reason for its popularity, especially among data analysts, data scientists, engineers, and scientists across diverse industries, is its extensive library of powerful tools for data manipulation, analysis, and modeling. For anyone working with tabular data (perhaps currently using a tool like Excel, R, or SAS), Pandas is the go-to tool in Python that not only makes the bulk of your work easier and more intuitive, but also provides seamless access to more specialized packages like statsmodels (statistics), scikit-learn (machine learning), and matplotlib (data visualization). Anyone looking for an entry point into the general scientific and analytic Python ecosystem should start with Pandas!

Who Should Attend:

Whether you’re a training or learning development coordinator who wants to learn more about our training options and approach, a member of a technical team considering group training, or an individual looking for engaging and effective Pandas training, this webinar will help you quickly evaluate how the Pandas Mastery Workshop can meet your needs.

 Upcoming Open Pandas Mastery Workshop Sessions:

London, UK, Feb 22-24
Chicago, IL, Mar 8-10
Albuquerque, NM, Apr 3-5
Washington, DC May 10-12
Los Alamos, NM, May 22-24
New York City, NY, Jun 7-9

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Continuum Analytics news

Continuum Analytics Appoints Scott Collison as Chief Executive Officer

Tuesday, January 17, 2017

Continuum Analytics Appoints Scott Collison as Chief Executive Officer

AUSTIN, TEXAS—January 17, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced Scott Collison as the company’s new chief executive officer (CEO). Collison, a successful entrepreneur and former executive at VMware and Salesforce.com, also joins the Board of Directors to drive the strategy and operations of the company. Collison succeeds co-founder and fellow Board member, Travis Oliphant, who will shift his focus on accelerating innovation within the dynamic Open Data Science community and managing customer solutions as chief data scientist.

“During the past year, our company’s Anaconda product and services grew more than 100 percent. As the company’s co-founders, Peter Wang and I started to look for an executive to help us strategically guide our growth. We are delighted to welcome Scott; his entrepreneurial experience and open source background make him the perfect fit for our company’s mission,” said Oliphant. “My new role as chief data scientist frees me up to further our investment in open source technologies to advance the Open Data Science market and ensure customer success with Anaconda.”

Anaconda downloads from inception through the end of 2016 totaled more than 11 million, an increase of more than eight million from the previous year. The Python community is estimated at more than 30 million members and according to the most recent O’Reilly Data Science Survey, among data scientists, 72 percent prefer Python as their main tool.

“Continuum Analytics has experienced great success as evidenced by the millions of downloads, extraordinary product and services growth in 2016 and Anaconda becoming the de facto Open Data Science platform for tech giants including, Intel, IBM, Cloudera and Microsoft,” said Collison. “The data science market opportunity is pushing the boundaries at $140 billion and I’m excited to join the company and capitalize on my previous experience to manage this explosive growth and support its continued momentum.” Collison previously held the position of vice president of Hybrid Platform at VMware and lifted the company’s high-growth cloud services business. Prior to that, he was vice president of Platform Go to Market at Salesforce.com. He was also instrumental in the sale of Signio (now a division of PayPal) to Verisign for$1.3 billion in 1999 and has held a variety of executive positions at both large software companies and startups, including Microsoft, SourceForge and Geeknet.

Scott Collison is a former Fulbright scholar and holds a Ph.D. from the University of California, Berkeley, a Master of Arts from the University of Freiburg (Germany) and a Bachelor of Arts from the University of Texas, Austin.

Join CEO Scott Collison and the Anaconda team at AnacondaCON 2017, Feb. 7-9th in Austin, Texas––the two-day event will bring together innovative enterprises on the journey to Open Data Science. Please register here to take advantage of our current two-for-one promotion and discounted hotel room block.

Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 11 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

Matthew Rocklin

Distributed Pandas on a Cluster with Dask DataFrames

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

Dask Dataframe extends the popular Pandas library to operate on big data-sets on a distributed cluster. We show its capabilities by running through common dataframe operations on a common dataset. We break up these computations into the following sections:

1. Introduction: Pandas is intuitive and fast, but needs Dask to scale
2. Read CSV and Basic operations
2. Basic Aggregations and Groupbys
3. Joins and Correlations
3. Shuffles and Time Series
4. Parquet I/O
5. Final thoughts
6. What we could have done better

Accompanying Plots

Throughout this post we accompany computational examples with profiles of exactly what task ran where on our cluster and when. These profiles are interactive Bokeh plots that include every task that every worker in our cluster runs over time. For example the following computation read_csv computation produces the following profile:

>>> df = dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv')


If you are reading this through a syndicated website like planet.python.org or through an RSS reader then these plots will not show up. You may want to visit http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes directly.

Introduction

Pandas provides an intuitive, powerful, and fast data analysis experience on tabular data. However, because Pandas uses only one thread of execution and requires all data to be in memory at once, it doesn’t scale well to datasets much beyond the gigabyte scale. That component is missing. Generally people move to Spark DataFrames on HDFS or a proper relational database to resolve this scaling issue. Dask is a Python library for parallel and distributed computing that aims to fill this need for parallelism among the PyData projects (NumPy, Pandas, Scikit-Learn, etc.). Dask dataframes combine Dask and Pandas to deliver a faithful “big data” version of Pandas operating in parallel over a cluster.

CSV Data and Basic Operations

I have an eight node cluster on EC2 of m4.2xlarges (eight cores, 30GB RAM each). Dask is running on each node with one process per core.

We have the 2015 Yellow Cab NYC Taxi data as 12 CSV files on S3. We look at that data briefly with s3fs

>>> import s3fs
>>> s3 = S3FileSystem()


This data is too large to fit into Pandas on a single computer. However, it can fit in memory if we break it up into many small pieces and load these pieces onto different computers across a cluster.

We connect a client to our Dask cluster, composed of one centralized dask-scheduler process and several dask-worker processes running on each of the machines in our cluster.

from dask.distributed import Client


And we load our CSV data using dask.dataframe which looks and feels just like Pandas, even though it’s actually coordinating hundreds of small Pandas dataframes. This takes about a minute to load and parse.

import dask.dataframe as dd

parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
storage_options={'anon': True})
df = client.persist(df)


This cuts up our 12 CSV files on S3 into a few hundred blocks of bytes, each 64MB large. On each of these 64MB blocks we then call pandas.read_csv to create a few hundred Pandas dataframes across our cluster, one for each block of bytes. Our single Dask Dataframe object, df, coordinates all of those Pandas dataframes. Because we’re just using Pandas calls it’s very easy for Dask dataframes to use all of the tricks from Pandas. For example we can use most of the keyword arguments from pd.read_csv in dd.read_csv without having to relearn anything.

This data is about 20GB on disk or 60GB in RAM. It’s not huge, but is also larger than we’d like to manage on a laptop, especially if we value interactivity. The interactive image above is a trace over time of what each of our 64 cores was doing at any given moment. By hovering your mouse over the rectangles you can see that cores switched between downloading byte ranges from S3 and parsing those bytes with pandas.read_csv.

Our dataset includes every cab ride in the city of New York in the year of 2015, including when and where it started and stopped, a breakdown of the fare, etc.

>>> df.head()

VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0.0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0.0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0.0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0.0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0.0 0.3 16.30

Basic Aggregations and Groupbys

As a quick exercise, we compute the length of the dataframe. When we call len(df) Dask.dataframe translates this into many len calls on each of the constituent Pandas dataframes, followed by communication of the intermediate results to one node, followed by a sum of all of the intermediate lengths.

>>> len(df)
146112989


This takes around 400-500ms. You can see that a few hundred length computations happened quickly on the left, followed by some delay, then a bit of data transfer (the red bar in the plot), and a final summation call.

More complex operations like simple groupbys look similar, although sometimes with more communications. Throughout this post we’re going to do more and more complex computations and our profiles will similarly become more and more rich with information. Here we compute the average trip distance, grouped by number of passengers. We find that single and double person rides go far longer distances on average. We acheive this one big-data-groupby by performing many small Pandas groupbys and then cleverly combining their results.

>>> df.groupby(df.passenger_count).trip_distance.mean().compute()
passenger_count
0     2.279183
1    15.541413
2    11.815871
3     1.620052
4     7.481066
5     3.066019
6     2.977158
9     5.459763
7     3.303054
8     3.866298
Name: trip_distance, dtype: float64


As a more complex operation we see how well New Yorkers tip by hour of day and by day of week.

df2 = df[(df.tip_amount > 0) & (df.fare_amount > 0)]    # filter out bad rows
df2['tip_fraction'] = df2.tip_amount / df2.fare_amount  # make new column

dayofweek = (df2.groupby(df2.tpep_pickup_datetime.dt.dayofweek)
.tip_fraction
.mean())
hour      = (df2.groupby(df2.tpep_pickup_datetime.dt.hour)
.tip_fraction
.mean())


We see that New Yorkers are generally pretty generous, tipping around 20%-25% on average. We also notice that they become very generous at 4am, tipping an average of 38%.

This more complex operation uses more of the Dask dataframe API (which mimics the Pandas API). Pandas users should find the code above fairly familiar. We remove rows with zero fare or zero tip (not every tip gets recorded), make a new column which is the ratio of the tip amount to the fare amount, and then groupby the day of week and hour of day, computing the average tip fraction for each hour/day.

Dask evaluates this computation with thousands of small Pandas calls across the cluster (try clicking the wheel zoom icon in the upper right of the image above and zooming in). The answer comes back in about 3 seconds.

Joins and Correlations

To show off more basic functionality we’ll join this Dask dataframe against a smaller Pandas dataframe that includes names of some of the more cryptic columns. Then we’ll correlate two derived columns to determine if there is a relationship between paying Cash and the recorded tip.

>>> payments = pd.Series({1: 'Credit Card',
2: 'Cash',
3: 'No Charge',
4: 'Dispute',
5: 'Unknown',
6: 'Voided trip'})

>>> df2 = df.merge(payments, left_on='payment_type', right_index=True)
>>> df2.groupby(df2.payment_name).tip_amount.mean().compute()
payment_name
Cash           0.000217
Credit Card    2.757708
Dispute       -0.011553
No charge      0.003902
Unknown        0.428571
Name: tip_amount, dtype: float64


We see that while the average tip for a credit card transaction is $2.75, the average tip for a cash transaction is very close to zero. At first glance it seems like cash tips aren’t being reported. To investigate this a bit further lets compute the Pearson correlation between paying cash and having zero tip. Again, this code should look very familiar to Pandas users. zero_tip = df2.tip_amount == 0 cash = df2.payment_name == 'Cash' dd.concat([zero_tip, cash], axis=1).corr().compute()  tip_amount payment_name tip_amount 1.000000 0.943123 payment_name 0.943123 1.000000 So we see that standard operations like row filtering, column selection, groupby-aggregations, joining with a Pandas dataframe, correlations, etc. all look and feel like the Pandas interface. Additionally, we’ve seen through profile plots that most of the time is spent just running Pandas functions on our workers, so Dask.dataframe is, in most cases, adding relatively little overhead. These little functions represented by the rectangles in these plots are just pandas functions. For example the plot above has many rectangles labeled merge if you hover over them. This is just the standard pandas.merge function that we love and know to be very fast in memory. Shuffles and Time Series Distributed dataframe experts will know that none of the operations above require a shuffle. That is we can do most of our work with relatively little inter-node communication. However not all operations can avoid communication like this and sometimes we need to exchange most of the data between different workers. For example if our dataset is sorted by customer ID but we want to sort it by time then we need to collect all the rows for January over to one Pandas dataframe, all the rows for February over to another, etc.. This operation is called a shuffle and is the base of computations like groupby-apply, distributed joins on columns that are not the index, etc.. You can do a lot with dask.dataframe without performing shuffles, but sometimes it’s necessary. In the following example we sort our data by pickup datetime. This will allow fast lookups, fast joins, and fast time series operations, all common cases. We do one shuffle ahead of time to make all future computations fast. We set the index as the pickup datetime column. This takes anywhere from 25-40s and is largely network bound (60GB, some text, eight machines with eight cores each on AWS non-enhanced network). This also requires running something like 16000 tiny tasks on the cluster. It’s worth zooming in on the plot below. >>> df = c.persist(df.set_index('tpep_pickup_datetime'))  This operation is expensive, far more expensive than it was with Pandas when all of the data was in the same memory space on the same computer. This is a good time to point out that you should only use distributed tools like Dask.datframe and Spark after tools like Pandas break down. We should only move to distributed systems when absolutely necessary. However, when it does become necessary, it’s nice knowing that Dask.dataframe can faithfully execute Pandas operations, even if some of them take a bit longer. As a result of this shuffle our data is now nicely sorted by time, which will keep future operations close to optimal. We can see how the dataset is sorted by pickup time by quickly looking at the first entries, last entries, and entries for a particular day. >>> df.head() # has the first entries of 2015  VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount tpep_pickup_datetime 2015-01-01 00:00:00 2 2015-01-01 00:00:00 3 1.56 -74.001320 40.729057 1 N -74.010208 40.719662 1 7.5 0.5 0.5 0.0 0.0 0.3 8.8 2015-01-01 00:00:00 2 2015-01-01 00:00:00 1 1.68 -73.991547 40.750069 1 N 0.000000 0.000000 2 10.0 0.0 0.5 0.0 0.0 0.3 10.8 2015-01-01 00:00:00 1 2015-01-01 00:11:26 5 4.00 -73.971436 40.760201 1 N -73.921181 40.768269 2 13.5 0.5 0.5 0.0 0.0 0.0 14.5 >>> df.tail() # has the last entries of 2015  VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount tpep_pickup_datetime 2015-12-31 23:59:56 1 2016-01-01 00:09:25 1 1.00 -73.973900 40.742893 1 N -73.989571 40.750549 1 8.0 0.5 0.5 1.85 0.0 0.3 11.15 2015-12-31 23:59:58 1 2016-01-01 00:05:19 2 2.00 -73.965271 40.760281 1 N -73.939514 40.752388 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80 2015-12-31 23:59:59 2 2016-01-01 00:10:26 1 1.96 -73.997559 40.725693 1 N -74.017120 40.705322 2 8.5 0.5 0.5 0.00 0.0 0.3 9.80 >>> df.loc['2015-05-05'].head() # has the entries for just May 5th  VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount tpep_pickup_datetime 2015-05-05 2 2015-05-05 00:00:00 1 1.20 -73.981941 40.766460 1 N -73.972771 40.758007 2 6.5 1.0 0.5 0.00 0.00 0.3 8.30 2015-05-05 1 2015-05-05 00:10:12 1 1.70 -73.994675 40.750507 1 N -73.980247 40.738560 1 9.0 0.5 0.5 2.57 0.00 0.3 12.87 2015-05-05 1 2015-05-05 00:07:50 1 2.50 -74.002930 40.733681 1 N -74.013603 40.702362 2 9.5 0.5 0.5 0.00 0.00 0.3 10.80 Because we know exactly which Pandas dataframe holds which data we can execute row-local queries like this very quickly. The total round trip from pressing enter in the interpreter or notebook is about 40ms. For reference, 40ms is the delay between two frames in a movie running at 25 Hz. This means that it’s fast enough that human users perceive this query to be entirely fluid. Time Series Additionally, once we have a nice datetime index all of Pandas’ time series functionality becomes available to us. For example we can resample by day: >>> (df.passenger_count .resample('1d') .mean() .compute() .plot())  We observe a strong periodic signal here. The number of passengers is reliably higher on the weekends. We can perform a rolling aggregation in about a second: >>> s = client.persist(df.passenger_count.rolling(10).mean())  Because Dask.dataframe inherits the Pandas index all of these operations become very fast and intuitive. Parquet Pandas’ standard “fast” recommended storage solution has generally been the HDF5 data format. Unfortunately the HDF5 file format is not ideal for distributed computing, so most Dask dataframe users have had to switch down to CSV historically. This is unfortunate because CSV is slow, doesn’t support partial queries (you can’t read in just one column), and also isn’t supported well by the other standard distributed Dataframe solution, Spark. This makes it hard to move data back and forth. Fortunately there are now two decent Python readers for Parquet, a fast columnar binary store that shards nicely on distributed data stores like the Hadoop File System (HDFS, not to be confused with HDF5) and Amazon’s S3. The already fast Parquet-cpp project has been growing Python and Pandas support through Arrow, and the Fastparquet project, which is an offshoot from the pure-python parquet library has been growing speed through use of NumPy and Numba. Using Fastparquet under the hood, Dask.dataframe users can now happily read and write to Parquet files. This increases speed, decreases storage costs, and provides a shared format that both Dask dataframes and Spark dataframes can understand, improving the ability to use both computational systems in the same workflow. Writing our Dask dataframe to S3 can be as simple as the following: df.to_parquet('s3://dask-data/nyc-taxi/tmp/parquet')  However there are also a variety of options we can use to store our data more compactly through compression, encodings, etc.. Expert users will probably recognize some of the terms below. df = df.astype({'VendorID': 'uint8', 'passenger_count': 'uint8', 'RateCodeID': 'uint8', 'payment_type': 'uint8'}) df.to_parquet('s3://dask-data/nyc-taxi/tmp/parquet', compression='snappy', has_nulls=False, object_encoding='utf8', fixed_text={'store_and_fwd_flag': 1})  We can then read our nicely indexed dataframe back with the dd.read_parquet function: >>> df2 = dd.read_parquet('s3://dask-data/nyc-taxi/tmp/parquet')  The main benefit here is that we can quickly compute on single columns. The following computation runs in around 6 seconds, even though we don’t have any data in memory to start (recall that we started this blogpost with a minute-long call to read_csv.and Client.persist) >>> df2.passenger_count.value_counts().compute() 1 102991045 2 20901372 5 7939001 3 6135107 6 5123951 4 2981071 0 40853 7 239 8 181 9 169 Name: passenger_count, dtype: int64  Final Thoughts With the recent addition of faster shuffles and Parquet support, Dask dataframes become significantly more attractive. This blogpost gave a few categories of common computations, along with precise profiles of their execution on a small cluster. Hopefully people find this combination of Pandas syntax and scalable computing useful. Now would also be a good time to remind people that Dask dataframe is only one module among many within the Dask project. Dataframes are nice, certainly, but Dask’s main strength is its flexibility to move beyond just plain dataframe computations to handle even more complex problems. Learn More If you’d like to learn more about Dask dataframe, the Dask distributed system, or other components you should look at the following documentation: The workflows presented here are captured in the following notebooks (among other examples): What we could have done better As always with computational posts we include a section on what went wrong, or what could have gone better. 1. The 400ms computation of len(df) is a regression from previous versions where this was closer to 100ms. We’re getting bogged down somewhere in many small inter-worker communications. 2. It would be nice to repeat this computation at a larger scale. Dask deployments in the wild are often closer to 1000 cores rather than the 64 core cluster we have here and datasets are often in the terrabyte scale rather than our 60 GB NYC Taxi dataset. Unfortunately representative large open datasets are hard to find. 3. The Parquet timings are nice, but there is still room for improvement. We seem to be making many small expensive queries of S3 when reading Thrift headers. 4. It would be nice to support both Python Parquet readers, both the Numba solution fastparquet and the C++ solution parquet-cpp January 10, 2017 Fabian Pedregosa Optimization inequalities cheatsheet Most proofs in optimization consist in using inequalities for a particular function class in some creative way. This is a cheatsheet with inequalities that I use most often. It considers class of functions that are convex, strongly convex and$L$-smooth. Setting.$f$is a function$\mathbb{R}^p \to \mathbb{R}$. Below are a set of inequalities that are verified when$f$belongs to a particular class of functions and$x, y \in \mathbb{R}^p$are arbitrary elements in its domain.$f$is$L$-smooth. This is the class of functions that are differentiable and its gradient is Lipschitz continuous. •$\|\nabla f(y) - \nabla f(x) \| \leq \|x - y\|$•$|f(x) - f(y) - \langle \nabla f(x), y - x\rangle| \leq \frac{L}{2}\|y - x\|^2$•$\|\nabla^2 f(x)\| \leq L\qquad \text{ (assuming $f$ is twice differentiable)} f$is convex. •$f(x) \leq f(y) + \langle \nabla f(x), x - y\rangle$•$0 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$•$f(\mathbb{E}X) \leq \mathbb{E}[f(X)]$where$X$is a random variable (Jensen's inequality).$f$is both$L$-smooth and convex: •$\frac{1}{L}\|\nabla f(x) - \nabla f(y)\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$•$0 \leq f(y) - f(x) - \langle \nabla f(x), y - x\rangle \leq \frac{L}{2}\|x - y\|^2$•$f(x) \leq f(y) + \langle \nabla f(x), x - y\rangle - \frac{1}{2 L}\|\nabla f(x) - \nabla f(y)\|^2f$is$\mu$-strongly convex. Set of functions$f$such that$f - \frac{\mu}{2}\|\cdot\|^2$is convex. It includes the set of convex functions with$\mu=0$. •$f(x) \leq f(y) + \langle \nabla f(x), x - y \rangle - \frac{\mu}{2}\|x - y\|^2$•$\frac{\mu}{2}\|x - y\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$•$\frac{\mu}{2}\|x-x^*\|^2\leq f(x^*) - f(x)f$is both$L$-smooth and$\mu$-strongly convex. •$\frac{\mu L}{\mu + L}\|x - y\|^2 + \frac{1}{\mu + L}\|\nabla f(x) - \nabla f(y)\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$References Most of these inequalities appear in the Book: "Introductory lectures on convex optimization: A basic course" by Nesterov, Yurii (2013, Springer Science & Business Media). Another good source (and freely available for download) is the book "Convex Optimization" by Stephen Boyd and Lieven Vandenberghe. Titus Brown How I learned to stop worrying and love the coming archivability crisis in scientific software Note: This is the fifth post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. This post was put together after the event and benefited greatly from conversations with Victoria Stodden, Yolanda Gil, Monya Baker, Gail Peretsman-Clement, and Kristin Antelman! Archivability is a disaster in the software world In The talk I didn't give at Caltech, I pointed out that our current software stack is connected, brittle, and non-repeatable. This affects our ability to store and recover science from archives. Basically, in our lab, we find that our executable papers routinely break over time because of minor changes to dependent packages or libraries. Yes, the software stack is constantly changing! Why?! Let me back up -- Our analysis routines usually depend on an extensive hierarchy of packages. We may be writing bespoke scripts on top of our own library, but those scripts and that library sit on top of other libraries, which in turn use the Python language, the GNU ecosystem, Linux, and a bunch of firmware. All of this rests on a not-always-that-sane hardware implementation that occasionally throws up errors because x was compiled on y processor but is running on z processor. We've had every part of this stack cause problems for us. Three examples: • many current repeatability stacks are starting to rely on Docker. But Docker changes routinely, and it's not at all clear that the images you save today will work tomorrow. Dockerfiles (which provide the instructions for building images) should be more robust, but there is a tendency to have Dockerfiles run complex shell scripts that may themselves break due to software stack changes. But the bigger problem is that Docker just isn't that robust. Don't believe me? For more, read this and weep: Docker in Production: A history of Failure. • software stacks are astoundingly complex in ways that are mostly hidden when things are working (i.e. in the moment) but that block any kind of longitudinal robustness. Perhaps the best illustration of this in recent time is the JavaScript debacle where the author of "left-pad" pulled it from the packaging system, throwing the JavaScript world into temporary insanity. • practically, we can already see the problem - go sample from A gallery of interesting Jupyter Notebooks. Pick five. Try to run them. Try to install the stuff needed to run them. Weep in despair. (This is also true of mybinder repos, just in case you're wondering; many of my older ones simply don't work, for a myriad of reasons.) These are big, real issues that affect any scientific software that relies on any code written outside their project (which is everyone - see "Linux operating system" and/or "firmware" above.) My conclusion is that, on a decadal time scale, we cannot rely on software to run repeatably. This connects to two other important issues. First, since data implies software, we're rapidly moving into a space where the long tail of data is going to become useless because the software needed to interpret it is vanishing. (We're already seeing this with 454 sequence data, which is less than 10 years old; very few modern bioinformatics tools will ingest it, but we have an awful lot of it in the archives.) Second, it's not clear to me that we'll actually know if the software is running robustly, which is far worse than simply having it break. (The situation above with Jupyter Notebooks is hence less problematic than the subtle changes in behavior that will come from e.g. Python 5.0 fixing behavioral bugs that our code relied on in Python 3.) I expect that in situations where language specs have changed, or library bugs have been fixed, there will simply be silent changes in output. Detecting this behavior is hard. (In our own khmer project, we've started including tests that compare the md5sum of running the software on small data sets to stored md5sums, which gets us part of the way there, but is by no means sufficient.) If archivability is a problem, what's the solution? So I think we're heading towards a future where even perfectly repeatable research will not have any particular longevity, unless it's constantly maintained and used (which is unrealistic for most research software - heck, we can't even maintain the stuff we're using right now this very instant.) Are there any solutions? First, some things that definitely aren't solutions: • Saving ALL THE SOFTWARE is not a solution; you simply can't, because of the reliance on software/firmware/hardware interactions. • Blobbing it all up in a gigantic virtual machine image simply pushes the turtle one stack frame down: now you've got to worry about keeping VM images running consistently. I suppose it's possible but I don't expect to see people converge on this solution anytime soon. More importantly, VMs and docker images may let you reach bitwise reproducibility, but they're not scientifically useful because they're big black boxes that don't really let you reuse or remix the contents; see Virtual machines considered harmful for reproducibility and The post-apocalyptic world of binary containers. • Not using or relying on other software isn't a practical solution: first, good luck with that ;). Second, see "firmware", above. And, third, while there is definitely a crowd of folk who like to reimplement everything themselves, there is every likelihood that their code is wronger and/or buggier than widely used community software; Gael Varoquaux makes this point very well in his blog post, Software for reproducible science. I don't think trading archivability for incorrectness is a good trade :). The two solutions that I do see are these: • run everything all the time. This is essentially what the software world does with continuous integration. They run all their tests and pipelines all the time, just to check that it's all working. (See "Continuous integration at Google Scale".) Recently, my #MooreData colleagues Brett Beaulieau and Casey Greene proposed exactly this for scientific papers, in their preprint "Reproducible Computational Workflows with Continuous Analysis". While this is a potential solution, it's rather heavyweight to set up, and (more importantly) it gets kind of expensive -- Google runs many compute-years of code each day -- and I worry that the cost to utility ratio is not in science's favor. This is especially true when you consider that most research ends up being a dead end - unread, uncited, and unimportant - but of course you don't know which until much later... • acknowledge that exact repeatability has a half life of utility, and that this is OK. I've only just started thinking about this in detail, but it is at least plausible to argue that we don't really care about our ability to exactly re-run a decade old computational analysis. What we do care about is our ability to figure out what was run and what the important decisions were -- something that Yolanda Gil refers to as "inspectability." But exact repeatability has a short shelf-life. This has a couple of interesting implications that I'm just starting to unpack mentally: • maybe repeatability for science's sake can be thought of as a short-term aid in peer review, to make sure that the methods are suitably explicit and not obviously incorrect. (Another use for exact repeatability is enabling reuse and remixing, of course, which is important for scientific progress.) • as we already knew, closed source software is useless crap because it satisfies neither repeatability nor inspectability. But maybe it's not that important (for inspectability) to allow software reuse with a F/OSS license? (That license is critical for reuse and remixing, though.) • maybe we could and should think of articulating "half lives" for research products, and acknowledge explicitly that most research won't pass the test of time. • but perhaps this last point is a disaster for the kind of serendipitous reuse of old data that Christie Bahlai and Amanda Whitmire have convinced me is important. Huge (sincere) thanks to Gail for arguing both sides of this, including saying that (a) archive longevity is super important because everything has to be saved or else it's a disaster for humanity, and (b) maybe we don't care about saving everything because after all we can still read Latin even if we don't actually get the full cultural context and don't know how to pronounce the words, and (c) hey maybe the full cultural context is important and we should endeavor to save it all after all. <exasperation>Librarians!</exasperation> Lots for me to think on. --titus The talk I didn't give at Caltech (Paper of the Future) Note: This is the fourth post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. This is an outline of the talk I didn't give at Caltech, because I decided that Victoria Stodden and Yolanda Gil were going to cover most of it and I would rather talk about a random collection of things that they might not talk about. (I think I was 7 for 10 on that. ;) This is in outline-y form, but I think it's fairly understandable. Ask questions in the comments if not! What will the paper of the future look like? A few assertions about the scientific paper of the future: • The paper of the future will be open - open access, open data, and open source. • The paper of the future will be highly repeatable. • The paper of the future will be linked. • The paper of the future will not depend on expensive infrastructure. • The paper of the future will be commonplace. • The paper of the future will be archivable (or will it? Read on.) What's our experience with the paper of the future been? My lab (and many, many others) have been doing things like: • Automating the entire analysis from raw data to conclusion. • Publishing data narratives and notebooks. • Using version control for paper and data notebook and source code. • Anointing data sets with DOIs. • Posting virtual environments & execution specifications for papers. We've been doing parts of this for many years, and while we're not always that systematic about certain parts, I can say that everything works fairly smoothly. The biggest issues we have often seem to be about the small details, such as choice of workflow engine, whether we're using AWS or an HPC as our "reference location" to run stuff, etc. From this experience, I see two problems: The two big problems I see • Adoption! We need community use & experience & training; we also need funder and journal buy-in. The training aspect is what Software Carpentry and Data Carpentry focus on, and it's one of the reasons I'm involved with them. • Archivability! Our software stack is anything but robust, static, or archivable. This is a huge problem that I don't think is accorded enough attention. This last issue, archivability, is both somewhat technical and important - so I decided to move that to a new blog post, "How I learned to stop worrying and love the coming archivability crisis in scientific software". Concluding thoughts In which I summarize the above :) --titus Enthought Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way Data exploration, manipulation, and visualization start with loading data, be it from files or from a URL. Pandas has become the go-to library for all things data analysis in Python, but if your intention is to jump straight into data exploration and manipulation, the Canopy Data Import Tool can help, instead of having to learn the details of programming with the Pandas library. The Data Import Tool leverages the power of Pandas while providing an interactive UI, allowing you to visually explore and experiment with the DataFrame (the Pandas equivalent of a spreadsheet or a SQL table), without having to know the details of the Pandas-specific function calls and arguments. The Data Import Tool keeps track of all of the changes you make (in the form of Python code). That way, when you are done finding the right workflow for your data set, the Tool has a record of the series of actions you performed on the DataFrame, and you can apply them to future data sets for even faster data wrangling in the future. At the same time, the Tool can help you pick up how to use the Pandas library, while still getting work done. For every action you perform in the graphical interface, the Tool generates the appropriate Pandas/Python code, allowing you to see and relate the tasks to the corresponding Pandas code. With the Data Import Tool, loading data is as simple as choosing a file or pasting a URL. If a file is chosen, it automatically determines the format of the file, whether or not the file is compressed, and intelligently loads the contents of the file into a Pandas DataFrame. It does so while taking into account various possibilities that often throw a monkey wrench into initial data loading: that the file might contain lines that are comments, it might contain a header row, the values in different columns could be of different types e.g. DateTime or Boolean, and many more possibilities as well. The Data Import Tool makes loading data into a Pandas DataFrame as simple as choosing a file or pasting a URL. A Glimpse into Loading Data into Pandas DataFrames (The Hard Way) The following 4 “inconvenience” examples show typical problems (and the manual solutions) that might arise if you are writing Pandas code to load data, which are automatically solved by the Data Import Tool, saving you time and frustration, and allowing you to get to the important work of data analysis more quickly. Let’s say you were to load data from the file by yourself. After searching the Pandas documentation a bit, you will come across the pandas.read_table function which loads the contents of a file into a Pandas DataFrame. But it’s never so easy in practice: pandas.read_table and other functions you might find assume certain defaults, which might be at odds with the data in your file. Inconvenience #1: Data in the first row will automatically be used as a header. Let’s say that your file (like this one: [wind.data]) uses whitespace as the separator between columns and doesn’t have a row containing column names. pandas.read_table assumes by default that your file contains a header row and uses tabs for delimiters. If you don’t tell it otherwise, Pandas will use the data from the first row in your file as column names, which is clearly wrong in this case. From the docs, you can discover that this behavior can be turned off by passing header=None and use sep=\s+ to pandas.read_table, to use varying whitespace as the separator and to inform pandas that a header column doesn’t exist: In [1]: df = pandas.read_table('wind.data', sep='\s+') In [2]: df.head() Out[2]: 61 1 1.1 15.04 14.96 13.17 9.29 13.96 9.87 13.67 10.25 10.83 \ 0 61 1 2 14.71 16.88 10.83 6.50 12.62 7.67 11.50 10.04 9.79 1 61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 8.04 8.50 12.58 18.50 15.04.1 0 9.67 17.54 13.83 1 7.67 12.75 12.71  Without the header=None kwarg, you can see that the first row of data is being considered as column names: In [3]: df = pandas.read_table('wind.data', header=None, sep='\s+') In [4]: df.head() Out[4]: 0 1 2 3 4 5 6 7 8 9 10 11 \ 0 61 1 1 15.04 14.96 13.17 9.29 13.96 9.87 13.67 10.25 10.83 1 61 1 2 14.71 16.88 10.83 6.50 12.62 7.67 11.50 10.04 9.79 12 13 14 0 12.58 18.50 15.04 1 9.67 17.54 13.83  The behavior we expected, after we tell Pandas that the file does not contain a row containing column names using header=None and specify the separator: [File : test_data_comments.txt] Inconvenience #2: Commented lines cause the data load to fail. Next let’s say that your file contains commented lines which start with a #. Pandas doesn’t understand this by default and trying to load the data into a DataFrame will either fail with an Error or worse, succeed without notifying you that one row in the DataFrame might contain erroneous data, from the commented line. (This might also prevent correct inference of column types.) Again, you can tell pandas.read_table that commented lines exist in your file and to skip them using comment=#: In [1]: df = pandas.read_table('test_data_comments.txt', sep=',', header=None) --------------------------------------------------------------------------- CParserError Traceback (most recent call last) <ipython-input-10-b5cd8eee4851> in <module>() ----> 1 df = pandas.read_table('catalyst/tests/data/test_data_comments.txt', sep=',', header=None) (traceback) CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 5 As mentioned earlier, if you are lucky, Pandas will fail with a CParserError, complaining that each row contains a different number of columns in the data file. Needless to say, it’s not obvious to tell that this is an unidentified comment line: In [2]: df = pandas.read_table('test_data_comments.txt', sep=',', comment='#', header=None) In [3]: df Out[3]: 0 1 2 3 4 0 1 False 1.0 one 2015-01-01 1 2 True 2.0 two 2015-01-02 2 3 False 3.0 three 2015-01-03 3 4 True 4.0 four 2015-01-04 And we can read the file contents correctly when we tell pandas that ‘#’ is the character that commented lines in the file start with, as is seen in the following file: [File : ncaa_basketball_2016.txt] Inconvenience #3: Fixed-width formatted data will cause data load to fail. Now let’s say that your file contains data in a fixed-width format. Trying to load this data using pandas.read_table will fail. Dig around a little and you will come across the function pandas.read_fwf, which is the suggested way to load data from fixed-width files, not pandas.read_table. In [1]: df = pandas.read_table('ncaa_basketball_2016.txt', header=None) In [2]: df.head() Out[2]: 0 0 2016-02-25 @Ark Little Rock 72 UT Ar... 1 2016-02-25 ULM 66 @South... Those of you familiar with Pandas will recognize that the above DataFrame, created from the file, contains only one column, labelled 0. Which is clearly wrong, because there are 4 distinct columns in the file. In [3]: df = pandas.read_table('ncaa_basketball_2016.txt', header=None, sep='\s+') --------------------------------------------------------------------------- CParserError Traceback (most recent call last) <ipython-input-28-db4f2f128b37> in <module>() ----> 1 df = pandas.read_table('functional_tests/data/ncaa_basketball_2016.txt', header=None, sep='\s+') (Traceback) CParserError: Error tokenizing data. C error: Expected 8 fields in line 55, saw 9 If we didn’t know better, we would’ve assumed that the delimiter/separator character used in the file was whitespace. We can tell Pandas to load the file again, assuming that the separator was whitespace, represented using \s+. But, as you can clearly see above, that raises a CParserError, complaining that it noticed more columns of data in one row than the previous. In [4]: df = pandas.read_fwf('ncaa_basketball_2016.txt', header=None) In [5]: df.head() Out[5]: 0 1 2 3 4 5 0 2016-02-25 @Ark Little Rock 72 UT Arlington 60 NaN 1 2016-02-25 ULM 66 @South Alabama 59 NaN And finally, using pandas.read_fwf instead of pandas.read_table gives us a DataFrame that is close to what we expected, given the data in the file. Inconvenience #4: NA is not recognized as text; automatically converted to ‘None’: Finally, let’s assume that you have raw data containing the string NA, which is this specific case is used to represent North America. By default pandas.read_csv interprets these string values to represent None and automatically converts them to None. And Pandas does all of this underneath the hood, without informing the user. One of the things that the Zen of Python says is that Explicit is better than implicit. In that spirit, the Tool explicitly lists the values which will be interpreted as None/NaN. The user can remove NA (or any of the other values) from this list, to prevent it from being interpreted as None, as shown in the following file: [File : test_data_na_values.csv] In [2]: df = pandas.read_table('test_data_na_values.csv', sep=',', header=None) In [3]: df Out[3]: 0 1 2 0 NaN 1 True 1 NaN 2 False 2 NaN 3 False 3 NaN 4 True In [4]: df = pandas.read_table('test_data_na_values.csv', sep=',', header=None, keep_default_na=False, na_values=[]) In [5]: df Out[5]: 0 1 2 0 NA 1 True 1 NA 2 False 2 NA 3 False 3 NA 4 True If your intentions were to jump straight into data exploration and manipulation, then the above points are some of the inconveniences that you will have to deal with, requiring you to learn the various arguments that need to be passed to pandas.read_table before can load your data correctly and get to your analysis. Loading Data with the Data Import Tool (The Easy Way) The Canopy Data Import Tool automatically circumvents several common data loading inconveniences and errors by simply setting up the correct file assumptions in the Edit Command dialog box. The Data Import Tool takes care of all of these problems for you, allowing you to fast forward to the important work of data exploration and manipulation. It automatically: 1. Infers if your file contains a row of column names or not; 2. Intelligently infers if your file contains any commented lines and what the comment character is; 3. Infers what delimiter is used in the file or if the file contains data in a fixed-width format. Download Canopy (free) and start a free trial of the Data Import Tool to see just how much time and frustration you can save! The Data Import Tool as a Learning Resource: Using Auto-Generated Python/Pandas code So far, we talked about how the Tool can help you get started with data exploration, without the need for you to understand the Pandas library and its intricacies. But, what if you were also interested in learning about the Pandas library? That’s where the Python Code pane in the Data Import Tool can help. As you can see from the screenshot below, the Data Import Tool generates Pandas/Python code for every command you execute. This way, you can explore and learn about the Pandas library using the Tool. View the underlying Python / Pandas code in the Data Import Tool to help learn Pandas code, without slowing down your work. Finally, once you are done loading data from the file and manipulating the DataFrame, you can export the DataFrame to Canopy’s IPython console for further analysis and visualization. Simply click Use DataFrame at the bottom-right corner and the Tool will export the DataFrame to Canopy’s IPython pane, as you can see below. Import the cleaned data into the Canopy IPython console for further data analysis and visualization. Ready to try the Canopy Data Import Tool? Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today Additional resources: Watch a 2-minute demo video to see how the Canopy Data Import Tool works: See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging: The post Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way appeared first on Enthought Blog. January 08, 2017 Titus Brown Topics and concepts I'm excited about (Paper of the Future) Note: This is the third post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. I've been struggling to put together an interesting talk for the workshop, and last night Gail Clement (our host, @Repositorian) and Justin Bois helped me convinced myself (using red wine) that I should do something other than present my vision for #futurepaper. So, instead, here is a set of things that I'm pretty excited about in the world of scholarly communication! I've definitely left off a few, and I'd welcome pointers and commentary to things I've missed; please comment! 1. The wonderful ongoing discussion around significance and reproducibility. In addition to Victoria Stodden, Brian Nosek and John Ioannidis have been leaders in banging various drums (and executing various research agenda) that are showing us that we're not thinking very clearly about issues fundamental to science. For me, the blog post that really blew my mind was Dorothy Bishop's summary of the situation in psychology. To quote: Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution is not to deny we have a problem, but to recognize that under those circumstances there is an urgent need for our field to tackle the methodological issues of inadequate power and p-hacking, so we can distinguish genuine effects from false positives. Read the whole thing. It's worth it. Relevance to #futurepaper: detailed methods and exact repeatability is a prerequisite for conversations about what we really care about, which is: "is this effect real?" 2. Blogging as a way to explore issues without prior approval from Top People. I've always hated authority ever since I noticed the propensity for authorities to protect their position over seeking truth. This manifests in many ways, one of which through control over peer review and scientific publishing processes. With that in mind, it's worth reading up on Susan Fiske's "methodological terrorism" draft, in which Fiske, a professor at Princeton and an editor at PNAS, "publicly compares some of her colleagues to terrorists" (ref). Fiske is basically upset that people are daring to critique published papers via social media. There are a bunch of great responses; I'll highlight just one, by Chris Chambers: So what’s really going on here? The truth is that we are in the midst of a power struggle, and it’s not between Fiske’s “destructo-critics” and their victims, but between reformers who are trying desperately to improve science and a vanguard of traditionalists who, every so often, look down from their thrones to throw a log in the road. As the body count of failed replications continues to climb, a new generation want a different kind of science and they want it now. They're demanding greater transparency in research practices. They want to be rewarded for doing high quality research regardless of the results. They want researchers to be accountable for the quality of the work they produce and for the choices they make as public professionals. It's all very sensible, constructive, and in the public interest. Yeah! The long and short of it is that I'm really excited about how science and the scientific process are being discussed openly via blogs and Twitter. (You can also read my "Top 10 reasons why blog posts are better than scientific papers.) Relevance to #futurepaper: there are many alternate "publishing" models that offer advantages over our current publishing and dissemination system. They also offer potential disadvantages, of course. 3. Open source as a model for open science. Two or three times every year I come back to this wonderful chapter by K. Jarrod Millman and Fernando Perez entitled "Developing open source scientific practice." This wonderful chapter breaks down all the ways in which current computational science practice falls short of the mark and could learn from standard open source software development practices. For one quote (although the chapter offers far more!), Asking about reproducibility by the time a manuscript is ready for submission to a journal is too late: this problem must be tackled from the start, not as an afterthought tacked-on at publication time. We must therefore look for approaches that allow researchers to fluidly move back and forth between the above stages and that integrate naturally into their everyday practices of research, collaboration, and publishing, so that we can simultaneously address the technical and social aspects of this issue. Please, go read it! Relevance to #futurepaper: tools and processes prior to publication matter! 4. Computational narratives as the engine of collaborative data science. That's the title of the most recent Project Jupyter grant application, authored by Fernando Perez and Brian Granger (and funded!). It's hard to explain to people who haven't seen it, but the Jupyter Notebook is the single most impactful thing to happen to the science side of the data science ecosystem in the last decade. Not content with that, Fernando and Brian lay out a stunning vision for the future of Jupyter Notebook and the software ecosystem around it. Quote: the core problem we are trying to solve is the collaborative creation of reproducible computational narratives that can be used across a wide range of audiences and contexts. The bit that grabs me the most in this grant proposal is the bit on collaboration, but your mileage may vary - the whole thing is a great read! Relevance to #futurepaper: hopefully obvious. 5. mybinder: deploy running Jupyter Notebooks from GitHub repos in Docker containers Another thing that I'm super excited about are the opportunities provided by lightweight composition of many different services. If you haven't seen binder (mybinder.org), you should go play with it! What binder does is let you spin up running Jupyter Notebooks based on the contents of GitHub repositories. Even cooler, you can install and configure the execution environment however you want using Dockerfiles. If this all sounds like gobbledygook to you, check out this link to a binder for exploring the LIGO data. Set up by Min Ragan-Kelly, this link spools up an executable environment (in a Jupyter Notebook) for exploring the LIGO data. Single click, no fuss, no muss. I find this exciting because binder is one example (of several!) where people are building a new publication service by composing several well-supported software packages. Relevance to #futurepaper: ever wanted to give people a chance to play with your publication's analysis pipeline as well as your data? Here you go. 6. Overlay journals. As preprints grow, the question of "why do we have journals anyway?" looms. The trend of overlay journals provides a natural mixing point between preprints and more traditional (and expensive) publishing. An overlay journal is a journal that sits on top of a preprint server. To quote, “The only objection to just putting things on arXiv is that it’s not peer reviewed, so why not have a community-based effort that provides a peer-review service for the arXiv?" [Peter Coles] says — pointing out that academics already carry out peer review for scientific publishers, usually at no cost. Relevance to #futurepaper: many publishers offer very little in the way of services beyond this, so why pay them for it when the preprint server already exists? 7. Bjorn Brembs. Bjorn is one of these people that, if he were less nice, I'd find irritating in his awesomeness. He researches flies or something, and he consistently pushes the boundaries of process in his publications. Two examples -- living figures that integrate data from outside scientists, and systematic openness - to quote from Lenny Teytelman, The paper was posted as a preprint prior to submission and all previous versions of the article are available as biorxiv preprints. The published research paper is open access. The raw data are available at figshare. All authors were listed with their ORCID IDs and all materials referenced with RRIDs. All methods are detailed with DOIs on protocols.io. The blog post gives the history and context of the work. It's a fascinating and accessible read for non-fly scientists and non-scientists alike. Beautiful! Bjorn also has a wonderful paper on just how bad the Impact Factor and journal status-seeking system is, and his blog post on what a modern scholarly infrastructure should look like is worth reading. Relevance to #futurepaper: hopefully obvious. 8. Idea futures or prediction markets. There are other ways of reaching consensus than peer review, and idea futures are one of the most fascinating. To quote, Our policy-makers and media rely too much on the "expert" advice of a self-interested insider's club of pundits and big-shot academics. These pundits are rewarded too much for telling good stories, and for supporting each other, rather than for being "right". Instead, let us create betting markets on most controversial questions, and treat the current market odds as our best expert consensus. The real experts (maybe you), would then be rewarded for their contributions, while clueless pundits would learn to stay away. You should have a free-speech right to bet on political questions in policy markets, and we could even base a new form of government on idea futures. Balaji Srinivasan points out that the bitcoin blockchain is another way of reaching consensus, and I think that's worth reading, too. Relevance to #futurepaper: there are other ways of doing peer review and reaching consensus than blocking publication until you agree with the paper. 9. Open peer review by a selected papers network. This proposal by Chris Lee, a friend and colleague at UCLA, outlines how to do peer review via (essentially) a blog chain. To quote, A selected-papers (SP) network is a network in which researchers who read, write, and review articles subscribe to each other based on common interests. Instead of reviewing a manuscript in secret for the Editor of a journal, each reviewer simply publishes his review (typically of a paper he wishes to recommend) to his SP network subscribers. Once the SP network reviewers complete their review decisions, the authors can invite any journal editor they want to consider these reviews and initial audience size, and make a publication decision. Since all impact assessment, reviews, and revisions are complete, this decision process should be short. I show how the SP network can provide a new way of measuring impact, catalyze the emergence of new subfields, and accelerate discovery in existing fields, by providing each reader a fine-grained filter for high-impact. I think this is a nice concrete example of an alternate way to do peer review that should actually work. There's a lot of things that could tie into this, including trust metrics; cryptographic signing of papers, reviews, and decisions so that they are verifiable; verifiable computing a la worldmake; etc. Relevance to #futurepaper? Whether or not you believe this could work, figuring out why you think what you think is a good way to explore what the publishing landscape could look like. 10. A call to arms: make outbreak research open access. What would you call a publishing ecosystem that actively encourages withholding of information that could save lives, all in the name of reputation building and job security? Inhumane? Unethical? Just plain wrong? All of that. Read Yowziak, Schaffner, and Sabeti's article, "Data sharing: make outbreak research open access." There are horror stories galore about what bad data sharing does, but one of the most affecting is in this New Yorker article by Seth Mnookin, in which he quotes Daniel MacArthur; The current academic publication system does patients an enormous disservice. The larger context is that our failure to have and use good mechanisms of data publication is killing people. Maybe we should fix that? Relevance to #futurepaper: open access to papers and data and software is critical to society. Anyway, so that's what's on the tip of my brain this fine morning. --titus January 07, 2017 Titus Brown Data implies software. Note: This is the second post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. An important yet rarely articulated assumption of a lot of my work in biological data analysis is that data implies software: it's not much good gathering data if you don't have the ability to analyze it. For some data, spreadsheet software is good enough. This was the situation molecular biology was in up until the early 2000s - sure, we'd get numbers from instruments and sequences from sequencers, but they'd all fit pretty handily in whatever software we had lying around. Once numerical data sets get big enough -- e.g. I did approximately 50,000 qPCRs in my last two years of grad school, which was unpleasant to handle in Excel -- we need to invest in software like R or Python, which can do bulk and batch processing of the data. Software like OpenRefine can also help with "manual" cleaning and rationalization of the data. But this requires skills that are still relatively specialized. For other data, we need custom software built specifically for that data type. This is true of sequence analysis, where most of my work is focused: when you get 200m DNA sequences, each of length 150 bp, there's no simple, effective way to query or summarize that using general computational tools. We need specialized code to parse, summarize, explore, and investigate these data sets. Using this code doesn't necessarily require serious programming knowledge, but data analysts may need fortitude in dealing with potentially immature software, as well as a duct-tape mentality in terms of tying together software that wasn't designed to integrate, or repurposing software that was meant for a different purpose. There is at least one other category of data analysis software that I can think of but haven't personally experienced - that's the kind of stuff that CERN and Facebook and Google have to deal with, where the data sets are so overwhelmingly large that you need to build deep software and hardware infrastructure to handle them. This becomes (I think) more a matter of systems engineering than anything else, but I bet there is a really strong domain knowledge component that is required of at least some of the systems engineers here. I think some of the cancer sequencing folk are close to this stage, judging from a talk I heard from Lincoln Stein two years ago. Data-intensive research increasingly lives beyond the "spreadsheet" level As data set sizes increase across the board, researchers are increasingly finding that spreadsheets are insufficient. This is for all the reasons articulated in the Data Carpentry spreadsheet lesson, so I won't belabor the point any more, but what does this mean for us? So increasingly our analysis results don't depend on spreadsheets; they depend on custom data processing scripts (in R, MATLAB, Python, etc.) and other people's programs (e.g. in bioinformatics, mappers and assemblers) and on multiple steps of data handling, cleaning, summation, integration, analysis and summarization. And, as is nicely laid out in Stodden et al. (2016), all of these steps are critical components of the data interpretation and belong in the Methods section of any publication! What's your point, Dr. Brown? When we talk about "the scientific paper of the future", one of the facets that people are most excited about - and I think this Caltech panel will probably focus on this facet - is that we now possess the technology to readily and easily communicate the details of this data analysis. Not only that, we can communicate it in such a way that it becomes repeatable and explorable and remixable, using virtual environments and data analysis notebooks. I want to highlight something else, though. When I read excellent papers on research data management like "10 aspects of highly effective research data" (or is this a blog post? I can't always tell any more), I falter at section headings that say data should be "comprehensible" and "reviewed" and especially "reusable". This is not because they are incorrect, but rather because these are so dependent on having methods (i.e. software) to actually analyze the data. And that software seems to be secondary for many data-focused folk. For me, however, they are one and the same. If I don't have access to software customized to deal with the data-type specific nuances of this data set (e.g. batch effects of RNAseq data), the data set is much less useful. If I don't know exactly what statistical cutoffs were used to extract information from this data set by others, then the data set is much less useful. (I can make my own determination as to whether those cutoffs were good cutoffs, but if I don't know what they were, I'm stuck.) If I don't have access to the custom software that was used in removing noise, generated the interim results, and did the large-scale data processing, I may not even be able to approximate the same final results. Where does this leave me? I think: Archived data has diminished utility if we do not have the ability to analyze it; for most data, this means we need software. For each data set, we should aim to have at least one fully articulated data processing pipeline (that takes us from data to results). Preferably, this would be linked to the data somehow. What I'm most excited about when it comes to the scientific paper of the future is that most visions for it offer an opportunity to do exactly this! In the future, we will increasingly attach detailed (and automated and executable) methods to new data sets. And along with driving better (more repeatable) science, this will drive better data reuse and better methods development, and thereby accelerate science overall. Fini. --titus p.s. For a recent bioinformatics effort in enabling large-scale data reuse, see "The Lair" from the Pachter Lab. p.p.s. No, I did not attempt to analyze 50,000 qPCRs in a spreadsheet. ...but others in the lab did. January 06, 2017 Titus Brown The top 10 reasons why blog posts are better than scientific papers Note: This is the first post in what I hope to be a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. 1. Blog posts are like preprints, but faster. Even preprints go through some review before they're posted, just to make sure they're not obviously crank papers. Blog posts don't suffer from any prior restraint other than the need to take the time to write them. 2. Blog posts don't end up in PDFs. ...and you don't have to write them in nasty complex formats like Word or LaTeX. Reference: why PDFs suck. 3. Blog posts are like papers, but better written. Blog posts can be colloquial, funny, and sarcastic - unlike scientific papers. Blog posts can also contain narrative in a way that scientific papers simply don't. 4. Blog posts are often opinionated. Papers go through multiple rounds of review and revision, in which the naturally irregular and uneven surface of reality is sanded down and/or bludgeoned into a cuboid that looks and sounds objective and impartial. Blog posts suffer from no such fiction of objectivity and impartiality. (Self-referential case in point.) 5. Blog posts inspire feedback. Perhaps in part because blog posts convey personal opinion, blog posts are inherently more social, more interactive, and more open to commentary. (Presumably this will also be a self-referential case in point. Or not, which would be awesomely ironic!) 6. Blog posts are free, open access, and indexed by search engines. Kind of like preprints, but not in a PDF. Very much not like many scientific papers. 7. Blog posts can be versioned. You can have multiple versions of blog posts -- kind of like preprints, but very much unlike papers. Unlike either preprints or papers, blog posts can take advantage of real version control systems like git. (Self-referential case in point.) This also further enables collaboration. 8. Blogs don't have impact factors. Instead of a nonsensical and unrigorous statistic that signals to other scientists how important an editor thinks your paper will eventually be, blog posts are shared freely among an ad hoc self-assembled network of enemies on Twitter and Facebook. 9. Blog posts can be pseudonymous. There are many science blogs that are pseudonymous, and no one cares. (This is actually really important.) This (along with the general lack of prior restraint, above) allows unpleasant truths to be shared. 10. Blog posts are probably more reliable than scientific papers. Because blog posts don't matter for academic reputation, there is little reason to game the blog post system. Therefore, blog posts are inherently more likely to be reliable than scientific papers. I encourage people who disagree with this post to submit a commentary to a respectable high retraction index journal like Science or Nature. --titus Continuum Analytics news Using Anaconda and H2O to Supercharge your Machine Learning and Predictive Analytics Monday, January 9, 2017 Kristopher Overholt Continuum Analytics Anaconda integrates with many different providers and platforms to give you access to the data science libraries you love with the tools you use, including Amazon Web Services, Docker, and Cloudera CDH. Today we’re excited to announce our new partnership with H2O and the availability of H2O machine learning packages for Anaconda on Windows, Mac and Linux. h2o-machine-learning-1a.png H2O is an open source, in-memory, distributed, fast and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data. Using in-memory compression, H2O handles billions of data rows in-memory, even with a small cluster. H2O is used by over 60,000 data scientists and more than 7,000 organizations around the world. H2O includes a wide range of data science algorithms and estimators for supervised and unsupervised machine learning such as generalized linear modeling, gradient boosting, deep learning, random forest, naive bayes, ensemble learning, generalized low rank models, k-means clustering, principal component analysis, and others. H2O provides interfaces for Python, R, Java and Scala, and can be run in standalone mode or on a Hadoop/Spark cluster via Sparkling Water or sparklyr. In this blog post, we’ll demonstrate you how you can install and use H2O with Python alongside the 720+ packages in Anaconda to perform interactive machine learning workflows with notebooks and visualizations as part of Anaconda’s Open Data Science platform. h2o-machine-learning-1b.png Installing and Using H2O with Anaconda You can install H2O with Anaconda on Windows, Mac or Linux. The following conda command will install the H2O core library and engine, the H2O Python client library and the required Java dependencies (OpenJDK): $ conda install h2o h2o-py


That’s it! After installing H2O with Anaconda, you’re now ready to get started with a wide range of machine learning algorithms and data science modeling techniques.

In the following sections, we’ll demonstrate how to use H2O with Anaconda based on examples from the H2O documentation, including a k-means clustering example, a deep learning example and a gradient boosting example.

K-means Clustering with Anaconda and H2O

K-means clustering is an machine learning technique that can be used to classify values in a data set using a clustering algorithm.

In this example, we’ll use the k-means clustering algorithm in H2O on the Iris flower data set to classify the measurements into clusters.

First, we’ll start a Jupyter notebook server where we can run the H2O machine learning examples in an interactive notebook environment with access to all of the libraries from Anaconda.

jupyter notebook  h2o-machine-learning-2a.png In the notebook, we can import the H2O client library and initialize an H2O cluster, which will be started on our local machine: >>> import h2o >>> h2o.init() Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_102"; OpenJDK Runtime Environment (Zulu 8.17.0.3-macosx) (build 1.8.0_102-b14); OpenJDK 64-Bit Server VM (Zulu 8.17.0.3-macosx) (build 25.102-b14, mixed mode) Starting server from /Users/koverholt/anaconda3/h2o_jar/h2o.jar Ice root: /var/folders/5b/1vh3qn2x7_s7mj88zc3nms0m0000gp/T/tmpj9mo8ims JVM stdout: /var/folders/5b/1vh3qn2x7_s7mj88zc3nms0m0000gp/T/tmpj9mo8ims/h2o_koverholt_started_from_python.out JVM stderr: /var/folders/5b/1vh3qn2x7_s7mj88zc3nms0m0000gp/T/tmpj9mo8ims/h2o_koverholt_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321... successful. h2o-machine-learning-2b.png After we’ve started the H2O cluster, we can download the Iris data set from the H2O repository on Github and view a summary of the data: >>> iris = h2o.import_file(path="https://github.com/h2oai/h2o-3/raw/master/h2o-r/h2o-package/inst/extdata/iris_wheader.csv") >>> iris.describe()  h2o-machine-learning-2c.png Now that we’ve loaded the data set, we can import and run the k-means estimator from H2O: >>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> results = [H2OKMeansEstimator(k=clusters, init="Random", seed=2, standardize=True) for clusters in range(2,13)] >>> for estimator in results: estimator.train(x=iris.col_names[0:-1], training_frame = iris) kmeans Model Build progress: |████████████████████████████████████████████| 100% We can specify the number of clusters and iteratively compute the cluster locations and data points that are contained within the clusters: >>> clusters = 4 >>> predicted = results[clusters-2].predict(iris) >>> iris["Predicted"] = predicted["predict"].asfactor() kmeans prediction progress: |█████████████████████████████████████████████| 100% Once we’ve generated the predictions, we can visualize the classified data and clusters. Because we have access to all of the libraries in Anaconda in the same notebook as H2O, we can use matplotlib and seaborn to visualize the results: >>> import seaborn as sns >>> %matplotlib inline >>> sns.set() >>> sns.pairplot(iris.as_data_frame(True), vars=["sepal_len", "sepal_wid", "petal_len", "petal_wid"], hue="Predicted"); h2o-machine-learning-2d.png Deep Learning with Anaconda and H2O We can also perform deep learning with H2O and Anaconda. Deep learning is a class of machine learning algorithms that incorporate neural networks and can be used to perform regression and classification tasks on a data set. In this example, we’ll use the supervised deep learning algorithm in H2O on the Prostate Cancer data set stored on Amazon S3. We’ll use the same H2O cluster that we created using h2o.init() in the previous example. First, we’ll download the Prostate Cancer data set from a publicly available Amazon S3 bucket and view a summary of the data: >>> prostate = h2o.import_file(path="s3://h2o-public-test-data/smalldata/logreg/prostate.csv") >>> prostate.describe() Rows: 380 Cols: 9 h2o-machine-learning-3a.png We can then import and run the deep learning estimator from H2O on the Prostate Cancer data: >>> from h2o.estimators.deeplearning import H2ODeepLearningEstimator >>> prostate["CAPSULE"] = prostate["CAPSULE"].asfactor() >>> model = H2ODeepLearningEstimator(activation = "Tanh", hidden = [10, 10, 10], epochs = 10000) >>> model.train(x = list(set(prostate.columns) - set(["ID","CAPSULE"])), y ="CAPSULE", training_frame = prostate) >>> model.show() deeplearning Model Build progress: |██████████████████████████████████████| 100% Model Details ============= H2ODeepLearningEstimator : Deep Learning Model Key: DeepLearning_model_python_1483417629507_19 Status of Neuron Layers: predicting CAPSULE, 2-class classification, bernoulli distribution, CrossEntropy loss, 322 weights/biases, 8.5 KB, 3,800,000 training samples, mini-batch size 1 h2o-machine-learning-3b.png After we’ve trained the deep learning model, we can generate predictions and view the results, including the model scoring history and performance metrics: >>> predictions = model.predict(prostate) >>> predictions.show() deeplearning prediction progress: |███████████████████████████████████████| 100% h2o-machine-learning-3c.png Gradient Boosting with H2O and Anaconda We can also perform gradient boosting with H2O and Anaconda. Gradient boosting is an ensemble machine learning technique (commonly used in conjunction with decision trees) that can perform regression and classification tasks on a data set. In this example, we’ll use the supervised gradient boosting algorithm in H2O on a cleaned version of the Prostate Cancer data from the previous deep learning example. First, we’ll import and run the gradient boosting estimator from H2O on the Prostate Cancer data: >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> my_gbm = H2OGradientBoostingEstimator(distribution = "bernoulli", ntrees=50, learn_rate=0.1) >>> my_gbm.train(x=list(range(1,train.ncol)), y="CAPSULE", training_frame=train, validation_frame=train) gbm Model Build progress: |███████████████████████████████████████████████| 100% After we’ve trained the gradient boosting model, we can view the resulting model performance metrics: >>> my_gbm_metrics = my_gbm.model_performance(train) >>> my_gbm_metrics.show() ModelMetricsBinomial: gbm ** Reported on test data. ** MSE: 0.07338612348053128 RMSE: 0.2708987328883826 LogLoss: 0.26757238912319825 Mean Per-Class Error: 0.07431401341740806 AUC: 0.9801618150931445 Gini: 0.960323630186289 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4772353333869793:  h2o-machine-learning-4a.png Additional Resources for Machine Learning with Anaconda and H2O Refer to the H2O documentation for more information about the full set of machine learning algorithms, libraries and examples that are available in H2O, including generalized linear modeling, random forest, naive bayes, ensemble learning, generalized low rank models, principal component analysis and others. Interested in using Anaconda and H2O in your enterprise organization for machine learning, model deployment workflows and scalable analysis with Hadoop and Spark? Get in touch with us if you’d like to learn more about how Anaconda can empower your enterprise with Open Data Science, including an on-premise package repository, collaborative notebooks, cluster deployments and custom consulting/training solutions. The complete notebooks for the k-means clustering, deep learning, and gradient boosting examples shown in this blog post can be viewed and downloaded from Anaconda Cloud: January 03, 2017 Matthew Rocklin Dask Release 0.13.0 This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation Summary Dask just grew to version 0.13.0. This is a signifcant release for arrays, dataframes, and the distributed scheduler. This blogpost outlines some of the major changes since the last release November 4th. 1. Python 3.6 support 2. Algorithmic and API improvements for DataFrames 3. Dataframe to Array conversions for Machine Learning 4. Parquet support 5. Scheduling Performance and Worker Rewrite 6. Pervasive Visual Diagnostics with Embedded Bokeh Servers 7. Windows continuous integration 8. Custom serialization You can install new versions using Conda or Pip conda install -c conda-forge dask distributed  or pip install dask[complete] distributed --upgrade  Python 3.6 Support Dask and all necessary dependencies are now available on Conda Forge for Python 3.6. Algorithmic and API Improvements for DataFrames Thousand-core Dask deployments have become significantly more common in the last few months. This has highlighted scaling issues in some of the Dask.array and Dask.dataframe algorithms, which were originally designed for single workstations. Algorithmic and API changes can be grouped into the following two categories: 1. Filling out the Pandas API 2. Algorithms that needed to be changed or added due to scaling issues Dask Dataframes now include a fuller set of the Pandas API, including the following: 1. Inplace operations like df['x'] = df.y + df.z 2. The full Groupby-aggregate syntax like df.groupby(...).aggregate({'x': 'sum', 'y': ['min', max']}) 3. Resample on dataframes as well as series 4. Pandas’ new rolling syntax df.x.rolling(10).mean() 5. And much more Additionally, collaboration with some of the larger Dask deployments has highlighted scaling issues in some algorithms, resulting in the following improvements: 1. Tree reductions for groupbys, aggregations, etc. 2. Multi-output-partition aggregations for groupby-aggregations with millions of groups, drop_duplicates, etc.. 3. Approximate algorithms for nunique 4. etc.. These same collaborations have also yielded better handling of open file descriptors, changes upstream to Tornado, and upstream changes to the conda-forge CPython recipe itself to increase the default file descriptor limit on Windows up from 512. Dataframe to Array Conversions You can now convert Dask dataframes into Dask arrays. This is mostly to support efforts of groups building statistics and machine learning applications, where this conversion is common. For example you can load a terabyte of CSV or Parquet data, do some basic filtering and manipulation, and then convert to a Dask array to do more numeric work like SVDs, regressions, etc.. import dask.dataframe as dd import dask.array as da df = dd.read_csv('s3://...') # Read raw data x = df.values # Convert to dask.array u, s, v = da.linalg.svd(x) # Perform serious numerics  This should help machine learning and statistics developers generally, as many of the more sophisticated algorithms can be more easily implemented with the Dask array model than can be done with distributed dataframes. This change was done specifically to support the nascent third-party dask-glm project by Chris White at Capital One. Previously this was hard because Dask.array wanted to know the size of every chunk of data, which Dask dataframes can’t provide (because, for example, it is impossible to lazily tell how many rows are in a CSV file without actually looking through it). Now that Dask.arrays have relaxed this requirement they can also support other unknown shape operations, like indexing an array with another array. y = x[x > 0]  Parquet Support Dask.dataframe now supports Parquet, a columnar binary store for tabular data commonly used in distributed clusters and the Hadoop ecosystem. import dask.dataframe as dd df = dd.read_parquet('myfile.parquet') # Read from Parquet df.to_parquet('myfile.parquet', compression='snappy') # Write to Parquet  This is done through the new fastparquet library, a Numba-accelerated version of the Pure Python parquet-python. Fastparquet was built and is maintained by Martin Durant. It’s also exciting to see the Parquet-cpp project gain Python support through Arrow and work by Wes McKinney and Uwe Korn. Parquet has gone from inaccessible in Python to having multiple competing implementations, which is a wonderful and exciting change for the “Big Data” Python ecosystem. Scheduling Performance and Worker Rewrite The internals of the distributed scheduler and workers are significantly modified. Users shouldn’t experience much change here except for general performance enhancement, more upcoming features, and much deeper visual diagnostics through Bokeh servers. We’ve pushed some of the scheduling logic from the scheduler onto the workers. This lets us do two things: 1. We keep a much larger backlog of tasks on the workers. This allows workers to optimize and saturate their hardware more effectively. As a result, complex computations end up being significantly faster. 2. We can more easily deliver on a rising number of requests for complex scheduling features. For example, GPU users will be happy to learn that you can now specify abstract resource constraints like “this task requires a GPU” and “this worker has four GPUs” and the scheduler and workers will allocate tasks accordingly. This is just one example of a feature that was easy to implement after the scheduler/worker redesign and is now available. Pervasive Visual Diagnostics with Embedded Bokeh Servers While optimizing scheduler performance we built several new visual diagnostics using Bokeh. There is now a Bokeh Server running within the scheduler and within every worker. Current Dask.distributed users will be familiar with the current diagnostic dashboards: These plots provide intuition about the state of the cluster and the computations currently in flight. These dashboards are generally well loved. There are now many more of these, though more focused on internal state and timings that will be of interest to developers and power users than to a typical users. Here are a couple of the new pages (of which there are seven) that show various timings and counters of various parts of the worker and scheduler internals. The previous Bokeh dashboards were served from a separate process that queried the scheduler periodically (every 100ms). Now there are new Bokeh servers within every worker and a new Bokeh server within the scheduler process itself rather than in a separate process. Because these servers are embedded they have direct access to the state of the scheduler and workers which significantly reduces barriers for us to build out new visuals. However, this also adds some load to the scheduler, which can often be compute bound. These pages are available at new ports, 8788 for the scheduler and 8789 for the worker by default. Custom Serialization This is actually a change that occurred in the last release, but I haven’t written about it and it’s important, so I’m including it here. Previously inter-worker communication of data was accomplished with Pickle/Cloudpickle and optional generic compression like LZ4/Snappy. This was robust and worked mostly fine, but left out some exotic data types and did not provide optimal performance. Now we can serialize different types with special consideration. This allows special types, like NumPy arrays, to pass through without unnecessary memory copies and also allows us to use more exotic data-type specific compression techniques like Blosc. It also allows Dask to serialize some previously unserializable types. In particular this was intended to solve the Dask.array climate science community’s concern about HDF5 and NetCDF files which (correctly) are unpicklable and so restricted to single-machine use. This is also the first step towards two frequently requested features (neither of these exist yet): 1. Better support for GPU-GPU specific serialization options. We are now a large step closer to generalizing away our assumption of TCP Sockets as the universal communication mechanism. 2. Passing data between workers of different runtime languages. By embracing other protocols than Pickle we begin to allow for the communication of data between workers of different software environments. What’s Next So what should we expect to see in the future for Dask? • Communication: Now that workers are more fully saturated we’ve found that communication issues are arising more frequently as bottlenecks. This might be because everything else is nearing optimal or it might be because of the increased contention in the workers now that they are idle less often. Many of our new diagnostics are intended to measure components of the communication pipeline. • Third Party Tools: We’re seeing a nice growth of utilities like dask-drmaa for launching clusters on DRMAA job schedulers (SGE, SLURM, LSF) and dask-glm for solvers for GLM-like machine-learning algorithms. I hope that external projects like these become the main focus of Dask development going forward as Dask penetrates new domains. • Blogging: I’ll be launching a few fun blog posts throughout the next couple of weeks. Stay tuned. Learn More You can install or upgrade using Conda or Pip conda install -c conda-forge dask distributed  or pip install dask[complete] distributed --upgrade  You can learn more about Dask and its distributed scheduler at these websites: Acknowledgements Since the last main release the following developers have contributed to the core Dask repostiory (parallel algorithms, arrays, dataframes, etc..) • Alexander C. Booth • Antoine Pitrou • Christopher Prohm • Frederic Laliberte • Jim Crist • Martin Durant • Matthew Rocklin • Mike Graham • Rolando (Max) Espinoza • Sinhrks • Stuart Archibald And the following developers have contributed to the Dask/distributed repository (distributed scheduling, network communication, etc..) • Antoine Pitrou • jakirkham • Jeff Reback • Jim Crist • Martin Durant • Matthew Rocklin • rbubley • Stephan Hoyer • strets123 • Travis E. Oliphant December 27, 2016 Continuum Analytics news A Look Back and a Peek Ahead: A Year in Review at Anaconda Tuesday, December 27, 2016 Michele Chambers EVP Anaconda Business Unit & CMO Continuum Analytics 2016 has been quite the year for all of us at Anaconda! From expanding our strong team to growing our customer and partner rosters and continuing our spirit of innovation, it seems like a perfect time to reflect on the year that was as we gear up to hit the ground running in 2017. January—March We started the year off with a bang by expanding our executive leadership team, adding two new members to our rapidly growing company. In February, we welcomed Jon Shepherd, our new senior vice president of sales, and Matt Roberts joined as vice president of product engineering. On the product side, we announced Anaconda 2.5 (coupled with Intel Math Kernel Library), Anaconda Enterprise Notebooks, Anaconda for Cloudera and Anaconda advancements that bring high performance advanced analytics to Hadoop. Last but certainly not least, our fearless leaders Travis Oliphant, Peter Wang, and Michele Chambers had a busy March—they presented at the Gartner Business Intelligence & Analytics Summit and Strata + Hadoop World in San Jose. It was a great start to the year! April—June Here at Anaconda, April showers don’t just bring May flowers—they also bring new products, cool collaborations and awesome events. Kicking off the quarter in April, we announced an exciting partnership with American Enterprise Institute’s Open Source Policy Center TaxBrain initiative. By leveraging the power of open source, TaxBrain can provide policy makers, journalists and the general public with the information they need to impact and change policy for the better. In May, Intel adopted Anaconda as the basis for their Python distribution. Lastly, Spark Summit 2016 was a real hit with two of our team members presenting on “Connecting Python To The Spark Ecosystem” and “GPU Computing With Apache Spark And Python.” July—September While the dog days of summer were slightly quieter at Anaconda HQ, our team was still going strong under the hot Austin sun. Partnerships with big players such as Intel and IBM helped propel the quarter forward, and our popular data science capstone—the Anaconda Skills Accelerator Program—launched with Galvanize earlier in 2016 for prospective data scientists. In July, we announced our substantial grant from the Gordon and Betty Moore Foundation to help fund Numba and Dask. Rounding off the quarter, Our CTO & cofounder Peter Wang took the stage again at Strata + Hadoop World in NYC to discuss Open Data Science on Hadoop, and we introduced the Journey to Open Data Science with a new fun video that kept SAS on their toes and gave Strata attendees a good chuckle. Bokeh developers Bryan Van de Ven and Sarah Bird also presented their Interactive Data Applications tutorial at Strata + Hadoop World in NYC. It was a successful summer. October—December With the end of the year in sight, we launched the AnacondaCrew Partner Program to empower data scientists with superpowers (no, we’re not kidding!). We’re thrilled to announce that in the last year, we’ve quickly grown this program to include a dozen of the best known modern data science partners in the ecosystems, including Cloudera, Intel, Microsoft, IBM, NVIDIA, Docker, DataCamp and many others. We rounded this out with a new partnership with Esri to help enhance GIS applications with Open Data Science. Add to that our new relationship with Recursion Pharmaceuticals—they adopted Bokeh on Anaconda to make it easy for biologists to identify genetic disease markers and assess drug efficacy when visualizing cell data. We feel great about helping to contribute to the medical community and making real change in people's lives. Finally, AnacondaCON 2017 registration opened in November and we couldn’t be more excited for the first event—our cherry on top of a wonderful year! Wishing a happy and healthy holiday season and New Year to everyone in the Anaconda community; none of this would have been possible without you. As our holiday gift to you, please download our Anaconda Holiday Wallpaper on Dropbox for a festive, Python Desktop Background fit for the season. Cheers to an even more successful 2017! December 24, 2016 Matthew Rocklin Dask Development Log This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-11 and 2016-12-18. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected. Themes of last week: 1. Cleanup of load balancing 2. Found cause of worker lag 3. Initial Spark/Dask Dataframe comparisons 4. Benchmarks with asv Load Balancing Cleanup The last two weeks saw several disruptive changes to the scheduler and workers. This resulted in an overall performance degradation on messy workloads when compared to the most recent release, which stopped bleeding-edge users from using recent dev builds. This has been resolved, and bleeding-edge git-master is back up to the old speed and then some. As a visual aid, this is what bad (or in this case random) load balancing looks like: Identified and removed worker lag For a while there have been significant gaps of 100ms or more between successive tasks in workers, especially when using Pandas. This was particularly odd because the workers had lots of backed up work to keep them busy (thanks to the nice load balancing from before). The culprit here was the calculation of the size of the intermediate on object dtype dataframes. Explaining this in greater depth, recall that to schedule intelligently, the workers calculate the size in bytes of every intermediate result they produce. Often this is quite fast, for example for numpy arrays we can just multiply the number of elements by the dtype itemsize. However for object dtype arrays or dataframes (which are commonly used for text) it can take a long while to calculate an accurate result here. Now we no longer calculuate an accurate result, but instead take a fairly pessimistic guess. The gaps between tasks shrink considerably. Although there is still a significant bit of lag around 10ms long between tasks on these workloads (see zoomed version on the right). On other workloads we’re able to get inter-task lag down to the tens of microseconds scale. While 10ms may not sound like a long time, when we perform very many very short tasks this can quickly become a bottleneck. Anyway, this change reduced shuffle overhead by a factor of two. Things are starting to look pretty snappy for many-small-task workloads. Initial Spark/Dask Dataframe Comparisons I would like to run a small benchmark comparing Dask and Spark DataFrames. I spent a bit of the last couple of days using Spark locally on the NYC Taxi data and futzing with cluster deployment tools to set up Spark clusters on EC2 for basic benchmarking. I ran across flintrock, which has been highly recommended to me a few times. I’ve been thinking about how to do benchmarks in an unbiased way. Comparative benchmarks are useful to have around to motivate projects to grow and learn from each other. However in today’s climate where open source software developers have a vested interest, benchmarks often focus on a projects’ strengths and hide their deficiencies. Even with the best of intentions and practices, a developer is likely to correct for deficiencies on the fly. They’re much more able to do this for their own project than for others’. Benchmarks end up looking more like sales documents than trustworthy research. My tentative plan is to reach out to a few Spark devs and see if we can collaborate on a problem set and hardware before running computations and comparing results. Benchmarks with airspeed velocity Rich Postelnik is building on work from Tom Augspurger to build out benchmarks for Dask using airspeed velocity at dask-benchmarks. Building out benchmarks is a great way to get involved if anyone is interested. Pre-pre-release I intend to publish a pre-release for a 0.X.0 version bump of dask/dask and dask/distributed sometime next week. December 18, 2016 Matthew Rocklin Dask Development Log This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-11 and 2016-12-18. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected. Themes of last week: 1. Benchmarking new scheduler and worker on larger systems 2. Kubernetes and Google Container Engine 3. Fastparquet on S3 Rewriting Load Balancing In the last two weeks we rewrote a significant fraction of the worker and scheduler. This enables future growth, but also resulted in a loss of our load balancing and work stealing algorithms (the old one no longer made sense in the context of the new system.) Careful dynamic load balancing is essential to running atypical workloads (which are surprisingly typical among Dask users) so rebuilding this has been all-consuming this week for me personally. Briefly, Dask initially assigns tasks to workers taking into account the expected runtime of the task, the size and location of the data that the task needs, the duration of other tasks on every worker, and where each piece of data sits on all of the workers. Because the number of tasks can grow into the millions and the number of workers can grow into the thousands, Dask needs to figure out a near-optimal placement in near-constant time, which is hard. Furthermore, after the system runs for a while, uncertainties in our estimates build, and we need to rebalance work from saturated workers to idle workers relatively frequently. Load balancing intelligently and responsively is essential to a satisfying user experience. We have a decently strong test suite around these behaviors, but it’s hard to be comprehensive on performance-based metrics like this, so there has also been a lot of benchmarking against real systems to identify new failure modes. We’re doing what we can to create isolated tests for every failure mode that we find to make future rewrites retain good behavior. Generally working on the Dask distributed scheduler has taught me the brittleness of unit tests. As we have repeatedly rewritten internals while maintaining the same external API our testing strategy has evolved considerably away from fine-grained unit tests to a mixture of behavioral integration tests and a very strict runtime validation system. Rebuilding the load balancing algorithms has been high priority for me personally because these performance issues inhibit current power-users from using the development version on their problems as effectively as with the latest release. I’m looking forward to seeing load-balancing humming nicely again so that users can return to git-master and so that I can return to handling a broader base of issues. (Sorry to everyone I’ve been ignoring the last couple of weeks). Test deployments on Google Container Engine I’ve personally started switching over my development cluster from Amazon’s EC2 to Google’s Container Engine. Here are some pro’s and con’s from my particular perspective. Many of these probably have more to do with how I use each particular tool rather than intrinsic limitations of the service itself. In Google’s Favor 1. Native and immediate support for Kubernetes and Docker, the combination of which allows me to more quickly and dynamically create and scale clusters for different experiments. 2. Dynamic scaling from a single node to a hundred nodes and back ten minutes later allows me to more easily run a much larger range of scales. 3. I like being charged by the minute rather than by the hour, especially given the ability to dynamically scale up 4. Authentication and billing feel simpler In Amazon’s Favor 1. I already have tools to launch Dask on EC2 2. All of my data is on Amazon’s S3 3. I have nice data acquisition tools, s3fs, for S3 based on boto3. Google doesn’t seem to have a nice Python 3 library for accessing Google Cloud Storage :( I’m working from Olivier Grisel’s repository docker-distributed although updating to newer versions and trying to use as few modifications from naive deployment as possible. My current branch is here. I hope to have something more stable for next week. Fastparquet on S3 We gave fastparquet and Dask.dataframe a spin on some distributed S3 data on Friday. I was surprised that everything seemed to work out of the box. Martin Durant, who built both fastparquet and s3fs has done some nice work to make sure that all of the pieces play nicely together. We ran into some performance issues pulling bytes from S3 itself. I expect that there will be some tweaking over the next few weeks. December 14, 2016 Titus Brown Notes on our lab Code of Conduct I'm writing this up for the rOpenSci call on Codes of Conduct that I'm participating in today. My lab has a lab Code of Conduct. We adapted it from https://github.com/confcodeofconduct/confcodeofconduct.com. So the "how" was easy enough :). Key points I want to make: • develop & post a code of conduct whether or not you know of any problems; • the code of conduct has to set expectations for everyone, including the boss; • I provide a specific contact person outside the university hierarchy for complaints about me; A few notes on why a CoC and what use it's been: Adoption was not motivated by any one particular incident, although there have been a few incidents of problematic behavior over the years. It was more motivated by our adoption of a Code of Conduct for the khmer software project, which is one of our major projects, and also by the Software Carpentry Code of Conduct. (Note that Michael Crusoe was the originator of the CoC on the khmer project and has been both a strong proponent and an excellent resource for creating friendly workspaces.) The 2013 PyCon incidents helped convince me of CoCs in general. We also attended an excellent Ada Initiative ally workshop at PyCon in 2015 that convinced me of the utility of a CoC for the lab specifically. Another motivation to adopt a lab code of conduct came from our training efforts, where it is clear that impostor syndrome rules and it takes quite a bit of overt friendliness for people to ask questions. Providing ground rules for interaction helps there tremendously. I will note that there are a few unfriendly and/or obnoxious people in bioinformatics, and that at least two of these individuals have targeted students in my lab or collaborators. Not much to be done about that, although Twitter's "block" functionality works extremely well for me, or at least so I assume ;). I won't stand for that sort of behavior in the lab, though. A lab code differs from a workshop or community code of conduct in a few ways. The primary difference I see is at the intersection of authority and longevity - unlike an online community, there is a de facto authority (the head of the lab), who within some limits can make decisions for the lab; this is like a workshop where someone can be asked to leave by the workshop organizer. But unlike a workshop, labs exist for a long time, and so there are longer term relationships to consider. My goal in adopting a code was to make it clear that everyone could speak comfortably, without fear of being targeted for who they were or what they believed. I've come through both very macho and argumentative labs as well as super friendly and uncritical labs, and neither seemed right -- I wanted to maintain both the ability to give and take criticism within the lab together with having a friendly and productive lab atmosphere. I believe this is important for the development brainstorming and creativity, as well as simply making the lab a nice(r) place to be. There have definitely been benefits in recruiting: having a code (and following it!) means that people know you are aware of many issues that all too many faculty seem unaware of... this encourages a more diverse applicant pool. This may be one reason the lab is fairly diverse in practice, as well. A key aspect of our code is that it places expectations on the boss as well (that's me). Part of this is having someone to complain to about me; this has only been used once, and it was super important because I simply hadn't realized what I had said & done, and (long term) it led to me modifying a particular behavior of mine. It also enables labbies to take the initiative when something comes up, which has happened a few times. This doesn't need to involve me; lab members have felt free to speak up and remind others that what they are saying is inappropriate or hurtful, in part because they know that we have set expectations and I will back them up. This seems to work well, the few times it has been used. (Although I've never had to back anyone up.) Fundamentally a code of conduct defines a social contract and sets expectations for everyone in their basic set of interactions. I've found it to be a net positive with no downsides so far. Final note: I have no idea if it's legally enforceable but I don't actually think that matters that much, as the university channels for handling harassment are largely useless in practice; this is a social problem too, not "just" a legal problem. --titus December 13, 2016 Continuum Analytics news Counting Down to AnacondaCON 2017 Tuesday, December 13, 2016 Michele Chambers EVP Anaconda Business Unit & CMO Continuum Analytics fairmarket (1).jpg The entire #AnacondaCREW is busy gearing up for our inaugural user conference, AnacondaCON—and it’s less than two months away! To continue the hype and excitement (and to honor the fact that the conference is in just eight weeks), we’re sharing eight things to expect at AnacondaCON 2017. Check ‘em out! 1. Awesome attendees. AnacondaCON will be filled with Anaconda Enterprise users and the brightest minds in the Open Data Science movement that are harnessing the power and innovation of the Anaconda community. 2. Noteworthy speakers. Our speakers will open your eyes to a whole new world of Open Data Science: Blake Borgeson, co-founder and CTO of Recursion Pharmaceuticals, Eric Jonas, Postdoctoral Researcher at UC Berkeley, Travis Oliphant, Continuum Analytics CEO and co-founder, just to name a few. And we’re still updating the agenda! 3. Amazing schedule. We’ve got an agenda that’s packed with mind-blowing sessions on Open Data Science and cutting-edge insider information (not to mention delicious food!). Keep checking back as we’re updating the agenda everyday. 4. Captivating keynote. Our very own CTO and co-founder, Peter Wang, is kicking off the conference with a scintillating presentation starting at 9AM on Wednesday, February 8. We won’t spoil it, but if you’ve ever heard Peter speak, you know you won’t want to miss this. 5. Oversized rooms. We’re not kidding when we tell you the JW Marriott Austin has elegant, oversized rooms and extremely plush pillows (we tested them out ourselves). Be sure to book your stay now by calling (844) 473-3959 and mentioning "AnacondaCON,” or booking online at our discounted link. 6. Networking Offsite. Who doesn’t love authentic Texas BBQ? Did we mention unbelievable tacos? Don’t miss the AnacondaCON off-site party for networking, authentic Texas BBQ (yes, we said it twice, it’s just that good) and much more at Fair Market, an Austin Eastside venue only a short ride away from the JW Marriott. Check out the party venue here. 7. Dynamic teams. Bring everyone to AnacondaCON! At the conference, you’ll learn how your entire team—business analysts, bata scientists, developers, DevOps, data engineers, anyone—can share, engage and collaborate, from the prototype all the way through production deployment. 8. Awe-inspiring event. Overall, AnacondaCON will be the event of lifetime—you won’t want to miss it. See you there! Register now for AnacondaCON to take advantage of our Early Bird prices. December 12, 2016 Matthew Rocklin Dask Development Log This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-05 and 2016-12-12. Nothing here is stable or ready for production. This blogpost is written in haste, so refined polish should not be expected. Themes of last week: 1. Dask.array without known chunk sizes 2. Import time 3. Fastparquet blogpost and feedback 4. Scheduler improvements for 1000+ worker clusters 5. Channels and inter-client communication 6. New dependencies? Dask array without known chunk sizes Dask arrays can now work even in situations where we don’t know the exact chunk size. This is particularly important because it allows us to convert dask.dataframes to dask.arrays in a standard analysis cycle that includes both data preparation and statistical or machine learning algorithms. x = df.values x = df.to_records()  This work was motivated by the work of Christopher White on building scalable solvers for problems like logistic regression and generalized linear models over at dask-glm. As a pleasant side effect we can now also index dask.arrays with dask.arrays (a previous limitation) x[x > 0]  and mutate dask.arrays in certain cases with setitem x[x > 0] = 0  Both of which are frequntly requested. However, there are still holes in this implementation and many operations (like slicing) generally don’t work on arrays without known chunk sizes. We’re increasing capability here but blurring the lines of what is possible and what is not possible, which used to be very clear. Import time Import times had been steadily climbing for a while, rising above one second at times. These were reduced by Antoine Pitrou down to a more reasonable 300ms. FastParquet blogpost and feedback Martin Durant has built a nice Python Parquet library here: http://fastparquet.readthedocs.io/en/latest/ and released a blogpost about it last week here: https://www.continuum.io/blog/developer-blog/introducing-fastparquet Since then we’ve gotten some good feedback and error reports (non-string column names etc.) Martin has been optimizing performance and recently adding append support. Scheduler optimizations for 1000+ worker clusters The recent refactoring of the scheduler and worker exposed new opportunities for performance and for measurement. One of the 1000+ worker deployments here in NYC was kind enough to volunteer some compute time to run some experiments. It was very fun having all of the Dask/Bokeh dashboards up at once (there are now half a dozen of these things) giving live monitoring information on a thousand-worker deployment. It’s stunning how clearly performance issues present themselves when you have the right monitoring system. Anyway, this lead to better sequentialization when handling messages, greatly reduced open file handle requirements, and the use of cytoolz over toolz in a few critical areas. I intend to try this experiment again this week, now with new diagnostics. To aid in that we’ve made it very easy to turn timings and counters automatically into live Bokeh plots. It now takes literally one line of code to add a new plot to these pages (left: scheduler right: worker) Already we can see that the time it takes to connect between workers is absurdly high in the 10ms to 100ms range, highlighting an important performance flaw. This depends on an experimental project, crick, by Jim Crist that provides a fast T-Digest implemented in C (see also Ted Dunning’s implementation. Channels and inter-worker communication I’m starting to experiment with mechanisms for inter-client communication of futures. This enables both collaborative workflows (two researchers sharing the same cluster) and also complex workflows in which tasks start other tasks in a more streaming setting. We added a simple mechanism to share a rolling buffer of futures between clients: # Client 1 c = Client('scheduler:8786') x = c.channel('x') future = c.submit(inc, 1) x.put(future)  # Client 1 c = Client('scheduler:8786') x = c.channel('x') future = next(iter(x))  Additionally, this relatively simple mechanism was built external to the scheduler and client, establishing a pattern we can repeat in the future for more complex inter-client communication systems. Generally I’m on the lookout for other ways to make the system more extensible. This range of extension requests for the scheduler is somewhat large these days and we’d like to find ways to keep these expansions maintainable going forward. New dependency: Sorted collections The scheduler is now using the sortedcollections module, which is based off of sortedcontainers which is a pure-Python library offering sorted containers SortedList, SortedSet, ValueSortedDict, etc. at C-extensions speeds. So far I’m pretty sold on these libraries. I encourage other library maintainers to consider them. December 11, 2016 Titus Brown What metadata should we put in MinHash Sketch signatures? One of the uses that we are most interested in MinHash sketches for is the indexing and search of large public, semi-public, and private databases. There are many specific use cases for this, but the basic goal is to be able to find data sets by content queries, using sequence as the "bait". Think "find me data sets that overlap with my metagenome", or "what should I co-assemble with?" One particularly interesting feature of MinHash sketches for this purpose is that you can provide indices on closed or private data sets without revealing the actual data - while I'd prefer that most data be open to all, I figure "findable" is at least an advantage over the current situation. As we start to plan the indexing of larger databases, a couple of other features of MinHash sketches also start to become important. One feature is that they are very small, and they are also very quick to search. For 60,000 microbial genomes the compressed data set of sourmash sketches is under a few GB, and that's with an overly verbose and unoptimized storage format. These 60,000 genomes can be searched in under a few seconds and in less than a GB of RAM; because of the wonder of n-ary trees, it is unlikely that search of much larger databases will be significantly slower. A third feature (well explored in the mash paper) is that MinHash sketches with large k are both very specific and very sensitive to single genomes, in that you usually recover the right match, and it is rare to recover irrelevant matches. One consequence of the speed and small footprint of MinHash sketches is that we can easily provide the individual sketches as well as the aggregated Sequence Bloom Tree databases for download and use. Another consequence is that people can search and filter on these databases quite quickly and without a lot of hardware - pretty much everything can be done on laptop-scale hardware. Moreover the sketches (once calculated) don't really need to be updated - the sketch will change very little even if an assembly is updated. So while people might be interested in building custom MinHash databases for searching subsets of archives, it seems reasonable to maintain a single database of all the sketches that can be downloaded and searched by anyone. This opinion informed my response to Michael Barton, who is interested in building custom databases for several reasons - my guess is that this will be a somewhat specialized (though perhaps reasonably frequent) use case, compared to simply downloading and using a pre-constructed database. More important to me is the interoperability of different tools, which basically boils down to choosing the same hash functions and (eventually) figuring out what k-mer size and number of MinHash values to store per data. Something that I'm more focused on at the moment is another question that Michael asked, which is about metadata. Right now our individual signature files can contain multiple sketches per sample, with different k-mer sizes and molecule types (DNA/protein). These are kept in YAML. Because of this, the format is easily extensible to include a variety of metadata, but I have put very little thought into what metadata to store. Thinking out loud, • there will be a few pieces of metadata that every sketch should have; for public data, for example, the URL and an unambiguous database specific identifier should be there. • each source database will have its own metadata records; if we index data sets from the Assembly database at NCBI, there will be different fields available than from the SRA database at NCBI, vs the MG-RAST metagenome collection, vs the IMG/M database. I'm not aware of any metadata standards here (but I wouldn't know, either). This means that trying to come up with a single standard is an idea that is doomed to fail. • we should try to include enough information that there is something human readable and useful, if possible; • I'm not sure how much information we need to include beyond database identity and database record ID; it seems like dipping our toes into (e.g.) taxonomy and phylogeny would be a dangerous game, and that information could be pulled out of the databases for whatever specific use case. • I'm comfortable with the idea of a developing out the details over time as we add new data sets, and perhaps updating old records with more complete metadata as we develop new use cases and more robust handling code. Some examples For example, looking at Shewanella oneidensis MR-1, the assembly record has the following info: ASM14616v2 Organism name: Shewanella oneidensis MR-1 (g-proteobacteria) Infraspecific name: Strain: MR-1 BioSample: SAMN02604014 Submitter: TIGR Date: 2012/11/02 Assembly level: Complete Genome Genome representation: full RefSeq category: reference genome Relation to type material: assembly from type material GenBank assembly accession: GCA_000146165.2 (latest) RefSeq assembly accession: GCF_000146165.2 (latest) RefSeq assembly and GenBank assembly identical: yes  Clearly we want to store 'organism name' and probably the strain, and the accession information; and we probably want to include assembly level and genome representation. I'd probably also add the URL to download the .fna.gz file. But I don't think we want statistics (included at the bottom of the page), or any of the other information on the Genome page, because we'd end up having to update that regularly for many samples. Looking at the SRA record for a metagenome from Hu et al., 2016, I'd probably want to include: • the fact that it is metagenomic FASTQ; • the description at the top "Illumina MiSeq paired end sequencing; metagenome SB1 from not soured petroleum reservoir, Schrader bluffer formation, Alaska North Slope" • whatever error trimming/correction commands I used before minhashing it; • a link to the ENA FASTQ files for download; and that's about it. Other records would presumably vary in similar ways, ranging from really minimal information ("this kind of sample, this kind of sequencing, have fun") to much more fleshed out metadata. Your thoughts on how to go about this? --titus December 08, 2016 Enthought Webinar: Solving Enterprise Python Deployment Headaches with the New Enthought Deployment Server See a recording of the webinar: Built on 15 years of experience of Python packaging and deployment for Fortune 500 companies, the NEW Enthought Deployment Server provides enterprise-grade tools groups and organizations using Python need, including: 1. Secure, onsite access to a private copy of the proven 450+ package Enthought Python Distribution 2. Centralized management and control of packages and Python installations 3. Private repositories for sharing and deployment of proprietary Python packages 4. Support for the software development workflow with Continuous Integration and development, testing, and production repositories In this webinar, Enthought’s product team demonstrates the key features of the Enthought Deployment Server and how it can take the pain out of Python deployment and management at your organization. Who Should Watch this Webinar: If you answer “yes” to any of the questions below, then you (or someone at your organization) should watch this webinar: 1. Are you using Python in a high-security environment (firewalled or air gapped)? 2. Are you concerned about how to manage open source software licenses or compliance management? 3. Do you need multiple Python environment configurations or do you need to have consistent standardized environments across a group of users? 4. Are you producing or sharing internal Python packages and spending a lot of effort on distribution? 5. Do you have a “guru” (or are you the guru?) who spends a lot of time managing Python package builds and / or distribution? In this webinar, we demonstrate how the Enthought Deployment Server can help your organization address these situations and more. December 07, 2016 Enthought Using the Canopy Data Import Tool to Speed Cleaning and Transformation of Data & New Release Features Download Canopy to try the Data Import Tool In November 2016, we released Version 1.0.6 of the Data Import Tool (DIT), an addition to the Canopy data analysis environment. With the Data Import Tool, you can quickly import structured data files as Pandas DataFrames, clean and manipulate the data using a graphical interface, and create reusable Python scripts to speed future data wrangling. For example, the Data Import Tool lets you delete rows and columns containing Null values or replace the Null values in the DataFrame with a specific value. It also allows you to create new columns from existing ones. All operations are logged and are reversible in the Data Import Tool so you can experiment with various workflows with safeguards against errors or forgetting steps. What’s New in the Data Import Tool November 2016 Release Pandas 0.19 support, re-usable templates for data munging, and more. Over the last couple of releases, we added a number of new features and enhanced a number of existing ones. A few notable changes are: 1. The Data Import Tool now supports the recently released Pandas version 0.19.0. With this update, the Tool now supports Pandas versions 0.16 through 0.19. 2. The Data Import Tool now allows you to delete empty columns in the DataFrame, similar to existing option to delete empty rows. 3. The Data Import Tool allows you to choose how to delete rows or columns containing Null values: “Any” or “All” methods are available. 4. The Data Import Tool automatically generates a corresponding Python script for data manipulations performed in the GUI and saves it in your home directory re-use in future data wrangling. Every time you successfully import a DataFrame, the Data Import Tool automatically saves a generated Python script in your home directory. This way, you can easily review and reproduce your earlier work. 5. The Data Import Tool generates a Template with every successful import. A Template is a file that contains all of the commands or actions you performed on the DataFrame and a unique Template file is generated for every unique data file. With this feature, when you load a data file, if a Template file exists corresponding to the data file, the Data Import Tool will automatically perform the operations you performed the last time. This way, you can save progress on a data file and resume your work. Along with the feature additions discussed above, based on continued user feedback, we implemented a number of UI/UX improvements and bug fixes in this release. For a complete list of changes introduced in Version 1.0.6 of the Data Import Tool, please refer to the Release Notes page in the Tool’s documentation. Example Use Case: Using the Data Import Tool to Speed Data Cleaning and Transformation Now let’s take a look at how the Data Import Tool can be used to speed up the process of cleaning up and transforming data sets. As an example data set, let’s take a look at the Employee Compensation data from the city of San Francisco. NOTE: You can follow the example step-by-step by downloading Canopy and starting a free 7 day trial of the data import tool Step 1: Load data into the Data Import Tool First we’ll download the data as a .csv file from the San Francisco Government data website, then open it from File -> Import Data -> From File… menu item in the Canopy Editor (see screenshot at right). After loading the file, you should see the DataFrame below in the Data Import Tool: The Data Import Tool automatically detects and converts column types (in this case to an integer type). As you can see at the right, the Data Import Tool automatically detected and converted the columns “Job Code”, “Job Family Code” and “Union Code” to an Integer column type. But, if the Tool inferred erroneously, you can simply remove a specific column conversion by deleting it from the Edit Command window or remove all conversions by removing the command by clicking on the “X” in the Command History window. Step 2: Use the Data Import Tool quickly assess data by sorting in the GUI Using the Employee Compensation data set, let’s answer a few questions. For example, let’s see which Job Families get the highest Salary, the highest Overtime, the highest Total Salary and the highest Compensation. Further, let’s also determine what the highest and mean Total Compensation for a Job Family is. Let’s start with the question “Which Job Family contains the highest Salary?” We can now get this information easily by clicking on the right end of the “Salaries” column to sort the column in ascending or descending order. Doing so, we can see that the highest paid Job Family is “Administrative & Mgmt (Unrep)” and specifically, the Job is Chief Investment Officer. In fact, 4 out of 5 top Salaries are paid to Chief Investment Officers. Similarly, we can sort the “Overtime” column (see screenshot at right) to see which Job Family gets paid the most Overtime (turns out to be the “Deputy Sheriff” job family). Sort the Total Salary and Total Compensation columns to find out which Job and Job Family had the highest salary and highest overall compensation. [Note: While sorting the data set, you may have noticed the fact that there are negative values in the Salaries column. Yup. and hey! Don’t ask us. We don’t know why there are negative Salaries values either. If you know why or if you can figure out why, we would love to know! Comment below and tell us!] Step 3: Simplify and Clean Data Delete columns by right-clicking on a column name and selecting “Delete” from the menu. Let’s now look at the second question we mentioned earlier: “What is the median income for different Job Families?” But before we get to that, let’s first remove a few columns with data not relevant to the questions we’re trying to answer (or you may choose to ask different questions and keep the columns instead). Here we delete columns by clicking on the “Delete” menu item after right-clicking on a column name. When you are satisfied with how the DataFrame looks, click on the “Use DataFrame” button to push the DataFrame to Canopy’s IPython Console, where we can further analyze the data set. In Canopy’s IPython console, you can see what the final columns in the DataFrame are, which can be accessed using DataFrame.columns. [u'Year Type', u'Year', u'Organization Group', u'Department', u'Job Family', u'Job', u'Salaries', u'Overtime', u'Other Salaries', u'Total Salary', u'Retirement', u'Health/Dental', u'Other Benefits', u'Total Benefits', u'Total Compensation'] Let’s now use the pandas’ DataFrame.groupby method to calculate the median salary of different Job Families, over the years. Passing both Job Family and Year segments the original DataFrame based on Job Family first and Year next. This way, we will be able to see difference in median Total Compensation in different Job Families and how it changed in a Job Family over the years. grouped_df = Employee_Compensation.groupby(['Job Family', 'Year']) for name, df in grouped_df: print("{} - {}: median={:9.2f}, n={}".format(name[-1], name[0], df['Total Compensation'].median(), df['Total Compensation'].count()))  2013 - Administrative & Mgmt (Unrep): median=65154.66, n=9 2014 - Administrative & Mgmt (Unrep): median=189534.965, n=12 2015 - Administrative & Mgmt (Unrep): median=352931.01, n=13 2016 - Administrative & Mgmt (Unrep): median=351961.28, n=9 2013 - Administrative Secretarial: median=122900.205, n=22 2014 - Administrative Secretarial: median=130164.525, n=20 2015 - Administrative Secretarial: median=127206.02, n=19 2016 - Administrative Secretarial: median=137861.05, n=9 2013 - Administrative-DPW/PUC: median=164535.52, n=89 2014 - Administrative-DPW/PUC: median=172906.585, n=82 2015 - Administrative-DPW/PUC: median=180582.9, n=85 2016 - Administrative-DPW/PUC: median=180095.54, n=44 . . . We hope that this gives you a small idea of what can be done using the Data Import Tool and the Python Pandas library. If you analyzed this data set in a different way, comment below and tell us about it. BTW, if you are interested in honing your data analysis skills in Python, check out our Virtual Pandas Crash Course or join the Pandas Mastery Workshop for a more comprehensive introduction to Pandas and data analysis using it. If you have any feedback regarding the Data Import Tool, we’d love to hear from you at canopy.support@enthought.com. Additional resources: Watch a 2-minute demo video to see how the Canopy Data Import Tool works: See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging: December 06, 2016 Continuum Analytics news Introducing: fastparquet Tuesday, December 6, 2016 Martin Durant Continuum Analytics A compliant, flexible and speedy interface to Parquet format files for Python, fastparquet provides seamless translation between in-memory pandas DataFrames and on-disc storage. In this post, we will introduce the two functions that will most commonly be used within fastparquet, followed by a discussion of the current Big Data landscape, Python's place within it and details of how fastparquet fills one of the gaps on the way to building out a full end-to-end Big Data pipeline in Python. fastparquet Teaser New users of fastparquet will mainly use the functions write and ParquetFile.to_pandas. Both functions offer good performance with default values, and both have a number of options to improve performance further. import fastparquet # write data fastparquet.write('out.parq', df, compression='SNAPPY') # load data pfile = fastparquet.ParquetFile('out.parq') df2 = pfile.topandas() # all columns df3 = pfile.topandas(columns=['floats', 'times']) # pick some columns  Introduction: Python and Big Data Python was named as a favourite tool for data science by 45% of data scientists in 2016. Many reasons can be presented for this, and near the top will be: • Python is very commonly taught at college and university level • Python and associated numerical libraries are free and open source • The code tends to be concise, quick to write, and expressive • An extremely rich ecosystem of libraries exist for not only numerical processing but also other important links in the pipeline from data ingest to visualization and distribution of results Big Data, however, has typically been based on traditional databases and, in latter years, the Hadoop ecosystem. Hadoop provides a distributed file-system, cluster resource management (YARN, Mesos) and a set of frameworks for processing data (map-reduce, pig, kafka, and many more). In the past few years, Spark has rapidly increased in usage, becoming a major force, even though 62% of users use Python to execute Spark jobs (via PySpark). The Hadoop ecosystem and its tools, including Spark, are heavily based around the Java Virtual Machine (JVM), which creates a gap between the familiar, rich Python data ecosystem and clustered Big Data with Hadoop. One such missing piece is a data format that can efficiently store large amounts of tabular data, in a columnar layout, and split it into blocks on a distributed file-system. Parquet has become the de-facto standard file format for tabular data in Spark, Impala and other clustered frameworks. Parquet provides several advantages relevant to Big Data processing: • Columnar storage, only read the data of interest • Efficient binary packing • Choice of compression algorithms and encoding • Splits data into files, allowing for parallel processing • Range of logical types • Statistics stored in metadata to allow for skipping unneeded chunks • Data partitioning using the directory structure fastparquet bridges the gap to provide native Python read/write access with out the need to use Java. Until now, Spark's Python interface provided the only way to write Spark files from Python. Much of the time is spent in deserializing the data in the Java-Python bridge. Also, note that the times column returned is now just integers, rather than the correct datetime type. Not only does fastparquet provide native access to Parquet files, it in fact makes the transfer of data to Spark much faster. # to make and save a large-ish DataFrame import pandas as pd import numpy as np N = 10000000 df = pd.DataFrame({'ints': np.random.randint(0, 1000, size=N), 'floats': np.random.randn(N), 'times': pd.DatetimeIndex(start='1980', freq='s', periods=N)})  import pyspark sc = pyspark.SparkContext() sql = pyspark.SQLContext(sc)  The default Spark single-machine configuration cannot handle the above DataFrame (out-of-memory error), so we'll perform timing for 1/10 of the data: # sending data to spark via pySpark serialization, 1/10 of the data %time o = sql.createDataFrame(df[::10]).count()  CPU times: user 3.45 s, sys: 96.6 ms, total: 3.55 s Wall time: 4.14 s  %%time # sending data to spark via a file made with fastparquet, all the data fastparquet.write('out.parq', df, compression='SNAPPY') df4 = sql.read.parquet('outspark.parq').count() CPU times: user 2.75 s, sys: 285 ms, total: 3.04 s Wall time: 3.27 s  The fastparquet Library fastparquet is an open source library providing a Python interface to the Parquet file format. It uses Numba and NumPy to provide speed, and writes data to and from pandas DataFrames, the most typical starting point for Python data science operations. fastparquet can be installed using conda: conda install -c conda-forge fastparquet  (currently only available for Python 3) • The code is hosted on GitHub • The primary documentation is on RTD Bleeding edge installation directly from the GitHub repo is also supported, as long as Numba, pandas, pytest and ThriftPy are installed. Reading Parquet files into pandas is simple and, again, much faster than via PySpark serialization. import fastparquet pfile = fastparquet.ParquetFile('out.parq') %time df2 = pfile.to_pandas()  CPU times: user 812 ms, sys: 291 ms, total: 1.1 s Wall time: 1.1 s   The Parquet format is more compact and faster to load than the ubiquitous CSV format. df.to_csv('out.csv') !du -sh out.csv out.parq  490M out.csv 162M out.parq  In this case, the data is 229MB in memory, which translates to 162MB on-disc as Parquet or 490MB as CSV. Loading from CSV takes substantially longer than from Parquet. %time df2 = pd.read_csv('out.csv', parse_dates=True)  CPU times: user 9.85 s, sys: 1 s, total: 10.9 s Wall time: 10.9 s  The biggest advantage, however, is the ability to pick only some columns of interest. In CSV, this still means scanning through the whole file (if not parsing all the values), but the columnar nature of Parquet means only reading the data you need. %time df3 = pd.read_csv('out.csv', usecols=['floats']) %time df3 = pfile.to_pandas(columns=['floats'])  CPU times: user 4.04 s, sys: 176 ms, total: 4.22 s Wall time: 4.22 s CPU times: user 40 ms, sys: 96.9 ms, total: 137 ms Wall time: 137 ms   Example We have taken the airlines dataset and converted it into Parquet format using fastparquet. The original data was in CSV format, one file per year, 1987-2004. The total data size is 11GB as CSV, uncompressed, which becomes about double that in memory as a pandas DataFrame for typical dtypes. This is approaching, if not Big Data, Sizable Data, because it cannot fit into my machine's memory. The Parquet data is stored as a multi-file dataset. The total size is 2.5GB, with Snappy compression throughout. ls airlines-parq/  _common_metadata part.12.parquet part.18.parquet part.4.parquet _metadata part.13.parquet part.19.parquet part.5.parquet part.0.parquet part.14.parquet part.2.parquet part.6.parquet part.1.parquet part.15.parquet part.20.parquet part.7.parquet part.10.parquet part.16.parquet part.21.parquet part.8.parquet part.11.parquet part.17.parquet part.3.parquet part.9.parquet  To load the metadata: import fastparquet pf = fastparquet.ParquetFile('airlines-parq')  The ParquetFile instance provides various information about the data set in attributes: pf.info pf.schema pf.dtypes pf.count  Furthermore, we have information available about the "row-groups" (logical chunks) and the 29 column fragments contained within each. In this case, we have one row-group for each of the original CSV files—that is, one per year. fastparquet will not generally be as fast as a direct memory dump, such as numpy.save or Feather, nor will it be as fast or compact as custom tuned formats like bcolz. However, it provides good trade-offs and options which can be tuned to the nature of the data. For example, the column/row-group chunking of the data allows pre-selection of only some portions of the total, which enables not having to scan through the other parts of the disc at all. The load speed will depend on the data type of the column, the efficiency of compression, and whether there are any NULLs. There is, in general, a trade-off between compression and processing speed; uncompressed will tend to be faster, but larger on disc, and gzip compression will be the most compact, but slowest. Snappy compression, in this example, provides moderate space efficiency, without too much processing cost. fastparquet has no problem loading a very large number of rows or columns (memory allowing): %%time # 124M bool values d = pf.to_pandas(columns=['Cancelled']) CPU times: user 436 ms, sys: 167 ms, total: 603 ms Wall time: 620 ms  %%time d = pf.to_pandas(columns=['Distance'])  CPU times: user 964 ms, sys: 466 ms, total: 1.43 s Wall time: 1.47 s  %%time # just the first portion of the data, 1.3M rows, 29 columns d = pf.to_pandas(filters=(('Year', '==', 1987), )) CPU times: user 1.37 s, sys: 212 ms, total: 1.58 s Wall time: 1.59 s  The following factors are known to reduce performance: • The existence of NULLs in the data. It is faster to use special values, such as NaN for data types that allow it, or other known sentinel values, such as an empty byte-string. • Variable-length string encoding is slow on both write and read, and fixed-length will be faster, although this is not compatible with all Parquet frameworks (particularly Spark). Converting to categories will be a good option if the cardinality is low. • Some data types require conversion in order to be stored in Parquet's few primitive types. Conversion may take some time. The Python Big Data Ecosystem fastparquet provides one of the necessary links for Python to be a first-class citizen within Big Data processing. Although useful alone, it is intended to work seamlessly with the following libraries: • Dask, a pure-Python, flexible parallel execution engine, and its distributed scheduler. Each row-group is independent of the others, and Dask can take advantage of this to process parts of a Parquet data-set in parallel. The Dask DataFrame closely mirrors pandas, and methods on it (a subset of all those in pandas) actually call pandas methods on the underlying shards of the logical DataFrame. The Dask Parquet interface is experimental, as it lags slightly behind development in fastparquet. • hdfs3 , s3fs and adlfs provide native Pythonic interfaces to massive file systems. If the whole purpose of Parquet is to store Big Data, we need somewhere to keep it. fastparquet accepts a function to open a file-like object, given a path, and, so, can use any of these back-ends for reading and writing, and makes it easy to use any new file-system back-end in the future. Choosing the back-end is automatic when using Dask and a URL like s3://mybucket/mydata.parq. With the blossoming of interactive visualization technologies for Python, the prospect of end-to-end Big Data processing projects is now fully realizable. fastparquet Status and Plans As of the publication of this article, the fastparquet library can be considered beta—useful to the general public and able to cope with many situations, but with some caveats (see below). Please try your own use case and report issues and comments on the GitHub tracker. The code will continue to develop (contributions welcome), and we will endeavour to keep the documentation in sync and provide regular updates. A number of nice-to-haves are planned, and work to improve the performance should be completed around the new year, 2017. Further Helpful Information We don't have the space to talk about it here, but documentation at RTD gives further details on: • How to iterate through Parquet-stored data, rather than load the whole data set into memory at once • Using Parquet with Dask-DataFrames for parallelism and on a distributed cluster • Getting the most out of performance • Reading and writing partitioned data • Data types understood by Parquet and fastparquet fastparquet Caveats Aside from the performance pointers, above, some specific things do not work in fastparquet, and for some of these, fixes are not planned—unless there is substantial community interest. • Some encodings are not supported, such as delta encoding, since we have no test data to develop against. • Nested schemas are not supported at all, and are not currently planned, since they don't fit in well with pandas' tabular layout. If a column contains Python objects, they can be JSON-encoded and written to Parquet as strings. • Some output Parquet files will not be compatible with some other Parquet frameworks. For instance, Spark cannot read fixed-length byte arrays. This work is fully open source (Apache-2.0), and contributions are welcome. Development of the library has been supported by Continuum Analytics. December 05, 2016 Matthew Rocklin Dask Development Log This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation Dask has been active lately due to a combination of increased adoption and funded feature development by private companies. This increased activity is great, however an unintended side effect is that I have spent less time writing about development and engaging with the broader community. To address this I hope to write one blogpost a week about general development. These will not be particularly polished, nor will they announce ready-to-use features for users, however they should increase transparency and hopefully better engage the developer community. So themes of last week 1. Embedded Bokeh servers for the Workers 2. Smarter workers 3. An overhauled scheduler that is slightly simpler overall (thanks to the smarter workers) but with more clever work stealing 4. Fastparquet Embedded Bokeh Servers in Dask Workers The distributed scheduler’s web diagnostic page is one of Dask’s more flashy features. It shows the passage of every computation on the cluster in real time. These diagnostics are invaluable for understanding performance both for users and for core developers. I intend to focus on worker performance soon, so I decided to attach a Bokeh server to every worker to serve web diagnostics about that worker. To make this easier, I also learned how to embed Bokeh servers inside of other Tornado applications. This has reduced the effort to create new visuals and expose real time information considerably and I can now create a full live visualization in around 30 minutes. It is now faster for me to build a new diagnostic than to grep through logs. It’s pretty useful. Here are some screenshots. Nothing too flashy, but this information is highly valuable to me as I measure bandwidths, delays of various parts of the code, how workers send data between each other, etc.. To be clear, these diagnostic pages aren’t polished in any way. There’s lots missing, it’s just what I could get done in a day. Still, everyone running a Tornado application should have an embedded Bokeh server running. They’re great for rapidly pushing out visually rich diagnostics. Smarter Workers and a Simpler Scheduler Previously the scheduler knew everything and the workers were fairly simple-minded. Now we’ve moved some of the knowledge and responsibility over to the workers. Previously the scheduler would give just enough work to the workers to keep them occupied. This allowed the scheduler to make better decisions about the state of the entire cluster. By delaying committing a task to a worker until the last moment we made sure that we were making the right decision. However, this also means that the worker sometimes has idle resources, particularly network bandwidth, when it could be speculatively preparing for future work. Now we commit all ready-to-run tasks to a worker immediately and that worker has the ability to pipeline those tasks as it sees fit. This is better locally but slightly worse globally. To counter balance this we’re now being much more aggressive about work stealing and, because the workers have more information, they can manage some of the administrative costs of works stealing themselves. Because this isn’t bound to run on just the scheduler we can use more expensive algorithms than when when did everything on the scheduler. There were a few motivations for this change: 1. Dataframe performance was bound by keeping the worker hardware fully occupied, which we weren’t doing. I expect that these changes will eventually yield something like a 30% speedup. 2. Users on traditional job scheduler machines (SGE, SLURM, TORQUE) and users who like GPUS, both wanted the ability to tag tasks with specific resource constraints like “This consumes one GPU” or “This task requires a 5GB of RAM while running” and ensure that workers would respect those constraints when running tasks. The old workers weren’t complex enough to reason about these constraints. With the new workers, adding this feature was trivial. 3. By moving logic from the scheduler to the worker we’ve actually made them both easier to reason about. This should lower barriers for contributors to get into the core project. Dataframe algorithms Approximate nunique and multiple-output-partition groupbys landed in master last week. These arose because some power-users had very large dataframes that weree running into scalability limits. Thanks to Mike Graham for the approximate nunique algorithm. This has also pushed hashing changes upstream to Pandas. Fast Parquet Martin Durant has been working on a Parquet reader/writer for Python using Numba. It’s pretty slick. He’s been using it on internal Continuum projects for a little while and has seen both good performance and a very Pythonic experience for what was previously a format that was pretty inaccessible. He’s planning to write about this in the near future so I won’t steal his thunder. Here is a link to the documentation: fastparquet.readthedocs.io November 30, 2016 Continuum Analytics news Data Science in the Enterprise: Keys to Success Wednesday, November 30, 2016 Travis Oliphant President, Chief Data Scientist & Co-Founder Continuum Analytics Anaconda Bust Graphic@1.5x-8.png When examining the success of one of the most influential and iconic rock bands of all time, there’s no doubt that talent played a huge role. However, it would be unrealistic to attribute the phenomenon that was The Beatles to musical talents alone. Much of their success can be credited to the behind-the-scenes work of trusted advisors, managers and producers. There were many layers beneath the surface that contributed to their incredible fame—including implementing the proper team and tools to propel them from obscurity to commercial global success. Open Source: Where to Start Similar to the music industry, success in Open Data Science relies heavily on many layers, including motivated data scientists, proper tools and the right vision for how to leverage data and perspective. Open Data Science is not a single technology, but a revolution within the data science community. It is an inclusive movement that connects open source tools for data science—preparation, analytics and visualization—so they can easily work together as a connected ecosystem. The challenge lies in figuring out how to successfully navigate the ecosystem and identifying the right Open Data Science enterprise vendors to partner with for the journey. Most organizations have come to understand the value of Open Data Science, but they often struggle with how to adopt and implement it. Some select a “DIY” method when addressing open source, choosing one of the languages or tools available at low or no cost. Others augment an open source base and build proprietary technology into existing infrastructures to address data science needs. Most organizations will engage enterprise-grade products and services when selecting other items, such as unified communication and collaboration tools, instead of opting for short-run cost-savings. For example, using consumer-grade instant messaging and mobile phones might save money this quarter, but over time this choice will end up costing an organization much more. This is due to the costs in labor and other services to make up for the lack of enterprise features, performance for enterprise use-cases and support and maintenance that is essential to successful production usage. The same standards apply for Open Data Science and the open source that surrounds this movement. While it is tempting to try and go at it alone with open source and avoid paying a vendor, there are fundamental problems with that strategy that will result in delayed deliverables, staffing challenges, maintenance headaches for software and frustration when the innovative open source communities move faster than an organization can manage or in a direction that is unexpected. All of this hurts the bottom line and can be easily avoided by finding an open source vendor that can navigate the complexity and ensure the best use of what is available in Open Data Science. In the next section, we will discuss three specific reasons it is important to choose vendors that can leverage open source effectively in the enterprise. Finding Success: The Importance of Choosing the Right Vendor/Partner First, look for a vendor who is contributing significantly to the open source ecosystem. An open source vendor will not only provide enterprise solutions and services on top of existing open source, but will also produce significant open source innovations themselves—building communities like PyData, as well as contributing to open source organizations like The Apache Software Foundation, NumFOCUS or Software Freedom Conservancy. In this way, the software purchase translates directly into sustainability for the entire open source ecosystem. This will also ensure that the open source vendor is plugged into where the impactful open source communities are heading. Second, raw open source provides a fantastic foundation of innovation, but invariably does not contain all the common features necessary to adapt to an enterprise environment. Integration with disparate data sources, enterprise databases, single sign-on systems, scale-out management tools, tools for governance and control, as well as time-saving user interfaces, are all examples of things that typically do not exist in open source or exist in a very early form that lags behind proprietary offerings. Using internal resources to provide these common, enterprise-grade additions costs more money in the long run than purchasing these features from an open source vendor. The figure on the left below shows the kinds of ad-hoc layers that a company must typically create to adapt their applications, processes and workflows to what is available in open source. These ad-hoc layers are not unique to any one business, are hard to maintain and end up costing a lot more money than a software subscription from an open source vendor that would cover these capabilities with some of their enterprise offerings. Screen Shot 2016-11-30 at 9.27.43 AM.png The figure on the right above shows the addition of an enterprise layer that should be provided by an open source vendor. This layer can be proprietary, which w ill enable the vendor to build a sustainable software business that attracts investment, while it solves the fundamental adaptation problem as well. As long as the vendor is deeply connected to open source ecosystems and is constantly aware of what part of the stack is better maintained as open source, businesses receive the best of supported enterprise software without the painful lock-in and innovation gaps of traditional proprietary-only software. Maintaining ad-hoc interfaces to open source becomes very expensive, very quickly. Each interface is typically understood by only a few people in an organization and if they leave or move to different roles, their ability to make changes evaporates. In addition, rather than amortizing the cost of these interfaces over thousands of companies like a software vendor can do, the business pays the entire cost on their own. This discussion does not yet include the opportunity cost of tying up internal resources building and maintaining these common enterprise features instead of having those internal resources work on the software that is unique to a business. The best return from scarce software development talent is on software critical to a business that gives them a unique edge. We have also not discussed the time-to-market gaps that occur when organizations try to go at it alone, rather than selecting an open source vendor who becomes a strategic partner. Engaging an open source vendor who has in-depth knowledge of the technology, is committed to growing the open source ecosystem and has the ability to make the Open Data Science ecosystem work for enterprises, saves organizations significant time and money. Finally, working with an open source vendor provides a much needed avenue for the integration services, training and long-term support that is necessary when adapting an open source ecosystem to the enterprise. Open source communities develop for many reasons, but they are typically united in a passion for rapid innovation and continual progress. Adapting the rapid pace of this innovation to the more methodical gear of enterprise value creation requires a trusted open source vendor. Long-term support of older software releases, bug fixes that are less interesting to the community but essential to enterprises and industry-specific training for data science teams are all needed to fully leverage Open Data Science in the enterprise. The right enterprise vendor will help an enterprise obtain all of this seamlessly. The New World Order: Adopting Open Data Science in the Enterprise The journey to executing successful data science in the enterprise lies in the combination of the proper resources and tools. In general, in-house IT does not typically have the expertise needed to exploit the immense possibilities inherent to Open Data Science. Open Data Science platforms, like Anaconda, are a key mechanism to adopting Open Data Science across an organization. These platforms offer differing levels of empowerment for everyone from the citizen data scientist to the global enterprise data science team. Open Data Science in the enterprise has different needs from an individual or a small business. While the free foundational core of Anaconda may be enough for the individual data explorer or the small business looking to use marketing data to target market segments, a large enterprise will typically need much more support and enterprise features in order to successfully implement open source and therefore Open Data Science across their organization. Because of this, it is critical that larger organizations identify an enterprise open source vendor to both provide support and guidance as they implement Open Data Science. This vendor should also be able to provide that enterprise layer between the applications, processes and workflows that the data science team produces and the diverse open source ecosystem. The complexity inherent to this process of maximizing insights from data will demand proficiency from both the team and vendors, in order to harness the power of the data to transform the business to one that is first data-aware and then data-driven. Anaconda allows enterprises to innovate faster. It exposes previously unknown insights and improves the relationship between all members of the data science team. As a platform that embraces and deeply supports open source, it helps businesses to take full advantage of both the innovation at the core of the Open Data Science movement, as well as the enterprise adaptation that is essential to leveraging the full power of open source effectively in the business. It’s time to remove the chaos from open source and use Open Data Science platforms to simplify things, so that enterprises can fully realize their own superpowers to change the world. November 26, 2016 Titus Brown Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists! This is an update to last week's blog post, "Efficiently searching MinHash Sketch collections". Last week, Thanksgiving travel and post-turkey somnolescence gave me some time to work more with our combined MinHash/SBT implementation. One of the main things the last post contained was a collection of MinHash signatures of all of the bacterial genomes, together with a Sequence Bloom Tree index of them that enabled fast searching. Working with the index from last week, a few problems emerged: • In my initial index calculation, I'd ignored non-bacterial microbes. Conveniently my colleague Dr. Jiarong (Jaron) Guo had already downloaded the viral, archaeal, and protist genomes from NCBI for me. • The MinHashes I'd calculated contained only the filenames of the genome assemblies, and didn't contain the name or accession numbers of the microbes. This made them really annoying to use. (See the new --name-from-first argument to sourmash compute.) • We guessed that we wanted more sensitive MinHash sketches for all the things, which would involve re-calculating the sketches with more hashes. (The default is 500, which gives you one hash per 10,000 k-mers for a 5 Mbp genome.) • We also decided that we wanted more k-mer sizes; the sourmash default is 31, which is pretty specific and could limit the sensitivity of genome search. k=21 would enable more sensitivity, k=51 would enable more stringency. • I also came up with some simple ideas for using MinHash for taxonomy breakdown of metagenome samples, but I needed the number of k-mers in each hashed genome to do a good job of this. (More on this later.) (See the new --with-cardinality argument to sourmash compute.) Unfortunately this meant I had to recalculate MinHashes for 52,000 genomes, and calculate them for 8,000 new genomes. And it wasn't going to take only 36 hours this time, because I was calculating approximately 6 times as much stuff... Fortunately, 6 x 36 hrs still isn't very long, especially when you're dealing with pleasantly parallel low-memory computations. So I set it up to run on Friday, and ran six processes at the same time, and it finished in about 36 hours. Indexing the MinHash signatures also took much longer than the first batch, probably because the signature files were much larger and hence took longer to load. For k=21, it took about 5 1/2 hours, and 6.5 GB of RAM, to index the 60,000 signatures. The end index -- which includes the signatures themselves -- is around 3.2 GB for each k-mer size. (Clearly if we're going to do this for the entire SRA we'll have to optimize things a bit.) On the search side, though, searching takes roughly the same amount of time as before, because the indexed part of the signatures aren't much larger, and the Bloom filter internal nodes are the same size as before. But we can now search at k=21, and get better named results than before, too. For example, go grab the Shewanella MR-1 genome: curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/165/GCF_000146165.2_ASM14616v2/GCF_000146165.2_ASM14616v2_genomic.fna.gz > shewanella.fna.gz  Next, convert it into a signature: sourmash compute -k 21,31 -f --name-from-first shewanella.fna.gz  and search! sourmash sbt_search -k 21 microbes shewanella.fna.gz.sig  This yields: # running sourmash subcommand: sbt_search 1.00 NC_004347.2 Shewanella oneidensis MR-1 chromosome, complete genome 0.16 NZ_JGVI01000001.1 Shewanella xiamenensis strain BC01 contig1, whole genome shotgun sequence 0.16 NZ_LGYY01000235.1 Shewanella sp. Sh95 contig_1, whole genome shotgun sequence 0.15 NZ_AKZL01000001.1 Shewanella sp. POL2 contig00001, whole genome shotgun sequence 0.15 NZ_JTLE01000001.1 Shewanella sp. ZOR0012 L976_1, whole genome shotgun sequence 0.09 NZ_AXZL01000001.1 Shewanella decolorationis S12 Contig1, whole genome shotgun sequence 0.09 NC_008577.1 Shewanella sp. ANA-3 chromosome 1, complete sequence 0.08 NC_008322.1 Shewanella sp. MR-7, complete genome  The updated MinHash signatures & indices are available! Our MinHash signature collection now contains: 1. 53865 bacteria genomes 2. 5463 viral genomes 3. 475 archaeal genomes 4. 177 fungal genomes 5. 72 protist genomes for a total of 60,052 genomes. You can download the various file collections here: Hope these are useful! If there are features you want, please go ahead and file an issue; or, post a comment below. --titus Index building cost for k=21: Command being timed: "/home/ubuntu/sourmash/sourmash sbt_index microbes -k 21 --traverse-directory microbe-sigs-2016-11-27/" User time (seconds): 18815.48 System time (seconds): 80.81 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 5:15:09 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 6484264 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 7 Minor (reclaiming a frame) page faults: 94887308 Voluntary context switches: 5650 Involuntary context switches: 27059 Swaps: 0 File system inputs: 150624 File system outputs: 10366408 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0  November 21, 2016 Paul Ivanov November 9th, 2016 Two weeks ago, I went down to San Luis Obispo, California for a five day Jupyter team meeting with about twenty five others. This was the first such meeting since my return after being away for two years, and I enjoyed meeting some of the "newer" faces, as well as catching up with old friends. It was both a productive and an emotionally challenging week, as the project proceeds along at breakneck pace on some fronts yet continues to face growing pains which come from having to scale in the human dimension. On Wednesday, November 9th, 2016, we spent a good chunk of the day at a nearby beach: chatting, decompressing, and luckily I brought my journal with me and was able to capture the poem you will find below. I intended to read it at a local open mic the same evening, but by the time I got there with a handful of fellow Jovyans for support, all of the slots were taken. On Friday, the last day of our meeting, I got the opportunity to read it to most of the larger group. Here's a recording of that reading, courtesy of Matthias Bussonnier (thanks, Matthias!). November 9th, 2016 The lovely thing about the ocean is that it is tireless It never stops incessant pendulum of salty foamy slush Periodic and chaotic raw, serene Marine grandmother clock crashing against both pier and rock Statuesque encampment of abandonment recoiling with force and blasting forth again No end in sight a train forever riding forth and back along a line refined yet undefined the spirit with which it keeps time in timeless unity of the moon's alignment I. walk. forth. Forth forward by the force of obsolete contrition the vision of a life forgotten Excuses not made real with sand, wet and compressed beneath my heel and toes, yet reeling from the blinding glimmer of our Sol reflected by the glaze of distant hazy surf upon whose shoulders foam amoebas roam It's gone. Tone deaf and muted by anticipation each coming wave breaks up the pregnant pause And here I am, barefoot in slacks and tie experiencing sensations of loss, rebirth and seldom kelp bulbs popping in my soul.  November 18, 2016 Titus Brown Efficiently searching MinHash Sketch collections There is an update to this blog post: please see "Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists! Note: This blog post is based largely on work done by Luiz Irber. Camille Scott, Luiz Irber, Lisa Cohen, and Russell Neches all collaborated on the SBT software implementation! Note 2: Adam Phillipy points out in the comments below that they suggested using SBTs in the mash paper, which I reviewed. Well, they were right :) --- We've been pretty enthusiastic about MinHash Sketches over here in Davis (read here and here for background, or go look at mash directly), and I've been working a lot on applying them to metagenomes. Meanwhile, Luiz Irber has been thinking about how to build MinHash signatures for all the data. A problem that Luiz and I both needed to solve is the question of how you efficiently search hundreds, thousands, or even millions of MinHash Sketches. I thought about this on and off for a few months but didn't come up with an obvious solution. Luckily, Luiz is way smarter than me and quickly figured out that Sequence Bloom Trees were the right answer. Conveniently as part of my review of Solomon and Kingsford (2015) I had put together a BSD-compatible SBT implementation in Python. Even more conveniently, my students and colleagues at UC Davis fixed my somewhat broken implementation, so we had something ready to use. It apparently took Luiz around a nanosecond to write up a Sequence Bloom Tree implementation that indexed, saved, loaded, and searched MinHash sketches. (I don't want to minimize his work - that was a nanosecond on top of an awful lot of training and experience. :) Sequence Bloom Trees can be used to search many MinHash sketches Briefly, an SBT is a binary tree where the leaves are collections of k-mers (here, MinHash sketches) and the internal nodes are Bloom filters containing all of the k-mers in the leaves underneath them. Here's a nice image from Luiz's notebook: here, the leaf nodes are MinHash signatures from our sea urchin RNAseq collection, and the internal nodes are khmer Nodegraph objects containing all the k-mers in the MinHashes beneath them. These images can be very pretty for larger collections! The basic idea is that you build the tree once, and then to search it you prune your search by skipping over internal nodes that DON'T contain k-mers of interest. As usual for this kind of search, if you search for something that is only in a few leaves, it's super efficient; if you search for something in a lot of leaves, you have to walk over lots of the tree. This idea was so obviously good that I jumped on it and integrated the Luiz's SBT functionality into sourmash, our Python library for calculating and searching MinHash sketches. The pull request is still open -- more on that below -- but the PR currently adds two new functions, sbt_index and sbt_search, to index and search collections of sketches. Using sourmash to build and search MinHash collections This is already usable! Starting from a blank Ubuntu 15.10 install, run: sudo apt-get update && sudo apt-get -y install python3.5-dev \ python3-virtualenv python3-matplotlib python3-numpy g++ make  then create a new virtualenv, cd python3.5 -m virtualenv env -p python3.5 --system-site-packages . env/bin/activate  You'll need to install a few things, including a recent version of khmer: pip install screed pytest PyYAML pip install git+https://github.com/dib-lab/khmer.git  Next, grab the sbt_search branch of sourmash: cd git clone https://github.com/dib-lab/sourmash.git -b sbt_search  and then build & install sourmash: cd sourmash && make install  Once it's installed, you can index any collection of signatures like so: cd ~/sourmash sourmash sbt_index urchin demo/urchin/{var,purp}*.sig  It takes me about 4 seconds to load 70-odd sketches into an sbt index named 'urchin'. Now, search! This sig is in the index and takes about 1.6 seconds to find: sourmash sbt_search urchin demo/urchin/variegatus-SRR1661406.sig  Note you can adjust the search threshold, in which case the search truncates appropriately and takes about 1 second: sourmash sbt_search urchin demo/urchin/variegatus-SRR1661406.sig --threshold=0.3  This next sig is not in the index and the search takes about 0.2 seconds (which is basically how long it takes to load the tree structure and search the tree root). sourmash sbt_search urchin demo/urchin/leucospilota-DRR023762.sig  How well does this scale? Suppose, just hypothetically, that you had, oh, say, a thousand bacterial genome signatures lying around and you wanted to index and search them? # download mkdir bac cd bac curl -O http://teckla.idyll.org/~t/transfer/sigs1k.tar.gz tar xzf sigs1k.tar.gz # index time sourmash sbt_index 1k *.sig time sourmash sbt_search 1k GCF_001445095.1_ASM144509v1_genomic.fna.gz.sig  Here, the indexing takes about a minute, and the search takes about 5 seconds (mainly because there are a lot of closely related samples). The data set sizes are nice and small -- the 1,000 signatures are 4 MB compressed and 12 MB uncompressed, the SBT index is about 64 MB, and this is all representing about 5 Gbp of genomic sequence. (We haven't put any time or effort into optimizing the index so things will only get smaller and faster.) How far can we push it? There's lots of bacterial genomes out there, eh? Be an AWFUL SHAME if someone INDEXED them all for search, wouldn't it? Jiarong Guo, a postdoc split between my lab and Jim Tiedje's lab at MSU, helpfully downloaded 52,000 bacterial genomes from NCBI for another project. So I indexed them with sourmash. Indexing 52,000 bacterial genomes took about 36 hours on the MSU HPC, or about 2.5 seconds per genome. This produced about 1 GB of uncompressed signature files, which in tar.gz form ends up being about 208 MB. I loaded them into an SBT like so: curl -O http://spacegraphcats.ucdavis.edu.s3.amazonaws.com/bacteria-sourmash-signatures-2016-11-19.tar.gz tar xzf bacteria-sourmash-signatures-2016-11-19.tar.gz /usr/bin/time -s sourmash sbt_index bacteria --traverse-directory bacteria-sourmash-signatures-2016-11-19  The indexing step took about 53 minutes on an m4.xlarge EC2 instance, and required 4.2 GB of memory. The resulting tree was about 4 GB in size. (Download the 800 MB tar.gz here; just untar it somewhere.) Searching all of the bacterial genomes for matches to one genome in particular took about 3 seconds (and found 31 matches). It requires only 100 MB of RAM, because it uses on-demand loading of the tree. To try it out yourself, run: sourmash sbt_search bacteria bacteria-sourmash-signatures-2016-11-19/GCF_000006965.1_ASM696v1_genomic.fna.gz.sig  I'm sure we can speed this all up, but I have to say that's already pretty workable :). Again, you can download the 800 MB .tar.gz containing the SBT for all bacterial genomes here: bacteria-sourmash-sbt-2016-11-19.tar.gz. Example use case: finding genomes close to Shewanella oneidensis MR-1 What would you use this for? Here's an example use case. Suppose you were interested in genomes with similarity to Shewanella oneidensis MR-1. First, go to the S. oneidensis MR-1 assembly page, click on the "Assembly:" link, and find the genome assembly .fna.gz file. Now, go download it: curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/165/GCF_000146165.2_ASM14616v2/GCF_000146165.2_ASM14616v2_genomic.fna.gz > shewanella.fna.gz  Next, convert it into a signature: sourmash compute -f shewanella.fna.gz  (which takes 2-3 seconds to produce shewanella.fna.gz.sig. And, now, search with your new signature: sourmash sbt_search bacteria shewanella.fna.gz.sig  which produces this output: # running sourmash subcommand: sbt_search 1.00 ../GCF_000146165.2_ASM14616v2_genomic.fna.gz 0.09 ../GCF_000712635.2_SXM1.0_for_version_1_of_the_Shewanella_xiamenensis_genome_genomic.fna.gz 0.09 ../GCF_001308045.1_ASM130804v1_genomic.fna.gz 0.08 ../GCF_000282755.1_ASM28275v1_genomic.fna.gz 0.08 ../GCF_000798835.1_ZOR0012.1_genomic.fna.gz  telling us that not only is the original genome in the bacterial collection (the one with a similarity of 1!) but there are four other genomes in with about 9% similarity. These are other (distant) strains of Shewanella. The reason the similarity is so small is that sourmash is by default looking at k-mer sizes of 31, so we're asking how many k-mers of length 31 are in common between the two genomes. With little modification (k-mer error trimming), this same pipeline can be used on unassembled FASTQ sequence; streaming classification of FASTQ reads and metagenome taxonomy breakdown are simple extensions and are left as exercises for the reader. What's next? What's missing? This is all still early days; the code's not terribly well tested and a lot of polishing needs to happen. But it looks promising! I still don't have a good sense for exactly how people are going to use MinHashes. A command line implementation is all well and good but some questions come to mind: • what's the right output format? Clearly a CSV output format for the searching is in order. Do people want a scripting interface, or a command line interface, or what? • related - what kind of structured metadata should we support in the signature files? Right now it's pretty thin, but if we do things like sketch all of the bacterial genomes and all of the SRA, we should probably make sure we put in some of the metadata :). • what about at tagging interface so that you can subselect types of nodes to return? If you are a potential user, what do you want to do with large collections of MinHash sketches? On the developer side, we need to: • test, refactor, and polish the SBT stuff; • think about how best to pick Bloom filter sizes automatically; • benchmark and optimize the indexing; • make sure that we interoperate with mash • evaluate the SBT approach on 100s of thousands of signatures, instead of just 50,000. and probably lots of things I'm forgetting... --titus p.s. Output of /usr/bin/time -v on indexing 52,000 bacterial genome signatures: Command being timed: "sourmash sbt_index bacteria --traverse-directory bacteria-sourmash-signatures-2016-11-19" User time (seconds): 3192.58 System time (seconds): 14.66 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 53:35.72 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 4279056 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 8014404 Voluntary context switches: 972 Involuntary context switches: 5742 Swaps: 0 File system inputs: 0 File system outputs: 6576144 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0  Continuum Analytics news We Are Thankful Friday, November 18, 2016 Michele Chambers EVP Anaconda Business Unit & CMO Continuum Analytics It’s hard to believe it but it’s almost time to baste the turkey, mash the potatoes and take a moment to reflect on what we are thankful for this year amongst our family and friends. Good health? A job we actually enjoy? Our supportive family? While our personal reflections are of foremost importance, as a proud leader in the Open Data Science community, we’re thankful for advancements and innovations that contribute to the betterment of the world. This Thanksgiving, we give thanks to... 1. Data. Though Big Data gave us the meat with which to collect critical information, until recently, the technology needed to make sense of of the huge amount of data was either disparate or accessible only to the most technologically advanced companies in the world (translation: barely anyone). Today, we have the ability to extract actionable insights from the infinite amounts of data that literally drive the way people and businesses make decisions. 2. Our data science teams. We’re thankful there is no “i” in team. While we may have all the data in the world available to us, without adding the element of intelligent human intuition, it would be devoid of the endless value it provides. Our strong, versatile team members––including data scientists, business analysts, data engineers, devops and developers––are what gets us up in the morning and out the door to work. Being a part of this tight-knit community that offers immense support makes us grateful for the opportunity to do what we do. 3. New, innovative ideas. We keep our fingers on the pulse of enterprise happenings. Our customers afford us the opportunity to contribute to incredible, previously impossible tech breakthroughs. We’re thankful for the ability to exchange ideas with colleagues and constantly stand on the edge of change. 4. The opportunity to help others change the world. From combatting rare genetic diseases and eradicating human trafficking to predicting the effects of public policy, we’re thankful for the opportunity to work with companies who are using Anaconda to bring to life amazing new solutions that truly make a difference in the world. They keep us inspired and help to fuel the seemingly endless innovation made possible by the Open Data Science community. 5. The Anaconda community. Last but not least, we are thankful for the robust, rapidly growing Anaconda community that keeps us connected with other data science teams around the globe. Collaboration is key. Helping others discover, analyze and learn by connecting curiosity and experience is one of our main passions. We are grateful for the wonderment of innovation we see passing through on a daily basis. As the great late, great Arthur C. Nielsen once said, “the price of light is less than the cost of darkness.” We agree. Happy Thanksgiving! November 17, 2016 William Stein RethinkDB, SageMath, Andreessen-Horowitz, Basecamp and Open Source Software RethinkDB and sustainable business models Three weeks ago, I spent the evening of Sept 12, 2016 with Daniel Mewes, who is the lead engineer of RethinkDB (an open source database). I was also supposed to meet with the co-founders, Slava and Michael, but they were too busy fundraising and couldn't join us. I pestered Daniel the whole evening about what RethinkDB's business model actually was. Yesterday, on October 6, 2016, RethinkDB shut down. I met with some RethinkDB devs because an investor who runs a fund at the VC firm Andreessen-Horowitz (A16Z) had kindly invited me there to explain my commercialization plans for SageMath, Inc., and RethinkDB is one of the companies that A16Z has invested in. At first, I wasn't going to take the meeting with A16Z, since I have never met with Venture Capitalists before, and do not intend to raise VC. However, some of my advisors convinced me that VC's can be very helpful even if you never intend to take their investment, so I accepted the meeting. In the first draft of my slides for my presentation to A16Z, I had a slide with the question: "Why do you fund open source companies like RethinkDB and CoreOS, which have no clear (to me) business model? Is it out of some sense of charity to support the open source software ecosystem?" After talking with people at Google and the RethinkDB devs, I removed that slide, since charity is clearly not the answer (I don't know if there is a better answer than "by accident"). I have used RethinkDB intensely for nearly two years, and I might be their biggest user in some sense. My product SageMathCloud, which provides web-based course management, Python, R, Latex, etc., uses RethinkDB for everything. For example, every single time you enter some text in a realtime synchronized document, a RethinkDB table gets an entry inserted in it. I have RethinkDB tables with nearly 100 million records. I gave a talk at a RethinkDB meetup, filed numerous bug reports, and have been described by them as "their most unlucky user". In short, in 2015 I bet big on RethinkDB, just like I bet big on Python back in 2004 when starting SageMath. And when visiting the RethinkDB devs in San Francisco (this year and also last year), I have said to them many times "I have a very strong vested interest in you guys not failing." My company SageMath, Inc. also pays RethinkDB for a support contract. Sustainable business models were very much on my mind, because of my upcoming meeting at A16Z and the upcoming board meeting for my company. SageMath, Inc.'s business model involves making money from subscriptions to SageMathCloud (which is hosted on Google Cloud Platform); of course, there are tons of details about exactly how our business works, which we've been refining based on customer feedback. Though absolutely all of our software is open source, what we sell is convenience, easy of access and use, and we provide value by hosting hundreds of courses on shared infrastructure, so it is much cheaper and easier for universities to pay us rather than hosting our software themselves (which is also fairly easy). So that's our business model, and I would argue that it is working; at least our MRR is steadily increasing and is more than twice our hosting costs (we are not cash flow positive yet due to developer costs). So far as I can determine, the business model of RethinkDB was to make money in the following ways: 1. Sell support contracts to companies (I bought one). 2. Sell a closed-source proprietary version of RethinkDB with extra features that were of interest to enterprise (they had a handful of such features, e.g., audit logs for queries). 3. Horizon would become a cloud-hosted competitor to Firebase, with unique advantages that users have the option to migrate from the cloud to their own private data center, and more customizability. This strategy depends on a trend for users to migrate away from the cloud, rather than to it, which some people at RethinkDB thought was a real trend (I disagree). I don't know of anything else they were seriously trying right now. The closed-source proprietary version of RethinkDB also seemed like a very recent last ditch effort that had only just begun; perhaps it directly contradicted a desire to be a 100% open source company? With enough users, it's easier to make certain business models work. I suspect RethinkDB does not have a lot of real users. Number of users tends to be roughly linearly related to mailing list traffic, and the RethinkDB mailing list has an order of magnitude less traffic compared to the SageMath mailing lists, and SageMath has around 50,000 users. RethinkDB wasn't even advertised to be production ready until just over a year ago, so even they were telling people not to use it seriously until relatively recently. The adoption cycle for database technology is slow -- people wisely wait for Aphyr's tests, benchmarks comparing with similar technology, etc. I was unusual in that I chose RethinkDB much earlier than most people would, since I love the design of RethinkDB so much. It's the first database I loved, having seen a lot over many decades. Conclusion: RethinkDB wasn't a real business, and wouldn't become one without year(s) more runway. I'm also very worried about the future of RethinkDB as an open source project. I don't know if the developers have experience growing an open source community of volunteers; it's incredibly hard and its unclear they are even going to be involved. At a bare minimum, I think they must switch to a very liberal license (Apache instead of AGPL), and make everything (e.g., automated testing code, documentation, etc) open source. It's insanely hard getting any support for open source infrastructure work -- support mostly comes from small government grants (for research software) or contributions from employees at companies (that use the software). Relicensing in a company friendly way is thus critical. Company Incentives Companies can be incentived in various ways, including: • to get to the next round of VC funding • to be a sustainable profitable business by making more money from customers than they spend, or • to grow to have a very large number of users and somehow pivot to making money later. When founding a company, you have a chance to choose how your company will be incentived based on how much risk you are willing to take, the resources you have, the sort of business you are building, the current state of the market, and your model of what will happen in the future. For me, SageMath is an open source project I started in 2004, and I'm in it for the long haul. I will make the business I'm building around SageMathCloud succeed, or I will die trying -- therefore I have very, very little tolerance for risk. Failure is not an option, and I am not looking for an exit. For me, the strategy that best matches my values is to incentive my company to build a profitable business, since that is most likely to survive, and also to give us the freedom to maintain our longterm support for open source and pure mathematics software. Thus for my company, neither optimizing for raising the next round of VC or growing at all costs makes sense. You would be surprised how many people think I'm completely wrong for concluding this. Andreessen-Horowitz I spent the evening with RethinkDB developers, which scared the hell out of me regarding their business prospects. They are probably the most open source friendly VC-funded company I know of, and they had given me hope that it is possible to build a successful VC-funded tech startup around open source. I prepared for my meeting at A16Z, and deleted my slide about RethinkDB. I arrived at A16Z, and was greeted by incredibly friendly people. I was a little shocked when I saw their nuclear bomb art in the entry room, then went to a nice little office to wait. The meeting time arrived, and we went over my slides, and I explained my business model, goals, etc. They said there was no place for A16Z to invest directly in what I was planning to do, since I was very explicit that I'm not looking for an exit, and my plan about how big I wanted the company to grow in the next 5 years wasn't sufficiently ambitious. They were also worried about how small the total market cap of Mathematica and Matlab is (only a few hundred million?!). However, they generously and repeatedly offered to introduce me to more potential angel investors. We argued about the value of outside investment to the company I am trying to build. I had hoped to get some insight or introductions related to their portfolio companies that are of interest to my company (e.g., Udacity, GitHub), but they deflected all such questions. There was also some confusion, since I showed them slides about what I'm doing, but was quite clear that I was not asking for money, which is not what they are used to. In any case, I greatly appreciated the meeting, and it really made me think. They were crystal clear that they believed I was completely wrong to not be trying to do everything possible to raise investor money. Basecamp During the first year of SageMath, Inc., I was planning to raise a round of VC, and was doing everything to prepare for that. I then read some of DHH's books about Basecamp, and realized many of those arguments applied to my situation, given my values, and -- after a lot of reflection -- I changed my mind. I think Basecamp itself is mostly closed source, so they may have an advantage in building a business. SageMathCloud (and SageMath) really are 100% open source, and building a completely open source business might be harder. Our open source IP is considered worthless by investors. Witness: RethinkDB just shut down and Stripe hired just the engineers -- all the IP, customers, etc., of RethinkDB was evidently considered worthless by investors. The day after the A16Z meeting, I met with my board, which went well (we discussed a huge range of topics over several hours). Some of the board members also tried hard to convince me that I should raise a lot more investor money. Will Poole: you're doomed Two weeks ago I met with Will Poole, who is a friend of a friend, and we talked about my company and plans. I described what I was doing, that everything was open source, that I was incentivizing the company around building a business rather than raising investor money. He listened and asked a lot of follow up questions, making it very clear he understands building a company very, very well. His feedback was discouraging -- I said "So, you're saying that I'm basically doomed." He responded that I wasn't doomed, but might be able to run a small "lifestyle business" at best via my approach, but there was absolutely no way that what I was doing would have any impact or pay for my kids college tuition. If this was feedback from some random person, it might not have been so disturbing, but Will Poole joined Microsoft in 1996, where he went on to run Microsoft's multibillion dollar Windows business. Will Poole is like a retired four-star general that executed a successful campaign to conquer the world; he been around the block a few times. He tried pretty hard to convince me to make as much of SageMathCloud closed source as possible, and to try to convince my users to make content they create in SMC something that I can reuse however I want. I felt pretty shaken and convinced that I needed to close parts of SMC, e.g., the new Kubernetes-based backend that we spent all summer implementing. (Will: if you read this, though our discussion was really disturbing to me, I really appreciate it and respect you.) My friend, who introduced me to Will Poole, introduced me to some other people and described me as that really frustrating sort of entrepreneur who doesn't want investor money. He then remarked that one of the things he learned in business school, which really surprised him, was that it is good for a company to have a lot of debt. I gave him a funny look, and he added "of course, I've never run a company". I left that meeting with Will convinced that I would close source parts of SageMathCloud, to make things much more defensible. However, after thinking things through for several days, and talking this over with other people involved in the company, I have chosen not to close anything. This just makes our job harder. Way harder. But I'm not going to make any decisions based purely on fear. I don't care what anybody says, I do not think it is impossible to build an open source business (I think Wordpress is an example), and I do not need to raise VC. Hacker News Discussion: https://news.ycombinator.com/item?id=12663599 Chinese version: http://www.infoq.com/cn/news/2016/10/Reflection-sustainable-profit-co Continuum Analytics news DataCamp’s Online Platform Fuels the Future of Data Science, Powered By Anaconda Thursday, November 17, 2016 Michele Chambers EVP Anaconda Business Unit & CMO Continuum Analytics There’s no doubt that the role of ‘data scientist’ is nearing a fever pitch as companies become increasingly data-driven. In fact, the position ranked number one on Glassdoor’s top jobs in 2016, and in 2012, HBR dubbed it “The Sexiest Job of the 21st Century.” Yet, while more organizations are adopting data science, there exists a shortage of people with the right training and skills to fill the role. This challenge is being met by our newest partner, DataCamp, a data science learning platform focused on cultivating the next generation of data scientists. DataCamp’s interactive learning environment today launched the first of four Anaconda-based courses taught by Anaconda experts—Interactive Visualization with Bokeh. Our experts—both in academia and in the data science industry—provide users with maximum insight. While we’re proud to partner with companies representing various verticals, it is especially thrilling to contribute toward the creation of new data scientists, including citizen data scientists, both of which are extremely valued in the business community. Research finds that 88 percent of professionals say online learning is more helpful than in-person training; DataCamp has already trained over 620,000 aspiring data scientists. Of the four new Anaconda-based courses, two are interactive trainings. This allows DataCamp to offer students the opportunity to benefit from unprecedented breadth and depth of online learning, leading to highly skilled, next-gen data scientists. The data science revolution is growing by the day and DataCamp is poised to meet the challenge of scarcity in the market. By offering courses tailored to an individual’s unique pace, needs and expertise, DataCamp’s courses are generating more individuals with the skills to boast ‘the sexiest job of the 21st century.’ Interested in learning more or signing up for a course? Check out DataCamp’s blog. November 15, 2016 Titus Brown You can make GitHub repositories archival by using Zenodo or Figshare! Update: Zenodo will remove content upon request by the owner, and hence is not suitable for long-term archiving of published code and data. Please see my comment at the bottom (which is just a quote from an e-mail from a journal editor), and especially see "Ownership" and "Withdrawal" under Zenodo policies. I agree with the journal's interpretation of these policies. Bioinformatics researchers are increasingly pointing reviewers and readers at their GitHub repositories in the Methods sections of their papers. Great! Making the scripts and source code for methods available via a public version control system is a vast improvement over the methods of yore ("e-mail me for the scripts" or "here's a tarball that will go away in 6 months"). A common point of concern, however, is that GitHub repositories are not archival. That is, you can modify, rewrite, delete, or otherwise irreversibly mess with the contents of a git repository. And, of course, GitHub could go the way of Sourceforge and Google Code at any point. So GitHub is not a solution to the problem of making scripts and software available as part of the permanent record of a publication. But! Never fear! The folk at Zenodo and Mozilla Science Lab (in collaboration with Figshare) have solutions for you! I'll tell you about the Zenodo solution, because that's the one we use, but the Figshare approach should work as well. How Zenodo works Briely, at Zenodo you can set up a connection between Zenodo and GitHub where Zenodo watches your repository and produces a tarball and a DOI every time you cut a release. For example, see https://zenodo.org/record/31258, which archives https://github.com/dib-lab/khmer/releases/tag/v2.0 and has the DOI http://doi.org/10.5281/zenodo.31258. When we release khmer 2.1 (soon!), Zenodo will automatically detect the release, pull down the tar file of the repo at that version, and produce a new DOI. The DOI and tarball will then be independent of GitHub and I cannot edit, modify or delete the contents of the Zenodo-produced archive from that point forward. Yes, automatically. All of this will be done automatically. We just have to make a release. Yes, the DOI is permanent and Zenodo is archival! Zenodo is an open-access archive that is recommended by Peter Suber (as is Figshare). While I cannot quickly find a good high level summary of how DOIs and archiving and LOCKSS/CLOCKSS all work together, here is what I understand to be the case: • Digital object identifiers are permanent and persistent. (See Wikipedia on DOIs) • Zenodo policies say: "Retention period Items will be retained for the lifetime of the repository. This is currently the lifetime of the host laboratory CERN, which currently has an experimental programme defined for the next 20 years at least." So I think this is at least as good as any other archival solution I've found. Why is this better than journal-specific archives and supplemental data? Some journals request or require that you upload code and data to their own internal archive. This is often done in painful formats like PDF or XLSX, which may guarantee that a human can look at the files but does little to encourage reuse. At least for source code and smallish data sets, having the code and data available in a version controlled repository is far superior. This is (hopefully :) the place where the code and data is actually being used by the original researchers, so having it kept in that format can only lower barriers to reuse. And, just as importantly, getting a DOI for code and data means that people can be more granular in their citation and reference sections - they can cite the specific software they're referencing, they can point at specific versions, and they can indicate exactly which data set they're working with. This prevents readers from going down the citation network rabbit hole where they have to read the cited paper in order to figure out what data set or code is being reused and how it differs from the remixed version. Bonus: Why is the combination of GitHub/Zenodo/DOI better than an institutional repository? I've had a few discussions with librarians who seem inclined to point researchers at their own institutional repositories for archiving code and data. Personally, I think having GitHub and Zenodo do all of this automatically for me is the perfect solution: • quick and easy to configure (it takes about 3 minutes); • polished and easy user interface; • integrated with our daily workflow (GitHub); • completely automatic; • independent of whatever institution happens to be employing me today; so I see no reason to switch to using anything else unless it solves even more problems for me :). I'd love to hear contrasting viewpoints, though! thanks! --titus November 14, 2016 Continuum Analytics news Can New Technologies Help Stop Crime In Its Tracks? Tuesday, November 15, 2016 Peter Wang Chief Technology Officer & Co-Founder Continuum Analytics Earlier this week, I shared my thoughts on crime prevention through technology with IDG Connect reporter Bianca Wright. Take a look and feel free to share your opinions in the comment section below (edited for length and clarity)! Researchers from the University of Cardiff have been awarded more than800,000 by the U.S. Department of Justice to develop a pre-crime detection system that uses social media. How would such technology work? Are there other examples of technologies being used in this way?

The particular award for the University of Cardiff was to fight hate crime, and this is an important distinction. Taking a data-driven "predictive policing" approaches to fighting general crime is very difficult because crime itself is so varied, and the dimensions of each type of crime are so complex. However, for hate crimes in particular, social media could be a particularly useful data stream, because it yields insight into a variable that is otherwise extremely difficult to assess: human sentiment. The general mechanism of the system would be to look for patterns and correlations between all the dimensions of social media: text in posts and tweets, captions on images, the images themselves, even which people, topics, organizations someone subscribes to. Metadata on these would also feed into the data modeling; the timestamps and locations of their posts and social media activity can be used to infer where they live, their income, level of education, etc.

Social media is most powerful when the additional information streams it generates are paired up with existing information about someone. Sometimes unexpected correlations emerge. For instance, could it be the case that among those expressing hate speech in their social media feeds, the people with a criminal background are actually less likely to engage in hate crime, because they already have a rap sheet and know that law enforcement is aware of them, and, instead, most hate crimes are committed by first-time offenders? Ultimately, the hope of social media data science is to be able to get real insight into questions like these, which then can suggest effective remediations and preventative measures.

How viable is such technology in predicting and preventing crime? Is the amount of big data available to law enforcement enough to help them predict a crime before it happens?

It's hard to say in general. It seems like most common sorts of physical crime are deeply correlated to socioeconomic, geographic and demographic factors. These are things on which we have reasonably large amounts of data.  A challenge there is that many of those datasets are government data, stored in arcane locations and formats across a variety of different bureaucracies and difficult to synthesize. However, past evidence shows if you simply integrate the data that governments already possess, you can get some incredible insights. For instance, Jeff Chen's work with the Fire Department of New York shows that they can predict which areas have buildings that are more likely to catch on fire, and take preventative actions.

Ironically, hate crimes may be particularly difficult to actually tackle with data science, because they are a form of domestic terrorism, with highly asymmetric dynamics between perpetrator and potential victims.  One possible result of the University of Cardiff grant, is that we discover that data science and social media can reveal elevated risk of hate crimes in certain areas, but offer insufficient information for taking any kind of preventative or remediative measures.

What are the challenges to such technologies? How do you see this developing in the future?

I think that the breakthroughs in the field of machine learning can lead to better and smarter policy across the board: from crime prevention to international trade to reducing terrorism and extremism. The biggest challenge it faces is that its real technological breakthroughs are mostly mathematical in nature, and not something "concrete" that regular people can readily understand. Some technology breakthroughs are extremely visceral: electrical cars that go from 0-60 in 3 seconds, spacecraft that beam down breathtaking images, and the like. We even have computerized devices that talk to us in natural language. The average person can "get" that these are advances.

Advances in machine learning and data science can deeply improve human civilization, by helping us make better policy, allocate resources better, reduce waste and improve quality of life.

November 11, 2016

Continuum Analytics news

Monday, November 14, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

At Continuum Analytics, we believe our most valuable asset is our community. From data scientists to industry experts, the collective power held by our community to change the world is limitless. This is why we are thrilled to introduce AnacondaCON, our inaugural user conference, and to announce that registration is now open.

The data scientist community is growing—topping the list as the best job title in the U.S. in 2016—Open Data Science is affecting every industry at every level. As our ranks expand, now is the perfect time to bring together the brightest minds to exchange ideas, teach strategies and discuss what’s next in this industry that knows no bounds. AnacondaCON will offer an ideal forum to gather the superheroes of the data science world to discuss what works, what doesn’t, and where Anaconda is going next. Our sponsors, DataCamp, Intel, AMD and Gurobi, will be available to share with you their latest innovations and how we collaborate together. Attendees will walk away with the knowledge and connections needed to take their Open Data Science projects to the next level.

Who: The brightest minds in Open Data Science
What: AnacondaCON 2017
When: February 7-9, 2017
Where: JW Marriott Austin, Austin, TX
Register here: https://anacondacon17.io/register/

Join the conversation: #AnacondaCON #OpenDataScienceMeans

November 10, 2016

Gaël Varoquaux

Data science instrumenting social media for advertising is responsible for todays politics

To my friends developing data science for the social media, marketing, and advertising industries,

It is time to accept that we have our share of responsibility in the outcome of the US elections and the vote on Brexit. We are not creating the society that we would like. Facebook, Twitter, targeted advertising, customer profiling, are harmful to truth and have helped Brexiting and electing Trump. Journalism has been replaced by social media and commercial content tailored to influence the reader: your own personal distorted reality.

There are many deep reasons why Trump won the election. Here, as a data scientist, I want to talk about the factors created by data science.

Rumor replaces truth: the way we, data-miners, aggregate and recommend content is based on its popularity, on readership statistics. In no way is it based in the truthfulness of the content. As a result, Facebook, Twitter, Medium, and the like amplify rumors and sensational news, with no reality check [1].

This is nothing new: clickbait and tabloids build upon it. However, social networking and active recommendation makes things significantly worst. Indeed, birds of a feather flock together, reinforcing their own biases. We receive filtered information: have you noticed that every single argument you heard was overwhelmingly against (or in favor of) Brexit? To make matters even worse, our brain loves it: to resolve cognitive dissonance we avoid information that contradicts our biases [2].

Note

Gossiping, rumors, and propaganda have always made sane decisions difficult. The filter bubble, algorithmically-tuned rose-colored glasses of Facebook, escalate this problem into a major dysfunction of our society. They amplify messy and false information better than anything before. Soviet-style propaganda builds on a carefully-crafted lies; post-truth politics build on a flood of information that does not even pretend to be credible in the long run.

Active distortion of reality: amplifying biases to the point that they drown truth is bad. Social networks actually do worse: they give tools for active manipulation of our perception of the world. Indeed, the revenue of today’s Internet information engines comes from advertising. For this purpose they are designed to learn as much as possible about the reader. Then they sell this information bundled with a slot where the buyer can insert the optimal message to influence the reader.

The Trump campaign used targeted Facebook ads presenting to unenthusiastic democrats information about Clinton tuned to discourage them from voting. For instance, portraying her as racist to black voters.

Information manipulation works. The Trump campaign has been a smearing campaign aimed at suppressing votes of his opponent. Release of negative information on Clinton did affect her supporter allegiance.

Tech created the perfect mind-control tool, with an eyes on sales revenue. Someone used it for politics.

The tech industry is mostly socially-liberal and highly educated, wishing the best for society. But it must accept its share of the blame. My friends improving machine-learning for costumer profiling and ad placement, you help shaping a world of lies and deception. I will not blame you for accepting this money: if it were not for you, others would do it. But we should all be thinking about how do we improve this system. How do we use data science to build a world based on objectivity, transparency, and truth, rather than Internet-based marketing?

Disgression: other social issues of data science

• The tech industry is increasing inequalities, making the rich richer and leaving the poor behind. Data-science, with its ability to automate actions and wield large sources of information, is a major contributor to these sources of inequalities.
• Internet-based marketing is building a huge spying machine that infers as much as possible about the user. The Trump campaign was able to target a specific population, black voters leaning towards democrats. What if this data was used for direct executive action? This could come quicker than we think, given how intelligence agencies tap into social media.

I preferred to focus this post on how data-science can help distort truth. Indeed, it is a problem too often ignored by data scientists who like to think that they are empowering users.

In memory of Aaron Schwartz who fought centralized power on Internet.

 [1] Facebook was until recently using human curators, but fired them, leading to a loss of control on veracity
 [2] It is a well-known and well-studied cognitive bias that individuals strive to reduce cognitive dissonace and actively avoid situations and information likely to increase it

November 08, 2016

Continuum Analytics news

AnacondaCON 2017: Continuum Analytics Opens Registration for First User Conference

Monday, November 14, 2016

Two-day event will bring together thought leaders in the Open Data Science community to learn, debate and socialize in an atmosphere of collaboration and innovation

AUSTIN, TX—November 9, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that registration is open for AnacondaCON 2017, taking place February 7-9, 2017 in Austin, Texas. The inaugural Anaconda user conference is a two-day event that will bring together innovative enterprises on the journey to Open Data Science. These companies have recognized the value of capitalizing on their growing treasure trove of data assets to create compelling business value for their enterprise. Register here.

In addition to enterprise users, AnacondaCON will offer the Open Data Science community––from foundational contributors to thought leaders––an opportunity to engage in breakout sessions, hear from industry experts, learn about case studies from subject matter experts and choose from specialized and focused sessions based on topic areas of interest. Sessions will prove educational, informative and thought-provoking—attendees will walk away with the knowledge and connections needed to move their Open Data Science initiatives forward.

Come hear keynote speakers, including Continuum Analytics CEO & Co-Founder Travis Oliphant and Co-Founder & CTO Peter Wang. Guest keynotes will be announced shortly and additional speakers are being added to the agenda regularly; check here for updates.

WHO: Continuum Analytics

WHAT: Registration for AnacondaCON 2017. Early bird registration prices until December 31, 2016.

All ticket prices are 3-day passes and include access to all of AnacondaCON, including sessions, tutorials, keynotes, the opening reception and the off-site party.

WHEN: February 7-9, 2017

WHERE: JW Marriott Austin, 110 E. 2nd St. Austin, Texas, 78701

Continuum Analytics has secured a special room rate for AnacondaCON attendees. If you are interested in attending and booking a room at the special conference rate available until January 17, 2017, click here or call the JW Marriott Austin at (844) 473-3959 and reference the room block “AnacondaCON.”

REGISTER: HERE

###

Continuum Analytics’ Anaconda is the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world. Anaconda is trusted by leading businesses worldwide and across industries––financial services, government, health and life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda helps data science teams discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage Open Data Science environments and harness the power of the latest open source analytic and technology innovations. Visit http://www.continuum.io.

###

Media Contact:

Jill Rosenthal

InkHouse

anaconda@inkhouse.com

Chicxulub Impact Crater Expedition Recovers Core to Further Discovery on the Impact on Life and the Historical Dinosaur Extinction

From April to May 2016, a team of international scientists drilled into the site of an asteroid impact, known as the Chicxulub Impact Crater, which occurred 66 million years ago. The crater is buried several hundred meters below the surface in the Yucatán region of Mexico. Until that time, dinosaurs and marine reptiles dominated the world, but the series of catastrophic events that followed the impact caused the extinction of all large animals, leading to the rise of mammals and evolution of mankind. This joint expedition, organized by the International Ocean Discovery Program (IODP) and International Continental Scientific Drilling Program (ICDP) recovered a nearly complete set of rock cores from 506 to 1335 meters below the modern day seafloor.  These cores are now being studied in detail by an international team of scientists to understand the effects of the impact on life and as a case study of how impacts affect planets.

CT Scans of Cores Provide Deeper Insight Into Core Description and Analysis

Before being shipped to Germany (where the onshore science party took place from September to October 2016), the cores were sent to Houston, TX for CT scanning and imaging. The scanning was done at Weatherford Labs, who performed a high resolution dual energy scan on the entire core.  Dual energy scanning utilizes x-rays at two different energy levels. This provides the information necessary to calculate the bulk density and effective atomic numbers of the core. Enthought processed the raw CT data, and provided cleaned CT data along with density and effective atomic number images.  The expedition scientists were able to use these images to assist with core description and analysis.

Digital images of the CT scans of the recovered core are displayed side by side with the physical cores for analysis

Information not evident in physical observation (bottom, core photograph) can be observed in CT scans (top)

These images are helping scientists understand the processes that occurred during the impact, how the rock was damaged, and how the properties of the rock were affected.  From analysis of images, well log data and laboratory tests it appears that the impact had a permanent effect on rock properties such as density, and the shattered granite in the core is yielding new insights into the mechanics of large impacts.

Virtual Core Provides Co-Visualization of CT Data with Well Log Data, Borehole Images, and Line Scan Photographs for Detailed Interrogation

Enthought’s Virtual Core software was used by the expedition scientists to integrate the CT data along with well log data, borehole images and line scan photographs.  This gave the scientists access to high resolution 2D and 3D images of the core, and allowed them to quickly interrogate regions in more detail when questions arose. Virtual Core also provides machine learning feature detection intelligence and visualization capabilities for detailed insight into the composition and structure of the core, which has proved to be a valuable tool both during the onshore science party and ongoing studies of the Chicxulub core.

Enthought’s Virtual Core software was used by the expedition scientists to visualize the CT data alongside well log data, borehole images and line scan photographs.

Related Articles

Drilling to Doomsday
Discover Magazine, October 27, 2016

Chicxulub ‘dinosaur crater’ investigation begins in earnest
BBC News, October 11, 2016

How CT scans help Chicxulub Crater scientists
Integrated Ocean Drilling Program (IODP) Chicxulub Impact Crater Expedition Blog, October 3, 2016

Chicxulub ‘dinosaur’ crater drill project declared a success
BBC Science, May 25, 2016

Scientists hit pay dirt in drilling of dinosaur-killing impact crater
Science Magazine, May 3, 2016

Scientists gear up to drill into ‘ground zero’ of the impact that killed the dinosaurs
Science Magazine, March 3, 2016

Texas scientists probe crater they think led to dinosaur doomsday
Austin American-Statesman, June 2, 2016

The post Scientists Use Enthought’s Virtual Core Software to Study Asteroid Impact appeared first on Enthought Blog.

October 31, 2016

Continuum Analytics news

Another Great Year At Strata + Hadoop 2016

Tuesday, November 1, 2016
Peter Wang
Co-Founder & Chief Technology Officer
Continuum Analytics

The Anaconda team had a blast at this year’s Strata + Hadoop World in NYC. We’re really excited about the interest and discussions around Open Data Science! For those of you that weren’t able to attend, here’s a quick recap of what we presented.

IMG_0807.JPG

Three Anaconda team members - including myself - took the stage at Strata to chat all things Python, Anaconda and Open Data Science on Hadoop. For my presentation, I wanted to hit home the idea that Open Data Science is the foundation of modernization for the enterprise, and that open source communities can create powerful technologies for data science. I also touched upon the core challenge of open data science in the enterprise. Many people think data science is the same thing as software development, but that’s a very common misconception. Business tend to misinterpret the Python language and pigeonhole it, saying it competes with Java, C#, Ruby, R, SAS, Matlab, SPSS, or BI systems - which is not true. Done right, Python can be an extremely powerful force across any given business.

Screen Shot 2016-10-31 at 2.14.52 PM.png

I then jumped into an overview of Anaconda for Data Science in Hadoop, highlighting how much Modern Data Science teams use Anaconda to drive more intelligent decisions - from the business analyst to the data scientist to the developer to the data engineer to DevOps. Anaconda truly powers these teams and gives businesses the superpowers required to change the world. To date, Anaconda has seen nearly 8 million downloads, and that number is growing everyday. You can see my slides from this talk, ‘Open Data Science on Hadoop in the Enterprise,’ here

SAS-Python-nbpresent (1).png

We also ran some really awesome demos at the Anaconda booth, including examples of Anaconda Enterprise, Dask, datashader, and more. One of our most popular demos was Ian Stokes-Rees’ demonstration of SAS and Open Data Science using Jupyter and Anaconda. For many enterprises that currently use SAS, there is not a clear path to Open Data Science. To embark on the journey to Open Data Science, enterprises need an easy on-boarding path for their team to use SAS in combination with Open Data Science. Ian showcased why Anaconda is an ideal platform that embraces both open source and legacy languages, including Base SAS, so that enterprise teams can bridge the gap by leveraging their current SAS expertise while ramping up on Open Data Science.

You can see his notebook from the demo here, and you can download his newest episode of Fresh Data Science discussing the use of Anaconda and Jupyter notebooks with SAS and Python here

In addition to my talk and our awesome in-booth demos, two of our software engineers, Bryan Van de Ven and Sarah Bird, demonstrated how to build intelligent apps in a week with Bokeh, Python and optimization. Attendees of the sessions learned how to create standard and custom visualizations using Bokeh, how to make them interactive, how to connect that interactivity with their Python stacks, and how to share their new interactive data applications. Congratulations to both on a job well done.

All that being said, we can’t wait to see you at next year’s Strata + Hadoop! Check back here for details on more upcoming conferences we’re attending.

October 28, 2016

Continuum Analytics news

Self-Service Open Data Science: Custom Anaconda Parcels for Cloudera CDH

Monday, October 31, 2016
Kristopher Overholt
Continuum Analytics

Earlier this year, as part of our partnership with Cloudera, we announced a freely available Anaconda parcel for Cloudera CDH based on Python 2.7 and the Anaconda Distribution. The Anaconda parcel has been very well received by both Anaconda and Cloudera users by making it easier for data scientists and analysts to use libraries from Anaconda that they know and love with Hadoop and Spark along with Cloudera CDH.

blog1.png

Since then, we’ve had significant interest from Anaconda Enterprise users asking how they can create and use custom Anaconda parcels with Cloudera CDH. Our users want to deploy Anaconda with different versions of Python and custom conda packages that are not included in the freely available Anaconda parcel. Using parcels to manage multiple Anaconda installations across a Cloudera CDH cluster is convenient, because it works natively with Cloudera Manager without the need to install additional software or services on the cluster nodes.

We’re excited to announce a new self-service feature of the Anaconda platform that can be used to generate custom Anaconda parcels and installers. This functionality is now available in the Anaconda platform as part of the Anaconda Scale and Anaconda Repository platform components.

Deploying multiple custom versions of Anaconda on a Cloudera CDH cluster with Hadoop and Spark has never been easier! Let’s take a closer look at how we can create and install a custom Anaconda parcel using Anaconda Repository and Cloudera Manager.

Generating Custom Anaconda Parcels and Installers

For this example, we’ve installed Anaconda Repository (which is part of the Anaconda Enterprise subscription) and created an on-premises mirror of more than 600 conda packages that are available in the Anaconda distribution. We’ve also installed Cloudera CDH 5.8.2 with Spark on a cluster.

In Anaconda Repository, we can see a new feature for Installers, which can be used to generate custom Anaconda parcels for Cloudera CDH or standalone Anaconda installers.

blog2.png

The Installers page gives an overview of how to get started with custom Anaconda installers and parcels, and it describes how we can create custom Anaconda parcels that are served directly from Anaconda Repository from a Remote Parcel Repository URL.

blog3.png

After choosing Create new installer, we can then specify packages to include in our custom Anaconda parcel, which we’ve named anaconda_plus.

First, we specify the latest version of Anaconda (4.2.0) and Python 2.7. We’ve added the anaconda package to include all of the conda packages that are included by default in the Anaconda installer. Specifying the anaconda package is optional, but it’s a great way to supercharge your custom Anaconda parcel with more than 200 of the most popular Open Data Science packages, including NumPy, Pandas, SciPy, matplotlib, scikit-learn and more.

We also specified additional conda packages to include in the custom Anaconda parcel, including libraries for natural language processing, visualization, data I/O and other data analytics libraries: azure, bcolz, boto3, datashader, distributed, gensim, hdfs3, holoviews, impyla, seaborn, spacy, tensorflow and xarray.

blog4.png

We also could have included conda packages from other channels in our on-premise installation of Anaconda Repository, including community-built packages from conda-forge or other custom-built conda packages from different users within our organization.

After creating a custom Anaconda parcel, we see a list of parcel files that were generated for all of the Linux distributions supported by Cloudera Manager.

blog5.png

Additionally, Anaconda Repository has already updated the manifest file used by Cloudera Manager with the new parcel information at the existing Remote Parcel Repository URL. Now, we’re ready to install the newly created custom Anaconda parcel using Cloudera Manager.

Installing Custom Anaconda Parcels Using Cloudera Manager

Now that we’ve generated a custom Anaconda parcel, we can install it on our Cloudera CDH cluster and make it available to all of the cluster users for PySpark and SparkR jobs.

From the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar.

blog6.png

Click the Configuration button on the top right of the Parcels page.

blog7.png

Click the plus symbol in the Remote Parcel Repository URLs section, and add the repository URL that was provided from Anaconda Repository.

blog12.png

And we’re done! The custom-generated Anaconda parcel is now activated and ready to use with Spark or other distributed frameworks on our Cloudera CDH cluster.

blog13.png

Using the Custom Anaconda Parcel

Now that we’ve generated, installed and activated a custom Anaconda parcel, we can use libraries from our custom Anaconda parcel with PySpark.

You can use spark-submit along with the PYSPARK_PYTHON environment variable to run Spark jobs that use libraries from the Anaconda parcel, for example:

September 27, 2016

Continuum Analytics news

Continuum Analytics Joins Forces with IBM to Bring Open Data Science to the Enterprise

Tuesday, September 27, 2016

Optimized Python experience empowers data scientists to develop advanced open source analytics on Spark

AUSTIN, TEXAS—September 27, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced an alliance with IBM to advance open source analytics for the enterprise. Data scientists and data engineers in open source communities can now embrace Python and R to develop analytic and machine learning models in the Spark environment through its integration with IBM's Project DataWorks.

Combining the power of IBM's Project DataWorks with Anaconda enables organizations to build high-performance Python and R data science models and visualization applications required to compete in today’s data-driven economy. The companies will collaborate on several open source initiatives including enhancements to Apache Spark that fully leverage Jupyter Notebooks with Apache Spark – benefiting the entire data science community.

“Our strategic relationship with Continuum Analytics empowers Project DataWorks users with full access to the Anaconda platform to streamline and help accelerate the development of advanced machine learning models and next-generation analytics apps,” said Ritika Gunnar, vice president, IBM Analytics. “This allows data science professionals to utilize the tools they are most comfortable with in an environment that reinforces collaboration with colleagues of different skillsets.”

By collaborating to bring about the best Spark experience for Open Data Science in IBM's Project DataWorks, enterprises are able to easily connect their data, analytics and compute with innovative machine learning to accelerate and deploy their data science solutions.

“We welcome IBM to the growing family of industry titans that recognize Anaconda as the defacto Open Data Science platform for enterprises,” said Michele Chambers, EVP of Anaconda Business & CMO at Continuum Analytics. “As the next generation moves from machine learning to artificial intelligence, cloud-based solutions are key to help companies adopt and develop agile solutions––IBM recognizes that. We’re thrilled to be one of the driving forces powering the future of machine learning and artificial intelligence in the Spark environment.”

IBM's Project Dataworks the industry’s first cloud-based data and analytics platform that integrates all types of data to enable AI-powered decision making. With this, companies are able to realize the full promise of data by enabling data professionals to collaborate and build cognitive solutions by combining IBM data and analytics services and a growing ecosystem of data and analytics partners - all delivered on Apache Spark. Project Dataworks is designed to allow for faster development and deployment of data and analytics solutions with self-service user experiences to help accelerate business value.

To learn more, join Bob Picciano, SVP of IBM Analytics and Travis Oliphant, CEO of Continuum Analytics at the IBM DataFirst Launch Event on Sept 27, 2016, Hudson Mercantile Building in NYC. The event is also available on livestream.

Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team––data scientists, developers, DevOps, architects and business analysts––to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created and contributed to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.