January 18, 2017

Titus Brown

Computational postdoc opening at UC Davis!

We are currently soliciting applications for computational postdoctoral fellows to undertake exciting projects in computational biology/bioinformatics jointly supervised by Dr. Titus Brown (http://ivory.idyll.org/lab/) and Dr. Fereydoun Hormozdiari (http://www.hormozdiarilab.org/) at UC Davis.

UC Davis is a world class research institution with a strong genomics faculty. In addition to being part of Dr. Brown and Dr. Hormazdiari's labs, the postdoc will be able to participate in Genome Center activities. Potential collaborators include Megan Dennis, Alex Norde, and Paul and Randi Hagerman. UC Davis is close to the Bay Area and there will be opportunities to connect and collaborate with researchers at Berkeley, Stanford, and UCSF as well.

Davis, CA is an excellent place to live with good food, great schools, nice weather, non-Bay-Area housing prices, and a bike-friendly culture.

---

The successful candidate will undertake computational method and tool development for better understanding the contribution of genetic variation (especially structural variation) on changing the genome structure. In collaboration with the members of both labs, the postdoctoral candidate will also be building models for predicting the changes in gene expression based on variants (especially CNV) and performing a comparative study of genome structures in multiple tissues/samples using HiC data.

This opportunity requires developing novel computational algorithms and machine learning methods to solve emerging biological problems. The technical expertise needed include strong computational background to develop novel combinatorial, machine learning (ML) or statistical inference algorithms, with strong programming capabilities and a general understanding of the concepts in genomics and genetics.

Candidates are guaranteed funding for two years and will be strongly encouraged to apply for external funding in the second year of their postdoc to make a successful transition to independent investigator.

Some of the projects to work on include but are not limited to:

  • Computational methods to discover and predict the structural variations (SV) which will result in significant modification of genome structure. It is been shown recently that structural variation which results in modification of TAD (Topologically Associating Domains) can result in genetic diseases. As part of this project we are trying to develop methods which would predict which SVs will result in such a significant modification and potentially build a method for ranking/scoring SVs based on their pathogenicity in disease such as autism and cancer.
  • Study the effect of SV/CNVs which result in significant changes of genome structure during (great ape) evolution and associated with changes in gene expression for each of these species as a result of such variants.
  • Develop computational tools for finding conserved and significantly differentiated TADs in two more samples (from different cell types or species) using HiC data, with application to data from different tissues and/or species.

The start date for this position is flexible, although we hope the successful candidate can start before Sep 1, 2017.

Suggested candidate background:

  • Ph.D. in computer science, computational biology or related fields
  • Excellent programming skills in at least one language (C/C++, Java or Python)
  • Strong written/oral presentation skills
  • Enthusiasm for genomics-related problems
  • Knowledge of next-generation sequencing technologies and HiC data is a plus.

Interested candidates should send their CV and a research statement to Fereydoun Hormozdiari (email: fhormozd[at]ucdavis.edu) and Titus Brown (email: ctbrown[at]ucdavis.edu).

We will begin review of applications on Feb 1, 2017.

by C. Titus Brown at January 18, 2017 11:00 PM

Matthew Rocklin

Dask Development Log

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2017-01-01 and 2016-01-17. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

  1. Stability enhancements for the distributed scheduler and micro-release
  2. NASA Grant writing
  3. Dask-EC2 script
  4. Dataframe categorical flexibility (work in progress)
  5. Communication refactor (work in progress)

Stability enhancements and micro-release

We’ve released dask.distributed version 1.15.1, which includes important bugfixes after the recent 1.15.0 release. There were a number of small issues that coordinated to remove tasks erroneously. This was generally OK because the Dask scheduler was able to heal the missing pieces (using the same machinery that makes Dask resilience) and so we didn’t notice the flaw until the system was deployed in some of the more serious Dask deployments in the wild. PR dask/distributed #804 contains a full writeup in case anyone is interested. The writeup ends with the following line:

This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.

This also adds other fixes, like a compatibility issue with the new Bokeh 0.12.4 release and others.

NASA Grant Writing

I’ve been writing a proposal to NASA to help fund distributed Dask+XArray work for atmospheric and oceanographic science at the 100TB scale. Many thanks to our scientific collaborators who are offering support here.

Dask-EC2 startup

The Dask-EC2 project deploys Anaconda, a Dask cluster, and Jupyter notebooks on Amazon’s Elastic Compute Cloud (EC2) with a small command line interface:

pip install dask-ec2 --upgrade
dask-ec2 up --keyname KEYNAME \
            --keypair /path/to/ssh-key \
            --type m4.2xlarge
            --count 8

This project can be either very useful for people just getting started and for Dask developers when we run benchmarks, or it can be horribly broken if AWS or Dask interfaces change and we don’t keep this project maintained. Thanks to a great effort from Ben Zaitlen `dask-ec2 is again in the very useful state, where I’m hoping it will stay for some time.

If you’ve always wanted to try Dask on a real cluster and if you already have AWS credentials then this is probably the easiest way.

This already seems to be paying dividends. There have been a few unrelated pull requests from new developers this week.

Dataframe Categorical Flexibility

Categoricals can significantly improve performance on text-based data. Currently Dask’s dataframes support categoricals, but they expect to know all of the categories up-front. This is easy if this set is small, like the ["Healthy", "Sick"] categories that might arise in medical research, but requires a full dataset read if the categories are not known ahead of time, like the names of all of the patients.

Jim Crist is changing this so that Dask can operates on categorical columns with unknown categories at dask/dask #1877. The constituent pandas dataframes all have possibly different categories that are merged as necessary. This distinction may seem small, but it limits performance in a surprising number of real-world use cases.

Communication Refactor

Since the recent worker refactor and optimizations it has become clear that inter-worker communication has become a dominant bottleneck in some intensive applications. Antoine Pitrou is currently refactoring Dask’s network communication layer, making room for more communication options in the future. This is an ambitious project. I for one am very happy to have someone like Antoine looking into this.

January 18, 2017 12:00 AM

January 17, 2017

Titus Brown

Categorizing 400,000 microbial genome shotgun data sets from the SRA

This is another blog post on MinHash sketches; see also:



A few months ago I was at the Woods Hole MBL Microbial Diversity course, and I ran across Mihai Pop, who was teaching at the STAMPS Microbial Ecology course. Mihai is an old friend who shares my interest in microbial genomes and assembly and other such stuff, and during our conversation he pointed out that there were many unassembled microbial genomes sitting in the Sequence Read Archive.

The NCBI Sequence Read Archive is one of the primary archives for biological sequencing data, but it generally holds only the raw sequencing data; assemblies and analysis products go elsewhere. It's also largely unsearchable by sequence: you can search an individual data set with BLAST, I think, but you can't search multiple data sets (because each data set is large, and the search functionality to handle it doesn't really exist). There have been some attempts to make it searchable, including most notably Brad Solomon and Carl Kingsford's Sequence Bloom Tree paper (also on biorxiv, and see my review), but it's still not straightforward.

Back to Mihai - Mihai told me that there were several hundred thousand microbial WGS samples in the SRA for which assemblies were not readily available. That got me kind of interested, and -- combined with my interest in indexing all the microbial genomes for MinHash searching -- led to... well, read on!

How do you dance lightly across the surface of 400,000 data sets?

tl;dr? To avoid downloading all the things, we're sipping from the beginning of each SRA data set only.

The main problem we faced in looking at the SRA is that whole genome shotgun data sets are individually rather large (typically at least 500 MB to 1 GB), and we have no special access to the SRA, so we were looking at a 200-400 PB download. Luiz Irber found that NCBI seems to throttle downloads to about 100 Mbps, so we calculated that grabbing the 400k samples would significantly extend his PhD.

But, not only is the data volume quite large, the samples themselves are mildly problematic: they're not assembled or error trimmed, so we had to develop a way to error trim them in order to minimize spurious k-mer presence.

We tackled these problems in several ways:

  • Luiz implemented a distributed system to grab SRA samples and compute MinHash sketch signatures on them with sourmash; he then ran this 50x across Rackspace, Google Compute Engine, and the MSU High Performance Compute cluster (see blog post);

    To quote, "Just to keep track: we are posting Celery tasks from a Rackspace server to Amazon SQS, running workers inside Docker managed by Kubernetes on GCP, putting results on Amazon S3 and finally reading the results on Rackspace and then posting it to IPFS."

    This meant we were no longer dependent on a single node, or even on a single compute solution. w00t!

  • We needed a way to quickly and efficiently error trim the WGS samples. In MinHash land, this means walking through reads and finding "true" k-mers based on their abundance in the read data set.

    Thanks to khmer, we already have ways of doing this on a low-memory streaming basis, so we started with that (using trim-low-abund.py).

  • Because whole-genome shotgun data is generally pretty high coverage, we guessed that we could get away with computing signatures on only a small subset of the data. After all, if you have 100x coverage sample, and you only need 5x coverage to build a MinHash signature, then you only need to look at 5% of the data!

    The fastq-dump program has a streaming output mode, and both khmer and sourmash support streaming I/O, so we could do all this computing progressively. The question was, how do we know when to stop?

    Our first attempt was to grab the first million reads from each sample, and then abundance-trim them, and MinHash them. Luiz calculated that (with about 50 workers going over the holiday break) this would take about 3 weeks to run on the 400,000 samples.

    Fortunately, due to a bug in my categorization code, we thought that this approach wasn't working. I say "fortunately" because in attempting to fix the wrong problem, we came across a much better solution :).

    For mark 2 of streaming, some basic experimentation suggested that we could get a decent match when searching a sample against known microbial genomes with only about 20% of the genome. For E. coli, this is about 1m bases, which is about 1m k-mers.

    So I whipped together a program called syrah that reads FASTA/FASTQ sequences and outputs high-abundance regions of the sequences until it has seen 1m k-mers. Then it exits, terminating the stream.

    This is nice and simple to use with fastq-dump and sourmash --

    fastq-dump -A {sra_id} -Z | syrah | \
       sourmash compute -k 21 --dna - -o {output} --name {sra_id}
    

    and when Luiz tested it out we found that it was 3-4x faster than our previous approach, because it tended to terminate much earlier in the stream and hence downloaded less data. (See the final command here.)

At this point we were down to an estimated 5 days for computing about 400,000 sourmash signatures on the microbial genomes section of the SRA. That was fast enough even for grad students in a hurry :).

Categorizing 400,000 sourmash signatures... quickly!

tl;dr? We sped up the sourmash Sequence Bloom Tree search functionality, like, a lot.

Now we had the signatures! Done, right? We just categorize 'em all! How long can that take!?

Well, no. It turns out when operating at this scale even the small things take too much time!

We knew from browsing the SRA metadata that most of the samples were likely to be strain variants of human pathogens, which are very well represented in the microbial RefSeq. Conveniently, we already had prepared those for search. So my initial approach to looking at the signatures was to compare them to the 52,000 microbial RefSeq genomes, and screen out those that could be identified at k=21 as something known. This would leave us with the cool and interesting unknown/unidentifiable SRA samples.

I implemented a new sourmash subcommand, categorize, that took in a list (or a directory) full of sourmash signatures and searched them individually against a Sequence Bloom Tree of signatures. The output was a CSV file of categorized signatures, with each entry containing the best match to a given signature against the entire SBT.

The command looks like this:

sourmash categorize --csv categories.csv \
   -k 21 --dna --traverse-directory syrah microbes.sbt.json

and the default threshold is 8%, which is just above random background.

This worked great! It took about 1-3 seconds per genome. For 400,000 signatures that would take... 14 days. Sigh. Even if we parallelized that it was annoyingly slow.

So I dug into the source code and found that the problem was our YAML signature format, which was slow as a dog. When searching the SBT, each leaf node was stored in YAML and loading this was consuming something like 80% of the time.

My first solution was to cache all the signatures, which worked great but consumed about a GB of RAM. Now we could search each signature in about half a second.

In the meantime, Laurent Gautier had discovered the same problem in his work and he came along and reimplemented signature storage in JSON, which was 10-20x faster and was a way better permanent solution. So now we have JSON as the default sourmash signature format, huzzah!

At this point I could categorize about 200,000 signatures in 1 day on an AWS m4.xlarge, when running 8 categorize tasks in parallel (on a single machine). That was fast enough for me.

It's worth noting that we explicitly opted for separating the signature creation from the categorization, because (a) the signatures themselves are valuable, and (b) we were sure the signature generation code was reasonably bug free but we didn't know how much iteration we would have to do on the categorization code. If you're interested in calculating and categorizing signatures directly from streaming FASTQ, see sourmash watch. But Buyer Beware ;).

Results! What are the results?!

For 361,077 SRA samples, we cannot identify 8707 against the 52,000 RefSeq microbial genomes. That's about 2.4%.

Most of the 340,000+ samples are human pathogens. I can do a breakdown later, but it's all E. coli, staph, tuberculosis, etc.

From the 8707 unidentified, I randomly chose and downloaded 34 entire samples. I ran them all through the MEGAHIT assembler, and 27 of them assembled (the rest looked like PacBio, which MEGAHIT doesn't assemble). Of the 27, 20 could not be identified against the RefSeq genomes. This suggests that about 60% of the 8707 samples (5200 or so) are samples that are (a) Illumina sequence, (b) assemble-able, and (c) not identifiable.

You can download the signatures here - the .tar.gz file is about 1 GB in size.

You can get the CSV of categorized samples here (it's about 5 MB, .csv.gz).

What next?

Well, there are a few directions --

  • we have about 350,000 SRA samples identified based on sequence content now. We should cross-check that against the SRA metadata to see where the metadata is wrong or incomplete.
  • we could do bulk strain analyses of a variety of human pathogens at this point, if we wanted.
  • we can pursue the uncategorized/uncategorizable samples too, of course! There are a few strategies we can try here but I think the best strategy boils down to assembling them, annotating them, and then using protein-based comparisons to identify nearest known microbes. I'm thinking of trying phylosift. (See Twitter conversation 1 and Twitter conversation 2.)
  • we should cross-compare uncategorized samples!

At this point I'm not 100% sure what we'll do next - we have some other fish to fry in the sourmash project first, I think - but we'll see. Suggestions welcome!

A few points based partly on reactions to the Twitter conversations (1) and (2) about what to do --

  • mash/MinHash comparisons aren't going to give us anything interesting, most likely; that's what's leading to our list of uncategorizables, after all.
  • I'm skeptical that nucleotide level comparisons of any kind (except perhaps of SSU/16s genes) will get us anywhere.
  • functional analysis seems secondary to figuring out what branch of bacteria they are, but maybe I'm just guilty of name-ism here. Regardless, if we were to do any functional analysis for e.g. metabolism, I'd want to do it on all of 'em, not just the identified ones.

Backing up -- why would you want to do any of this?

No, I'm not into doing this just for the sake of doing it ;). Here's some of my (our) motivations:

  • It would be nice to make the entire SRA content searchable. This is particularly important for non-model genomic/transcriptomic/metagenomic folk who are looking for resources.
  • I think a bunch of the tooling we're building around sourmash is going to be broadly useful for lots of people who are sequencing lots of microbes.
  • Being able to scale sourmash to hundreds of thousands (and millions and eventually billions) of samples is going to be, like, super useful.
  • More generally, this is infrastructure to support data-intensive biology, and I think this is important. Conveniently the Moore Foundation has funded me to develop stuff like this.
  • I'm hoping I can tempt the grey (access restricted, etc.) databases into indexing their (meta)genomes and transcriptomes and making the signatures available for search. See e.g. "MinHash signatures as ways to find samples, and collaborators?".

Also, I'm starting to talk to some databases about getting local access to do this to their data. If you are at, or know of, a public database that would like to cooperate with this kind of activity, let's chat -- titus@idyll.org.

--titus

by C. Titus Brown at January 17, 2017 11:00 PM

Continuum Analytics news

Announcing General Availability of conda 4.3

Wednesday, January 18, 2017
Kale Franz
Continuum Analytics

We're excited to announce that conda 4.3 has been released for general availability. The 4.3 release series has several new features and substantial improvements. Below is a summary. 

To get the latest, just run conda update conda.

New Features

  • Unlink and Link Packages in a Single Transaction: In the past, conda hasn't always been safe and defensive with its disk-mutating actions. It has gleefully clobbered existing files; mid-operation failures left environments completely broken. In some of the most severe examples, conda can appear to "uninstall itself." With this release, the unlinking and linking of packages for an executed command is done in a single transaction. If a failure occurs for any reason while conda is mutating files on disk, the environment will be returned its previous state. While we've implemented some pre-transaction checks (verifying package integrity for example), it's impossible to anticipate every failure mechanism. In some circumstances, OS file permissions cannot be fully known until an operation is attempted and fails, and conda itself is not without bugs. Moving forward, unforeseeable failures won't be catastrophic.

  • Progressive Fetch and Extract Transactions: Like package unlinking and linking, the download and extract phases of package handling have also been given transaction-like behavior. The distinction is that the rollback on error is limited to a single package. Rather than rolling back the download and extract operation for all packages, the single-package rollback prevents the need for having to re-download every package if an error is encountered.

  • Generic- and Python-Type Noarch/Universal Packages: Along with conda-build 2.1, a noarch/universal type for Python packages is officially supported. These are much like universal Python wheels. Files in a Python noarch package are linked into a prefix just like any other conda package, with the following additional features:

    1. conda maps the site-packages directory to the correct location for the Python version in the environment,
    2. conda maps the python-scripts directory to either $PREFIX/bin or $PREFIX/Scripts depending on platform,
    3. conda creates the Python entry points specified in the conda-build recipe, and
    4. conda compiles pyc files at install time when prefix write permissions are guaranteed.

    Python noarch packages must be "fully universal." They cannot have OS- or Python version-specific dependencies. They cannot have OS- or Python version-specific "scripts" files. If these features are needed, traditional conda packages must be used.

  • Multi-User Package Caches: While the on-disk package cache structure has been preserved, the core logic implementing package cache handling has had a complete overhaul. Writable and read-only package caches are fully supported.

  • Python API Module: An oft requested feature is the ability to use conda as a Python library, obviating the need to "shell out" to another Python process. Conda 4.3 includes a conda.cli.python_api module that facilitates this use case. While we maintain the user-facing command-line interface, conda commands can be executed in-process. There is also a conda.exports module to facilitate longer-term usage of conda as a library across conda releases. However, conda's Python code is considered internal and private, subject to change at any time across releases. At the moment, conda will not install itself into environments other than its original install environment.

  • Remove All Locks: Locking has never been fully effective in conda, and it often created a false sense of security. In this release, multi-user package cache support has been implemented for improved safety by hard-linking packages in read-only caches to the user's primary user package cache. Still, users are cautioned that undefined behavior can result when conda is running in multiple process and operating on the same package caches and/or environments.

Deprecations/Breaking Changes

  • Conda now has the ability to refuse to clobber existing files that are not within the unlink instructions of the transaction. This behavior is configurable via the path_conflict configuration option, which has three possible values: clobber, warn, and prevent. In 4.3, the default value is clobber. This preserves existing behaviors, and it gives package maintainers time to correct current incompatibilities within their package ecosystem. In 4.4, the default will switch to warn, which means these operations continue to clobber, but the warning messages are displayed. In 4.5, the default value will switch to prevent. As we tighten up the path_conflict constraint, a new command line flag --clobber will loosen it back up on an ad hoc basis. Using --clobber overrides the setting for path_conflict to effectively be clobber for that operation.

  • Conda signed packages have been removed in 4.3. Vulnerabilities existed, and an illusion of security is worse than not having the feature at all. We will be incorporating The Update Framework (TUF) into conda in a future feature release.

  • Conda 4.4 will drop support for older versions of conda-build.

Other Notable Improvements

  • A new "trace" log level is added, with output that is extremely verbose. To enable it, use -v -v -v or -vvv as a command-line flag, set a verbose: 3 configuration parameter, or set a CONDA_VERBOSE=3 environment variable.

  • The r channel is now part of the default channels.

  • Package resolution/solver hints have been improved with better messaging.

by swebster at January 17, 2017 07:22 PM

Matthew Rocklin

Distributed NumPy on a Cluster with Dask Arrays

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

This page includes embedded large profiles. It may look better on the actual site TODO: link to live site (rather than through syndicated pages like planet.python) and it may take a while to load on non-broadband connections (total size is around 20MB)

Summary

We analyze a stack of images in parallel with NumPy arrays distributed across a cluster of machines on Amazon’s EC2 with Dask array. This is a model application shared among many image analysis groups ranging from satellite imagery to bio-medical applications. We go through a series of common operations:

  1. Inspect a sample of images locally with Scikit Image
  2. Construct a distributed Dask.array around all of our images
  3. Process and re-center images with Numba
  4. Transpose data to get a time-series for every pixel, compute FFTs

This last step is quite fun. Even if you skim through the rest of this article I recommend checking out the last section.

Inspect Dataset

I asked a colleague at the US National Institutes for Health (NIH) for a biggish imaging dataset. He came back with the following message:

*Electron microscopy may be generating the biggest ndarray datasets in the field - terabytes regularly. Neuroscience needs EM to see connections between neurons, because the critical features of neural synapses (connections) are below the diffraction limit of light microscopes. This type of research has been called “connectomics”. Many groups are looking at machine vision approaches to follow small neuron parts from one slice to the next. *

This data is from drosophila: http://emdata.janelia.org/. Here is an example 2d slice of the data http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000.

import skimage.io
import matplotlib.pyplot as plt

sample = skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000'
skimage.io.imshow(sample)

Sample electron microscopy image from stack

The last number in the URL is an index into a large stack of about 10000 images. We can change that number to get different slices through our 3D dataset.

samples = [skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % i)
    for i in [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]]

fig, axarr = plt.subplots(1, 9, sharex=True, sharey=True, figsize=(24, 2.5))
for i, sample in enumerate(samples):
    axarr[i].imshow(sample, cmap='gray')

Sample electron microscopy images over time

We see that our field of interest wanders across the frame over time and drops off in the beginning and at the end.

Create a Distributed Array

Even though our data is spread across many files, we still want to think of it as a single logical 3D array. We know how to get any particular 2D slice of that array using Scikit-image. Now we’re going to use Dask.array to stitch all of those Scikit-image calls into a single distributed array.

import dask.array as da
from dask import delayed

imread = delayed(skimage.io.imread, pure=True)  # Lazy version of imread

urls = ['http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % i
        for i in range(10000)]  # A list of our URLs

lazy_values = [imread(url) for url in urls]     # Lazily evaluate imread on each url

arrays = [da.from_delayed(lazy_value,           # Construct a small Dask array
                          dtype=sample.dtype,   # for every lazy value
                          shape=sample.shape)
          for lazy_value in lazy_values]

stack = da.stack(arrays, axis=0)                # Stack all small Dask arrays into one
>>> stack
dask.array<shape=(10000, 2000, 2000), dtype=uint8, chunksize=(1, 2000, 2000)>
>>> stack = stack.rechunk((20, 2000, 2000))     # combine chunks to reduce overhead
>>> stack
dask.array<shape=(10000, 2000, 2000), dtype=uint8, chunksize=(20, 2000, 2000)>

So here we’ve constructed a lazy Dask.array from 10 000 delayed calls to skimage.io.imread. We haven’t done any actual work yet, we’ve just constructed a parallel array that knows how to get any particular slice of data by downloading the right image if necessary. This gives us a full NumPy-like abstraction on top of all of these remote images. For example we can now download a particular image just by slicing our Dask array.

>>> stack[5000, :, :].compute()
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

>>> stack[5000, :, :].mean().compute()
11.49902425

However we probably don’t want to operate too much further without connecting to a cluster. That way we can just download all of the images once into distributed RAM and start doing some real computations. I happen to have ten m4.2xlarges on Amazon’s EC2 (8 cores, 30GB RAM each) running Dask workers. So we’ll connect to those.

from dask.distributed import Client, progress
client = Client('schdeduler-address:8786')

>>> client
<Client: scheduler="scheduler-address:8786" processes=10 cores=80>

I’ve replaced the actual address of my scheduler (something like 54.183.180.153 with `scheduler-address. Let’s go ahead and bring in all of our images, persisting the array into concrete data in memory.

stack = client.persist(stack)

This starts downloads of our 10 000 images across our 10 workers. When this completes we have 10 000 NumPy arrays spread around on our cluster, coordinated by our single logical Dask array. This takes a while, about five minutes. We’re mostly network bound here (Janelia’s servers are not co-located with our compute nodes). Here is a parallel profile of the computation as an interactive Bokeh plot.

There will be a few of these profile plots throughout the blogpost, so you might want to familiarize yoursel with them now. Every horizontal rectangle in this plot corresponds to a single Python function running somewhere in our cluster over time. Because we called skimage.io.imread 10 000 times there are 10 000 purple rectangles. Their position along the y-axis denotes which of the 80 cores in our cluster that they ran on and their position along the x-axis denotes their start and stop times. You can hover over each rectangle (function) for more information on what kind of task it was, how long it took, etc.. In the image below, purple rectangles are skimage.io.imread calls and red rectangles are data transfer between workers in our cluster. Click the magnifying glass icons in the upper right of the image to enable zooming tools.

Now that we have persisted our Dask array in memory our data is based on hundreds of concrete in-memory NumPy arrays across the cluster, rather than based on hundreds of lazy scikit-image calls. Now we can do all sorts of fun distributed array computations more quickly.

For example we can easily see our field of interest move across the frame by averaging across time:

skimage.io.imshow(stack.mean(axis=0).compute())

Avergage image over time

Or we can see when the field of interest is actually present within the frame by averaging across x and y

plt.plot(stack.mean(axis=[1, 2]).compute())

Image brightness over time

By looking at the profile plots for each case we can see that averaging over time involves much more inter-node communication, which can be quite expensive in this case.

Recenter Images with Numba

In order to remove the spatial offset across time we’re going to compute a centroid for each slice and then crop the image around that center. I looked up centroids in the Scikit-Image docs and came across a function that did way more than what I was looking for, so I just quickly coded up a solution in Pure Python and then JIT-ed it with Numba (which makes this run at C-speeds).

from numba import jit

@jit(nogil=True)
def centroid(im):
    n, m = im.shape
    total_x = 0
    total_y = 0
    total = 0
    for i in range(n):
        for j in range(m):
            total += im[i, j]
            total_x += i * im[i, j]
            total_y += j * im[i, j]

    if total > 0:
        total_x /= total
        total_y /= total
    return total_x, total_y

>>> centroid(sample)  # this takes around 9ms
(748.7325324581344, 802.4893005160851)
def recenter(im):
    x, y = centroid(im.squeeze())
    x, y = int(x), int(y)
    if x < 500:
        x = 500
    if y < 500:
        y = 500
    if x > 1500:
        x = 1500
    if y > 1500:
        y = 1500

    return im[..., x-500:x+500, y-500:y+500]

plt.figure(figsize=(8, 8))
skimage.io.imshow(recenter(sample))

Recentered image

Now we map this function across our distributed array.

import numpy as np
def recenter_block(block):
    """ Recenter a short stack of images """
    return np.stack([recenter(block[i]) for i in range(block.shape[0])])

recentered = stack.map_blocks(recenter,
                              chunks=(20, 1000, 1000),  # chunk size changes
                              dtype=a.dtype)
recentered = client.persist(recentered)

This profile provides a good opportunity to talk about a scheduling failure; things went a bit wrong here. Towards the beginning we quickly recenter several images (Numba is fast), taking around 300-400ms for each block of twenty images. However as some workers finish all of their allotted tasks, the scheduler erroneously starts to load balance, moving images from busy workers to idle workers. Unfortunately the network at this time appeared to be much slower than expected and so the move + compute elsewhere strategy ended up being much slower than just letting the busy workers finish their work. The scheduler keeps track of expected compute times and transfer times precisely to avoid mistakes like this one. These sorts of issues are rare, but do occur on occasion.

We check our work by averaging our re-centered images across time and displaying that to the screen. We see that our images are better centered with each other as expected.

skimage.io.imshow(recentered.mean(axis=0))

Recentered time average

This shows how easy it is to create fast in-memory code with Numba and then scale it out with Dask.array. The two projects complement each other nicely, giving us near-optimal performance with intuitive code across a cluster.

Rechunk to Time Series by Pixel

We’re now going to rearrange our data from being partitioned by time slice, to being partitioned by pixel. This will allow us to run computations like Fast Fourier Transforms (FFTs) on each time series efficiently. Switching the chunk pattern back and forth like this is generally a very difficult operation for distributed arrays because every slice of the array contributes to every time-series. We have N-squared communication.

This analysis may not be appropriate for this data (we won’t learn any useful science from doing this), but it represents a very frequently asked question, so I wanted to include it.

Currently our Dask array has chunkshape (20, 1000, 1000), meaning that our data is collected into 500 NumPy arrays across the cluster, each of size (20, 1000, 1000).

>>> recentered
dask.array<shape=(10000, 1000, 1000), dtype=uint8, chunksize=(20, 1000, 1000)>

But we want to change this shape so that the chunks cover the entire first axis. We want all data for any particular pixel to be in the same NumPy array, not spread across hundreds of different NumPy arrays. We could solve this by rechunking so that each pixel is its own block like the following:

>>> rechunked = recentered.rechunk((10000, 1, 1))

However this would result in one million chunks (there are one million pixels) which will result in a bit of scheduling overhead. Instead we’ll collect our time-series into 10 x 10 groups of one hundred pixels. This will help us to reduce overhead.

>>> # rechunked = recentered.rechunk((10000, 1, 1))  # Too many chunks
>>> rechunked = recentered.rechunk((10000, 10, 10))  # Use larger chunks

Now we compute the FFT of each pixel, take the absolute value and square to get the power spectrum. Finally to conserve space we’ll down-grade the dtype to float32 (our original data is only 8-bit anyway).

x = da.fft.fft(rechunked, axis=0)
power = abs(x ** 2).astype('float32')

power = client.persist(power, optimize_graph=False)

This is a fun profile to inspect; it includes both the rechunking and the subsequent FFTs. We’ve included a real-time trace during execution, the full profile, as well as some diagnostics plots from a single worker. These plots total up to around 20MB. I sincerely apologize to those without broadband access.

Here is a real time plot of the computation finishing over time:

Dask task stream of rechunk + fft

And here is a single interactive plot of the entire computation after it completes. Zoom with the tools in the upper right. Hover over rectangles to get more information. Remember that red is communication.

Screenshots of the diagnostic dashboard of a single worker during this computation.

Worker communications during FFT Worker communications during FFT

This computation starts with a lot of communication while we rechunk and realign our data (recent optimizations here by Antoine Pitrou in dask #417). Then we transition into doing thousands of small FFTs and other arithmetic operations. All of the plots above show a nice transition from heavy communication to heavy processing with some overlap each way (once some complex blocks are available we get to start overlapping communication and computation). Inter-worker communication was around 100-300 MB/s (typical for Amazon’s EC2) and CPU load remained high. We’re using our hardware.

Finally we can inspect the results. We see that the power spectrum is very boring in the corner, and has typical activity towards the center of the image.

plt.semilogy(1 + power[:, 0, 0].compute())

Power spectrum near edge

plt.semilogy(1 + power[:, 500, 500].compute())

Power spectrum at center

Final Thoughts

This blogpost showed a non-trivial image processing workflow, emphasizing the following points:

  1. Construct a Dask array from lazy SKImage calls.
  2. Use NumPy syntax with Dask.array to aggregate distributed data across a cluster.
  3. Build a centroid function with Numba. Use Numba and Dask together to clean up an image stack.
  4. Rechunk to facilitate time-series operations. Perform FFTs.

Hopefully this example has components that look similar to what you want to do with your data on your hardware. We would love to see more applications like this out there in the wild.

What we could have done better

As always with all computationally focused blogposts we’ll include a section on what went wrong and what we could have done better with more time.

  1. Communication is too expensive: Interworker communications that should be taking 200ms are taking up to 10 or 20 seconds. We need to take a closer look at our communications pipeline (which normally performs just fine on other computations) to see if something is acting up. Disucssion here dask/distributed #776 and early work here dask/distributed #810.
  2. Faulty Load balancing: We discovered a case where our load-balancing heuristics misbehaved, incorrectly moving data between workers when it would have been better to let everything alone. This is likely due to the oddly low bandwidth issues observed above.
  3. Loading from disk blocks network I/O: While doing this we discovered an issue where loading large amounts of data from disk can block workers from responding to network requests (dask/distributed #774)
  4. Larger datasets: It would be fun to try this on a much larger dataset to see how the solutions here scale.

January 17, 2017 12:00 AM

January 12, 2017

Enthought

Webinar: An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

pandas-mastery-workshop-webinar-email-header-900x300comp

Enthought’s Pandas Mastery Workshop is designed to accelerate the development of skill and confidence with Python’s Pandas data analysis package — in just three days, you’ll look like an old pro! This course was created ground up by our training experts based on insights from the science of human learning, as well as what we’ve learned from over a decade of extensive practical experience of teaching thousands of scientists, engineers, and analysts to use Python effectively in their everyday work.

In this webinar, we’ll give you the key information and insight you need to evaluate whether the Pandas Mastery Workshop is the right solution to advance your data analysis skills in Python, including:

  • Who will benefit most from the course
  • A guided tour through the course topics
  • What skills you’ll take away from the course, how the instructional design supports that
  • What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
  • What previous workshop attendees say about the course

Date and Registration Info:
January 26, 2017, 11-11:45 AM CT
Register (if you can’t attend, register and we’ll be happy to send you a recording of the session)

Register


michael_connell-enthought-vp-trainingPresenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT


Why Focus on Pandas:

Python has been identified as the most popular coding language for five years in a row. One reason for its popularity, especially among data analysts, data scientists, engineers, and scientists across diverse industries, is its extensive library of powerful tools for data manipulation, analysis, and modeling. For anyone working with tabular data (perhaps currently using a tool like Excel, R, or SAS), Pandas is the go-to tool in Python that not only makes the bulk of your work easier and more intuitive, but also provides seamless access to more specialized packages like statsmodels (statistics), scikit-learn (machine learning), and matplotlib (data visualization). Anyone looking for an entry point into the general scientific and analytic Python ecosystem should start with Pandas!

Who Should Attend: 

Whether you’re a training or learning development coordinator who wants to learn more about our training options and approach, a member of a technical team considering group training, or an individual looking for engaging and effective Pandas training, this webinar will help you quickly evaluate how the Pandas Mastery Workshop can meet your needs.


Additional Resources

Upcoming Open Pandas Mastery Workshop Sessions:

London, UK, Feb 22-24
Chicago, IL, Mar 8-10
Albuquerque, NM, Apr 3-5
Washington, DC May 10-12
Los Alamos, NM, May 22-24
New York City, NY, Jun 7-9

Learn More

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Download Enthought’s Pandas Cheat Sheets

The post Webinar: An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop appeared first on Enthought Blog.

by admin at January 12, 2017 10:31 PM

Continuum Analytics news

Continuum Analytics Appoints Scott Collison as Chief Executive Officer

Tuesday, January 17, 2017

Continuum Analytics Appoints Scott Collison as Chief Executive Officer

AUSTIN, TEXAS—January 17, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced Scott Collison as the company’s new chief executive officer (CEO). Collison, a successful entrepreneur and former executive at VMware and Salesforce.com, also joins the Board of Directors to drive the strategy and operations of the company. Collison succeeds co-founder and fellow Board member, Travis Oliphant, who will shift his focus on accelerating innovation within the dynamic Open Data Science community and managing customer solutions as chief data scientist.
 
“During the past year, our company’s Anaconda product and services grew more than 100 percent. As the company’s co-founders, Peter Wang and I started to look for an executive to help us strategically guide our growth. We are delighted to welcome Scott; his entrepreneurial experience and open source background make him the perfect fit for our company’s mission,” said Oliphant. “My new role as chief data scientist frees me up to further our investment in open source technologies to advance the Open Data Science market and ensure customer success with Anaconda.”
 
Anaconda downloads from inception through the end of 2016 totaled more than 11 million, an increase of more than eight million from the previous year. The Python community is estimated at more than 30 million members and according to the most recent O’Reilly Data Science Survey, among data scientists, 72 percent prefer Python as their main tool.

“Continuum Analytics has experienced great success as evidenced by the millions of downloads, extraordinary product and services growth in 2016 and Anaconda becoming the de facto Open Data Science platform for tech giants including, Intel, IBM, Cloudera and Microsoft,” said Collison. “The data science market opportunity is pushing the boundaries at $140 billion and I’m excited to join the company and capitalize on my previous experience to manage this explosive growth and support its continued momentum.”
 
Collison previously held the position of vice president of Hybrid Platform at VMware and lifted the company’s high-growth cloud services business. Prior to that, he was vice president of Platform Go to Market at Salesforce.com. He was also instrumental in the sale of Signio (now a division of PayPal) to Verisign for $1.3 billion in 1999 and has held a variety of executive positions at both large software companies and startups, including Microsoft, SourceForge and Geeknet.

Scott Collison is a former Fulbright scholar and holds a Ph.D. from the University of California, Berkeley, a Master of Arts from the University of Freiburg (Germany) and a Bachelor of Arts from the University of Texas, Austin.
 
Join CEO Scott Collison and the Anaconda team at AnacondaCON 2017, Feb. 7-9th in Austin, Texas––the two-day event will bring together innovative enterprises on the journey to Open Data Science. Please register here to take advantage of our current two-for-one promotion and discounted hotel room block.
 
About Anaconda Powered by Continuum Analytics
Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 11 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by swebster at January 12, 2017 06:22 PM

Matthew Rocklin

Distributed Pandas on a Cluster with Dask DataFrames

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

Dask Dataframe extends the popular Pandas library to operate on big data-sets on a distributed cluster. We show its capabilities by running through common dataframe operations on a common dataset. We break up these computations into the following sections:

  1. Introduction: Pandas is intuitive and fast, but needs Dask to scale
  2. Read CSV and Basic operations
    1. Read CSV
    2. Basic Aggregations and Groupbys
    3. Joins and Correlations
  3. Shuffles and Time Series
  4. Parquet I/O
  5. Final thoughts
  6. What we could have done better

Accompanying Plots

Throughout this post we accompany computational examples with profiles of exactly what task ran where on our cluster and when. These profiles are interactive Bokeh plots that include every task that every worker in our cluster runs over time. For example the following computation read_csv computation produces the following profile:

>>> df = dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv')

If you are reading this through a syndicated website like planet.python.org or through an RSS reader then these plots will not show up. You may want to visit http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes directly.

Dask.dataframe breaks up reading this data into many small tasks of different types. For example reading bytes and parsing those bytes into pandas dataframes. Each rectangle corresponds to one task. The y-axis enumerates each of the worker processes. We have 64 processes spread over 8 machines so there are 64 rows. You can hover over any rectangle to get more information about that task. You can also use the tools in the upper right to zoom around and focus on different regions in the computation. In this computation we can see that workers interleave reading bytes from S3 (light green) and parsing bytes to dataframes (dark green). The entire computation took about a minute and most of the workers were busy the entire time (little white space). Inter-worker communication is always depicted in red (which is absent in this relatively straightforward computation.)

Introduction

Pandas provides an intuitive, powerful, and fast data analysis experience on tabular data. However, because Pandas uses only one thread of execution and requires all data to be in memory at once, it doesn’t scale well to datasets much beyond the gigabyte scale. That component is missing. Generally people move to Spark DataFrames on HDFS or a proper relational database to resolve this scaling issue. Dask is a Python library for parallel and distributed computing that aims to fill this need for parallelism among the PyData projects (NumPy, Pandas, Scikit-Learn, etc.). Dask dataframes combine Dask and Pandas to deliver a faithful “big data” version of Pandas operating in parallel over a cluster.

I’ve written about this topic before. This blogpost is newer and will focus on performance and newer features like fast shuffles and the Parquet format.

CSV Data and Basic Operations

I have an eight node cluster on EC2 of m4.2xlarges (eight cores, 30GB RAM each). Dask is running on each node with one process per core.

We have the 2015 Yellow Cab NYC Taxi data as 12 CSV files on S3. We look at that data briefly with s3fs

>>> import s3fs
>>> s3 = S3FileSystem()
>>> s3.ls('dask-data/nyc-taxi/2015/')
['dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-02.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-03.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-04.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-05.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-06.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-07.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-08.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-09.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-10.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-11.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-12.csv']

This data is too large to fit into Pandas on a single computer. However, it can fit in memory if we break it up into many small pieces and load these pieces onto different computers across a cluster.

We connect a client to our Dask cluster, composed of one centralized dask-scheduler process and several dask-worker processes running on each of the machines in our cluster.

from dask.distributed import Client
client = Client('scheduler-address:8786')

And we load our CSV data using dask.dataframe which looks and feels just like Pandas, even though it’s actually coordinating hundreds of small Pandas dataframes. This takes about a minute to load and parse.

import dask.dataframe as dd

df = dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv',
                 parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
                 storage_options={'anon': True})
df = client.persist(df)

This cuts up our 12 CSV files on S3 into a few hundred blocks of bytes, each 64MB large. On each of these 64MB blocks we then call pandas.read_csv to create a few hundred Pandas dataframes across our cluster, one for each block of bytes. Our single Dask Dataframe object, df, coordinates all of those Pandas dataframes. Because we’re just using Pandas calls it’s very easy for Dask dataframes to use all of the tricks from Pandas. For example we can use most of the keyword arguments from pd.read_csv in dd.read_csv without having to relearn anything.

This data is about 20GB on disk or 60GB in RAM. It’s not huge, but is also larger than we’d like to manage on a laptop, especially if we value interactivity. The interactive image above is a trace over time of what each of our 64 cores was doing at any given moment. By hovering your mouse over the rectangles you can see that cores switched between downloading byte ranges from S3 and parsing those bytes with pandas.read_csv.

Our dataset includes every cab ride in the city of New York in the year of 2015, including when and where it started and stopped, a breakdown of the fare, etc.

>>> df.head()
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0.0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0.0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0.0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0.0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0.0 0.3 16.30

Basic Aggregations and Groupbys

As a quick exercise, we compute the length of the dataframe. When we call len(df) Dask.dataframe translates this into many len calls on each of the constituent Pandas dataframes, followed by communication of the intermediate results to one node, followed by a sum of all of the intermediate lengths.

>>> len(df)
146112989

This takes around 400-500ms. You can see that a few hundred length computations happened quickly on the left, followed by some delay, then a bit of data transfer (the red bar in the plot), and a final summation call.

More complex operations like simple groupbys look similar, although sometimes with more communications. Throughout this post we’re going to do more and more complex computations and our profiles will similarly become more and more rich with information. Here we compute the average trip distance, grouped by number of passengers. We find that single and double person rides go far longer distances on average. We acheive this one big-data-groupby by performing many small Pandas groupbys and then cleverly combining their results.

>>> df.groupby(df.passenger_count).trip_distance.mean().compute()
passenger_count
0     2.279183
1    15.541413
2    11.815871
3     1.620052
4     7.481066
5     3.066019
6     2.977158
9     5.459763
7     3.303054
8     3.866298
Name: trip_distance, dtype: float64

As a more complex operation we see how well New Yorkers tip by hour of day and by day of week.

df2 = df[(df.tip_amount > 0) & (df.fare_amount > 0)]    # filter out bad rows
df2['tip_fraction'] = df2.tip_amount / df2.fare_amount  # make new column

dayofweek = (df2.groupby(df2.tpep_pickup_datetime.dt.dayofweek)
                .tip_fraction
                .mean())
hour      = (df2.groupby(df2.tpep_pickup_datetime.dt.hour)
                .tip_fraction
                .mean())

tip fraction by hour

We see that New Yorkers are generally pretty generous, tipping around 20%-25% on average. We also notice that they become very generous at 4am, tipping an average of 38%.

This more complex operation uses more of the Dask dataframe API (which mimics the Pandas API). Pandas users should find the code above fairly familiar. We remove rows with zero fare or zero tip (not every tip gets recorded), make a new column which is the ratio of the tip amount to the fare amount, and then groupby the day of week and hour of day, computing the average tip fraction for each hour/day.

Dask evaluates this computation with thousands of small Pandas calls across the cluster (try clicking the wheel zoom icon in the upper right of the image above and zooming in). The answer comes back in about 3 seconds.

Joins and Correlations

To show off more basic functionality we’ll join this Dask dataframe against a smaller Pandas dataframe that includes names of some of the more cryptic columns. Then we’ll correlate two derived columns to determine if there is a relationship between paying Cash and the recorded tip.

>>> payments = pd.Series({1: 'Credit Card',
                          2: 'Cash',
                          3: 'No Charge',
                          4: 'Dispute',
                          5: 'Unknown',
                          6: 'Voided trip'})

>>> df2 = df.merge(payments, left_on='payment_type', right_index=True)
>>> df2.groupby(df2.payment_name).tip_amount.mean().compute()
payment_name
Cash           0.000217
Credit Card    2.757708
Dispute       -0.011553
No charge      0.003902
Unknown        0.428571
Name: tip_amount, dtype: float64

We see that while the average tip for a credit card transaction is $2.75, the average tip for a cash transaction is very close to zero. At first glance it seems like cash tips aren’t being reported. To investigate this a bit further lets compute the Pearson correlation between paying cash and having zero tip. Again, this code should look very familiar to Pandas users.

zero_tip = df2.tip_amount == 0
cash     = df2.payment_name == 'Cash'

dd.concat([zero_tip, cash], axis=1).corr().compute()
tip_amount payment_name
tip_amount 1.000000 0.943123
payment_name 0.943123 1.000000

So we see that standard operations like row filtering, column selection, groupby-aggregations, joining with a Pandas dataframe, correlations, etc. all look and feel like the Pandas interface. Additionally, we’ve seen through profile plots that most of the time is spent just running Pandas functions on our workers, so Dask.dataframe is, in most cases, adding relatively little overhead. These little functions represented by the rectangles in these plots are just pandas functions. For example the plot above has many rectangles labeled merge if you hover over them. This is just the standard pandas.merge function that we love and know to be very fast in memory.

Shuffles and Time Series

Distributed dataframe experts will know that none of the operations above require a shuffle. That is we can do most of our work with relatively little inter-node communication. However not all operations can avoid communication like this and sometimes we need to exchange most of the data between different workers.

For example if our dataset is sorted by customer ID but we want to sort it by time then we need to collect all the rows for January over to one Pandas dataframe, all the rows for February over to another, etc.. This operation is called a shuffle and is the base of computations like groupby-apply, distributed joins on columns that are not the index, etc..

You can do a lot with dask.dataframe without performing shuffles, but sometimes it’s necessary. In the following example we sort our data by pickup datetime. This will allow fast lookups, fast joins, and fast time series operations, all common cases. We do one shuffle ahead of time to make all future computations fast.

We set the index as the pickup datetime column. This takes anywhere from 25-40s and is largely network bound (60GB, some text, eight machines with eight cores each on AWS non-enhanced network). This also requires running something like 16000 tiny tasks on the cluster. It’s worth zooming in on the plot below.

>>> df = c.persist(df.set_index('tpep_pickup_datetime'))

This operation is expensive, far more expensive than it was with Pandas when all of the data was in the same memory space on the same computer. This is a good time to point out that you should only use distributed tools like Dask.datframe and Spark after tools like Pandas break down. We should only move to distributed systems when absolutely necessary. However, when it does become necessary, it’s nice knowing that Dask.dataframe can faithfully execute Pandas operations, even if some of them take a bit longer.

As a result of this shuffle our data is now nicely sorted by time, which will keep future operations close to optimal. We can see how the dataset is sorted by pickup time by quickly looking at the first entries, last entries, and entries for a particular day.

>>> df.head()  # has the first entries of 2015
VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
tpep_pickup_datetime
2015-01-01 00:00:00 2 2015-01-01 00:00:00 3 1.56 -74.001320 40.729057 1 N -74.010208 40.719662 1 7.5 0.5 0.5 0.0 0.0 0.3 8.8
2015-01-01 00:00:00 2 2015-01-01 00:00:00 1 1.68 -73.991547 40.750069 1 N 0.000000 0.000000 2 10.0 0.0 0.5 0.0 0.0 0.3 10.8
2015-01-01 00:00:00 1 2015-01-01 00:11:26 5 4.00 -73.971436 40.760201 1 N -73.921181 40.768269 2 13.5 0.5 0.5 0.0 0.0 0.0 14.5
>>> df.tail()  # has the last entries of 2015
VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
tpep_pickup_datetime
2015-12-31 23:59:56 1 2016-01-01 00:09:25 1 1.00 -73.973900 40.742893 1 N -73.989571 40.750549 1 8.0 0.5 0.5 1.85 0.0 0.3 11.15
2015-12-31 23:59:58 1 2016-01-01 00:05:19 2 2.00 -73.965271 40.760281 1 N -73.939514 40.752388 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
2015-12-31 23:59:59 2 2016-01-01 00:10:26 1 1.96 -73.997559 40.725693 1 N -74.017120 40.705322 2 8.5 0.5 0.5 0.00 0.0 0.3 9.80
>>> df.loc['2015-05-05'].head()  # has the entries for just May 5th
VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
tpep_pickup_datetime
2015-05-05 2 2015-05-05 00:00:00 1 1.20 -73.981941 40.766460 1 N -73.972771 40.758007 2 6.5 1.0 0.5 0.00 0.00 0.3 8.30
2015-05-05 1 2015-05-05 00:10:12 1 1.70 -73.994675 40.750507 1 N -73.980247 40.738560 1 9.0 0.5 0.5 2.57 0.00 0.3 12.87
2015-05-05 1 2015-05-05 00:07:50 1 2.50 -74.002930 40.733681 1 N -74.013603 40.702362 2 9.5 0.5 0.5 0.00 0.00 0.3 10.80

Because we know exactly which Pandas dataframe holds which data we can execute row-local queries like this very quickly. The total round trip from pressing enter in the interpreter or notebook is about 40ms. For reference, 40ms is the delay between two frames in a movie running at 25 Hz. This means that it’s fast enough that human users perceive this query to be entirely fluid.

Time Series

Additionally, once we have a nice datetime index all of Pandas’ time series functionality becomes available to us.

For example we can resample by day:

>>> (df.passenger_count
       .resample('1d')
       .mean()
       .compute()
       .plot())

resample by day

We observe a strong periodic signal here. The number of passengers is reliably higher on the weekends.

We can perform a rolling aggregation in about a second:

>>> s = client.persist(df.passenger_count.rolling(10).mean())

Because Dask.dataframe inherits the Pandas index all of these operations become very fast and intuitive.

Parquet

Pandas’ standard “fast” recommended storage solution has generally been the HDF5 data format. Unfortunately the HDF5 file format is not ideal for distributed computing, so most Dask dataframe users have had to switch down to CSV historically. This is unfortunate because CSV is slow, doesn’t support partial queries (you can’t read in just one column), and also isn’t supported well by the other standard distributed Dataframe solution, Spark. This makes it hard to move data back and forth.

Fortunately there are now two decent Python readers for Parquet, a fast columnar binary store that shards nicely on distributed data stores like the Hadoop File System (HDFS, not to be confused with HDF5) and Amazon’s S3. The already fast Parquet-cpp project has been growing Python and Pandas support through Arrow, and the Fastparquet project, which is an offshoot from the pure-python parquet library has been growing speed through use of NumPy and Numba.

Using Fastparquet under the hood, Dask.dataframe users can now happily read and write to Parquet files. This increases speed, decreases storage costs, and provides a shared format that both Dask dataframes and Spark dataframes can understand, improving the ability to use both computational systems in the same workflow.

Writing our Dask dataframe to S3 can be as simple as the following:

df.to_parquet('s3://dask-data/nyc-taxi/tmp/parquet')

However there are also a variety of options we can use to store our data more compactly through compression, encodings, etc.. Expert users will probably recognize some of the terms below.

df = df.astype({'VendorID': 'uint8',
                'passenger_count': 'uint8',
                'RateCodeID': 'uint8',
                'payment_type': 'uint8'})

df.to_parquet('s3://dask-data/nyc-taxi/tmp/parquet',
              compression='snappy',
              has_nulls=False,
              object_encoding='utf8',
              fixed_text={'store_and_fwd_flag': 1})

We can then read our nicely indexed dataframe back with the dd.read_parquet function:

>>> df2 = dd.read_parquet('s3://dask-data/nyc-taxi/tmp/parquet')

The main benefit here is that we can quickly compute on single columns. The following computation runs in around 6 seconds, even though we don’t have any data in memory to start (recall that we started this blogpost with a minute-long call to read_csv.and Client.persist)

>>> df2.passenger_count.value_counts().compute()
1    102991045
2     20901372
5      7939001
3      6135107
6      5123951
4      2981071
0        40853
7          239
8          181
9          169
Name: passenger_count, dtype: int64

Final Thoughts

With the recent addition of faster shuffles and Parquet support, Dask dataframes become significantly more attractive. This blogpost gave a few categories of common computations, along with precise profiles of their execution on a small cluster. Hopefully people find this combination of Pandas syntax and scalable computing useful.

Now would also be a good time to remind people that Dask dataframe is only one module among many within the Dask project. Dataframes are nice, certainly, but Dask’s main strength is its flexibility to move beyond just plain dataframe computations to handle even more complex problems.

Learn More

If you’d like to learn more about Dask dataframe, the Dask distributed system, or other components you should look at the following documentation:

  1. http://dask.pydata.org/en/latest/
  2. http://distributed.readthedocs.io/en/latest/

The workflows presented here are captured in the following notebooks (among other examples):

  1. NYC Taxi example, shuffling, others
  2. Parquet

What we could have done better

As always with computational posts we include a section on what went wrong, or what could have gone better.

  1. The 400ms computation of len(df) is a regression from previous versions where this was closer to 100ms. We’re getting bogged down somewhere in many small inter-worker communications.
  2. It would be nice to repeat this computation at a larger scale. Dask deployments in the wild are often closer to 1000 cores rather than the 64 core cluster we have here and datasets are often in the terrabyte scale rather than our 60 GB NYC Taxi dataset. Unfortunately representative large open datasets are hard to find.
  3. The Parquet timings are nice, but there is still room for improvement. We seem to be making many small expensive queries of S3 when reading Thrift headers.
  4. It would be nice to support both Python Parquet readers, both the Numba solution fastparquet and the C++ solution parquet-cpp

January 12, 2017 12:00 AM

January 10, 2017

Fabian Pedregosa

Optimization inequalities cheatsheet

Most proofs in optimization consist in using inequalities for a particular function class in some creative way. This is a cheatsheet with inequalities that I use most often. It considers class of functions that are convex, strongly convex and $L$-smooth.

Setting. $f$ is a function $\mathbb{R}^p \to \mathbb{R}$. Below are a set of inequalities that are verified when $f$ belongs to a particular class of functions and $x, y \in \mathbb{R}^p$ are arbitrary elements in its domain.

$f$ is $L$-smooth. This is the class of functions that are differentiable and its gradient is Lipschitz continuous.

  • $\|\nabla f(y) - \nabla f(x) \| \leq \|x - y\|$
  • $|f(x) - f(y) - \langle \nabla f(x), y - x\rangle| \leq \frac{L}{2}\|y - x\|^2$
  • $\|\nabla^2 f(x)\| \leq L\qquad \text{ (assuming $f$ is twice differentiable)} $

$f$ is convex.

  • $f(x) \leq f(y) + \langle \nabla f(x), x - y\rangle$
  • $0 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$
  • $f(\mathbb{E}X) \leq \mathbb{E}[f(X)]$ where $X$ is a random variable (Jensen's inequality).

$f$ is both $L$-smooth and convex:

  • $\frac{1}{L}\|\nabla f(x) - \nabla f(y)\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$
  • $0 \leq f(y) - f(x) - \langle \nabla f(x), y - x\rangle \leq \frac{L}{2}\|x - y\|^2$
  • $f(x) \leq f(y) + \langle \nabla f(x), x - y\rangle - \frac{1}{2 L}\|\nabla f(x) - \nabla f(y)\|^2$

$f$ is $\mu$-strongly convex. Set of functions $f$ such that $f - \frac{\mu}{2}\|\cdot\|^2$ is convex. It includes the set of convex functions with $\mu=0$.

  • $f(x) \leq f(y) + \langle \nabla f(x), x - y \rangle - \frac{\mu}{2}\|x - y\|^2$
  • $\frac{\mu}{2}\|x - y\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$
  • $\frac{\mu}{2}\|x-x^*\|^2\leq f(x^*) - f(x)$

$f$ is both $L$-smooth and $\mu$-strongly convex.
  • $\frac{\mu L}{\mu + L}\|x - y\|^2 + \frac{1}{\mu + L}\|\nabla f(x) - \nabla f(y)\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$

References

Most of these inequalities appear in the Book: "Introductory lectures on convex optimization: A basic course" by Nesterov, Yurii (2013, Springer Science & Business Media). Another good source (and freely available for download) is the book "Convex Optimization" by Stephen Boyd and Lieven Vandenberghe.

by Fabian Pedregosa at January 10, 2017 11:00 PM

Titus Brown

How I learned to stop worrying and love the coming archivability crisis in scientific software

Note: This is the fifth post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.

This post was put together after the event and benefited greatly from conversations with Victoria Stodden, Yolanda Gil, Monya Baker, Gail Peretsman-Clement, and Kristin Antelman!


Archivability is a disaster in the software world

In The talk I didn't give at Caltech, I pointed out that our current software stack is connected, brittle, and non-repeatable. This affects our ability to store and recover science from archives.

Basically, in our lab, we find that our executable papers routinely break over time because of minor changes to dependent packages or libraries.

Yes, the software stack is constantly changing! Why?!

Let me back up --

Our analysis routines usually depend on an extensive hierarchy of packages. We may be writing bespoke scripts on top of our own library, but those scripts and that library sit on top of other libraries, which in turn use the Python language, the GNU ecosystem, Linux, and a bunch of firmware. All of this rests on a not-always-that-sane hardware implementation that occasionally throws up errors because x was compiled on y processor but is running on z processor.

We've had every part of this stack cause problems for us.

Three examples:

  • many current repeatability stacks are starting to rely on Docker. But Docker changes routinely, and it's not at all clear that the images you save today will work tomorrow. Dockerfiles (which provide the instructions for building images) should be more robust, but there is a tendency to have Dockerfiles run complex shell scripts that may themselves break due to software stack changes.

    But the bigger problem is that Docker just isn't that robust.

    Don't believe me? For more, read this and weep: Docker in Production: A history of Failure.

  • software stacks are astoundingly complex in ways that are mostly hidden when things are working (i.e. in the moment) but that block any kind of longitudinal robustness. Perhaps the best illustration of this in recent time is the JavaScript debacle where the author of "left-pad" pulled it from the packaging system, throwing the JavaScript world into temporary insanity.

  • practically, we can already see the problem - go sample from A gallery of interesting Jupyter Notebooks. Pick five. Try to run them. Try to install the stuff needed to run them. Weep in despair.

    (This is also true of mybinder repos, just in case you're wondering; many of my older ones simply don't work, for a myriad of reasons.)

These are big, real issues that affect any scientific software that relies on any code written outside their project (which is everyone - see "Linux operating system" and/or "firmware" above.)

My conclusion is that, on a decadal time scale, we cannot rely on software to run repeatably.

This connects to two other important issues.

First, since data implies software, we're rapidly moving into a space where the long tail of data is going to become useless because the software needed to interpret it is vanishing. (We're already seeing this with 454 sequence data, which is less than 10 years old; very few modern bioinformatics tools will ingest it, but we have an awful lot of it in the archives.)

Second, it's not clear to me that we'll actually know if the software is running robustly, which is far worse than simply having it break. (The situation above with Jupyter Notebooks is hence less problematic than the subtle changes in behavior that will come from e.g. Python 5.0 fixing behavioral bugs that our code relied on in Python 3.)

I expect that in situations where language specs have changed, or library bugs have been fixed, there will simply be silent changes in output. Detecting this behavior is hard. (In our own khmer project, we've started including tests that compare the md5sum of running the software on small data sets to stored md5sums, which gets us part of the way there, but is by no means sufficient.)

If archivability is a problem, what's the solution?

So I think we're heading towards a future where even perfectly repeatable research will not have any particular longevity, unless it's constantly maintained and used (which is unrealistic for most research software - heck, we can't even maintain the stuff we're using right now this very instant.)

Are there any solutions?

First, some things that definitely aren't solutions:

  • Saving ALL THE SOFTWARE is not a solution; you simply can't, because of the reliance on software/firmware/hardware interactions.

  • Blobbing it all up in a gigantic virtual machine image simply pushes the turtle one stack frame down: now you've got to worry about keeping VM images running consistently. I suppose it's possible but I don't expect to see people converge on this solution anytime soon.

    More importantly, VMs and docker images may let you reach bitwise reproducibility, but they're not scientifically useful because they're big black boxes that don't really let you reuse or remix the contents; see Virtual machines considered harmful for reproducibility and The post-apocalyptic world of binary containers.

  • Not using or relying on other software isn't a practical solution: first, good luck with that ;). Second, see "firmware", above.

    And, third, while there is definitely a crowd of folk who like to reimplement everything themselves, there is every likelihood that their code is wronger and/or buggier than widely used community software; Gael Varoquaux makes this point very well in his blog post, Software for reproducible science.

    I don't think trading archivability for incorrectness is a good trade :).

The two solutions that I do see are these:

  • run everything all the time.

    This is essentially what the software world does with continuous integration. They run all their tests and pipelines all the time, just to check that it's all working. (See "Continuous integration at Google Scale".)

    Recently, my #MooreData colleagues Brett Beaulieau and Casey Greene proposed exactly this for scientific papers, in their preprint "Reproducible Computational Workflows with Continuous Analysis".

    While this is a potential solution, it's rather heavyweight to set up, and (more importantly) it gets kind of expensive -- Google runs many compute-years of code each day -- and I worry that the cost to utility ratio is not in science's favor. This is especially true when you consider that most research ends up being a dead end - unread, uncited, and unimportant - but of course you don't know which until much later...

  • acknowledge that exact repeatability has a half life of utility, and that this is OK.

    I've only just started thinking about this in detail, but it is at least plausible to argue that we don't really care about our ability to exactly re-run a decade old computational analysis. What we do care about is our ability to figure out what was run and what the important decisions were -- something that Yolanda Gil refers to as "inspectability." But exact repeatability has a short shelf-life.

    This has a couple of interesting implications that I'm just starting to unpack mentally:

    • maybe repeatability for science's sake can be thought of as a short-term aid in peer review, to make sure that the methods are suitably explicit and not obviously incorrect. (Another use for exact repeatability is enabling reuse and remixing, of course, which is important for scientific progress.)
    • as we already knew, closed source software is useless crap because it satisfies neither repeatability nor inspectability. But maybe it's not that important (for inspectability) to allow software reuse with a F/OSS license? (That license is critical for reuse and remixing, though.)
    • maybe we could and should think of articulating "half lives" for research products, and acknowledge explicitly that most research won't pass the test of time.
    • but perhaps this last point is a disaster for the kind of serendipitous reuse of old data that Christie Bahlai and Amanda Whitmire have convinced me is important.

    Huge (sincere) thanks to Gail for arguing both sides of this, including saying that (a) archive longevity is super important because everything has to be saved or else it's a disaster for humanity, and (b) maybe we don't care about saving everything because after all we can still read Latin even if we don't actually get the full cultural context and don't know how to pronounce the words, and (c) hey maybe the full cultural context is important and we should endeavor to save it all after all. <exasperation>Librarians!</exasperation>

Lots for me to think on.

--titus

by C. Titus Brown at January 10, 2017 11:00 PM

The talk I didn't give at Caltech (Paper of the Future)

Note: This is the fourth post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.

This is an outline of the talk I didn't give at Caltech, because I decided that Victoria Stodden and Yolanda Gil were going to cover most of it and I would rather talk about a random collection of things that they might not talk about. (I think I was 7 for 10 on that. ;)

This is in outline-y form, but I think it's fairly understandable. Ask questions in the comments if not!


What will the paper of the future look like?

A few assertions about the scientific paper of the future:

  • The paper of the future will be open - open access, open data, and open source.
  • The paper of the future will be highly repeatable.
  • The paper of the future will be linked.
  • The paper of the future will not depend on expensive infrastructure.
  • The paper of the future will be commonplace.
  • The paper of the future will be archivable (or will it? Read on.)

What's our experience with the paper of the future been?

My lab (and many, many others) have been doing things like:

  • Automating the entire analysis from raw data to conclusion.
  • Publishing data narratives and notebooks.
  • Using version control for paper and data notebook and source code.
  • Anointing data sets with DOIs.
  • Posting virtual environments & execution specifications for papers.

We've been doing parts of this for many years, and while we're not always that systematic about certain parts, I can say that everything works fairly smoothly. The biggest issues we have often seem to be about the small details, such as choice of workflow engine, whether we're using AWS or an HPC as our "reference location" to run stuff, etc.

From this experience, I see two problems:

The two big problems I see

  • Adoption!

    We need community use & experience & training; we also need funder and journal buy-in.

    The training aspect is what Software Carpentry and Data Carpentry focus on, and it's one of the reasons I'm involved with them.

  • Archivability!

    Our software stack is anything but robust, static, or archivable.

    This is a huge problem that I don't think is accorded enough attention.

This last issue, archivability, is both somewhat technical and important - so I decided to move that to a new blog post, "How I learned to stop worrying and love the coming archivability crisis in scientific software".

Concluding thoughts

In which I summarize the above :)

--titus

by C. Titus Brown at January 10, 2017 11:00 PM

Enthought

Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

Data exploration, manipulation, and visualization start with loading data, be it from files or from a URL. Pandas has become the go-to library for all things data analysis in Python, but if your intention is to jump straight into data exploration and manipulation, the Canopy Data Import Tool can help, instead of having to learn the details of programming with the Pandas library.

The Data Import Tool leverages the power of Pandas while providing an interactive UI, allowing you to visually explore and experiment with the DataFrame (the Pandas equivalent of a spreadsheet or a SQL table), without having to know the details of the Pandas-specific function calls and arguments. The Data Import Tool keeps track of all of the changes you make (in the form of Python code). That way, when you are done finding the right workflow for your data set, the Tool has a record of the series of actions you performed on the DataFrame, and you can apply them to future data sets for even faster data wrangling in the future.

At the same time, the Tool can help you pick up how to use the Pandas library, while still getting work done. For every action you perform in the graphical interface, the Tool generates the appropriate Pandas/Python code, allowing you to see and relate the tasks to the corresponding Pandas code.

With the Data Import Tool, loading data is as simple as choosing a file or pasting a URL. If a file is chosen, it automatically determines the format of the file, whether or not the file is compressed, and intelligently loads the contents of the file into a Pandas DataFrame. It does so while taking into account various possibilities that often throw a monkey wrench into initial data loading: that the file might contain lines that are comments, it might contain a header row, the values in different columns could be of different types e.g. DateTime or Boolean, and many more possibilities as well.

Importing files or data into Pandas with the Canopy Data Import Tool

The Data Import Tool makes loading data into a Pandas DataFrame as simple as choosing a file or pasting a URL.

A Glimpse into Loading Data into Pandas DataFrames (The Hard Way)

The following 4 “inconvenience” examples show typical problems (and the manual solutions) that might arise if you are writing Pandas code to load data, which are automatically solved by the Data Import Tool, saving you time and frustration, and allowing you to get to the important work of data analysis more quickly.

Let’s say you were to load data from the file by yourself. After searching the Pandas documentation a bit, you will come across the pandas.read_table function which loads the contents of a file into a Pandas DataFrame. But it’s never so easy in practice: pandas.read_table and other functions you might find assume certain defaults, which might be at odds with the data in your file.


Inconvenience #1: Data in the first row will automatically be used as a header.  Let’s say that your file (like this one: [wind.data]) uses whitespace as the separator between columns and doesn’t have a row containing column names. pandas.read_table assumes by default that your file contains a header row and uses tabs for delimiters. If you don’t tell it otherwise, Pandas will use the data from the first row in your file as column names, which is clearly wrong in this case.

From the docs, you can discover that this behavior can be turned off by passing header=None and use sep=\s+ to pandas.read_table, to use varying whitespace as the separator and to inform pandas that a header column doesn’t exist:

In [1]: df = pandas.read_table('wind.data', sep='\s+')
In [2]: df.head()
Out[2]:
61  1  1.1  15.04  14.96  13.17   9.29  13.96  9.87  13.67  10.25  10.83  \
0  61  1    2  14.71  16.88  10.83   6.50  12.62  7.67  11.50  10.04   9.79
1  61  1    3  18.50  16.88  12.33  10.13  11.17  6.17  11.25   8.04   8.50
12.58  18.50  15.04.1
0   9.67  17.54    13.83
1   7.67  12.75    12.71

Without the header=None kwarg, you can see that the first row of data is being considered as column names:

In [3]: df = pandas.read_table('wind.data', header=None, sep='\s+')
In [4]: df.head()
Out[4]:
0   1   2      3      4      5      6      7     8      9      10     11  \
0  61   1   1  15.04  14.96  13.17   9.29  13.96  9.87  13.67  10.25  10.83
1  61   1   2  14.71  16.88  10.83   6.50  12.62  7.67  11.50  10.04   9.79
12     13     14
0  12.58  18.50  15.04
1   9.67  17.54  13.83

The behavior we expected, after we tell Pandas that the file does not contain a row containing column names using header=None and specify the separator:

[File : test_data_comments.txt]


Inconvenience #2: Commented lines cause the data load to fail.  Next let’s say that your file contains commented lines which start with a #. Pandas doesn’t understand this by default and trying to load the data into a DataFrame will either fail with an Error or worse, succeed without notifying you that one row in the DataFrame might contain erroneous data, from the commented line.  (This might also prevent correct inference of column types.)

Again, you can tell pandas.read_table that commented lines exist in your file and to skip them using comment=#:

In [1]: df = pandas.read_table('test_data_comments.txt', sep=',', header=None)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-10-b5cd8eee4851> in <module>()
----> 1 df = pandas.read_table('catalyst/tests/data/test_data_comments.txt', sep=',', header=None)
(traceback)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 5

As mentioned earlier, if you are lucky, Pandas will fail with a CParserError, complaining that each row contains a different number of columns in the data file.  Needless to say, it’s not obvious to tell that this is an unidentified comment line:

In [2]: df = pandas.read_table('test_data_comments.txt', sep=',', comment='#', header=None)
In [3]: df
Out[3]:
0   1    2      3            4
0  1  False  1.0    one   2015-01-01
1  2   True  2.0    two   2015-01-02
2  3  False  3.0  three  2015-01-03
3  4   True  4.0  four  2015-01-04

And we can read the file contents correctly when we tell pandas that ‘#’ is the character that commented lines in the file start with, as is seen in the following file:

[File : ncaa_basketball_2016.txt]


Inconvenience #3: Fixed-width formatted data will cause data load to fail.  Now let’s say that your file contains data in a fixed-width format. Trying to load this data using pandas.read_table will fail.

Dig around a little and you will come across the function pandas.read_fwf, which is the suggested way to load data from fixed-width files, not pandas.read_table.

In [1]: df = pandas.read_table('ncaa_basketball_2016.txt', header=None)
In [2]: df.head()
Out[2]:
0
0  2016-02-25 @Ark Little Rock          72  UT Ar...
1  2016-02-25  ULM                      66 @South...

Those of you familiar with Pandas will recognize that the above DataFrame, created from the file, contains only one column, labelled 0. Which is clearly wrong, because there are 4 distinct columns in the file.

In [3]: df = pandas.read_table('ncaa_basketball_2016.txt', header=None, sep='\s+')
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-28-db4f2f128b37> in <module>()
----> 1 df = pandas.read_table('functional_tests/data/ncaa_basketball_2016.txt', header=None, sep='\s+')
(Traceback)
CParserError: Error tokenizing data. C error: Expected 8 fields in line 55, saw 9

If we didn’t know better, we would’ve assumed that the delimiter/separator character used in the file was whitespace. We can tell Pandas to load the file again, assuming that the separator was whitespace, represented using \s+. But, as you can clearly see above, that raises a CParserError, complaining that it noticed more columns of data in one row than the previous.

In [4]: df = pandas.read_fwf('ncaa_basketball_2016.txt', header=None)
In [5]: df.head()
Out[5]:
0                 1   2               3   4    5
0  2016-02-25  @Ark Little Rock  72    UT Arlington  60  NaN
1  2016-02-25               ULM  66  @South Alabama  59  NaN

And finally, using pandas.read_fwf instead of pandas.read_table gives us a DataFrame that is close to what we expected, given the data in the file.


Inconvenience #4: NA is not recognized as text; automatically converted to ‘None’:  Finally, let’s assume that you have raw data containing the string NA, which is this specific case is used to represent North America. By default pandas.read_csv interprets these string values to represent None and automatically converts them to None. And Pandas does all of this underneath the hood, without informing the user. One of the things that the Zen of Python says is that Explicit is better than implicit. In that spirit, the Tool explicitly lists the values which will be interpreted as None/NaN.

The user can remove NA (or any of the other values) from this list, to prevent it from being interpreted as None, as shown in the following file:

[File : test_data_na_values.csv]

In [2]: df = pandas.read_table('test_data_na_values.csv', sep=',', header=None)
In [3]: df
Out[3]:
0  1       2
0 NaN  1    True
1 NaN  2   False
2 NaN  3   False
3 NaN  4    True
In [4]: df = pandas.read_table('test_data_na_values.csv', sep=',', header=None, keep_default_na=False, na_values=[])
In [5]: df
Out[5]:
0  1       2
0  NA  1    True
1  NA  2   False
2  NA  3   False
3  NA  4    True

If your intentions were to jump straight into data exploration and manipulation, then the above points are some of the inconveniences that you will have to deal with, requiring you to learn the various arguments that need to be passed to pandas.read_table before can load your data correctly and get to your analysis.


Loading Data with the Data Import Tool (The Easy Way)

Use the Data Import Tool to automatically set up the correct file assumptions

The Canopy Data Import Tool automatically circumvents several common data loading inconveniences and errors by simply setting up the correct file assumptions in the Edit Command dialog box.

The Data Import Tool takes care of all of these problems for you, allowing you to fast forward to the important work of data exploration and manipulation. It automatically:

  1. Infers if your file contains a row of column names or not;
  2. Intelligently infers if your file contains any commented lines and what the comment character is;
  3. Infers what delimiter is used in the file or if the file contains data in a fixed-width format.

Download Canopy (free) and start a free trial of the Data Import Tool to see just how much time and frustration you can save!


The Data Import Tool as a Learning Resource: Using Auto-Generated Python/Pandas code

So far, we talked about how the Tool can help you get started with data exploration, without the need for you to understand the Pandas library and its intricacies. But, what if you were also interested in learning about the Pandas library? That’s where the Python Code pane in the Data Import Tool can help.

As you can see from the screenshot below, the Data Import Tool generates Pandas/Python code for every command you execute. This way, you can explore and learn about the Pandas library using the Tool.

See generated Python / Pandas code to help learn underlying code for data wrangling tasks.

View the underlying Python / Pandas code in the Data Import Tool to help learn Pandas code, without slowing down your work.

Finally, once you are done loading data from the file and manipulating the DataFrame, you can export the DataFrame to Canopy’s IPython console for further analysis and visualization. Simply click Use DataFrame at the bottom-right corner and the Tool will export the DataFrame to Canopy’s IPython pane, as you can see below.

mport the cleaned data into the Canopy IPython console for further data analysis and visualization.

Import the cleaned data into the Canopy IPython console for further data analysis and visualization.

 Ready to try the Canopy Data Import Tool?

Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today


Additional resources:

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

The post Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way appeared first on Enthought Blog.

by admin at January 10, 2017 06:44 PM

January 08, 2017

Titus Brown

Topics and concepts I'm excited about (Paper of the Future)

Note: This is the third post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.


I've been struggling to put together an interesting talk for the workshop, and last night Gail Clement (our host, @Repositorian) and Justin Bois helped me convinced myself (using red wine) that I should do something other than present my vision for #futurepaper.

So, instead, here is a set of things that I'm pretty excited about in the world of scholarly communication!

I've definitely left off a few, and I'd welcome pointers and commentary to things I've missed; please comment!


1. The wonderful ongoing discussion around significance and reproducibility.

In addition to Victoria Stodden, Brian Nosek and John Ioannidis have been leaders in banging various drums (and executing various research agenda) that are showing us that we're not thinking very clearly about issues fundamental to science.

For me, the blog post that really blew my mind was Dorothy Bishop's summary of the situation in psychology. To quote:

Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution is not to deny we have a problem, but to recognize that under those circumstances there is an urgent need for our field to tackle the methodological issues of inadequate power and p-hacking, so we can distinguish genuine effects from false positives.

Read the whole thing. It's worth it.

Relevance to #futurepaper: detailed methods and exact repeatability is a prerequisite for conversations about what we really care about, which is: "is this effect real?"

2. Blogging as a way to explore issues without prior approval from Top People.

I've always hated authority ever since I noticed the propensity for authorities to protect their position over seeking truth. This manifests in many ways, one of which through control over peer review and scientific publishing processes.

With that in mind, it's worth reading up on Susan Fiske's "methodological terrorism" draft, in which Fiske, a professor at Princeton and an editor at PNAS, "publicly compares some of her colleagues to terrorists" (ref). Fiske is basically upset that people are daring to critique published papers via social media.

There are a bunch of great responses; I'll highlight just one, by Chris Chambers:

So what’s really going on here? The truth is that we are in the midst of a power struggle, and it’s not between Fiske’s “destructo-critics” and their victims, but between reformers who are trying desperately to improve science and a vanguard of traditionalists who, every so often, look down from their thrones to throw a log in the road. As the body count of failed replications continues to climb, a new generation want a different kind of science and they want it now. They're demanding greater transparency in research practices. They want to be rewarded for doing high quality research regardless of the results. They want researchers to be accountable for the quality of the work they produce and for the choices they make as public professionals. It's all very sensible, constructive, and in the public interest.

Yeah!

The long and short of it is that I'm really excited about how science and the scientific process are being discussed openly via blogs and Twitter. (You can also read my "Top 10 reasons why blog posts are better than scientific papers.)

Relevance to #futurepaper: there are many alternate "publishing" models that offer advantages over our current publishing and dissemination system. They also offer potential disadvantages, of course.

3. Open source as a model for open science.

Two or three times every year I come back to this wonderful chapter by K. Jarrod Millman and Fernando Perez entitled "Developing open source scientific practice." This wonderful chapter breaks down all the ways in which current computational science practice falls short of the mark and could learn from standard open source software development practices.

For one quote (although the chapter offers far more!),

Asking about reproducibility by the time a manuscript is ready for submission to a journal is too late: this problem must be tackled from the start, not as an afterthought tacked-on at publication time. We must therefore look for approaches that allow researchers to fluidly move back and forth between the above stages and that integrate naturally into their everyday practices of research, collaboration, and publishing, so that we can simultaneously address the technical and social aspects of this issue.

Please, go read it!

Relevance to #futurepaper: tools and processes prior to publication matter!

4. Computational narratives as the engine of collaborative data science.

That's the title of the most recent Project Jupyter grant application, authored by Fernando Perez and Brian Granger (and funded!).

It's hard to explain to people who haven't seen it, but the Jupyter Notebook is the single most impactful thing to happen to the science side of the data science ecosystem in the last decade. Not content with that, Fernando and Brian lay out a stunning vision for the future of Jupyter Notebook and the software ecosystem around it.

Quote:

the core problem we are trying to solve is the collaborative creation of reproducible computational narratives that can be used across a wide range of audiences and contexts.

The bit that grabs me the most in this grant proposal is the bit on collaboration, but your mileage may vary - the whole thing is a great read!

Relevance to #futurepaper: hopefully obvious.

5. mybinder: deploy running Jupyter Notebooks from GitHub repos in Docker containers

Another thing that I'm super excited about are the opportunities provided by lightweight composition of many different services. If you haven't seen binder (mybinder.org), you should go play with it!

What binder does is let you spin up running Jupyter Notebooks based on the contents of GitHub repositories. Even cooler, you can install and configure the execution environment however you want using Dockerfiles.

If this all sounds like gobbledygook to you, check out this link to a binder for exploring the LIGO data. Set up by Min Ragan-Kelly, this link spools up an executable environment (in a Jupyter Notebook) for exploring the LIGO data. Single click, no fuss, no muss.

I find this exciting because binder is one example (of several!) where people are building a new publication service by composing several well-supported software packages.

Relevance to #futurepaper: ever wanted to give people a chance to play with your publication's analysis pipeline as well as your data? Here you go.

6. Overlay journals.

As preprints grow, the question of "why do we have journals anyway?" looms. The trend of overlay journals provides a natural mixing point between preprints and more traditional (and expensive) publishing.

An overlay journal is a journal that sits on top of a preprint server. To quote,

“The only objection to just putting things on arXiv is that it’s not peer reviewed, so why not have a community-based effort that provides a peer-review service for the arXiv?" [Peter Coles] says — pointing out that academics already carry out peer review for scientific publishers, usually at no cost.

Relevance to #futurepaper: many publishers offer very little in the way of services beyond this, so why pay them for it when the preprint server already exists?

7. Bjorn Brembs.

Bjorn is one of these people that, if he were less nice, I'd find irritating in his awesomeness. He researches flies or something, and he consistently pushes the boundaries of process in his publications.

Two examples -- living figures that integrate data from outside scientists, and systematic openness - to quote from Lenny Teytelman,

The paper was posted as a preprint prior to submission and all previous versions of the article are available as biorxiv preprints. The published research paper is open access. The raw data are available at figshare. All authors were listed with their ORCID IDs and all materials referenced with RRIDs. All methods are detailed with DOIs on protocols.io. The blog post gives the history and context of the work. It's a fascinating and accessible read for non-fly scientists and non-scientists alike. Beautiful!

Bjorn also has a wonderful paper on just how bad the Impact Factor and journal status-seeking system is, and his blog post on what a modern scholarly infrastructure should look like is worth reading.

Relevance to #futurepaper: hopefully obvious.

8. Idea futures or prediction markets.

There are other ways of reaching consensus than peer review, and idea futures are one of the most fascinating. To quote,

Our policy-makers and media rely too much on the "expert" advice of a self-interested insider's club of pundits and big-shot academics. These pundits are rewarded too much for telling good stories, and for supporting each other, rather than for being "right". Instead, let us create betting markets on most controversial questions, and treat the current market odds as our best expert consensus. The real experts (maybe you), would then be rewarded for their contributions, while clueless pundits would learn to stay away. You should have a free-speech right to bet on political questions in policy markets, and we could even base a new form of government on idea futures.

Balaji Srinivasan points out that the bitcoin blockchain is another way of reaching consensus, and I think that's worth reading, too.

Relevance to #futurepaper: there are other ways of doing peer review and reaching consensus than blocking publication until you agree with the paper.

9. Open peer review by a selected papers network.

This proposal by Chris Lee, a friend and colleague at UCLA, outlines how to do peer review via (essentially) a blog chain. To quote,

A selected-papers (SP) network is a network in which researchers who read, write, and review articles subscribe to each other based on common interests. Instead of reviewing a manuscript in secret for the Editor of a journal, each reviewer simply publishes his review (typically of a paper he wishes to recommend) to his SP network subscribers. Once the SP network reviewers complete their review decisions, the authors can invite any journal editor they want to consider these reviews and initial audience size, and make a publication decision. Since all impact assessment, reviews, and revisions are complete, this decision process should be short. I show how the SP network can provide a new way of measuring impact, catalyze the emergence of new subfields, and accelerate discovery in existing fields, by providing each reader a fine-grained filter for high-impact.

I think this is a nice concrete example of an alternate way to do peer review that should actually work.

There's a lot of things that could tie into this, including trust metrics; cryptographic signing of papers, reviews, and decisions so that they are verifiable; verifiable computing a la worldmake; etc.

Relevance to #futurepaper? Whether or not you believe this could work, figuring out why you think what you think is a good way to explore what the publishing landscape could look like.

10. A call to arms: make outbreak research open access.

What would you call a publishing ecosystem that actively encourages withholding of information that could save lives, all in the name of reputation building and job security?

Inhumane? Unethical? Just plain wrong?

All of that.

Read Yowziak, Schaffner, and Sabeti's article, "Data sharing: make outbreak research open access."

There are horror stories galore about what bad data sharing does, but one of the most affecting is in this New Yorker article by Seth Mnookin, in which he quotes Daniel MacArthur;

The current academic publication system does patients an enormous disservice.

The larger context is that our failure to have and use good mechanisms of data publication is killing people. Maybe we should fix that?

Relevance to #futurepaper: open access to papers and data and software is critical to society.


Anyway, so that's what's on the tip of my brain this fine morning.

--titus

by C. Titus Brown at January 08, 2017 11:00 PM

January 07, 2017

Titus Brown

Data implies software.

Note: This is the second post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.

An important yet rarely articulated assumption of a lot of my work in biological data analysis is that data implies software: it's not much good gathering data if you don't have the ability to analyze it.

For some data, spreadsheet software is good enough. This was the situation molecular biology was in up until the early 2000s - sure, we'd get numbers from instruments and sequences from sequencers, but they'd all fit pretty handily in whatever software we had lying around.

Once numerical data sets get big enough -- e.g. I did approximately 50,000 qPCRs in my last two years of grad school, which was unpleasant to handle in Excel -- we need to invest in software like R or Python, which can do bulk and batch processing of the data. Software like OpenRefine can also help with "manual" cleaning and rationalization of the data. But this requires skills that are still relatively specialized.

For other data, we need custom software built specifically for that data type. This is true of sequence analysis, where most of my work is focused: when you get 200m DNA sequences, each of length 150 bp, there's no simple, effective way to query or summarize that using general computational tools. We need specialized code to parse, summarize, explore, and investigate these data sets. Using this code doesn't necessarily require serious programming knowledge, but data analysts may need fortitude in dealing with potentially immature software, as well as a duct-tape mentality in terms of tying together software that wasn't designed to integrate, or repurposing software that was meant for a different purpose.

There is at least one other category of data analysis software that I can think of but haven't personally experienced - that's the kind of stuff that CERN and Facebook and Google have to deal with, where the data sets are so overwhelmingly large that you need to build deep software and hardware infrastructure to handle them. This becomes (I think) more a matter of systems engineering than anything else, but I bet there is a really strong domain knowledge component that is required of at least some of the systems engineers here. I think some of the cancer sequencing folk are close to this stage, judging from a talk I heard from Lincoln Stein two years ago.

Data-intensive research increasingly lives beyond the "spreadsheet" level

As data set sizes increase across the board, researchers are increasingly finding that spreadsheets are insufficient. This is for all the reasons articulated in the Data Carpentry spreadsheet lesson, so I won't belabor the point any more, but what does this mean for us?

So increasingly our analysis results don't depend on spreadsheets; they depend on custom data processing scripts (in R, MATLAB, Python, etc.) and other people's programs (e.g. in bioinformatics, mappers and assemblers) and on multiple steps of data handling, cleaning, summation, integration, analysis and summarization.

And, as is nicely laid out in Stodden et al. (2016), all of these steps are critical components of the data interpretation and belong in the Methods section of any publication!

What's your point, Dr. Brown?

When we talk about "the scientific paper of the future", one of the facets that people are most excited about - and I think this Caltech panel will probably focus on this facet - is that we now possess the technology to readily and easily communicate the details of this data analysis. Not only that, we can communicate it in such a way that it becomes repeatable and explorable and remixable, using virtual environments and data analysis notebooks.

I want to highlight something else, though.

When I read excellent papers on research data management like "10 aspects of highly effective research data" (or is this a blog post? I can't always tell any more), I falter at section headings that say data should be "comprehensible" and "reviewed" and especially "reusable". This is not because they are incorrect, but rather because these are so dependent on having methods (i.e. software) to actually analyze the data. And that software seems to be secondary for many data-focused folk.

For me, however, they are one and the same.

If I don't have access to software customized to deal with the data-type specific nuances of this data set (e.g. batch effects of RNAseq data), the data set is much less useful.

If I don't know exactly what statistical cutoffs were used to extract information from this data set by others, then the data set is much less useful. (I can make my own determination as to whether those cutoffs were good cutoffs, but if I don't know what they were, I'm stuck.)

If I don't have access to the custom software that was used in removing noise, generated the interim results, and did the large-scale data processing, I may not even be able to approximate the same final results.

Where does this leave me? I think:

Archived data has diminished utility if we do not have the ability to analyze it; for most data, this means we need software.

For each data set, we should aim to have at least one fully articulated data processing pipeline (that takes us from data to results). Preferably, this would be linked to the data somehow.

What I'm most excited about when it comes to the scientific paper of the future is that most visions for it offer an opportunity to do exactly this! In the future, we will increasingly attach detailed (and automated and executable) methods to new data sets. And along with driving better (more repeatable) science, this will drive better data reuse and better methods development, and thereby accelerate science overall.

Fini.

--titus

p.s. For a recent bioinformatics effort in enabling large-scale data reuse, see "The Lair" from the Pachter Lab.

p.p.s. No, I did not attempt to analyze 50,000 qPCRs in a spreadsheet. ...but others in the lab did.

by C. Titus Brown at January 07, 2017 11:00 PM

January 06, 2017

Titus Brown

The top 10 reasons why blog posts are better than scientific papers

Note: This is the first post in what I hope to be a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.

1. Blog posts are like preprints, but faster.

Even preprints go through some review before they're posted, just to make sure they're not obviously crank papers. Blog posts don't suffer from any prior restraint other than the need to take the time to write them.

2. Blog posts don't end up in PDFs.

...and you don't have to write them in nasty complex formats like Word or LaTeX.

Reference: why PDFs suck.

3. Blog posts are like papers, but better written.

Blog posts can be colloquial, funny, and sarcastic - unlike scientific papers. Blog posts can also contain narrative in a way that scientific papers simply don't.

4. Blog posts are often opinionated.

Papers go through multiple rounds of review and revision, in which the naturally irregular and uneven surface of reality is sanded down and/or bludgeoned into a cuboid that looks and sounds objective and impartial. Blog posts suffer from no such fiction of objectivity and impartiality.

(Self-referential case in point.)

5. Blog posts inspire feedback.

Perhaps in part because blog posts convey personal opinion, blog posts are inherently more social, more interactive, and more open to commentary.

(Presumably this will also be a self-referential case in point. Or not, which would be awesomely ironic!)

6. Blog posts are free, open access, and indexed by search engines.

Kind of like preprints, but not in a PDF. Very much not like many scientific papers.

7. Blog posts can be versioned.

You can have multiple versions of blog posts -- kind of like preprints, but very much unlike papers.

Unlike either preprints or papers, blog posts can take advantage of real version control systems like git. (Self-referential case in point.) This also further enables collaboration.

8. Blogs don't have impact factors.

Instead of a nonsensical and unrigorous statistic that signals to other scientists how important an editor thinks your paper will eventually be, blog posts are shared freely among an ad hoc self-assembled network of enemies on Twitter and Facebook.

9. Blog posts can be pseudonymous.

There are many science blogs that are pseudonymous, and no one cares. (This is actually really important.) This (along with the general lack of prior restraint, above) allows unpleasant truths to be shared.

10. Blog posts are probably more reliable than scientific papers.

Because blog posts don't matter for academic reputation, there is little reason to game the blog post system. Therefore, blog posts are inherently more likely to be reliable than scientific papers.


I encourage people who disagree with this post to submit a commentary to a respectable high retraction index journal like Science or Nature.

--titus

by C. Titus Brown at January 06, 2017 11:00 PM

Continuum Analytics news

Using Anaconda and H2O to Supercharge your Machine Learning and Predictive Analytics

Monday, January 9, 2017
Kristopher Overholt
Continuum Analytics

Anaconda integrates with many different providers and platforms to give you access to the data science libraries you love with the tools you use, including Amazon Web Services, Docker, and Cloudera CDH. Today we’re excited to announce our new partnership with H2O and the availability of H2O machine learning packages for Anaconda on Windows, Mac and Linux.

H2O is an open source, in-memory, distributed, fast and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data. Using in-memory compression, H2O handles billions of data rows in-memory, even with a small cluster. H2O is used by over 60,000 data scientists and more than 7,000 organizations around the world.

H2O includes a wide range of data science algorithms and estimators for supervised and unsupervised machine learning such as generalized linear modeling, gradient boosting, deep learning, random forest, naive bayes, ensemble learning, generalized low rank models, k-means clustering, principal component analysis, and others. H2O provides interfaces for Python, R, Java and Scala, and can be run in standalone mode or on a Hadoop/Spark cluster via Sparkling Water or sparklyr.

In this blog post, we’ll demonstrate you how you can install and use H2O with Python alongside the 720+ packages in Anaconda to perform interactive machine learning workflows with notebooks and visualizations as part of Anaconda’s Open Data Science platform.

Installing and Using H2O with Anaconda

You can install H2O with Anaconda on Windows, Mac or Linux. The following conda command will install the H2O core library and engine, the H2O Python client library and the required Java dependencies (OpenJDK):

$ conda install h2o h2o-py

That’s it! After installing H2O with Anaconda, you’re now ready to get started with a wide range of machine learning algorithms and data science modeling techniques.

In the following sections, we’ll demonstrate how to use H2O with Anaconda based on examples from the H2O documentation, including a k-means clustering example, a deep learning example and a gradient boosting example.

K-means Clustering with Anaconda and H2O

K-means clustering is an machine learning technique that can be used to classify values in a data set using a clustering algorithm.

In this example, we’ll use the k-means clustering algorithm in H2O on the Iris flower data set to classify the measurements into clusters.

First, we’ll start a Jupyter notebook server where we can run the H2O machine learning examples in an interactive notebook environment with access to all of the libraries from Anaconda.

$ jupyter notebook

In the notebook, we can import the H2O client library and initialize an H2O cluster, which will be started on our local machine:

>>> import h2o
>>> h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.

Attempting to start a local H2O server... 
 Java Version: openjdk version "1.8.0_102"; OpenJDK Runtime Environment (Zulu 8.17.0.3-macosx) (build 1.8.0_102-b14); OpenJDK 64-Bit Server VM (Zulu 8.17.0.3-macosx) (build 25.102-b14, mixed mode)
 Starting server from /Users/koverholt/anaconda3/h2o_jar/h2o.jar
 Ice root: /var/folders/5b/1vh3qn2x7_s7mj88zc3nms0m0000gp/T/tmpj9mo8ims
 JVM stdout: /var/folders/5b/1vh3qn2x7_s7mj88zc3nms0m0000gp/T/tmpj9mo8ims/h2o_koverholt_started_from_python.out
 JVM stderr: /var/folders/5b/1vh3qn2x7_s7mj88zc3nms0m0000gp/T/tmpj9mo8ims/h2o_koverholt_started_from_python.err
 Server is running at http://127.0.0.1:54321

Connecting to H2O server at http://127.0.0.1:54321... successful.

After we’ve started the H2O cluster, we can download the Iris data set from the H2O repository on Github and view a summary of the data:

>>> iris = h2o.import_file(path="https://github.com/h2oai/h2o-3/raw/master/h2o-r/h2o-package/inst/extdata/iris_wheader.csv")
>>> iris.describe()

Now that we’ve loaded the data set, we can import and run the k-means estimator from H2O:

>>> from h2o.estimators.kmeans import H2OKMeansEstimator
>>> results = [H2OKMeansEstimator(k=clusters, init="Random", seed=2, standardize=True) for clusters in range(2,13)]
>>> for estimator in results:
    estimator.train(x=iris.col_names[0:-1], training_frame = iris)

kmeans Model Build progress: |████████████████████████████████████████████| 100%

We can specify the number of clusters and iteratively compute the cluster locations and data points that are contained within the clusters:

>>> clusters = 4
>>> predicted = results[clusters-2].predict(iris)
>>> iris["Predicted"] = predicted["predict"].asfactor()

kmeans prediction progress: |█████████████████████████████████████████████| 100%

Once we’ve generated the predictions, we can visualize the classified data and clusters. Because we have access to all of the libraries in Anaconda in the same notebook as H2O, we can use matplotlib and seaborn to visualize the results:

>>> import seaborn as sns
>>> %matplotlib inline
>>> sns.set()
>>> sns.pairplot(iris.as_data_frame(True), vars=["sepal_len", "sepal_wid", "petal_len", "petal_wid"],  hue="Predicted");

Deep Learning with Anaconda and H2O

We can also perform deep learning with H2O and Anaconda. Deep learning is a class of machine learning algorithms that incorporate neural networks and can be used to perform regression and classification tasks on a data set.

In this example, we’ll use the supervised deep learning algorithm in H2O on the Prostate Cancer data set stored on Amazon S3.

We’ll use the same H2O cluster that we created using h2o.init() in the previous example. First, we’ll download the Prostate Cancer data set from a publicly available Amazon S3 bucket and view a summary of the data:

>>> prostate = h2o.import_file(path="s3://h2o-public-test-data/smalldata/logreg/prostate.csv")
>>> prostate.describe()

Rows: 380
Cols: 9

We can then import and run the deep learning estimator from H2O on the Prostate Cancer data:

>>> from h2o.estimators.deeplearning import H2ODeepLearningEstimator
>>> prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
>>> model = H2ODeepLearningEstimator(activation = "Tanh", hidden = [10, 10, 10], epochs = 10000)
>>> model.train(x = list(set(prostate.columns) - set(["ID","CAPSULE"])), y ="CAPSULE", training_frame = prostate)
>>> model.show()

deeplearning Model Build progress: |██████████████████████████████████████| 100%
Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  DeepLearning_model_python_1483417629507_19
Status of Neuron Layers: predicting CAPSULE, 2-class classification, bernoulli distribution, CrossEntropy loss, 322 weights/biases, 8.5 KB, 3,800,000 training samples, mini-batch size 1

After we’ve trained the deep learning model, we can generate predictions and view the results, including the model scoring history and performance metrics:

>>> predictions = model.predict(prostate)
>>> predictions.show()

deeplearning prediction progress: |███████████████████████████████████████| 100%

Gradient Boosting with H2O and Anaconda

We can also perform gradient boosting with H2O and Anaconda. Gradient boosting is an ensemble machine learning technique (commonly used in conjunction with decision trees) that can perform regression and classification tasks on a data set.

In this example, we’ll use the supervised gradient boosting algorithm in H2O on a cleaned version of the Prostate Cancer data from the previous deep learning example.

First, we’ll import and run the gradient boosting estimator from H2O on the Prostate Cancer data:

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> my_gbm = H2OGradientBoostingEstimator(distribution = "bernoulli", ntrees=50, learn_rate=0.1)
>>> my_gbm.train(x=list(range(1,train.ncol)), y="CAPSULE", training_frame=train, validation_frame=train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%

After we’ve trained the gradient boosting model, we can view the resulting model performance metrics:

>>> my_gbm_metrics = my_gbm.model_performance(train)
>>> my_gbm_metrics.show()

ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.07338612348053128
RMSE: 0.2708987328883826
LogLoss: 0.26757238912319825
Mean Per-Class Error: 0.07431401341740806
AUC: 0.9801618150931445
Gini: 0.960323630186289
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4772353333869793:

Additional Resources for Machine Learning with Anaconda and H2O

Refer to the H2O documentation for more information about the full set of machine learning algorithms, libraries and examples that are available in H2O, including generalized linear modeling, random forest, naive bayes, ensemble learning, generalized low rank models, principal component analysis and others.

Interested in using Anaconda and H2O in your enterprise organization for machine learning, model deployment workflows and scalable analysis with Hadoop and Spark? Get in touch with us if you’d like to learn more about how Anaconda can empower your enterprise with Open Data Science, including an on-premise package repository, collaborative notebooks, cluster deployments and custom consulting/training solutions.

The complete notebooks for the k-means clustering, deep learning, and gradient boosting examples shown in this blog post can be viewed and downloaded from Anaconda Cloud:

https://anaconda.org/koverholt/h2o-kmeans-clustering/notebook
https://anaconda.org/koverholt/h2o-deep-learning/notebook
https://anaconda.org/koverholt/h2o-gradient-boosting/notebook

by swebster at January 06, 2017 05:50 PM

January 03, 2017

Matthew Rocklin

Dask Release 0.13.0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

Dask just grew to version 0.13.0. This is a signifcant release for arrays, dataframes, and the distributed scheduler. This blogpost outlines some of the major changes since the last release November 4th.

  1. Python 3.6 support
  2. Algorithmic and API improvements for DataFrames
  3. Dataframe to Array conversions for Machine Learning
  4. Parquet support
  5. Scheduling Performance and Worker Rewrite
  6. Pervasive Visual Diagnostics with Embedded Bokeh Servers
  7. Windows continuous integration
  8. Custom serialization

You can install new versions using Conda or Pip

conda install -c conda-forge dask distributed

or

pip install dask[complete] distributed --upgrade

Python 3.6 Support

Dask and all necessary dependencies are now available on Conda Forge for Python 3.6.

Algorithmic and API Improvements for DataFrames

Thousand-core Dask deployments have become significantly more common in the last few months. This has highlighted scaling issues in some of the Dask.array and Dask.dataframe algorithms, which were originally designed for single workstations. Algorithmic and API changes can be grouped into the following two categories:

  1. Filling out the Pandas API
  2. Algorithms that needed to be changed or added due to scaling issues

Dask Dataframes now include a fuller set of the Pandas API, including the following:

  1. Inplace operations like df['x'] = df.y + df.z
  2. The full Groupby-aggregate syntax like df.groupby(...).aggregate({'x': 'sum', 'y': ['min', max']})
  3. Resample on dataframes as well as series
  4. Pandas’ new rolling syntax df.x.rolling(10).mean()
  5. And much more

Additionally, collaboration with some of the larger Dask deployments has highlighted scaling issues in some algorithms, resulting in the following improvements:

  1. Tree reductions for groupbys, aggregations, etc.
  2. Multi-output-partition aggregations for groupby-aggregations with millions of groups, drop_duplicates, etc..
  3. Approximate algorithms for nunique
  4. etc..

These same collaborations have also yielded better handling of open file descriptors, changes upstream to Tornado, and upstream changes to the conda-forge CPython recipe itself to increase the default file descriptor limit on Windows up from 512.

Dataframe to Array Conversions

You can now convert Dask dataframes into Dask arrays. This is mostly to support efforts of groups building statistics and machine learning applications, where this conversion is common. For example you can load a terabyte of CSV or Parquet data, do some basic filtering and manipulation, and then convert to a Dask array to do more numeric work like SVDs, regressions, etc..

import dask.dataframe as dd
import dask.array as da

df = dd.read_csv('s3://...')  # Read raw data

x = df.values                 # Convert to dask.array

u, s, v = da.linalg.svd(x)    # Perform serious numerics

This should help machine learning and statistics developers generally, as many of the more sophisticated algorithms can be more easily implemented with the Dask array model than can be done with distributed dataframes. This change was done specifically to support the nascent third-party dask-glm project by Chris White at Capital One.

Previously this was hard because Dask.array wanted to know the size of every chunk of data, which Dask dataframes can’t provide (because, for example, it is impossible to lazily tell how many rows are in a CSV file without actually looking through it). Now that Dask.arrays have relaxed this requirement they can also support other unknown shape operations, like indexing an array with another array.

y = x[x > 0]

Parquet Support

Dask.dataframe now supports Parquet, a columnar binary store for tabular data commonly used in distributed clusters and the Hadoop ecosystem.

import dask.dataframe as dd

df = dd.read_parquet('myfile.parquet')                 # Read from Parquet

df.to_parquet('myfile.parquet', compression='snappy')  # Write to Parquet

This is done through the new fastparquet library, a Numba-accelerated version of the Pure Python parquet-python. Fastparquet was built and is maintained by Martin Durant. It’s also exciting to see the Parquet-cpp project gain Python support through Arrow and work by Wes McKinney and Uwe Korn. Parquet has gone from inaccessible in Python to having multiple competing implementations, which is a wonderful and exciting change for the “Big Data” Python ecosystem.

Scheduling Performance and Worker Rewrite

The internals of the distributed scheduler and workers are significantly modified. Users shouldn’t experience much change here except for general performance enhancement, more upcoming features, and much deeper visual diagnostics through Bokeh servers.

We’ve pushed some of the scheduling logic from the scheduler onto the workers. This lets us do two things:

  1. We keep a much larger backlog of tasks on the workers. This allows workers to optimize and saturate their hardware more effectively. As a result, complex computations end up being significantly faster.
  2. We can more easily deliver on a rising number of requests for complex scheduling features. For example, GPU users will be happy to learn that you can now specify abstract resource constraints like “this task requires a GPU” and “this worker has four GPUs” and the scheduler and workers will allocate tasks accordingly. This is just one example of a feature that was easy to implement after the scheduler/worker redesign and is now available.

Pervasive Visual Diagnostics with Embedded Bokeh Servers

While optimizing scheduler performance we built several new visual diagnostics using Bokeh. There is now a Bokeh Server running within the scheduler and within every worker.

Current Dask.distributed users will be familiar with the current diagnostic dashboards:

Dask Bokeh Plots

These plots provide intuition about the state of the cluster and the computations currently in flight. These dashboards are generally well loved.

There are now many more of these, though more focused on internal state and timings that will be of interest to developers and power users than to a typical users. Here are a couple of the new pages (of which there are seven) that show various timings and counters of various parts of the worker and scheduler internals.

Dask Bokeh counters page

The previous Bokeh dashboards were served from a separate process that queried the scheduler periodically (every 100ms). Now there are new Bokeh servers within every worker and a new Bokeh server within the scheduler process itself rather than in a separate process. Because these servers are embedded they have direct access to the state of the scheduler and workers which significantly reduces barriers for us to build out new visuals. However, this also adds some load to the scheduler, which can often be compute bound. These pages are available at new ports, 8788 for the scheduler and 8789 for the worker by default.

Custom Serialization

This is actually a change that occurred in the last release, but I haven’t written about it and it’s important, so I’m including it here.

Previously inter-worker communication of data was accomplished with Pickle/Cloudpickle and optional generic compression like LZ4/Snappy. This was robust and worked mostly fine, but left out some exotic data types and did not provide optimal performance.

Now we can serialize different types with special consideration. This allows special types, like NumPy arrays, to pass through without unnecessary memory copies and also allows us to use more exotic data-type specific compression techniques like Blosc.

It also allows Dask to serialize some previously unserializable types. In particular this was intended to solve the Dask.array climate science community’s concern about HDF5 and NetCDF files which (correctly) are unpicklable and so restricted to single-machine use.

This is also the first step towards two frequently requested features (neither of these exist yet):

  1. Better support for GPU-GPU specific serialization options. We are now a large step closer to generalizing away our assumption of TCP Sockets as the universal communication mechanism.
  2. Passing data between workers of different runtime languages. By embracing other protocols than Pickle we begin to allow for the communication of data between workers of different software environments.

What’s Next

So what should we expect to see in the future for Dask?

  • Communication: Now that workers are more fully saturated we’ve found that communication issues are arising more frequently as bottlenecks. This might be because everything else is nearing optimal or it might be because of the increased contention in the workers now that they are idle less often. Many of our new diagnostics are intended to measure components of the communication pipeline.
  • Third Party Tools: We’re seeing a nice growth of utilities like dask-drmaa for launching clusters on DRMAA job schedulers (SGE, SLURM, LSF) and dask-glm for solvers for GLM-like machine-learning algorithms. I hope that external projects like these become the main focus of Dask development going forward as Dask penetrates new domains.
  • Blogging: I’ll be launching a few fun blog posts throughout the next couple of weeks. Stay tuned.

Learn More

You can install or upgrade using Conda or Pip

conda install -c conda-forge dask distributed

or

pip install dask[complete] distributed --upgrade

You can learn more about Dask and its distributed scheduler at these websites:

Acknowledgements

Since the last main release the following developers have contributed to the core Dask repostiory (parallel algorithms, arrays, dataframes, etc..)

  • Alexander C. Booth
  • Antoine Pitrou
  • Christopher Prohm
  • Frederic Laliberte
  • Jim Crist
  • Martin Durant
  • Matthew Rocklin
  • Mike Graham
  • Rolando (Max) Espinoza
  • Sinhrks
  • Stuart Archibald

And the following developers have contributed to the Dask/distributed repository (distributed scheduling, network communication, etc..)

  • Antoine Pitrou
  • jakirkham
  • Jeff Reback
  • Jim Crist
  • Martin Durant
  • Matthew Rocklin
  • rbubley
  • Stephan Hoyer
  • strets123
  • Travis E. Oliphant

January 03, 2017 12:00 AM

December 27, 2016

Continuum Analytics news

A Look Back and a Peek Ahead: A Year in Review at Anaconda

Tuesday, December 27, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

2016 has been quite the year for all of us at Anaconda! From expanding our strong team to growing our customer and partner rosters and continuing our spirit of innovation, it seems like a perfect time to reflect on the year that was as we gear up to hit the ground running in 2017. 

January—March

We started the year off with a bang by expanding our executive leadership team, adding two new members to our rapidly growing company. In February, we welcomed Jon Shepherd, our new senior vice president of sales, and Matt Roberts joined as vice president of product engineering.  On the product side, we announced Anaconda 2.5 (coupled with Intel Math Kernel Library), Anaconda Enterprise Notebooks, Anaconda for Cloudera and Anaconda advancements that bring high performance advanced analytics to Hadoop. Last but certainly not least, our fearless leaders Travis Oliphant, Peter Wang, and Michele Chambers had a busy March—they  presented at the Gartner Business Intelligence & Analytics Summit and Strata + Hadoop World in San Jose. It was a great start to the year! 

April—June

Here at Anaconda, April showers don’t just bring May flowers—they also bring new products, cool collaborations and awesome events. Kicking off the quarter in April, we announced an exciting partnership with American Enterprise Institute’s Open Source Policy Center TaxBrain initiative. By leveraging the power of open source, TaxBrain can provide policy makers, journalists and the general public with the information they need to impact and change policy for the better. In May, Intel adopted Anaconda as the basis for their Python distribution. Lastly, Spark Summit 2016 was a real hit with two of our team members presenting on  “Connecting Python To The Spark Ecosystem” and “GPU Computing With Apache Spark And Python.”

July—September

While the dog days of summer were slightly quieter at Anaconda HQ, our team was still going strong under the hot Austin sun. Partnerships with big players such as Intel and IBM helped propel the quarter forward, and our popular data science capstone—the Anaconda Skills Accelerator Program—launched  with Galvanize earlier in 2016 for prospective data scientists. In July, we announced our substantial grant from the Gordon and Betty Moore Foundation to help fund Numba and Dask. Rounding off the quarter, Our CTO & cofounder Peter Wang took the stage again at Strata + Hadoop World in NYC to discuss Open Data Science on Hadoop, and we introduced the Journey to Open Data Science with a new fun video that kept SAS on their toes and gave Strata attendees a good chuckle. Bokeh developers Bryan Van de Ven and Sarah Bird also presented their Interactive Data Applications tutorial at Strata + Hadoop World in NYC. It was a successful summer. 

October—December 

With the end of the year in sight,  we launched the AnacondaCrew Partner Program to empower data scientists with superpowers (no, we’re not kidding!). We’re thrilled to announce that in the last year, we’ve quickly grown this program to include a dozen of the best known modern data science partners in the ecosystems, including Cloudera, Intel, Microsoft, IBM, NVIDIA, Docker, DataCamp and many others. We rounded this out with a new partnership with Esri to help enhance GIS applications with Open Data Science. Add to that our new relationship with Recursion Pharmaceuticals—they adopted Bokeh on Anaconda to make it easy for biologists to identify genetic disease markers and assess drug efficacy when visualizing cell data. We feel great about helping to contribute to the medical community and making real change in people's lives. Finally, AnacondaCON 2017 registration opened in November and we couldn’t be more excited for the first event—our cherry on top of a wonderful year!

 

Wishing a happy and healthy holiday season and New Year to everyone in the Anaconda community; none of this would have been possible without you. 

As our holiday gift to you,  please download our Anaconda Holiday Wallpaper on Dropbox for a festive, Python Desktop Background fit for the season. 

Cheers to an even more successful 2017! 

 

 

by swebster at December 27, 2016 03:28 PM

December 24, 2016

Matthew Rocklin

Dask Development Log

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-11 and 2016-12-18. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of last week:

  1. Cleanup of load balancing
  2. Found cause of worker lag
  3. Initial Spark/Dask Dataframe comparisons
  4. Benchmarks with asv

Load Balancing Cleanup

The last two weeks saw several disruptive changes to the scheduler and workers. This resulted in an overall performance degradation on messy workloads when compared to the most recent release, which stopped bleeding-edge users from using recent dev builds. This has been resolved, and bleeding-edge git-master is back up to the old speed and then some.

As a visual aid, this is what bad (or in this case random) load balancing looks like:

bad work stealing

Identified and removed worker lag

For a while there have been significant gaps of 100ms or more between successive tasks in workers, especially when using Pandas. This was particularly odd because the workers had lots of backed up work to keep them busy (thanks to the nice load balancing from before). The culprit here was the calculation of the size of the intermediate on object dtype dataframes.

lag between tasks

Explaining this in greater depth, recall that to schedule intelligently, the workers calculate the size in bytes of every intermediate result they produce. Often this is quite fast, for example for numpy arrays we can just multiply the number of elements by the dtype itemsize. However for object dtype arrays or dataframes (which are commonly used for text) it can take a long while to calculate an accurate result here. Now we no longer calculuate an accurate result, but instead take a fairly pessimistic guess. The gaps between tasks shrink considerably.

no lag between tasks no lag between tasks zoomed

Although there is still a significant bit of lag around 10ms long between tasks on these workloads (see zoomed version on the right). On other workloads we’re able to get inter-task lag down to the tens of microseconds scale. While 10ms may not sound like a long time, when we perform very many very short tasks this can quickly become a bottleneck.

Anyway, this change reduced shuffle overhead by a factor of two. Things are starting to look pretty snappy for many-small-task workloads.

Initial Spark/Dask Dataframe Comparisons

I would like to run a small benchmark comparing Dask and Spark DataFrames. I spent a bit of the last couple of days using Spark locally on the NYC Taxi data and futzing with cluster deployment tools to set up Spark clusters on EC2 for basic benchmarking. I ran across flintrock, which has been highly recommended to me a few times.

I’ve been thinking about how to do benchmarks in an unbiased way. Comparative benchmarks are useful to have around to motivate projects to grow and learn from each other. However in today’s climate where open source software developers have a vested interest, benchmarks often focus on a projects’ strengths and hide their deficiencies. Even with the best of intentions and practices, a developer is likely to correct for deficiencies on the fly. They’re much more able to do this for their own project than for others’. Benchmarks end up looking more like sales documents than trustworthy research.

My tentative plan is to reach out to a few Spark devs and see if we can collaborate on a problem set and hardware before running computations and comparing results.

Benchmarks with airspeed velocity

Rich Postelnik is building on work from Tom Augspurger to build out benchmarks for Dask using airspeed velocity at dask-benchmarks. Building out benchmarks is a great way to get involved if anyone is interested.

Pre-pre-release

I intend to publish a pre-release for a 0.X.0 version bump of dask/dask and dask/distributed sometime next week.

December 24, 2016 12:00 AM

December 18, 2016

Matthew Rocklin

Dask Development Log

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-11 and 2016-12-18. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of last week:

  1. Benchmarking new scheduler and worker on larger systems
  2. Kubernetes and Google Container Engine
  3. Fastparquet on S3

Rewriting Load Balancing

In the last two weeks we rewrote a significant fraction of the worker and scheduler. This enables future growth, but also resulted in a loss of our load balancing and work stealing algorithms (the old one no longer made sense in the context of the new system.) Careful dynamic load balancing is essential to running atypical workloads (which are surprisingly typical among Dask users) so rebuilding this has been all-consuming this week for me personally.

Briefly, Dask initially assigns tasks to workers taking into account the expected runtime of the task, the size and location of the data that the task needs, the duration of other tasks on every worker, and where each piece of data sits on all of the workers. Because the number of tasks can grow into the millions and the number of workers can grow into the thousands, Dask needs to figure out a near-optimal placement in near-constant time, which is hard. Furthermore, after the system runs for a while, uncertainties in our estimates build, and we need to rebalance work from saturated workers to idle workers relatively frequently. Load balancing intelligently and responsively is essential to a satisfying user experience.

We have a decently strong test suite around these behaviors, but it’s hard to be comprehensive on performance-based metrics like this, so there has also been a lot of benchmarking against real systems to identify new failure modes. We’re doing what we can to create isolated tests for every failure mode that we find to make future rewrites retain good behavior.

Generally working on the Dask distributed scheduler has taught me the brittleness of unit tests. As we have repeatedly rewritten internals while maintaining the same external API our testing strategy has evolved considerably away from fine-grained unit tests to a mixture of behavioral integration tests and a very strict runtime validation system.

Rebuilding the load balancing algorithms has been high priority for me personally because these performance issues inhibit current power-users from using the development version on their problems as effectively as with the latest release. I’m looking forward to seeing load-balancing humming nicely again so that users can return to git-master and so that I can return to handling a broader base of issues. (Sorry to everyone I’ve been ignoring the last couple of weeks).

Test deployments on Google Container Engine

I’ve personally started switching over my development cluster from Amazon’s EC2 to Google’s Container Engine. Here are some pro’s and con’s from my particular perspective. Many of these probably have more to do with how I use each particular tool rather than intrinsic limitations of the service itself.

In Google’s Favor

  1. Native and immediate support for Kubernetes and Docker, the combination of which allows me to more quickly and dynamically create and scale clusters for different experiments.
  2. Dynamic scaling from a single node to a hundred nodes and back ten minutes later allows me to more easily run a much larger range of scales.
  3. I like being charged by the minute rather than by the hour, especially given the ability to dynamically scale up
  4. Authentication and billing feel simpler

In Amazon’s Favor

  1. I already have tools to launch Dask on EC2
  2. All of my data is on Amazon’s S3
  3. I have nice data acquisition tools, s3fs, for S3 based on boto3. Google doesn’t seem to have a nice Python 3 library for accessing Google Cloud Storage :(

I’m working from Olivier Grisel’s repository docker-distributed although updating to newer versions and trying to use as few modifications from naive deployment as possible. My current branch is here. I hope to have something more stable for next week.

Fastparquet on S3

We gave fastparquet and Dask.dataframe a spin on some distributed S3 data on Friday. I was surprised that everything seemed to work out of the box. Martin Durant, who built both fastparquet and s3fs has done some nice work to make sure that all of the pieces play nicely together. We ran into some performance issues pulling bytes from S3 itself. I expect that there will be some tweaking over the next few weeks.

December 18, 2016 12:00 AM

December 14, 2016

Titus Brown

Notes on our lab Code of Conduct

I'm writing this up for the rOpenSci call on Codes of Conduct that I'm participating in today.


My lab has a lab Code of Conduct.

We adapted it from https://github.com/confcodeofconduct/confcodeofconduct.com. So the "how" was easy enough :).

Key points I want to make:

  • develop & post a code of conduct whether or not you know of any problems;
  • the code of conduct has to set expectations for everyone, including the boss;
  • I provide a specific contact person outside the university hierarchy for complaints about me;

A few notes on why a CoC and what use it's been:

Adoption was not motivated by any one particular incident, although there have been a few incidents of problematic behavior over the years. It was more motivated by our adoption of a Code of Conduct for the khmer software project, which is one of our major projects, and also by the Software Carpentry Code of Conduct. (Note that Michael Crusoe was the originator of the CoC on the khmer project and has been both a strong proponent and an excellent resource for creating friendly workspaces.) The 2013 PyCon incidents helped convince me of CoCs in general. We also attended an excellent Ada Initiative ally workshop at PyCon in 2015 that convinced me of the utility of a CoC for the lab specifically.

Another motivation to adopt a lab code of conduct came from our training efforts, where it is clear that impostor syndrome rules and it takes quite a bit of overt friendliness for people to ask questions. Providing ground rules for interaction helps there tremendously.

I will note that there are a few unfriendly and/or obnoxious people in bioinformatics, and that at least two of these individuals have targeted students in my lab or collaborators. Not much to be done about that, although Twitter's "block" functionality works extremely well for me, or at least so I assume ;). I won't stand for that sort of behavior in the lab, though.

A lab code differs from a workshop or community code of conduct in a few ways. The primary difference I see is at the intersection of authority and longevity - unlike an online community, there is a de facto authority (the head of the lab), who within some limits can make decisions for the lab; this is like a workshop where someone can be asked to leave by the workshop organizer. But unlike a workshop, labs exist for a long time, and so there are longer term relationships to consider.

My goal in adopting a code was to make it clear that everyone could speak comfortably, without fear of being targeted for who they were or what they believed. I've come through both very macho and argumentative labs as well as super friendly and uncritical labs, and neither seemed right -- I wanted to maintain both the ability to give and take criticism within the lab together with having a friendly and productive lab atmosphere. I believe this is important for the development brainstorming and creativity, as well as simply making the lab a nice(r) place to be.

There have definitely been benefits in recruiting: having a code (and following it!) means that people know you are aware of many issues that all too many faculty seem unaware of... this encourages a more diverse applicant pool. This may be one reason the lab is fairly diverse in practice, as well.

A key aspect of our code is that it places expectations on the boss as well (that's me). Part of this is having someone to complain to about me; this has only been used once, and it was super important because I simply hadn't realized what I had said & done, and (long term) it led to me modifying a particular behavior of mine.

It also enables labbies to take the initiative when something comes up, which has happened a few times. This doesn't need to involve me; lab members have felt free to speak up and remind others that what they are saying is inappropriate or hurtful, in part because they know that we have set expectations and I will back them up. This seems to work well, the few times it has been used. (Although I've never had to back anyone up.)

Fundamentally a code of conduct defines a social contract and sets expectations for everyone in their basic set of interactions. I've found it to be a net positive with no downsides so far.

Final note: I have no idea if it's legally enforceable but I don't actually think that matters that much, as the university channels for handling harassment are largely useless in practice; this is a social problem too, not "just" a legal problem.

--titus

by C. Titus Brown at December 14, 2016 11:00 PM

December 13, 2016

Continuum Analytics news

Counting Down to AnacondaCON 2017

Tuesday, December 13, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

The entire #AnacondaCREW is busy gearing up for our inaugural user conference, AnacondaCON—and it’s less than two months away! To continue the hype and excitement (and to honor the fact that the conference is in just eight weeks), we’re sharing eight things to expect at AnacondaCON 2017. Check ‘em out! 

  1. Awesome attendees. AnacondaCON will be filled with Anaconda Enterprise users and the brightest minds in the Open Data Science movement that are harnessing the power and innovation of the Anaconda community. 

  2. Noteworthy speakers. Our speakers will open your eyes to a whole new world of Open Data Science: Blake Borgeson, co-founder and CTO of Recursion Pharmaceuticals, Eric Jonas, Postdoctoral Researcher at UC Berkeley, Travis Oliphant, Continuum Analytics CEO and co-founder, just to name a few. And we’re still updating the agenda!

  3. Amazing schedule. We’ve got an agenda that’s packed with mind-blowing sessions on Open Data Science and cutting-edge insider information (not to mention delicious food!). Keep checking back as we’re updating the agenda everyday.  

  4. Captivating keynote. Our very own CTO and co-founder, Peter Wang, is kicking off the conference with a scintillating presentation starting at 9AM on Wednesday, February 8. We won’t spoil it, but if you’ve ever heard Peter speak, you know you won’t want to miss this.

  5. Oversized rooms. We’re not kidding when we tell you the JW Marriott Austin has elegant, oversized rooms and extremely plush pillows (we tested them out ourselves). Be sure to book your stay now by calling (844) 473-3959 and mentioning "AnacondaCON,” or booking online at our discounted link. 

  6. Networking Offsite. Who doesn’t love authentic Texas BBQ? Did we mention unbelievable tacos? Don’t miss the AnacondaCON off-site party for networking, authentic Texas BBQ (yes, we said it twice, it’s just that good) and much more at Fair Market, an Austin Eastside venue only a short ride away from the JW Marriott. Check out the party venue here.

  7. Dynamic teams. Bring everyone to AnacondaCON! At the conference, you’ll learn how your entire team—business analysts, bata scientists, developers, DevOps, data engineers, anyone—can share, engage and collaborate, from the prototype all the way through production deployment. 

  8. Awe-inspiring event. Overall, AnacondaCON will be the event of lifetime—you won’t want to miss it. See you there!

Register now for AnacondaCON to take advantage of our Early Bird prices. 

by swebster at December 13, 2016 05:09 PM

December 12, 2016

Matthew Rocklin

Dask Development Log

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-05 and 2016-12-12. Nothing here is stable or ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of last week:

  1. Dask.array without known chunk sizes
  2. Import time
  3. Fastparquet blogpost and feedback
  4. Scheduler improvements for 1000+ worker clusters
  5. Channels and inter-client communication
  6. New dependencies?

Dask array without known chunk sizes

Dask arrays can now work even in situations where we don’t know the exact chunk size. This is particularly important because it allows us to convert dask.dataframes to dask.arrays in a standard analysis cycle that includes both data preparation and statistical or machine learning algorithms.

x = df.values

x = df.to_records()

This work was motivated by the work of Christopher White on building scalable solvers for problems like logistic regression and generalized linear models over at dask-glm.

As a pleasant side effect we can now also index dask.arrays with dask.arrays (a previous limitation)

x[x > 0]

and mutate dask.arrays in certain cases with setitem

x[x > 0] = 0

Both of which are frequntly requested.

However, there are still holes in this implementation and many operations (like slicing) generally don’t work on arrays without known chunk sizes. We’re increasing capability here but blurring the lines of what is possible and what is not possible, which used to be very clear.

Import time

Import times had been steadily climbing for a while, rising above one second at times. These were reduced by Antoine Pitrou down to a more reasonable 300ms.

FastParquet blogpost and feedback

Martin Durant has built a nice Python Parquet library here: http://fastparquet.readthedocs.io/en/latest/ and released a blogpost about it last week here: https://www.continuum.io/blog/developer-blog/introducing-fastparquet

Since then we’ve gotten some good feedback and error reports (non-string column names etc.) Martin has been optimizing performance and recently adding append support.

Scheduler optimizations for 1000+ worker clusters

The recent refactoring of the scheduler and worker exposed new opportunities for performance and for measurement. One of the 1000+ worker deployments here in NYC was kind enough to volunteer some compute time to run some experiments. It was very fun having all of the Dask/Bokeh dashboards up at once (there are now half a dozen of these things) giving live monitoring information on a thousand-worker deployment. It’s stunning how clearly performance issues present themselves when you have the right monitoring system.

Anyway, this lead to better sequentialization when handling messages, greatly reduced open file handle requirements, and the use of cytoolz over toolz in a few critical areas.

I intend to try this experiment again this week, now with new diagnostics. To aid in that we’ve made it very easy to turn timings and counters automatically into live Bokeh plots. It now takes literally one line of code to add a new plot to these pages (left: scheduler right: worker)

Dask Bokeh counters page

Already we can see that the time it takes to connect between workers is absurdly high in the 10ms to 100ms range, highlighting an important performance flaw.

This depends on an experimental project, crick, by Jim Crist that provides a fast T-Digest implemented in C (see also Ted Dunning’s implementation.

Channels and inter-worker communication

I’m starting to experiment with mechanisms for inter-client communication of futures. This enables both collaborative workflows (two researchers sharing the same cluster) and also complex workflows in which tasks start other tasks in a more streaming setting.

We added a simple mechanism to share a rolling buffer of futures between clients:

# Client 1
c = Client('scheduler:8786')
x = c.channel('x')

future = c.submit(inc, 1)
x.put(future)
# Client 1
c = Client('scheduler:8786')
x = c.channel('x')

future = next(iter(x))

Additionally, this relatively simple mechanism was built external to the scheduler and client, establishing a pattern we can repeat in the future for more complex inter-client communication systems. Generally I’m on the lookout for other ways to make the system more extensible. This range of extension requests for the scheduler is somewhat large these days and we’d like to find ways to keep these expansions maintainable going forward.

New dependency: Sorted collections

The scheduler is now using the sortedcollections module, which is based off of sortedcontainers which is a pure-Python library offering sorted containers SortedList, SortedSet, ValueSortedDict, etc. at C-extensions speeds.

So far I’m pretty sold on these libraries. I encourage other library maintainers to consider them.

December 12, 2016 12:00 AM

December 11, 2016

Titus Brown

What metadata should we put in MinHash Sketch signatures?

One of the uses that we are most interested in MinHash sketches for is the indexing and search of large public, semi-public, and private databases. There are many specific use cases for this, but the basic goal is to be able to find data sets by content queries, using sequence as the "bait". Think "find me data sets that overlap with my metagenome", or "what should I co-assemble with?" One particularly interesting feature of MinHash sketches for this purpose is that you can provide indices on closed or private data sets without revealing the actual data - while I'd prefer that most data be open to all, I figure "findable" is at least an advantage over the current situation.

As we start to plan the indexing of larger databases, a couple of other features of MinHash sketches also start to become important. One feature is that they are very small, and they are also very quick to search. For 60,000 microbial genomes the compressed data set of sourmash sketches is under a few GB, and that's with an overly verbose and unoptimized storage format. These 60,000 genomes can be searched in under a few seconds and in less than a GB of RAM; because of the wonder of n-ary trees, it is unlikely that search of much larger databases will be significantly slower. A third feature (well explored in the mash paper) is that MinHash sketches with large k are both very specific and very sensitive to single genomes, in that you usually recover the right match, and it is rare to recover irrelevant matches.

One consequence of the speed and small footprint of MinHash sketches is that we can easily provide the individual sketches as well as the aggregated Sequence Bloom Tree databases for download and use. Another consequence is that people can search and filter on these databases quite quickly and without a lot of hardware - pretty much everything can be done on laptop-scale hardware. Moreover the sketches (once calculated) don't really need to be updated - the sketch will change very little even if an assembly is updated. So while people might be interested in building custom MinHash databases for searching subsets of archives, it seems reasonable to maintain a single database of all the sketches that can be downloaded and searched by anyone.

This opinion informed my response to Michael Barton, who is interested in building custom databases for several reasons - my guess is that this will be a somewhat specialized (though perhaps reasonably frequent) use case, compared to simply downloading and using a pre-constructed database. More important to me is the interoperability of different tools, which basically boils down to choosing the same hash functions and (eventually) figuring out what k-mer size and number of MinHash values to store per data.

Something that I'm more focused on at the moment is another question that Michael asked, which is about metadata. Right now our individual signature files can contain multiple sketches per sample, with different k-mer sizes and molecule types (DNA/protein). These are kept in YAML. Because of this, the format is easily extensible to include a variety of metadata, but I have put very little thought into what metadata to store.

Thinking out loud,

  • there will be a few pieces of metadata that every sketch should have; for public data, for example, the URL and an unambiguous database specific identifier should be there.

  • each source database will have its own metadata records; if we index data sets from the Assembly database at NCBI, there will be different fields available than from the SRA database at NCBI, vs the MG-RAST metagenome collection, vs the IMG/M database. I'm not aware of any metadata standards here (but I wouldn't know, either).

    This means that trying to come up with a single standard is an idea that is doomed to fail.

  • we should try to include enough information that there is something human readable and useful, if possible;

  • I'm not sure how much information we need to include beyond database identity and database record ID; it seems like dipping our toes into (e.g.) taxonomy and phylogeny would be a dangerous game, and that information could be pulled out of the databases for whatever specific use case.

  • I'm comfortable with the idea of a developing out the details over time as we add new data sets, and perhaps updating old records with more complete metadata as we develop new use cases and more robust handling code.

Some examples

For example, looking at Shewanella oneidensis MR-1, the assembly record has the following info:

ASM14616v2
Organism name: Shewanella oneidensis MR-1 (g-proteobacteria)
Infraspecific name: Strain: MR-1
BioSample: SAMN02604014
Submitter: TIGR
Date: 2012/11/02
Assembly level: Complete Genome
Genome representation: full
RefSeq category: reference genome
Relation to type material: assembly from type material
GenBank assembly accession: GCA_000146165.2 (latest)
RefSeq assembly accession: GCF_000146165.2 (latest)
RefSeq assembly and GenBank assembly identical: yes

Clearly we want to store 'organism name' and probably the strain, and the accession information; and we probably want to include assembly level and genome representation. I'd probably also add the URL to download the .fna.gz file. But I don't think we want statistics (included at the bottom of the page), or any of the other information on the Genome page, because we'd end up having to update that regularly for many samples.


Looking at the SRA record for a metagenome from Hu et al., 2016, I'd probably want to include:

  • the fact that it is metagenomic FASTQ;
  • the description at the top "Illumina MiSeq paired end sequencing; metagenome SB1 from not soured petroleum reservoir, Schrader bluffer formation, Alaska North Slope"
  • whatever error trimming/correction commands I used before minhashing it;
  • a link to the ENA FASTQ files for download;

and that's about it.


Other records would presumably vary in similar ways, ranging from really minimal information ("this kind of sample, this kind of sequencing, have fun") to much more fleshed out metadata.

Your thoughts on how to go about this?

--titus

by C. Titus Brown at December 11, 2016 11:00 PM

December 08, 2016

Enthought

Webinar: Solving Enterprise Python Deployment Headaches with the New Enthought Deployment Server

See a recording of the webinar:

Built on 15 years of experience of Python packaging and deployment for Fortune 500 companies, the NEW Enthought Deployment Server provides enterprise-grade tools groups and organizations using Python need, including:

  1. Secure, onsite access to a private copy of the proven 450+ package Enthought Python Distribution
  2. Centralized management and control of packages and Python installations
  3. Private repositories for sharing and deployment of proprietary Python packages
  4. Support for the software development workflow with Continuous Integration and development, testing, and production repositories

In this webinar, Enthought’s product team demonstrates the key features of the Enthought Deployment Server and how it can take the pain out of Python deployment and management at your organization.

Who Should Watch this Webinar:

If you answer “yes” to any of the questions below, then you (or someone at your organization) should watch this webinar:

  1. Are you using Python in a high-security environment (firewalled or air gapped)?
  2. Are you concerned about how to manage open source software licenses or compliance management?
  3. Do you need multiple Python environment configurations or do you need to have consistent standardized environments across a group of users?
  4. Are you producing or sharing internal Python packages and spending a lot of effort on distribution?
  5. Do you have a “guru” (or are you the guru?) who spends a lot of time managing Python package builds and / or distribution?

In this webinar, we demonstrate how the Enthought Deployment Server can help your organization address these situations and more.

The post Webinar: Solving Enterprise Python Deployment Headaches with the New Enthought Deployment Server appeared first on Enthought Blog.

by admin at December 08, 2016 03:39 PM

December 07, 2016

Enthought

Using the Canopy Data Import Tool to Speed Cleaning and Transformation of Data & New Release Features

Enthought Canopy Data Import Tool

Download Canopy to try the Data Import Tool

In November 2016, we released Version 1.0.6 of the Data Import Tool (DIT), an addition to the Canopy data analysis environment. With the Data Import Tool, you can quickly import structured data files as Pandas DataFrames, clean and manipulate the data using a graphical interface, and create reusable Python scripts to speed future data wrangling.

For example, the Data Import Tool lets you delete rows and columns containing Null values or replace the Null values in the DataFrame with a specific value. It also allows you to create new columns from existing ones. All operations are logged and are reversible in the Data Import Tool so you can experiment with various workflows with safeguards against errors or forgetting steps.


What’s New in the Data Import Tool November 2016 Release

Pandas 0.19 support, re-usable templates for data munging, and more.

Over the last couple of releases, we added a number of new features and enhanced a number of existing ones. A few notable changes are:

  1. The Data Import Tool now supports the recently released Pandas version 0.19.0. With this update, the Tool now supports Pandas versions 0.16 through 0.19.
  2. The Data Import Tool now allows you to delete empty columns in the DataFrame, similar to existing option to delete empty rows.
  3. Tdelete-empty-columnshe Data Import Tool allows you to choose how to delete rows or columns containing Null values: “Any” or “All” methods are available.
  4. autosaved_scripts

    The Data Import Tool automatically generates a corresponding Python script for data manipulations performed in the GUI and saves it in your home directory re-use in future data wrangling.

    Every time you successfully import a DataFrame, the Data Import Tool automatically saves a generated Python script in your home directory. This way, you can easily review and reproduce your earlier work.

  5. The Data Import Tool generates a Template with every successful import. A Template is a file that contains all of the commands or actions you performed on the DataFrame and a unique Template file is generated for every unique data file. With this feature, when you load a data file, if a Template file exists corresponding to the data file, the Data Import Tool will automatically perform the operations you performed the last time. This way, you can save progress on a data file and resume your work.

Along with the feature additions discussed above, based on continued user feedback, we implemented a number of UI/UX improvements and bug fixes in this release. For a complete list of changes introduced in Version 1.0.6 of the Data Import Tool, please refer to the Release Notes page in the Tool’s documentation.

 

 


Example Use Case: Using the Data Import Tool to Speed Data Cleaning and Transformation

Now let’s take a look at how the Data Import Tool can be used to speed up the process of cleaning up and transforming data sets. As an example data set, let’s take a look at the Employee Compensation data from the city of San Francisco.

NOTE: You can follow the example step-by-step by downloading Canopy and starting a free 7 day trial of the data import tool

Step 1: Load data into the Data Import Tool

import-data-canopy-menuFirst we’ll download the data as a .csv file from the San Francisco Government data website, then open it from File -> Import Data -> From File… menu item in the Canopy Editor (see screenshot at right).

After loading the file, you should see the DataFrame below in the Data Import Tool:
data-frame-view

edit-command-window

The Data Import Tool automatically detects and converts column types (in this case to an integer type).

As you can see at the right, the Data Import Tool automatically detected and converted the columns “Job Code”, “Job Family Code” and “Union Code” to an Integer column type. But, if the Tool inferred erroneously, you can simply remove a specific column conversion by deleting it from the Edit Command window or remove all conversions by removing the command by clicking on the “X” in the Command History window.

Step 2: Use the Data Import Tool quickly assess data by sorting in the GUI

Using the Employee Compensation data set, let’s answer a few questions. For example, let’s see which Job Families get the highest Salary, the highest Overtime, the highest Total Salary and the highest Compensation. Further, let’s also determine what the highest and mean Total Compensation for a Job Family is.

multiple-columns-sortedLet’s start with the question “Which Job Family contains the highest Salary?” We can now get this information easily by clicking on the right end of the “Salaries” column to sort the column in ascending or descending order. Doing so, we can see that the highest paid Job Family is “Administrative & Mgmt (Unrep)” and specifically, the Job is Chief Investment Officer. In fact, 4 out of 5 top Salaries are paid to Chief Investment Officers.

overtime-sortedSimilarly, we can sort the “Overtime” column (see screenshot at right) to see which Job Family gets paid the most Overtime (turns out to be the “Deputy Sheriff” job family).

Sort the Total Salary and Total Compensation columns to find out which Job and Job Family had the highest salary and highest overall compensation.

[Note: While sorting the data set, you may have noticed the fact that there are negative values in the Salaries column. Yup. and hey! Don’t ask us. We don’t know why there are negative Salaries values either. If you know why or if you can figure out why, we would love to know! Comment below and tell us!]

Step 3: Simplify and Clean Data

delete-menu-item

Delete columns by right-clicking on a column name and selecting “Delete” from the menu.

Let’s now look at the second question we mentioned earlier: “What is the median income for different Job Families?” But before we get to that, let’s first remove a few columns with data not relevant to the questions we’re trying to answer (or you may choose to ask different questions and keep the columns instead). Here we delete columns by clicking on the “Delete” menu item after right-clicking on a column name.

When you are satisfied with how the DataFrame looks, click on the “Use DataFrame” button to push the DataFrame to Canopy’s IPython Console, where we can further analyze the data set. In Canopy’s IPython console, you can see what the final columns in the DataFrame are, which can be accessed using DataFrame.columns.

[u'Year Type', u'Year', u'Organization Group', u'Department', u'Job Family', u'Job', u'Salaries', u'Overtime', u'Other Salaries', u'Total Salary', u'Retirement', u'Health/Dental', u'Other Benefits', u'Total Benefits', u'Total Compensation']

canopy-ipython-viewLet’s now use the pandas’ DataFrame.groupby method to calculate the median salary of different Job Families, over the years. Passing both Job Family and Year segments the original DataFrame based on Job Family first and Year next. This way, we will be able to see difference in median Total Compensation in different Job Families and how it changed in a Job Family over the years.

grouped_df = Employee_Compensation.groupby(['Job Family', 'Year'])
for name, df in grouped_df:
print("{} - {}: median={:9.2f}, n={}".format(name[-1], name[0],
df['Total Compensation'].median(),
df['Total Compensation'].count()))


2013 - Administrative & Mgmt (Unrep): median=65154.66, n=9
2014 - Administrative & Mgmt (Unrep): median=189534.965, n=12
2015 - Administrative & Mgmt (Unrep): median=352931.01, n=13
2016 - Administrative & Mgmt (Unrep): median=351961.28, n=9
2013 - Administrative Secretarial: median=122900.205, n=22
2014 - Administrative Secretarial: median=130164.525, n=20
2015 - Administrative Secretarial: median=127206.02, n=19
2016 - Administrative Secretarial: median=137861.05, n=9
2013 - Administrative-DPW/PUC: median=164535.52, n=89
2014 - Administrative-DPW/PUC: median=172906.585, n=82
2015 - Administrative-DPW/PUC: median=180582.9, n=85
2016 - Administrative-DPW/PUC: median=180095.54, n=44
. . .

We hope that this gives you a small idea of what can be done using the Data Import Tool and the Python Pandas library. If you analyzed this data set in a different way, comment below and tell us about it.

BTW, if you are interested in honing your data analysis skills in Python, check out our Virtual Pandas Crash Course or join the Pandas Mastery Workshop for a more comprehensive introduction to Pandas and data analysis using it.

If you have any feedback regarding the Data Import Tool, we’d love to hear from you at canopy.support@enthought.com.

Additional resources:

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

The post Using the Canopy Data Import Tool to Speed Cleaning and Transformation of Data & New Release Features appeared first on Enthought Blog.

by cgodshall at December 07, 2016 07:59 PM

December 06, 2016

Continuum Analytics news

Introducing: fastparquet

Tuesday, December 6, 2016
Martin Durant
Continuum Analytics

A compliant, flexible and speedy interface to Parquet format files for Python, fastparquet provides seamless translation between in-memory pandas DataFrames and on-disc storage.

In this post, we will introduce the two functions that will most commonly be used within fastparquet, followed by a discussion of the current Big Data landscape, Python's place within it and details of how fastparquet fills one of the gaps on the way to building out a full end-to-end Big Data pipeline in Python.

fastparquet Teaser

New users of fastparquet will mainly use the functions write and ParquetFile.to_pandas. Both functions offer good performance with default values, and both have a number of options to improve performance further.

import fastparquet

# write data
fastparquet.write('out.parq', df, compression='SNAPPY')

# load data
pfile = fastparquet.ParquetFile('out.parq') 
df2 = pfile.topandas() # all columns 
df3 = pfile.topandas(columns=['floats', 'times']) # pick some columns

Introduction: Python and Big Data

Python was named as a favourite tool for data science by 45% of data scientists in 2016. Many reasons can be presented for this, and near the top will be:

  • Python is very commonly taught at college and university level

  • Python and associated numerical libraries are free and open source

  • The code tends to be concise, quick to write, and expressive

  • An extremely rich ecosystem of libraries exist for not only numerical processing but also other important links in the pipeline from data ingest to visualization and distribution of results

Big Data, however, has typically been based on traditional databases and, in latter years, the Hadoop ecosystem. Hadoop provides a distributed file-system, cluster resource management (YARN, Mesos) and a set of frameworks for processing data (map-reduce, pig, kafka, and many more). In the past few years, Spark has rapidly increased in usage, becoming a major force, even though 62% of users use Python to execute Spark jobs (via PySpark).

The Hadoop ecosystem and its tools, including Spark, are heavily based around the Java Virtual Machine (JVM), which creates a gap between the familiar, rich Python data ecosystem and clustered Big Data with Hadoop. One such missing piece is a data format that can efficiently store large amounts of tabular data, in a columnar layout, and split it into blocks on a distributed file-system.

Parquet has become the de-facto standard file format for tabular data in Spark, Impala and other clustered frameworks. Parquet provides several advantages relevant to Big Data processing:

  • Columnar storage, only read the data of interest

  • Efficient binary packing

  • Choice of compression algorithms and encoding

  • Splits data into files, allowing for parallel processing

  • Range of logical types

  • Statistics stored in metadata to allow for skipping unneeded chunks

  • Data partitioning using the directory structure

fastparquet bridges the gap to provide native Python read/write access with out the need to use Java.

Until now, Spark's Python interface provided the only way to write Spark files from Python. Much of the time is spent in deserializing the data in the Java-Python bridge. Also, note that the times column returned is now just integers, rather than the correct datetime type. Not only does fastparquet provide native access to Parquet files, it in fact makes the transfer of data to Spark much faster.

# to make and save a large-ish DataFrame
import pandas as pd 
import numpy as np 
N = 10000000

df = pd.DataFrame({'ints': np.random.randint(0, 1000, size=N),
 'floats': np.random.randn(N),
 'times': pd.DatetimeIndex(start='1980', freq='s', periods=N)})
import pyspark
sc = pyspark.SparkContext()
sql = pyspark.SQLContext(sc) 

The default Spark single-machine configuration cannot handle the above DataFrame (out-of-memory error), so we'll perform timing for 1/10 of the data:

# sending data to spark via pySpark serialization, 1/10 of the data
%time o = sql.createDataFrame(df[::10]).count()
CPU times: user 3.45 s, sys: 96.6 ms, total: 3.55 s
Wall time: 4.14 s
%%time
# sending data to spark via a file made with fastparquet, all the data 
fastparquet.write('out.parq', df, compression='SNAPPY')
df4 = sql.read.parquet('outspark.parq').count()
CPU times: user 2.75 s, sys: 285 ms, total: 3.04 s
Wall time: 3.27 s

 

The fastparquet Library

fastparquet is an open source library providing a Python interface to the Parquet file format. It uses Numba and NumPy to provide speed, and writes data to and from pandas DataFrames, the most typical starting point for Python data science operations.

fastparquet can be installed using conda:

conda install -c conda-forge fastparquet

(currently only available for Python 3)

  • The code is hosted on GitHub
  • The primary documentation is on RTD

Bleeding edge installation directly from the GitHub repo is also supported, as long as Numba, pandas, pytest and ThriftPy are installed.

Reading Parquet files into pandas is simple and, again, much faster than via PySpark serialization.

import fastparquet 
pfile = fastparquet.ParquetFile('out.parq')
%time df2 = pfile.to_pandas()
CPU times: user 812 ms, sys: 291 ms, total: 1.1 s
Wall time: 1.1 s

 

The Parquet format is more compact and faster to load than the ubiquitous CSV format.

df.to_csv('out.csv')
!du -sh out.csv out.parq
490M    out.csv

162M    out.parq

 

In this case, the data is 229MB in memory, which translates to 162MB on-disc as Parquet or 490MB as CSV. Loading from CSV takes substantially longer than from Parquet.

%time df2 = pd.read_csv('out.csv', parse_dates=True)
CPU times: user 9.85 s, sys: 1 s, total: 10.9 s
Wall time: 10.9 s

 

The biggest advantage, however, is the ability to pick only some columns of interest. In CSV, this still means scanning through the whole file (if not parsing all the values), but the columnar nature of Parquet means only reading the data you need.

%time df3 = pd.read_csv('out.csv', usecols=['floats'])
%time df3 = pfile.to_pandas(columns=['floats'])
CPU times: user 4.04 s, sys: 176 ms, total: 4.22 s
Wall time: 4.22 s
CPU times: user 40 ms, sys: 96.9 ms, total: 137 ms
Wall time: 137 ms

 

Example

We have taken the airlines dataset and converted it into Parquet format using fastparquet. The original data was in CSV format, one file per year, 1987-2004. The total data size is 11GB as CSV, uncompressed, which becomes about double that in memory as a pandas DataFrame for typical dtypes. This is approaching, if not Big Data, Sizable Data, because it cannot fit into my machine's memory.

The Parquet data is stored as a multi-file dataset. The total size is 2.5GB, with Snappy compression throughout.

ls airlines-parq/
_common_metadata  part.12.parquet   part.18.parquet   part.4.parquet

_metadata         part.13.parquet   part.19.parquet   part.5.parquet

part.0.parquet    part.14.parquet   part.2.parquet    part.6.parquet

part.1.parquet    part.15.parquet   part.20.parquet   part.7.parquet

part.10.parquet   part.16.parquet   part.21.parquet   part.8.parquet

part.11.parquet   part.17.parquet   part.3.parquet    part.9.parquet

 

To load the metadata:

import fastparquet
pf = fastparquet.ParquetFile('airlines-parq')

The ParquetFile instance provides various information about the data set in attributes:

pf.info

pf.schema

pf.dtypes

pf.count

Furthermore, we have information available about the "row-groups" (logical chunks) and the 29 column fragments contained within each. In this case, we have one row-group for each of the original CSV files—that is, one per year.

fastparquet will not generally be as fast as a direct memory dump, such as numpy.save or Feather, nor will it be as fast or compact as custom tuned formats like bcolz. However, it provides good trade-offs and options which can be tuned to the nature of the data. For example, the column/row-group chunking of the data allows pre-selection of only some portions of the total, which enables not having to scan through the other parts of the disc at all. The load speed will depend on the data type of the column, the efficiency of compression, and whether there are any NULLs.

There is, in general, a trade-off between compression and processing speed; uncompressed will tend to be faster, but larger on disc, and gzip compression will be the most compact, but slowest. Snappy compression, in this example, provides moderate space efficiency, without too much processing cost.

fastparquet has no problem loading a very large number of rows or columns (memory allowing):

%%time
# 124M bool values
d = pf.to_pandas(columns=['Cancelled'])
CPU times: user 436 ms, sys: 167 ms, total: 603 ms
Wall time: 620 ms
%%time
d = pf.to_pandas(columns=['Distance'])
CPU times: user 964 ms, sys: 466 ms, total: 1.43 s
Wall time: 1.47 s
%%time
# just the first portion of the data, 1.3M rows, 29 columns 
d = pf.to_pandas(filters=(('Year', '==', 1987), ))
CPU times: user 1.37 s, sys: 212 ms, total: 1.58 s
Wall time: 1.59 s

 

The following factors are known to reduce performance:

  • The existence of NULLs in the data. It is faster to use special values, such as NaN for data types that allow it, or other known sentinel values, such as an empty byte-string.

  • Variable-length string encoding is slow on both write and read, and fixed-length will be faster, although this is not compatible with all Parquet frameworks (particularly Spark). Converting to categories will be a good option if the cardinality is low.

  • Some data types require conversion in order to be stored in Parquet's few primitive types. Conversion may take some time.

The Python Big Data Ecosystem

fastparquet provides one of the necessary links for Python to be a first-class citizen within Big Data processing. Although useful alone, it is intended to work seamlessly with the following libraries:

  • Dask, a pure-Python, flexible parallel execution engine, and its distributed scheduler. Each row-group is independent of the others, and Dask can take advantage of this to process parts of a Parquet data-set in parallel. The Dask DataFrame closely mirrors pandas, and methods on it (a subset of all those in pandas) actually call pandas methods on the underlying shards of the logical DataFrame. The Dask Parquet interface is experimental, as it lags slightly behind development in fastparquet.

  • hdfs3 , s3fs and adlfs provide native Pythonic interfaces to massive file systems. If the whole purpose of Parquet is to store Big Data, we need somewhere to keep it. fastparquet accepts a function to open a file-like object, given a path, and, so, can use any of these back-ends for reading and writing, and makes it easy to use any new file-system back-end in the future. Choosing the back-end is automatic when using Dask and a URL like s3://mybucket/mydata.parq.

With the blossoming of interactive visualization technologies for Python, the prospect of end-to-end Big Data processing projects is now fully realizable.

fastparquet Status and Plans

As of the publication of this article, the fastparquet library can be considered beta—useful to the general public and able to cope with many situations, but with some caveats (see below). Please try your own use case and report issues and comments on the GitHub tracker. The code will continue to develop (contributions welcome), and we will endeavour to keep the documentation in sync and provide regular updates.

A number of nice-to-haves are planned, and work to improve the performance should be completed around the new year, 2017.

Further Helpful Information

We don't have the space to talk about it here, but documentation at RTD gives further details on:

  • How to iterate through Parquet-stored data, rather than load the whole data set into memory at once

  • Using Parquet with Dask-DataFrames for parallelism and on a distributed cluster

  • Getting the most out of performance

  • Reading and writing partitioned data

  • Data types understood by Parquet and fastparquet

fastparquet Caveats

Aside from the performance pointers, above, some specific things do not work in fastparquet, and for some of these, fixes are not planned—unless there is substantial community interest.

  • Some encodings are not supported, such as delta encoding, since we have no test data to develop against.

  • Nested schemas are not supported at all, and are not currently planned, since they don't fit in well with pandas' tabular layout. If a column contains Python objects, they can be JSON-encoded and written to Parquet as strings.

  • Some output Parquet files will not be compatible with some other Parquet frameworks. For instance, Spark cannot read fixed-length byte arrays.

This work is fully open source (Apache-2.0), and contributions are welcome.

Development of the library has been supported by Continuum Analytics.

by swebster at December 06, 2016 05:48 PM

December 05, 2016

Matthew Rocklin

Dask Development Log

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Dask has been active lately due to a combination of increased adoption and funded feature development by private companies. This increased activity is great, however an unintended side effect is that I have spent less time writing about development and engaging with the broader community. To address this I hope to write one blogpost a week about general development. These will not be particularly polished, nor will they announce ready-to-use features for users, however they should increase transparency and hopefully better engage the developer community.

So themes of last week

  1. Embedded Bokeh servers for the Workers
  2. Smarter workers
  3. An overhauled scheduler that is slightly simpler overall (thanks to the smarter workers) but with more clever work stealing
  4. Fastparquet

Embedded Bokeh Servers in Dask Workers

The distributed scheduler’s web diagnostic page is one of Dask’s more flashy features. It shows the passage of every computation on the cluster in real time. These diagnostics are invaluable for understanding performance both for users and for core developers.

I intend to focus on worker performance soon, so I decided to attach a Bokeh server to every worker to serve web diagnostics about that worker. To make this easier, I also learned how to embed Bokeh servers inside of other Tornado applications. This has reduced the effort to create new visuals and expose real time information considerably and I can now create a full live visualization in around 30 minutes. It is now faster for me to build a new diagnostic than to grep through logs. It’s pretty useful.

Here are some screenshots. Nothing too flashy, but this information is highly valuable to me as I measure bandwidths, delays of various parts of the code, how workers send data between each other, etc..

Dask Bokeh Worker system page Dask Bokeh Worker system page Dask Bokeh Worker system page

To be clear, these diagnostic pages aren’t polished in any way. There’s lots missing, it’s just what I could get done in a day. Still, everyone running a Tornado application should have an embedded Bokeh server running. They’re great for rapidly pushing out visually rich diagnostics.

Smarter Workers and a Simpler Scheduler

Previously the scheduler knew everything and the workers were fairly simple-minded. Now we’ve moved some of the knowledge and responsibility over to the workers. Previously the scheduler would give just enough work to the workers to keep them occupied. This allowed the scheduler to make better decisions about the state of the entire cluster. By delaying committing a task to a worker until the last moment we made sure that we were making the right decision. However, this also means that the worker sometimes has idle resources, particularly network bandwidth, when it could be speculatively preparing for future work.

Now we commit all ready-to-run tasks to a worker immediately and that worker has the ability to pipeline those tasks as it sees fit. This is better locally but slightly worse globally. To counter balance this we’re now being much more aggressive about work stealing and, because the workers have more information, they can manage some of the administrative costs of works stealing themselves. Because this isn’t bound to run on just the scheduler we can use more expensive algorithms than when when did everything on the scheduler.

There were a few motivations for this change:

  1. Dataframe performance was bound by keeping the worker hardware fully occupied, which we weren’t doing. I expect that these changes will eventually yield something like a 30% speedup.
  2. Users on traditional job scheduler machines (SGE, SLURM, TORQUE) and users who like GPUS, both wanted the ability to tag tasks with specific resource constraints like “This consumes one GPU” or “This task requires a 5GB of RAM while running” and ensure that workers would respect those constraints when running tasks. The old workers weren’t complex enough to reason about these constraints. With the new workers, adding this feature was trivial.
  3. By moving logic from the scheduler to the worker we’ve actually made them both easier to reason about. This should lower barriers for contributors to get into the core project.

Dataframe algorithms

Approximate nunique and multiple-output-partition groupbys landed in master last week. These arose because some power-users had very large dataframes that weree running into scalability limits. Thanks to Mike Graham for the approximate nunique algorithm. This has also pushed hashing changes upstream to Pandas.

Fast Parquet

Martin Durant has been working on a Parquet reader/writer for Python using Numba. It’s pretty slick. He’s been using it on internal Continuum projects for a little while and has seen both good performance and a very Pythonic experience for what was previously a format that was pretty inaccessible.

He’s planning to write about this in the near future so I won’t steal his thunder. Here is a link to the documentation: fastparquet.readthedocs.io

December 05, 2016 12:00 AM

November 30, 2016

Continuum Analytics news

Data Science in the Enterprise: Keys to Success

Wednesday, November 30, 2016
Travis Oliphant
President, Chief Data Scientist & Co-Founder
Continuum Analytics

 

When examining the success of one of the most influential and iconic rock bands of all time, there’s no doubt that talent played a huge role. However, it would be unrealistic to attribute the phenomenon that was The Beatles to musical talents alone. Much of their success can be credited to the behind-the-scenes work of trusted advisors, managers and producers. There were many layers beneath the surface that contributed to their incredible fame—including implementing the proper team and tools to propel them from obscurity to commercial global success.  

 

Open Source: Where to Start

Similar to the music industry, success in Open Data Science relies heavily on many layers, including motivated data scientists, proper tools and the right vision for how to leverage data and perspective. Open Data Science is not a single technology, but a revolution within the data science community. It is an inclusive movement that connects open source tools for data science—preparation, analytics and visualization—so they can easily work together as a connected ecosystem. The challenge lies in figuring out how to successfully navigate the ecosystem and identifying the right Open Data Science enterprise vendors to partner with for the journey. 

Most organizations have come to understand the value of Open Data Science, but they often struggle with how to adopt and implement it. Some select a “DIY” method when addressing open source, choosing one of the languages or tools available at low or no cost. Others augment an open source base and build proprietary technology into existing infrastructures to address data science needs. 

Most organizations will engage enterprise-grade products and services when selecting other items, such as unified communication and collaboration tools, instead of opting for short-run cost-savings. For example, using consumer-grade instant messaging and mobile phones might save money this quarter, but over time this choice will end up costing an organization much more. This is due to the costs in labor and other services to make up for the lack of enterprise features, performance for enterprise use-cases and support and maintenance that is essential to successful production usage.  

The same standards apply for Open Data Science and the open source that surrounds this movement. While it is tempting to try and go at it alone with open source and avoid paying a vendor, there are fundamental problems with that strategy that will result in delayed deliverables, staffing challenges, maintenance headaches for software and frustration when the innovative open source communities move faster than an organization can manage or in a direction that is unexpected. All of this hurts the bottom line and can be easily avoided by finding an open source vendor that can navigate the complexity and ensure the best use of what is available in Open Data Science. In the next section, we will discuss three specific reasons it is important to choose vendors that can leverage open source effectively in the enterprise. 

Finding Success: The Importance of Choosing the Right Vendor/Partner

First, look for a vendor who is contributing significantly to the open source ecosystem. An open source vendor will not only provide enterprise solutions and services on top of existing open source, but will also produce significant open source innovations themselves—building communities like PyData, as well as contributing to open source organizations like The Apache Software Foundation, NumFOCUS or Software Freedom Conservancy. In this way, the software purchase translates directly into sustainability for the entire open source ecosystem. This will also ensure that the open source vendor is plugged into where the impactful open source communities are heading. 

Second, raw open source provides a fantastic foundation of innovation, but invariably does not contain all the common features necessary to adapt to an enterprise environment. Integration with disparate data sources, enterprise databases, single sign-on systems, scale-out management tools, tools for governance and control, as well as time-saving user interfaces, are all examples of things that typically do not exist in open source or exist in a very early form that lags behind proprietary offerings. Using internal resources to provide these common, enterprise-grade additions costs more money in the long run than purchasing these features from an open source vendor. 

The figure on the left below shows the kinds of ad-hoc layers that a company must typically create to adapt their applications, processes and workflows to what is available in open source. These ad-hoc layers are not unique to any one business, are hard to maintain and end up costing a lot more money than a software subscription from an open source vendor that would cover these capabilities with some of their enterprise offerings. 

The figure on the right above shows the addition of an enterprise layer that should be provided by an open source vendor. This layer can be proprietary, which w ill enable the vendor to build a sustainable software business that attracts investment, while it solves the fundamental adaptation problem as well.  As long as the vendor is deeply connected to open source ecosystems and is constantly aware of what part of the stack is better maintained as open source, businesses receive the best of supported enterprise software without the painful lock-in and innovation gaps of traditional proprietary-only software. 

Maintaining ad-hoc interfaces to open source becomes very expensive, very quickly.   Each interface is typically understood by only a few people in an organization and if they leave or move to different roles, their ability to make changes evaporates. In addition, rather than amortizing the cost of these interfaces over thousands of companies like a software vendor can do, the business pays the entire cost on their own. This discussion does not yet include the opportunity cost of tying up internal resources building and maintaining these common enterprise features instead of having those internal resources work on the software that is unique to a business. The best return from scarce software development talent is on software critical to a business that gives them a unique edge. We have also not discussed the time-to-market gaps that occur when organizations try to go at it alone, rather than selecting an open source vendor who becomes a strategic partner. Engaging an open source vendor who has in-depth knowledge of the technology, is committed to growing the open source ecosystem and has the ability to make the Open Data Science ecosystem work for enterprises, saves organizations significant time and money. 

Finally, working with an open source vendor provides a much needed avenue for the integration services, training and long-term support that is necessary when adapting an open source ecosystem to the enterprise. Open source communities develop for many reasons, but they are typically united in a passion for rapid innovation and continual progress. Adapting the rapid pace of this innovation to the more methodical gear of enterprise value creation requires a trusted open source vendor. Long-term support of older software releases, bug fixes that are less interesting to the community but essential to enterprises and industry-specific training for data science teams are all needed to fully leverage Open Data Science in the enterprise. The right enterprise vendor will help an enterprise obtain all of this seamlessly. 

The New World Order: Adopting Open Data Science in the Enterprise

The journey to executing successful data science in the enterprise lies in the combination of the proper resources and tools. In general, in-house IT does not typically have the expertise needed to exploit the immense possibilities inherent to Open Data Science.  

Open Data Science platforms, like Anaconda, are a key mechanism to adopting Open Data Science across an organization. These platforms offer differing levels of empowerment for everyone from the citizen data scientist to the global enterprise data science team. Open Data Science in the enterprise has different needs from an individual or a small business. While the free foundational core of Anaconda may be enough for the individual data explorer or the small business looking to use marketing data to target market segments, a large enterprise will typically need much more support and enterprise features in order to successfully implement open source and therefore Open Data Science across their organization. Because of this, it is critical that larger organizations identify an enterprise open source vendor to both provide support and guidance as they implement Open Data Science.  This vendor should also be able to provide that enterprise layer between the applications, processes and workflows that the data science team produces and the diverse open source ecosystem. The complexity inherent to this process of maximizing insights from data will demand proficiency from both the team and vendors, in order to harness the power of the data to transform the business to one that is first data-aware and then data-driven. 

Anaconda allows enterprises to innovate faster. It exposes previously unknown insights and improves the relationship between all members of the data science team. As a platform that embraces and deeply supports open source, it helps businesses to take full advantage of both the innovation at the core of the Open Data Science movement, as well as the enterprise adaptation that is essential to leveraging the full power of open source effectively in the business. It’s time to remove the chaos from open source and use Open Data Science platforms to simplify things, so that enterprises can fully realize their own superpowers to change the world. 

by swebster at November 30, 2016 03:34 PM

November 26, 2016

Titus Brown

Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists!

This is an update to last week's blog post, "Efficiently searching MinHash Sketch collections".


Last week, Thanksgiving travel and post-turkey somnolescence gave me some time to work more with our combined MinHash/SBT implementation. One of the main things the last post contained was a collection of MinHash signatures of all of the bacterial genomes, together with a Sequence Bloom Tree index of them that enabled fast searching.

Working with the index from last week, a few problems emerged:

  • In my initial index calculation, I'd ignored non-bacterial microbes. Conveniently my colleague Dr. Jiarong (Jaron) Guo had already downloaded the viral, archaeal, and protist genomes from NCBI for me.

  • The MinHashes I'd calculated contained only the filenames of the genome assemblies, and didn't contain the name or accession numbers of the microbes. This made them really annoying to use.

    (See the new --name-from-first argument to sourmash compute.)

  • We guessed that we wanted more sensitive MinHash sketches for all the things, which would involve re-calculating the sketches with more hashes. (The default is 500, which gives you one hash per 10,000 k-mers for a 5 Mbp genome.)

  • We also decided that we wanted more k-mer sizes; the sourmash default is 31, which is pretty specific and could limit the sensitivity of genome search. k=21 would enable more sensitivity, k=51 would enable more stringency.

  • I also came up with some simple ideas for using MinHash for taxonomy breakdown of metagenome samples, but I needed the number of k-mers in each hashed genome to do a good job of this. (More on this later.)

    (See the new --with-cardinality argument to sourmash compute.)

Unfortunately this meant I had to recalculate MinHashes for 52,000 genomes, and calculate them for 8,000 new genomes. And it wasn't going to take only 36 hours this time, because I was calculating approximately 6 times as much stuff...

Fortunately, 6 x 36 hrs still isn't very long, especially when you're dealing with pleasantly parallel low-memory computations. So I set it up to run on Friday, and ran six processes at the same time, and it finished in about 36 hours.

Indexing the MinHash signatures also took much longer than the first batch, probably because the signature files were much larger and hence took longer to load. For k=21, it took about 5 1/2 hours, and 6.5 GB of RAM, to index the 60,000 signatures. The end index -- which includes the signatures themselves -- is around 3.2 GB for each k-mer size. (Clearly if we're going to do this for the entire SRA we'll have to optimize things a bit.)

On the search side, though, searching takes roughly the same amount of time as before, because the indexed part of the signatures aren't much larger, and the Bloom filter internal nodes are the same size as before. But we can now search at k=21, and get better named results than before, too.

For example, go grab the Shewanella MR-1 genome:

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/165/GCF_000146165.2_ASM14616v2/GCF_000146165.2_ASM14616v2_genomic.fna.gz > shewanella.fna.gz

Next, convert it into a signature:

sourmash compute -k 21,31 -f --name-from-first shewanella.fna.gz

and search!

sourmash sbt_search -k 21 microbes shewanella.fna.gz.sig

This yields:

# running sourmash subcommand: sbt_search
1.00 NC_004347.2 Shewanella oneidensis MR-1 chromosome, complete genome
0.16 NZ_JGVI01000001.1 Shewanella xiamenensis strain BC01 contig1, whole genome shotgun sequence
0.16 NZ_LGYY01000235.1 Shewanella sp. Sh95 contig_1, whole genome shotgun sequence
0.15 NZ_AKZL01000001.1 Shewanella sp. POL2 contig00001, whole genome shotgun sequence
0.15 NZ_JTLE01000001.1 Shewanella sp. ZOR0012 L976_1, whole genome shotgun sequence
0.09 NZ_AXZL01000001.1 Shewanella decolorationis S12 Contig1, whole genome shotgun sequence
0.09 NC_008577.1 Shewanella sp. ANA-3 chromosome 1, complete sequence
0.08 NC_008322.1 Shewanella sp. MR-7, complete genome

The updated MinHash signatures & indices are available!

Our MinHash signature collection now contains:

  1. 53865 bacteria genomes
  2. 5463 viral genomes
  3. 475 archaeal genomes
  4. 177 fungal genomes
  5. 72 protist genomes

for a total of 60,052 genomes.

You can download the various file collections here:

Hope these are useful! If there are features you want, please go ahead and file an issue; or, post a comment below.

--titus


Index building cost for k=21:

Command being timed: "/home/ubuntu/sourmash/sourmash sbt_index microbes -k 21 --traverse-directory microbe-sigs-2016-11-27/"
     User time (seconds): 18815.48
     System time (seconds): 80.81
     Percent of CPU this job got: 99%
     Elapsed (wall clock) time (h:mm:ss or m:ss): 5:15:09
     Average shared text size (kbytes): 0
     Average unshared data size (kbytes): 0
     Average stack size (kbytes): 0
     Average total size (kbytes): 0
     Maximum resident set size (kbytes): 6484264
     Average resident set size (kbytes): 0
     Major (requiring I/O) page faults: 7
     Minor (reclaiming a frame) page faults: 94887308
     Voluntary context switches: 5650
     Involuntary context switches: 27059
     Swaps: 0
     File system inputs: 150624
     File system outputs: 10366408
     Socket messages sent: 0
     Socket messages received: 0
     Signals delivered: 0
     Page size (bytes): 4096
     Exit status: 0

by C. Titus Brown at November 26, 2016 11:00 PM

November 21, 2016

Paul Ivanov

November 9th, 2016

Two weeks ago, I went down to San Luis Obispo, California for a five day Jupyter team meeting with about twenty five others. This was the first such meeting since my return after being away for two years, and I enjoyed meeting some of the "newer" faces, as well as catching up with old friends.

It was both a productive and an emotionally challenging week, as the project proceeds along at breakneck pace on some fronts yet continues to face growing pains which come from having to scale in the human dimension.

On Wednesday, November 9th, 2016, we spent a good chunk of the day at a nearby beach: chatting, decompressing, and luckily I brought my journal with me and was able to capture the poem you will find below. I intended to read it at a local open mic the same evening, but by the time I got there with a handful of fellow Jovyans for support, all of the slots were taken. On Friday, the last day of our meeting, I got the opportunity to read it to most of the larger group. Here's a recording of that reading, courtesy of Matthias Bussonnier (thanks, Matthias!).

November 9th, 2016

The lovely thing about the ocean is
that it
is
tireless 
It never stops
incessant pendulum of salty foamy slush
Periodic and chaotic
raw, serene 
Marine grandmother clock  
crashing against both pier
and rock

Statuesque encampment of abandonment
recoiling with force
and blasting forth again
No end in sight
a train forever riding forth
and back
along a line
refined yet undefined
the spirit with
which it keeps time 
in timeless unity of the moon's alignment

I. walk. forth.

Forth forward by the force
of obsolete contrition
the vision of a life forgotten
Excuses not
made real with sand, wet and compressed
beneath my heel and toes, yet reeling from
the blinding glimmer of our Sol
reflected by the glaze of distant hazy surf
upon whose shoulders foam amoebas roam

It's gone.
Tone deaf and muted by

anticipation
each coming wave
breaks up the pregnant pause
And here I am, barefoot in slacks and tie
experiencing sensations
of loss, rebirth and seldom 
kelp bulbs popping in my soul.

by Paul Ivanov at November 21, 2016 08:00 AM

November 18, 2016

Titus Brown

Efficiently searching MinHash Sketch collections

There is an update to this blog post: please see "Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists!


Note: This blog post is based largely on work done by Luiz Irber. Camille Scott, Luiz Irber, Lisa Cohen, and Russell Neches all collaborated on the SBT software implementation!

Note 2: Adam Phillipy points out in the comments below that they suggested using SBTs in the mash paper, which I reviewed. Well, they were right :)

---

We've been pretty enthusiastic about MinHash Sketches over here in Davis (read here and here for background, or go look at mash directly), and I've been working a lot on applying them to metagenomes. Meanwhile, Luiz Irber has been thinking about how to build MinHash signatures for all the data.

A problem that Luiz and I both needed to solve is the question of how you efficiently search hundreds, thousands, or even millions of MinHash Sketches. I thought about this on and off for a few months but didn't come up with an obvious solution.

Luckily, Luiz is way smarter than me and quickly figured out that Sequence Bloom Trees were the right answer. Conveniently as part of my review of Solomon and Kingsford (2015) I had put together a BSD-compatible SBT implementation in Python. Even more conveniently, my students and colleagues at UC Davis fixed my somewhat broken implementation, so we had something ready to use. It apparently took Luiz around a nanosecond to write up a Sequence Bloom Tree implementation that indexed, saved, loaded, and searched MinHash sketches. (I don't want to minimize his work - that was a nanosecond on top of an awful lot of training and experience. :)

Sequence Bloom Trees can be used to search many MinHash sketches

Briefly, an SBT is a binary tree where the leaves are collections of k-mers (here, MinHash sketches) and the internal nodes are Bloom filters containing all of the k-mers in the leaves underneath them.

Here's a nice image from Luiz's notebook: here, the leaf nodes are MinHash signatures from our sea urchin RNAseq collection, and the internal nodes are khmer Nodegraph objects containing all the k-mers in the MinHashes beneath them.

These images can be very pretty for larger collections!

The basic idea is that you build the tree once, and then to search it you prune your search by skipping over internal nodes that DON'T contain k-mers of interest. As usual for this kind of search, if you search for something that is only in a few leaves, it's super efficient; if you search for something in a lot of leaves, you have to walk over lots of the tree.

This idea was so obviously good that I jumped on it and integrated the Luiz's SBT functionality into sourmash, our Python library for calculating and searching MinHash sketches. The pull request is still open -- more on that below -- but the PR currently adds two new functions, sbt_index and sbt_search, to index and search collections of sketches.

Using sourmash to build and search MinHash collections

This is already usable!

Starting from a blank Ubuntu 15.10 install, run:

sudo apt-get update && sudo apt-get -y install python3.5-dev \
     python3-virtualenv python3-matplotlib python3-numpy g++ make

then create a new virtualenv,

cd
python3.5 -m virtualenv env -p python3.5 --system-site-packages
. env/bin/activate

You'll need to install a few things, including a recent version of khmer:

pip install screed pytest PyYAML
pip install git+https://github.com/dib-lab/khmer.git

Next, grab the sbt_search branch of sourmash:

cd
git clone https://github.com/dib-lab/sourmash.git -b sbt_search

and then build & install sourmash:

cd sourmash && make install

Once it's installed, you can index any collection of signatures like so:

cd ~/sourmash
sourmash sbt_index urchin demo/urchin/{var,purp}*.sig

It takes me about 4 seconds to load 70-odd sketches into an sbt index named 'urchin'.

Now, search!

This sig is in the index and takes about 1.6 seconds to find:

sourmash sbt_search urchin demo/urchin/variegatus-SRR1661406.sig

Note you can adjust the search threshold, in which case the search truncates appropriately and takes about 1 second:

sourmash sbt_search urchin demo/urchin/variegatus-SRR1661406.sig --threshold=0.3

This next sig is not in the index and the search takes about 0.2 seconds (which is basically how long it takes to load the tree structure and search the tree root).

sourmash sbt_search urchin demo/urchin/leucospilota-DRR023762.sig

How well does this scale? Suppose, just hypothetically, that you had, oh, say, a thousand bacterial genome signatures lying around and you wanted to index and search them?

# download
mkdir bac
cd bac
curl -O http://teckla.idyll.org/~t/transfer/sigs1k.tar.gz
tar xzf sigs1k.tar.gz

# index
time sourmash sbt_index 1k *.sig
time sourmash sbt_search 1k GCF_001445095.1_ASM144509v1_genomic.fna.gz.sig

Here, the indexing takes about a minute, and the search takes about 5 seconds (mainly because there are a lot of closely related samples).

The data set sizes are nice and small -- the 1,000 signatures are 4 MB compressed and 12 MB uncompressed, the SBT index is about 64 MB, and this is all representing about 5 Gbp of genomic sequence. (We haven't put any time or effort into optimizing the index so things will only get smaller and faster.)

How far can we push it?

There's lots of bacterial genomes out there, eh? Be an AWFUL SHAME if someone INDEXED them all for search, wouldn't it?

Jiarong Guo, a postdoc split between my lab and Jim Tiedje's lab at MSU, helpfully downloaded 52,000 bacterial genomes from NCBI for another project. So I indexed them with sourmash.

Indexing 52,000 bacterial genomes took about 36 hours on the MSU HPC, or about 2.5 seconds per genome. This produced about 1 GB of uncompressed signature files, which in tar.gz form ends up being about 208 MB.

I loaded them into an SBT like so:

curl -O http://spacegraphcats.ucdavis.edu.s3.amazonaws.com/bacteria-sourmash-signatures-2016-11-19.tar.gz
tar xzf bacteria-sourmash-signatures-2016-11-19.tar.gz
/usr/bin/time -s sourmash sbt_index bacteria --traverse-directory bacteria-sourmash-signatures-2016-11-19

The indexing step took about 53 minutes on an m4.xlarge EC2 instance, and required 4.2 GB of memory. The resulting tree was about 4 GB in size. (Download the 800 MB tar.gz here; just untar it somewhere.)

Searching all of the bacterial genomes for matches to one genome in particular took about 3 seconds (and found 31 matches). It requires only 100 MB of RAM, because it uses on-demand loading of the tree. To try it out yourself, run:

sourmash sbt_search bacteria bacteria-sourmash-signatures-2016-11-19/GCF_000006965.1_ASM696v1_genomic.fna.gz.sig

I'm sure we can speed this all up, but I have to say that's already pretty workable :).

Again, you can download the 800 MB .tar.gz containing the SBT for all bacterial genomes here: bacteria-sourmash-sbt-2016-11-19.tar.gz.

Example use case: finding genomes close to Shewanella oneidensis MR-1

What would you use this for? Here's an example use case.

Suppose you were interested in genomes with similarity to Shewanella oneidensis MR-1.

First, go to the S. oneidensis MR-1 assembly page, click on the "Assembly:" link, and find the genome assembly .fna.gz file.

Now, go download it:

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/165/GCF_000146165.2_ASM14616v2/GCF_000146165.2_ASM14616v2_genomic.fna.gz > shewanella.fna.gz

Next, convert it into a signature:

sourmash compute -f shewanella.fna.gz

(which takes 2-3 seconds to produce shewanella.fna.gz.sig.

And, now, search with your new signature:

sourmash sbt_search bacteria shewanella.fna.gz.sig

which produces this output:

# running sourmash subcommand: sbt_search
1.00 ../GCF_000146165.2_ASM14616v2_genomic.fna.gz
0.09 ../GCF_000712635.2_SXM1.0_for_version_1_of_the_Shewanella_xiamenensis_genome_genomic.fna.gz
0.09 ../GCF_001308045.1_ASM130804v1_genomic.fna.gz
0.08 ../GCF_000282755.1_ASM28275v1_genomic.fna.gz
0.08 ../GCF_000798835.1_ZOR0012.1_genomic.fna.gz

telling us that not only is the original genome in the bacterial collection (the one with a similarity of 1!) but there are four other genomes in with about 9% similarity. These are other (distant) strains of Shewanella. The reason the similarity is so small is that sourmash is by default looking at k-mer sizes of 31, so we're asking how many k-mers of length 31 are in common between the two genomes.

With little modification (k-mer error trimming), this same pipeline can be used on unassembled FASTQ sequence; streaming classification of FASTQ reads and metagenome taxonomy breakdown are simple extensions and are left as exercises for the reader.

What's next? What's missing?

This is all still early days; the code's not terribly well tested and a lot of polishing needs to happen. But it looks promising!

I still don't have a good sense for exactly how people are going to use MinHashes. A command line implementation is all well and good but some questions come to mind:

  • what's the right output format? Clearly a CSV output format for the searching is in order. Do people want a scripting interface, or a command line interface, or what?
  • related - what kind of structured metadata should we support in the signature files? Right now it's pretty thin, but if we do things like sketch all of the bacterial genomes and all of the SRA, we should probably make sure we put in some of the metadata :).
  • what about at tagging interface so that you can subselect types of nodes to return?

If you are a potential user, what do you want to do with large collections of MinHash sketches?


On the developer side, we need to:

  • test, refactor, and polish the SBT stuff;
  • think about how best to pick Bloom filter sizes automatically;
  • benchmark and optimize the indexing;
  • make sure that we interoperate with mash
  • evaluate the SBT approach on 100s of thousands of signatures, instead of just 50,000.

and probably lots of things I'm forgetting...

--titus

p.s. Output of /usr/bin/time -v on indexing 52,000 bacterial genome signatures:

Command being timed: "sourmash sbt_index bacteria --traverse-directory bacteria-sourmash-signatures-2016-11-19"
User time (seconds): 3192.58
System time (seconds): 14.66
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 53:35.72
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4279056
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 8014404
Voluntary context switches: 972
Involuntary context switches: 5742
Swaps: 0
File system inputs: 0
File system outputs: 6576144
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

by C. Titus Brown at November 18, 2016 11:00 PM

Continuum Analytics news

We Are Thankful

Friday, November 18, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

It’s hard to believe it but it’s almost time to baste the turkey, mash the potatoes and take a moment to reflect on what we are thankful for this year amongst our family and friends. Good health? A job we actually enjoy? Our supportive family? While our personal reflections are of foremost importance, as a proud leader in the Open Data Science community, we’re thankful for advancements and innovations that contribute to the betterment of the world. This Thanksgiving, we give thanks to...

  1. Data. Though Big Data gave us the meat with which to collect critical information, until recently, the technology needed to make sense of of the huge amount of data was either disparate or accessible only to the most technologically advanced companies in the world (translation: barely anyone). Today, we have the ability to extract actionable insights from the infinite amounts of data that literally drive the way people and businesses make decisions. 

  2. Our data science teams. We’re thankful there is no “i” in team. While we may have all the data in the world available to us, without adding the element of intelligent human intuition, it would be devoid of the endless value it provides. Our strong, versatile team members––including data scientists, business analysts, data engineers, devops and developers––are what gets us up in the morning and out the door to work. Being a part of this tight-knit community that offers immense support makes us grateful for the opportunity to do what we do.

  3. New, innovative ideas. We keep our fingers on the pulse of enterprise happenings. Our customers afford us the opportunity to contribute to incredible, previously impossible tech breakthroughs. We’re thankful for the ability to exchange ideas with colleagues and constantly stand on the edge of change. 

  4. The opportunity to help others change the world. From combatting rare genetic diseases and eradicating human trafficking to predicting the effects of public policy, we’re thankful for the opportunity to work with companies who are using Anaconda to bring to life amazing new solutions that truly make a difference in the world. They keep us inspired and help to fuel the seemingly endless innovation made possible by the Open Data Science community. 

  5. The Anaconda community. Last but not least, we are thankful for the robust, rapidly growing Anaconda community that keeps us connected with other data science teams around the globe. Collaboration is key. Helping others discover, analyze and learn by connecting curiosity and experience is one of our main passions. We are grateful for the wonderment of innovation we see passing through on a daily basis. 

As the great late, great Arthur C. Nielsen once said, “the price of light is less than the cost of darkness.” We agree.

Happy Thanksgiving!

by swebster at November 18, 2016 03:25 PM

November 17, 2016

William Stein

RethinkDB, SageMath, Andreessen-Horowitz, Basecamp and Open Source Software

RethinkDB and sustainable business models

Three weeks ago, I spent the evening of Sept 12, 2016 with Daniel Mewes, who is the lead engineer of RethinkDB (an open source database). I was also supposed to meet with the co-founders, Slava and Michael, but they were too busy fundraising and couldn't join us. I pestered Daniel the whole evening about what RethinkDB's business model actually was. Yesterday, on October 6, 2016, RethinkDB shut down.

I met with some RethinkDB devs because an investor who runs a fund at the VC firm Andreessen-Horowitz (A16Z) had kindly invited me there to explain my commercialization plans for SageMath, Inc., and RethinkDB is one of the companies that A16Z has invested in. At first, I wasn't going to take the meeting with A16Z, since I have never met with Venture Capitalists before, and do not intend to raise VC. However, some of my advisors convinced me that VC's can be very helpful even if you never intend to take their investment, so I accepted the meeting.

In the first draft of my slides for my presentation to A16Z, I had a slide with the question: "Why do you fund open source companies like RethinkDB and CoreOS, which have no clear (to me) business model? Is it out of some sense of charity to support the open source software ecosystem?" After talking with people at Google and the RethinkDB devs, I removed that slide, since charity is clearly not the answer (I don't know if there is a better answer than "by accident").

I have used RethinkDB intensely for nearly two years, and I might be their biggest user in some sense. My product SageMathCloud, which provides web-based course management, Python, R, Latex, etc., uses RethinkDB for everything. For example, every single time you enter some text in a realtime synchronized document, a RethinkDB table gets an entry inserted in it. I have RethinkDB tables with nearly 100 million records. I gave a talk at a RethinkDB meetup, filed numerous bug reports, and have been described by them as "their most unlucky user". In short, in 2015 I bet big on RethinkDB, just like I bet big on Python back in 2004 when starting SageMath. And when visiting the RethinkDB devs in San Francisco (this year and also last year), I have said to them many times "I have a very strong vested interest in you guys not failing." My company SageMath, Inc. also pays RethinkDB for a support contract.

Sustainable business models were very much on my mind, because of my upcoming meeting at A16Z and the upcoming board meeting for my company.  SageMath, Inc.'s business model involves making money from subscriptions to SageMathCloud (which is hosted on Google Cloud Platform); of course, there are tons of details about exactly how our business works, which we've been refining based on customer feedback. Though absolutely all of our software is open source, what we sell is convenience, easy of access and use, and we provide value by hosting hundreds of courses on shared infrastructure, so it is much cheaper and easier for universities to pay us rather than hosting our software themselves (which is also fairly easy). So that's our business model, and I would argue that it is working; at least our MRR is steadily increasing and is more than twice our hosting costs (we are not cash flow positive yet due to developer costs).

So far as I can determine, the business model of RethinkDB was to make money in the following ways: 1. Sell support contracts to companies (I bought one). 2. Sell a closed-source proprietary version of RethinkDB with extra features that were of interest to enterprise (they had a handful of such features, e.g., audit logs for queries). 3. Horizon would become a cloud-hosted competitor to Firebase, with unique advantages that users have the option to migrate from the cloud to their own private data center, and more customizability. This strategy depends on a trend for users to migrate away from the cloud, rather than to it, which some people at RethinkDB thought was a real trend (I disagree).

I don't know of anything else they were seriously trying right now. The closed-source proprietary version of RethinkDB also seemed like a very recent last ditch effort that had only just begun; perhaps it directly contradicted a desire to be a 100% open source company?

With enough users, it's easier to make certain business models work. I suspect RethinkDB does not have a lot of real users. Number of users tends to be roughly linearly related to mailing list traffic, and the RethinkDB mailing list has an order of magnitude less traffic compared to the SageMath mailing lists, and SageMath has around 50,000 users. RethinkDB wasn't even advertised to be production ready until just over a year ago, so even they were telling people not to use it seriously until relatively recently. The adoption cycle for database technology is slow -- people wisely wait for Aphyr's tests, benchmarks comparing with similar technology, etc. I was unusual in that I chose RethinkDB much earlier than most people would, since I love the design of RethinkDB so much. It's the first database I loved, having seen a lot over many decades.

Conclusion: RethinkDB wasn't a real business, and wouldn't become one without year(s) more runway.

I'm also very worried about the future of RethinkDB as an open source project. I don't know if the developers have experience growing an open source community of volunteers; it's incredibly hard and its unclear they are even going to be involved. At a bare minimum, I think they must switch to a very liberal license (Apache instead of AGPL), and make everything (e.g., automated testing code, documentation, etc) open source. It's insanely hard getting any support for open source infrastructure work -- support mostly comes from small government grants (for research software) or contributions from employees at companies (that use the software). Relicensing in a company friendly way is thus critical.

Company Incentives

Companies can be incentived in various ways, including:
  • to get to the next round of VC funding
  • to be a sustainable profitable business by making more money from customers than they spend, or
  • to grow to have a very large number of users and somehow pivot to making money later.
When founding a company, you have a chance to choose how your company will be incentived based on how much risk you are willing to take, the resources you have, the sort of business you are building, the current state of the market, and your model of what will happen in the future.

For me, SageMath is an open source project I started in 2004, and I'm in it for the long haul. I will make the business I'm building around SageMathCloud succeed, or I will die trying -- therefore I have very, very little tolerance for risk. Failure is not an option, and I am not looking for an exit. For me, the strategy that best matches my values is to incentive my company to build a profitable business, since that is most likely to survive, and also to give us the freedom to maintain our longterm support for open source and pure mathematics software.

Thus for my company, neither optimizing for raising the next round of VC or growing at all costs makes sense. You would be surprised how many people think I'm completely wrong for concluding this.

Andreessen-Horowitz

I spent the evening with RethinkDB developers, which scared the hell out of me regarding their business prospects. They are probably the most open source friendly VC-funded company I know of, and they had given me hope that it is possible to build a successful VC-funded tech startup around open source. I prepared for my meeting at A16Z, and deleted my slide about RethinkDB.

I arrived at A16Z, and was greeted by incredibly friendly people. I was a little shocked when I saw their nuclear bomb art in the entry room, then went to a nice little office to wait. The meeting time arrived, and we went over my slides, and I explained my business model, goals, etc. They said there was no place for A16Z to invest directly in what I was planning to do, since I was very explicit that I'm not looking for an exit, and my plan about how big I wanted the company to grow in the next 5 years wasn't sufficiently ambitious. They were also worried about how small the total market cap of Mathematica and Matlab is (only a few hundred million?!). However, they generously and repeatedly offered to introduce me to more potential angel investors.

We argued about the value of outside investment to the company I am trying to build. I had hoped to get some insight or introductions related to their portfolio companies that are of interest to my company (e.g., Udacity, GitHub), but they deflected all such questions. There was also some confusion, since I showed them slides about what I'm doing, but was quite clear that I was not asking for money, which is not what they are used to. In any case, I greatly appreciated the meeting, and it really made me think. They were crystal clear that they believed I was completely wrong to not be trying to do everything possible to raise investor money.

Basecamp

During the first year of SageMath, Inc., I was planning to raise a round of VC, and was doing everything to prepare for that. I then read some of DHH's books about Basecamp, and realized many of those arguments applied to my situation, given my values, and -- after a lot of reflection -- I changed my mind. I think Basecamp itself is mostly closed source, so they may have an advantage  in building a business. SageMathCloud (and SageMath) really are 100% open source, and building a completely open source business might be harder. Our open source IP is considered worthless by investors. Witness: RethinkDB just shut down and Stripe hired just the engineers -- all the IP, customers, etc., of RethinkDB was evidently considered worthless by investors.

The day after the A16Z meeting, I met with my board, which went well (we discussed a huge range of topics over several hours). Some of the board members also tried hard to convince me that I should raise a lot more investor money.

Will Poole: you're doomed

Two weeks ago I met with Will Poole, who is a friend of a friend, and we talked about my company and plans. I described what I was doing, that everything was open source, that I was incentivizing the company around building a business rather than raising investor money. He listened and asked a lot of follow up questions, making it very clear he understands building a company very, very well.

His feedback was discouraging -- I said "So, you're saying that I'm basically doomed." He responded that I wasn't doomed, but might be able to run a small "lifestyle business" at best via my approach, but there was absolutely no way that what I was doing would have any impact or pay for my kids college tuition. If this was feedback from some random person, it might not have been so disturbing, but Will Poole joined Microsoft in 1996, where he went on to run Microsoft's multibillion dollar Windows business. Will Poole is like a retired four-star general that executed a successful campaign to conquer the world; he been around the block a few times. He tried pretty hard to convince me to make as much of SageMathCloud closed source as possible, and to try to convince my users to make content they create in SMC something that I can reuse however I want. I felt pretty shaken and convinced that I needed to close parts of SMC, e.g., the new Kubernetes-based backend that we spent all summer implementing. (Will: if you read this, though our discussion was really disturbing to me, I really appreciate it and respect you.)

My friend, who introduced me to Will Poole, introduced me to some other people and described me as that really frustrating sort of entrepreneur who doesn't want investor money. He then remarked that one of the things he learned in business school, which really surprised him, was that it is good for a company to have a lot of debt. I gave him a funny look, and he added "of course, I've never run a company".

I left that meeting with Will convinced that I would close source parts of SageMathCloud, to make things much more defensible. However, after thinking things through for several days, and talking this over with other people involved in the company, I have chosen not to close anything. This just makes our job harder. Way harder. But I'm not going to make any decisions based purely on fear. I don't care what anybody says, I do not think it is impossible to build an open source business (I think Wordpress is an example), and I do not need to raise VC.

Hacker News Discussion: https://news.ycombinator.com/item?id=12663599

Chinese version: http://www.infoq.com/cn/news/2016/10/Reflection-sustainable-profit-co

by William Stein (noreply@blogger.com) at November 17, 2016 03:57 PM

Continuum Analytics news

DataCamp’s Online Platform Fuels the Future of Data Science, Powered By Anaconda

Thursday, November 17, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

There’s no doubt that the role of ‘data scientist’ is nearing a fever pitch as companies become increasingly data-driven. In fact, the position ranked number one on Glassdoor’s top jobs in 2016, and in 2012, HBR dubbed it “The Sexiest Job of the 21st Century.” Yet, while more organizations are adopting data science, there exists a shortage of people with the right training and skills to fill the role. This challenge is being met by our newest partner, DataCamp, a data science learning platform focused on cultivating the next generation of data scientists. 

DataCamp’s interactive learning environment today launched the first of four Anaconda-based courses taught by Anaconda experts—Interactive Visualization with Bokeh. Our experts—both in academia and in the data science industry—provide users with maximum insight. While we’re proud to partner with companies representing various verticals, it is especially thrilling to contribute toward the creation of new data scientists, including citizen data scientists, both of which are extremely valued in the business community. 

Research finds that 88 percent of professionals say online learning is more helpful than in-person training; DataCamp has already trained over 620,000 aspiring data scientists. Of the four new Anaconda-based courses, two are interactive trainings. This allows DataCamp to offer students the opportunity to benefit from unprecedented breadth and depth of online learning, leading to highly skilled, next-gen data scientists.  

The data science revolution is growing by the day and DataCamp is poised to meet the challenge of scarcity in the market. By offering courses tailored to an individual’s unique pace, needs and expertise, DataCamp’s courses are generating more individuals with the skills to boast ‘the sexiest job of the 21st century.’

Interested in learning more or signing up for a course? Check out DataCamp’s blog.

by swebster at November 17, 2016 01:37 AM

November 15, 2016

Titus Brown

You can make GitHub repositories archival by using Zenodo or Figshare!

Update: Zenodo will remove content upon request by the owner, and hence is not suitable for long-term archiving of published code and data. Please see my comment at the bottom (which is just a quote from an e-mail from a journal editor), and especially see "Ownership" and "Withdrawal" under Zenodo policies. I agree with the journal's interpretation of these policies.


Bioinformatics researchers are increasingly pointing reviewers and readers at their GitHub repositories in the Methods sections of their papers. Great! Making the scripts and source code for methods available via a public version control system is a vast improvement over the methods of yore ("e-mail me for the scripts" or "here's a tarball that will go away in 6 months").

A common point of concern, however, is that GitHub repositories are not archival. That is, you can modify, rewrite, delete, or otherwise irreversibly mess with the contents of a git repository. And, of course, GitHub could go the way of Sourceforge and Google Code at any point.

So GitHub is not a solution to the problem of making scripts and software available as part of the permanent record of a publication.

But! Never fear! The folk at Zenodo and Mozilla Science Lab (in collaboration with Figshare) have solutions for you!

I'll tell you about the Zenodo solution, because that's the one we use, but the Figshare approach should work as well.

How Zenodo works

Briely, at Zenodo you can set up a connection between Zenodo and GitHub where Zenodo watches your repository and produces a tarball and a DOI every time you cut a release.

For example, see https://zenodo.org/record/31258, which archives https://github.com/dib-lab/khmer/releases/tag/v2.0 and has the DOI http://doi.org/10.5281/zenodo.31258.

When we release khmer 2.1 (soon!), Zenodo will automatically detect the release, pull down the tar file of the repo at that version, and produce a new DOI.

The DOI and tarball will then be independent of GitHub and I cannot edit, modify or delete the contents of the Zenodo-produced archive from that point forward.

Yes, automatically. All of this will be done automatically. We just have to make a release.

Yes, the DOI is permanent and Zenodo is archival!

Zenodo is an open-access archive that is recommended by Peter Suber (as is Figshare).

While I cannot quickly find a good high level summary of how DOIs and archiving and LOCKSS/CLOCKSS all work together, here is what I understand to be the case:

  • Digital object identifiers are permanent and persistent. (See Wikipedia on DOIs)

  • Zenodo policies say:

    "Retention period

    Items will be retained for the lifetime of the repository. This is currently the lifetime of the host laboratory CERN, which currently has an experimental programme defined for the next 20 years at least."

So I think this is at least as good as any other archival solution I've found.

Why is this better than journal-specific archives and supplemental data?

Some journals request or require that you upload code and data to their own internal archive. This is often done in painful formats like PDF or XLSX, which may guarantee that a human can look at the files but does little to encourage reuse.

At least for source code and smallish data sets, having the code and data available in a version controlled repository is far superior. This is (hopefully :) the place where the code and data is actually being used by the original researchers, so having it kept in that format can only lower barriers to reuse.

And, just as importantly, getting a DOI for code and data means that people can be more granular in their citation and reference sections - they can cite the specific software they're referencing, they can point at specific versions, and they can indicate exactly which data set they're working with. This prevents readers from going down the citation network rabbit hole where they have to read the cited paper in order to figure out what data set or code is being reused and how it differs from the remixed version.

Bonus: Why is the combination of GitHub/Zenodo/DOI better than an institutional repository?

I've had a few discussions with librarians who seem inclined to point researchers at their own institutional repositories for archiving code and data. Personally, I think having GitHub and Zenodo do all of this automatically for me is the perfect solution:

  • quick and easy to configure (it takes about 3 minutes);
  • polished and easy user interface;
  • integrated with our daily workflow (GitHub);
  • completely automatic;
  • independent of whatever institution happens to be employing me today;

so I see no reason to switch to using anything else unless it solves even more problems for me :). I'd love to hear contrasting viewpoints, though!

thanks!

--titus

by C. Titus Brown at November 15, 2016 11:00 PM

November 14, 2016

Continuum Analytics news

Can New Technologies Help Stop Crime In Its Tracks?

Tuesday, November 15, 2016
Peter Wang
Chief Technology Officer & Co-Founder
Continuum Analytics

Earlier this week, I shared my thoughts on crime prevention through technology with IDG Connect reporter Bianca Wright. Take a look and feel free to share your opinions in the comment section below (edited for length and clarity)!

Researchers from the University of Cardiff have been awarded more than $800,000 by the U.S. Department of Justice to develop a pre-crime detection system that uses social media. How would such technology work? Are there other examples of technologies being used in this way?

The particular award for the University of Cardiff was to fight hate crime, and this is an important distinction. Taking a data-driven "predictive policing" approaches to fighting general crime is very difficult because crime itself is so varied, and the dimensions of each type of crime are so complex. However, for hate crimes in particular, social media could be a particularly useful data stream, because it yields insight into a variable that is otherwise extremely difficult to assess: human sentiment. The general mechanism of the system would be to look for patterns and correlations between all the dimensions of social media: text in posts and tweets, captions on images, the images themselves, even which people, topics, organizations someone subscribes to. Metadata on these would also feed into the data modeling; the timestamps and locations of their posts and social media activity can be used to infer where they live, their income, level of education, etc. 

Social media is most powerful when the additional information streams it generates are paired up with existing information about someone. Sometimes unexpected correlations emerge. For instance, could it be the case that among those expressing hate speech in their social media feeds, the people with a criminal background are actually less likely to engage in hate crime, because they already have a rap sheet and know that law enforcement is aware of them, and, instead, most hate crimes are committed by first-time offenders? Ultimately, the hope of social media data science is to be able to get real insight into questions like these, which then can suggest effective remediations and preventative measures.

How viable is such technology in predicting and preventing crime? Is the amount of big data available to law enforcement enough to help them predict a crime before it happens?

It's hard to say in general. It seems like most common sorts of physical crime are deeply correlated to socioeconomic, geographic and demographic factors. These are things on which we have reasonably large amounts of data.  A challenge there is that many of those datasets are government data, stored in arcane locations and formats across a variety of different bureaucracies and difficult to synthesize. However, past evidence shows if you simply integrate the data that governments already possess, you can get some incredible insights. For instance, Jeff Chen's work with the Fire Department of New York shows that they can predict which areas have buildings that are more likely to catch on fire, and take preventative actions. 

Ironically, hate crimes may be particularly difficult to actually tackle with data science, because they are a form of domestic terrorism, with highly asymmetric dynamics between perpetrator and potential victims.  One possible result of the University of Cardiff grant, is that we discover that data science and social media can reveal elevated risk of hate crimes in certain areas, but offer insufficient information for taking any kind of preventative or remediative measures.

What are the challenges to such technologies? How do you see this developing in the future?

I think that the breakthroughs in the field of machine learning can lead to better and smarter policy across the board: from crime prevention to international trade to reducing terrorism and extremism. The biggest challenge it faces is that its real technological breakthroughs are mostly mathematical in nature, and not something "concrete" that regular people can readily understand. Some technology breakthroughs are extremely visceral: electrical cars that go from 0-60 in 3 seconds, spacecraft that beam down breathtaking images, and the like. We even have computerized devices that talk to us in natural language. The average person can "get" that these are advances.

Advances in machine learning and data science can deeply improve human civilization, by helping us make better policy, allocate resources better, reduce waste and improve quality of life. 

 

See the full article in IDG Connect, here

 

by swebster at November 14, 2016 05:35 PM

November 11, 2016

Continuum Analytics news

AnacondaCON 2017: Join us for our Inaugural User Conference

Monday, November 14, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

At Continuum Analytics, we believe our most valuable asset is our community. From data scientists to industry experts, the collective power held by our community to change the world is limitless. This is why we are thrilled to introduce AnacondaCON, our inaugural user conference, and to announce that registration is now open.

The data scientist community is growing—topping the list as the best job title in the U.S. in 2016—Open Data Science is affecting every industry at every level. As our ranks expand, now is the perfect time to bring together the brightest minds to exchange ideas, teach strategies and discuss what’s next in this industry that knows no bounds. AnacondaCON will offer an ideal forum to gather the superheroes of the data science world to discuss what works, what doesn’t, and where Anaconda is going next. Our sponsors, DataCamp, Intel, AMD and Gurobi, will be available to share with you their latest innovations and how we collaborate together. Attendees will walk away with the knowledge and connections needed to take their Open Data Science projects to the next level.

Registration opens today. We hope you’ll join us.

Who: The brightest minds in Open Data Science
What: AnacondaCON 2017
When: February 7-9, 2017
Where: JW Marriott Austin, Austin, TX 
Register here: https://anacondacon17.io/register/

Book your room now!

Join the conversation: #AnacondaCON #OpenDataScienceMeans

 

by swebster at November 11, 2016 06:32 PM

November 10, 2016

Gaël Varoquaux

Data science instrumenting social media for advertising is responsible for todays politics

To my friends developing data science for the social media, marketing, and advertising industries,

It is time to accept that we have our share of responsibility in the outcome of the US elections and the vote on Brexit. We are not creating the society that we would like. Facebook, Twitter, targeted advertising, customer profiling, are harmful to truth and have helped Brexiting and electing Trump. Journalism has been replaced by social media and commercial content tailored to influence the reader: your own personal distorted reality.

There are many deep reasons why Trump won the election. Here, as a data scientist, I want to talk about the factors created by data science.


Rumor replaces truth: the way we, data-miners, aggregate and recommend content is based on its popularity, on readership statistics. In no way is it based in the truthfulness of the content. As a result, Facebook, Twitter, Medium, and the like amplify rumors and sensational news, with no reality check [1].

This is nothing new: clickbait and tabloids build upon it. However, social networking and active recommendation makes things significantly worst. Indeed, birds of a feather flock together, reinforcing their own biases. We receive filtered information: have you noticed that every single argument you heard was overwhelmingly against (or in favor of) Brexit? To make matters even worse, our brain loves it: to resolve cognitive dissonance we avoid information that contradicts our biases [2].

Note

We all believe more information when it confirms our biases

Gossiping, rumors, and propaganda have always made sane decisions difficult. The filter bubble, algorithmically-tuned rose-colored glasses of Facebook, escalate this problem into a major dysfunction of our society. They amplify messy and false information better than anything before. Soviet-style propaganda builds on a carefully-crafted lies; post-truth politics build on a flood of information that does not even pretend to be credible in the long run.


Active distortion of reality: amplifying biases to the point that they drown truth is bad. Social networks actually do worse: they give tools for active manipulation of our perception of the world. Indeed, the revenue of today’s Internet information engines comes from advertising. For this purpose they are designed to learn as much as possible about the reader. Then they sell this information bundled with a slot where the buyer can insert the optimal message to influence the reader.

The Trump campaign used targeted Facebook ads presenting to unenthusiastic democrats information about Clinton tuned to discourage them from voting. For instance, portraying her as racist to black voters.

Information manipulation works. The Trump campaign has been a smearing campaign aimed at suppressing votes of his opponent. Release of negative information on Clinton did affect her supporter allegiance.

Tech created the perfect mind-control tool, with an eyes on sales revenue. Someone used it for politics.


The tech industry is mostly socially-liberal and highly educated, wishing the best for society. But it must accept its share of the blame. My friends improving machine-learning for costumer profiling and ad placement, you help shaping a world of lies and deception. I will not blame you for accepting this money: if it were not for you, others would do it. But we should all be thinking about how do we improve this system. How do we use data science to build a world based on objectivity, transparency, and truth, rather than Internet-based marketing?


Disgression: other social issues of data science

  • The tech industry is increasing inequalities, making the rich richer and leaving the poor behind. Data-science, with its ability to automate actions and wield large sources of information, is a major contributor to these sources of inequalities.
  • Internet-based marketing is building a huge spying machine that infers as much as possible about the user. The Trump campaign was able to target a specific population, black voters leaning towards democrats. What if this data was used for direct executive action? This could come quicker than we think, given how intelligence agencies tap into social media.

I preferred to focus this post on how data-science can help distort truth. Indeed, it is a problem too often ignored by data scientists who like to think that they are empowering users.

In memory of Aaron Schwartz who fought centralized power on Internet.


[1]Facebook was until recently using human curators, but fired them, leading to a loss of control on veracity
[2]It is a well-known and well-studied cognitive bias that individuals strive to reduce cognitive dissonace and actively avoid situations and information likely to increase it

by Gaël Varoquaux at November 10, 2016 11:00 PM

November 08, 2016

Continuum Analytics news

AnacondaCON 2017: Continuum Analytics Opens Registration for First User Conference

Monday, November 14, 2016

Two-day event will bring together thought leaders in the Open Data Science community to learn, debate and socialize in an atmosphere of collaboration and innovation

AUSTIN, TX—November 9, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that registration is open for AnacondaCON 2017, taking place February 7-9, 2017 in Austin, Texas. The inaugural Anaconda user conference is a two-day event that will bring together innovative enterprises on the journey to Open Data Science. These companies have recognized the value of capitalizing on their growing treasure trove of data assets to create compelling business value for their enterprise. Register here.

In addition to enterprise users, AnacondaCON will offer the Open Data Science community––from foundational contributors to thought leaders––an opportunity to engage in breakout sessions, hear from industry experts, learn about case studies from subject matter experts and choose from specialized and focused sessions based on topic areas of interest. Sessions will prove educational, informative and thought-provoking—attendees will walk away with the knowledge and connections needed to move their Open Data Science initiatives forward.

Come hear keynote speakers, including Continuum Analytics CEO & Co-Founder Travis Oliphant and Co-Founder & CTO Peter Wang. Guest keynotes will be announced shortly and additional speakers are being added to the agenda regularly; check here for updates.

WHO: Continuum Analytics

WHAT: Registration for AnacondaCON 2017. Early bird registration prices until December 31, 2016.

All ticket prices are 3-day passes and include access to all of AnacondaCON, including sessions, tutorials, keynotes, the opening reception and the off-site party.

WHEN: February 7-9, 2017

WHERE: JW Marriott Austin, 110 E. 2nd St. Austin, Texas, 78701

Continuum Analytics has secured a special room rate for AnacondaCON attendees. If you are interested in attending and booking a room at the special conference rate available until January 17, 2017, click here or call the JW Marriott Austin at (844) 473-3959 and reference the room block “AnacondaCON.”

REGISTER: HERE

###

About Anaconda Powered by Continuum Analytics

Continuum Analytics’ Anaconda is the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world. Anaconda is trusted by leading businesses worldwide and across industries––financial services, government, health and life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda helps data science teams discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage Open Data Science environments and harness the power of the latest open source analytic and technology innovations. Visit http://www.continuum.io.

###

Media Contact:

Jill Rosenthal

InkHouse

anaconda@inkhouse.com

 

by swebster at November 08, 2016 10:39 PM

November 03, 2016

Enthought

Scientists Use Enthought’s Virtual Core Software to Study Asteroid Impact

Chicxulub Impact Crater Expedition Recovers Core to Further Discovery on the Impact on Life and the Historical Dinosaur Extinction

From April to May 2016, a team of international scientists drilled into the site of an asteroid impact, known as the Chicxulub Impact Crater, which occurred 66 million years ago. The crater is buried several hundred meters below the surface in the Yucatán region of Mexico. Until that time, dinosaurs and marine reptiles dominated the world, but the series of catastrophic events that followed the impact caused the extinction of all large animals, leading to the rise of mammals and evolution of mankind. This joint expedition, organized by the International Ocean Discovery Program (IODP) and International Continental Scientific Drilling Program (ICDP) recovered a nearly complete set of rock cores from 506 to 1335 meters below the modern day seafloor.  These cores are now being studied in detail by an international team of scientists to understand the effects of the impact on life and as a case study of how impacts affect planets.

CT Scans of Cores Provide Deeper Insight Into Core Description and Analysis

Before being shipped to Germany (where the onshore science party took place from September to October 2016), the cores were sent to Houston, TX for CT scanning and imaging. The scanning was done at Weatherford Labs, who performed a high resolution dual energy scan on the entire core.  Dual energy scanning utilizes x-rays at two different energy levels. This provides the information necessary to calculate the bulk density and effective atomic numbers of the core. Enthought processed the raw CT data, and provided cleaned CT data along with density and effective atomic number images.  The expedition scientists were able to use these images to assist with core description and analysis.

CT Scans of Chicxulub Crater Core Samples

Digital images of the CT scans of the recovered core are displayed side by side with the physical cores for analysis

chicxulub-virtual-core-scan-core-detail

Information not evident in physical observation (bottom, core photograph) can be observed in CT scans (top)

These images are helping scientists understand the processes that occurred during the impact, how the rock was damaged, and how the properties of the rock were affected.  From analysis of images, well log data and laboratory tests it appears that the impact had a permanent effect on rock properties such as density, and the shattered granite in the core is yielding new insights into the mechanics of large impacts.

Virtual Core Provides Co-Visualization of CT Data with Well Log Data, Borehole Images, and Line Scan Photographs for Detailed Interrogation

Enthought’s Virtual Core software was used by the expedition scientists to integrate the CT data along with well log data, borehole images and line scan photographs.  This gave the scientists access to high resolution 2D and 3D images of the core, and allowed them to quickly interrogate regions in more detail when questions arose. Virtual Core also provides machine learning feature detection intelligence and visualization capabilities for detailed insight into the composition and structure of the core, which has proved to be a valuable tool both during the onshore science party and ongoing studies of the Chicxulub core.

chicxulub-virtual-core-digital-co-visualization

Enthought’s Virtual Core software was used by the expedition scientists to visualize the CT data alongside well log data, borehole images and line scan photographs.

Related Articles

Drilling to Doomsday
Discover Magazine, October 27, 2016

Chicxulub ‘dinosaur crater’ investigation begins in earnest
BBC News, October 11, 2016

How CT scans help Chicxulub Crater scientists
Integrated Ocean Drilling Program (IODP) Chicxulub Impact Crater Expedition Blog, October 3, 2016

Chicxulub ‘dinosaur’ crater drill project declared a success
BBC Science, May 25, 2016

Scientists hit pay dirt in drilling of dinosaur-killing impact crater
Science Magazine, May 3, 2016

Scientists gear up to drill into ‘ground zero’ of the impact that killed the dinosaurs
Science Magazine, March 3, 2016

Texas scientists probe crater they think led to dinosaur doomsday
Austin American-Statesman, June 2, 2016

The post Scientists Use Enthought’s Virtual Core Software to Study Asteroid Impact appeared first on Enthought Blog.

by Brendon Hall at November 03, 2016 07:00 PM

October 31, 2016

Continuum Analytics news

Another Great Year At Strata + Hadoop 2016

Tuesday, November 1, 2016
Peter Wang
Co-Founder & Chief Technology Officer
Continuum Analytics

The Anaconda team had a blast at this year’s Strata + Hadoop World in NYC. We’re really excited about the interest and discussions around Open Data Science! For those of you that weren’t able to attend, here’s a quick recap of what we presented. 

Three Anaconda team members - including myself - took the stage at Strata to chat all things Python, Anaconda and Open Data Science on Hadoop. For my presentation, I wanted to hit home the idea that Open Data Science is the foundation of modernization for the enterprise, and that open source communities can create powerful technologies for data science. I also touched upon the core challenge of open data science in the enterprise. Many people think data science is the same thing as software development, but that’s a very common misconception. Business tend to misinterpret the Python language and pigeonhole it, saying it competes with Java, C#, Ruby, R, SAS, Matlab, SPSS, or BI systems - which is not true. Done right, Python can be an extremely powerful force across any given business.

I then jumped into an overview of Anaconda for Data Science in Hadoop, highlighting how much Modern Data Science teams use Anaconda to drive more intelligent decisions - from the business analyst to the data scientist to the developer to the data engineer to DevOps. Anaconda truly powers these teams and gives businesses the superpowers required to change the world. To date, Anaconda has seen nearly 8 million downloads, and that number is growing everyday. You can see my slides from this talk, ‘Open Data Science on Hadoop in the Enterprise,’ here

We also ran some really awesome demos at the Anaconda booth, including examples of Anaconda Enterprise, Dask, datashader, and more. One of our most popular demos was Ian Stokes-Rees’ demonstration of SAS and Open Data Science using Jupyter and Anaconda. For many enterprises that currently use SAS, there is not a clear path to Open Data Science. To embark on the journey to Open Data Science, enterprises need an easy on-boarding path for their team to use SAS in combination with Open Data Science. Ian showcased why Anaconda is an ideal platform that embraces both open source and legacy languages, including Base SAS, so that enterprise teams can bridge the gap by leveraging their current SAS expertise while ramping up on Open Data Science. 

You can see his notebook from the demo here, and you can download his newest episode of Fresh Data Science discussing the use of Anaconda and Jupyter notebooks with SAS and Python here

In addition to my talk and our awesome in-booth demos, two of our software engineers, Bryan Van de Ven and Sarah Bird, demonstrated how to build intelligent apps in a week with Bokeh, Python and optimization. Attendees of the sessions learned how to create standard and custom visualizations using Bokeh, how to make them interactive, how to connect that interactivity with their Python stacks, and how to share their new interactive data applications. Congratulations to both on a job well done. 

All that being said, we can’t wait to see you at next year’s Strata + Hadoop! Check back here for details on more upcoming conferences we’re attending.

 

 

 

 

by swebster at October 31, 2016 07:17 PM

October 28, 2016

Continuum Analytics news

Self-Service Open Data Science: Custom Anaconda Parcels for Cloudera CDH

Monday, October 31, 2016
Kristopher Overholt
Continuum Analytics

Earlier this year, as part of our partnership with Cloudera, we announced a freely available Anaconda parcel for Cloudera CDH based on Python 2.7 and the Anaconda Distribution. The Anaconda parcel has been very well received by both Anaconda and Cloudera users by making it easier for data scientists and analysts to use libraries from Anaconda that they know and love with Hadoop and Spark along with Cloudera CDH.

Since then, we’ve had significant interest from Anaconda Enterprise users asking how they can create and use custom Anaconda parcels with Cloudera CDH. Our users want to deploy Anaconda with different versions of Python and custom conda packages that are not included in the freely available Anaconda parcel. Using parcels to manage multiple Anaconda installations across a Cloudera CDH cluster is convenient, because it works natively with Cloudera Manager without the need to install additional software or services on the cluster nodes.

We’re excited to announce a new self-service feature of the Anaconda platform that can be used to generate custom Anaconda parcels and installers. This functionality is now available in the Anaconda platform as part of the Anaconda Scale and Anaconda Repository platform components.

Deploying multiple custom versions of Anaconda on a Cloudera CDH cluster with Hadoop and Spark has never been easier! Let’s take a closer look at how we can create and install a custom Anaconda parcel using Anaconda Repository and Cloudera Manager.

Generating Custom Anaconda Parcels and Installers

For this example, we’ve installed Anaconda Repository (which is part of the Anaconda Enterprise subscription) and created an on-premises mirror of more than 600 conda packages that are available in the Anaconda distribution. We’ve also installed Cloudera CDH 5.8.2 with Spark on a cluster.

In Anaconda Repository, we can see a new feature for Installers, which can be used to generate custom Anaconda parcels for Cloudera CDH or standalone Anaconda installers.

The Installers page gives an overview of how to get started with custom Anaconda installers and parcels, and it describes how we can create custom Anaconda parcels that are served directly from Anaconda Repository from a Remote Parcel Repository URL.

After choosing Create new installer, we can then specify packages to include in our custom Anaconda parcel, which we’ve named anaconda_plus.

First, we specify the latest version of Anaconda (4.2.0) and Python 2.7. We’ve added the anaconda package to include all of the conda packages that are included by default in the Anaconda installer. Specifying the anaconda package is optional, but it’s a great way to supercharge your custom Anaconda parcel with more than 200 of the most popular Open Data Science packages, including NumPy, Pandas, SciPy, matplotlib, scikit-learn and more.

We also specified additional conda packages to include in the custom Anaconda parcel, including libraries for natural language processing, visualization, data I/O and other data analytics libraries: azure, bcolz, boto3, datashader, distributed, gensim, hdfs3, holoviews, impyla, seaborn, spacy, tensorflow and xarray.

We also could have included conda packages from other channels in our on-premise installation of Anaconda Repository, including community-built packages from conda-forge or other custom-built conda packages from different users within our organization.

After creating a custom Anaconda parcel, we see a list of parcel files that were generated for all of the Linux distributions supported by Cloudera Manager.

Additionally, Anaconda Repository has already updated the manifest file used by Cloudera Manager with the new parcel information at the existing Remote Parcel Repository URL. Now, we’re ready to install the newly created custom Anaconda parcel using Cloudera Manager.

Installing Custom Anaconda Parcels Using Cloudera Manager

Now that we’ve generated a custom Anaconda parcel, we can install it on our Cloudera CDH cluster and make it available to all of the cluster users for PySpark and SparkR jobs.

From the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar.

Click the Configuration button on the top right of the Parcels page.

Click the plus symbol in the Remote Parcel Repository URLs section, and add the repository URL that was provided from Anaconda Repository.

Finally, we can download, distribute and activate the custom Anaconda parcel.

And we’re done! The custom-generated Anaconda parcel is now activated and ready to use with Spark or other distributed frameworks on our Cloudera CDH cluster.

Using the Custom Anaconda Parcel

Now that we’ve generated, installed and activated a custom Anaconda parcel, we can use libraries from our custom Anaconda parcel with PySpark.

You can use spark-submit along with the PYSPARK_PYTHON environment variable to run Spark jobs that use libraries from the Anaconda parcel, for example:

$ PYSPARK_PYTHON=/opt/cloudera/parcels/anaconda_plus/bin/python spark-submit pyspark_script.py

Or, to work with Spark interactively on the Cloudera CDH cluster, we can use Jupyter Notebooks via Anaconda Enterprise Notebooks, which is a multi-user notebook server with collaboration and support for enterprise authentication. You can configure Anaconda Enterprise Notebooks to use different Anaconda parcel installations on a per-job basis.

Get Started with Custom Anaconda Parcels in Your Enterprise

If you’re interested in generating custom Anaconda installers and parcels for Cloudera Manager, we can help! Get in touch with us by using our contact us page for more information about this functionality and our enterprise Anaconda platform subscriptions.

If you’d like to test-drive the on-premises, enterprise features of Anaconda on a bare-metal, on-premises or cloud-based cluster, get in touch with us at sales@continuum.io.

The enterprise features of the Anaconda platform, including the distributed functionality in Anaconda Scale and on-premises functionality of Anaconda Repository, are certified by Cloudera for use with Cloudera CDH 5.x.

by swebster at October 28, 2016 06:50 PM

Announcing the Continuum Founders Awards

Friday, October 28, 2016
Travis Oliphant
Chief Executive Officer & Co-Founder
Continuum Analytics

Team Award

This award is presented to a team (either formal or informal) that consistently delivers on the company mission, understands the impact and importance of all aspects of the business, and exemplifies the qualities and output of a high-functioning team. The team members consistently demonstrate being humble, hungry, and smart.

Joel Hull

Joel Hull

Erik Welch

Trent Oliphant

Trent Oliphant

Jim Kitchen

Jim Kitchen

This team was selected for at least the following activities that benefit all aspects of the business:

  • Repeated, successful delivery on an important project at a major customer that has led the way for product sales and future contract-based work with the customer

  • Internal work on Enterprise Notebooks to fix specific problems needed by a customer

  • Continued coordination with Repository to enable it to be purchased by a large customer

  • Erik’s work on helping with Dask

  • Trent’s work on getting Anaconda installed at customer sites ensuring successful customer engagement

  • The team’s initial work on productizing a successful project from their consulting project

  • Coordinated the build and delivery of many client specific conda packages

Mission, Values, Purpose (MVP) Award

This award is given to an individual who exemplifies the mission, values, and purpose of Continuum Analytics.

Stan Seibert

Stan Seibert

When Peter and Travis first organized the company values with other leaders we each separately envisioned several members of the Continuum Team that we thought exemplified what it meant to be at Continuum. Stan was the top of all of our lists.

Stan knows what it means to empower people to change the world. As a scientist he worked on improving neutrino detection, contributing to the experiment that was co-awarded the 2015 Nobel Prize in Physics and the 2016 Breakthrough Prize in Fundamental Physics. The Numba project has flourished under his leadership both in terms of project development as well as ensuring that funding for the project continues from government, non-profits, and companies. Stan shows the quality-first and owner-mentality of true craftsmanship in all of the projects he leads which has caused his customers to renew again and again. Stan also exemplifies continuous learning all of the time. In one example, he learned how to build a cluster from several Raspberry Pi systems --- including taking the initiative to attach an LCD display to the front of each one. His “daskmaster” has been a crowd favorite in the Continuum booth at several events.

Customer-First Award

This award recognizes individuals who consistently demonstrate that customers matter and we will succeed to the degree that we solve problems for and support our customers and future customers.

Derek Orgeron

Derek Orgeron

Atish Singh

Atish Singh

Ian Stokes-Rees

Ian Stokes-Rees

Derek, Atish, and Ian put customers first every single day.

Derek and Atish follow-up from all of our marketing events and are often the first contact our customers have with Continuum. They have the responsibility to triage opportunities and determine which are the most likely to lead to sales in product, training, and/or consulting. They are pursuing many, many customer contacts far above industry averages and doing it while remaining positive, enthusiastic, and helpful to the people they reach.

Ian Stokes Rees has gone above and beyond to serve customers for the past year and beyond. He is always willing to do what makes sense. He filled in as a sales engineer as well as an implementation engineer as needed. He has worked tirelessly for Citibank to ensure success on that opportunity. He has single-handedly enabled Anaconda and SAS to work together. At multiple conferences (e.g. Strata) he is a tireless and articulate advocate for Anaconda ,explaining in detail how it will help our clients.

by ryanwh at October 28, 2016 03:07 PM

October 26, 2016

Continuum Analytics news

Recursion Pharmaceuticals Wants to Cure Rare Genetic Diseases - and We’re Going to Help

Wednesday, October 26, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

Today we are pleased to announce that Continuum Analytics and Recursion Pharmaceuticals are teaming up to use data science in the quest to find cures for rare genetic diseases. Using Bokeh on Anaconda, Recursion is building its drug discovery assay platform to analyze layered cell images and weigh the effectiveness of different remedies. As we always say, Anaconda arms data scientists with superpowers to change the world. This is especially valuable for Recursion, since success literally means saving lives and changing the world by bringing drug remedies for rare genetic diseases to market faster than ever before.
 
It’s estimated that there are over 6,000 genetic disorders, yet many of these diseases represent a small market. Pharmaceutical companies aren’t usually equipped to pursue the cure for each disease. Anaconda will help Recursion by blending biology, bioinformatics and machine learning, bringing cell data to life. By identifying patterns and assessing drug remedies quickly, Recursion is using data science to discover potential drug remedies for rare genetic diseases. In English - this company is trying to cure big, bad, killer diseases using Open Data Science. 

The ODS community is important to us. Working with a company in the pharmaceutical industry, an industry that is poised to convert ideas into life-saving medications, is humbling. With so many challenges, not the least of which include regulatory roadblocks and lengthy and complex R&D processes, researchers must continually adapt and innovate to speed medical advances. Playing a part in that process? That’s why we do what we do. We’re excited to welcome Recursion to the family and observe as it uses its newfound superpowers to change the world, one remedy at a time.

Want to learn more about this news? Check out the press release, here

by swebster at October 26, 2016 02:17 PM

Recursion Pharmaceuticals Selects Anaconda to Create Innovative Next Generation Drug Discovery Assay Platform to Eradicate Rare Genetic Diseases

Wednesday, October 26, 2016

Open Data Science Platform Accelerates Time-to-Market for Drug Remedies

AUSTIN, TX—October 26, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Recursion Pharmaceuticals, LLC, a drug discovery company focused on rare genetic diseases, has adopted Bokeh––a Continuum Analytics open source visualization framework that operates on the Anaconda platform. Bokeh on Anaconda makes it easy for biologists to identify genetic disease markers and assess drug efficacy when visualizing cell data, allowing for faster time-to-value for pharmaceutical companies. 

“Bokeh on Anaconda enables us to perform analyses and make informative, actionable decisions that are driving real change in the treatment of rare genetic diseases,” said Blake Borgeson, CTO & co-founder at Recursion Pharmaceuticals. “By layering information and viewing images interactively, we are obtaining insights that were not previously possible and enabling our biologists to more quickly assess the efficacy of drugs. With the power of Open Data Science, we are one step closer to a world where genetic diseases are more effectively managed and more frequently cured, changing patient lives forever.” 

By combining interactive, layered visualizations in Bokeh on Anaconda to show both healthy and diseased cells along with relevant data, biologists can experiment with thousands of potential drug remedies and immediately understand the effectiveness of the drug to remediate the genetic disease. Biologists realize faster insights, speeding up time-to-market for potential drug treatments. 

“Recursion Pharmaceuticals’ data scientists crunch huge amounts of data to lay the foundation for some of the most advanced genetic research in the marketplace. With Anaconda, the Recursion data science team has created a breakthrough solution that allows biologists to quickly and cost effectively identify therapeutic treatments for rare genetic diseases,” said Peter Wang, CTO & co-founder at Continuum Analytics. “We are enabling companies like Recursion to harness the power of data on their terms, building solutions for both customized and universal insights that drive new value in all areas of business and science. Anaconda gives superpowers to people who change the world––and Recursion is a great example of how our Open Data Science vision is being realized and bringing solid, everyday value to critical healthcare processes.”

Data scientists at Recursion evaluate hundreds of genetic diseases, ranging from one evaluation per month to thousands in the same time frame. Bokeh on Anaconda delivers insights derived from heat maps, charts, plots and other scientific visualizations interactively and intuitively, while providing holistic data to enrich the context and allow biologists to discover potential treatments quickly. These visualizations empower the team with new ways to re-evaluate shelved pharmaceutical treatments and identify new potential uses for them. Ultimately, this creates new markets for pharmaceutical investments and helps develop new treatments for people suffering from genetic diseases. 

Bokeh on Anaconda is a framework for creating versatile, interactive and browser-based visualizations of streaming data or Big Data from Python, R or Scala without writing any JavaScript. It allows for exploration, embedded visualization apps and interactive dashboards, so that users can create rich, contextual plots, graphs, charts and more to enable more comprehensive deductions from images. 

For additional information about Continuum Analytics and Anaconda please visit: https://continuum.io. For more information on Bokeh on Anaconda visit https://bokeh.pydata.org.

About Recursion Pharmaceuticals, LLC

Founded in 2013, Salt Lake City, Utah-based Recursion Pharmaceuticals, LLC is a drug discovery company. Recursion uses a novel drug screening platform to efficiently repurpose and reposition drugs to treat rare genetic diseases. Recursion’s novel drug screening platform combines experimental biology and bioinformatics in a massively parallel system to quickly and efficiently identify treatments for multiple rare genetic diseases. The core of the approach revolves around high-throughput automated screening using images of human cells, which allows the near simultaneous modeling of hundreds of genetic diseases. Rich data from these assays is probed using advanced statistical and machine learning approaches, and the effects of thousands of known drugs and shelved drug candidates can be investigated efficiently to identify those holding the most promise for the treatment of any one rare genetic disease.

The company’s lead candidate, a new treatment for Cerebral Cavernous Malformation, is approaching clinical trials, and the company has a rich pipeline of repurposed therapies in its development pipeline for diverse genetic diseases.

About Anaconda Powered by Continuum Analytics

Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world. 

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations. 

Continuum Analytics' founders and developers have created and contributed to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup. 

To learn more, visit http://www.continuum.io.

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com 

by swebster at October 26, 2016 12:01 PM

October 25, 2016

Filipe Saraiva

My QtCon + Akademy 2016

From August 31th to September 10th I was em Berlin attending two amazing conferences: QtCon and Akademy.

QtCon brought together five communities to host their respective conferences at a same time and place, creating one big and diverse conference. Those communities were Qt, KDAB, KDE (celebrating 20th birthday), VLC and FSFE (both celebrating 15th birthday).

bcc

Main conference hall of QtCon at bcc

That diversity of themes was a very interesting characteristic of QtCon. I really appreciated see presentations of Qt and KDAB people, and I was surprised about topics related with VLC community. The strong technical aspects of trends like Qt in mobile, Qt in IoT (including autonomous cars), the future of Qt, Qt + Python, contributing to Qt, and more, called my attention during the conference.

On VLC I was surprised with the size of the community. I never imagined VLC had too much developers. In fact, I never imagined VideoLAN is in fact an umbrella of a lot of projects related with multimedia, like codecs, streaming tools, VLC ports to specific devices (including cars through Android Auto), and more. Yes, I really appreciated to find these persons and watch their presentations.

I was waiting for the VLC 3.0 release during QtCon, but unfortunately it did not happen. Of course the team is improving this new release and when it is finished I will have a VLC to use together with my Chromecast, so, keep this great work coneheads!

FSFE presentations were interesting as well. In Brazil there are several talks about political and philosophical aspects of free software in conferences like FISL and Latinoware. In QtCon, FSFE brought this type of presentation in an “European” style: sometimes the presentations looks like more pragmatically in their approaches. Other FSFE presentations talked about the infrastructure and organizational aspects of the foundation, a nice overview to be compared with others groups like ASL.org in Brazil.

Of course, there were a lot of amazing presentations from our gearheads. I highlight the talks about KDE history, Plasma Desktop latest news, Plasma Mobile status, KF5 on Android, the experience of Minuet in mobile world, among others.

The KDE Store announcement was really interesting and I expect it will bring more attention to the KDE ecosystem when software package bundles
(snap/flat/etc) be available in the store.

Other software called my attention was Peruse, a comic book reader. I expect developers can solve the current problems in order to release a mobile version of Peruse, so this software can reach a broad base of users of these platforms.

After the end of QtCon, Akademy had place in TU Berlin, in a very beautiful and comfortable campus. This phase of the conference was full of technical sessions and discussions, hacking, and fun.

I attended  to the Flatpack, Appstream, and Snapcraft BoFs. There were a lot of advanced technical discussions on those themes. Every Akademy I feel very impressed with the advanced level of the technical discussions performed by our hackers in KDE community. Really guys, you rocks!

The Snapcraft BoF was a tutorial about how to use that technology to create crossdistro bundle packages. That was interesting and I would like to test more and give a look in Flatpack in order to select something to create packages for Cantor.

Unfortunately I missed the BoF on Kube. I am very interested in an alternative PIM project for KDE, focused in E-Mail/Contacts/Calendar and more economic in computational resource demand. I am keeping my eyes and expectations on this project.

The others days basically I spent my time working on Cantor and having talk with our worldwide KDE fellows about several topics like KDE Edu, improvements in our Jabber/XMPP infrastructure, KDE 20th years, Plasma in small-size computers (thanks sebas for the Odroid-C1+ device 😉 ) WikiToLearn (could be interesting a way to import/export Cantor worksheets to/from WikiToLearn?), and of course, beers and Germany food.

And what about Berlin? It was my second time in the city, and like the previous one I was excited with the multicultural atmosphere, the food (<3 pork <3) and beers. We were in Kreuzberg, a hipster district in the city, so we could visit some bars and expat restaurants there. The QtCon+Akademy had interesting events as well, like the FSFE celebration in c-base and the Akademy daytrip in Peacock Island.

So, I would like to say thank you for KDE e.V. for funding my attendance in the events, thank you Petra for help us with the hostel, and thank your for all the volunteers for work hard and make this Akademy edition a real celebration of KDE community.

photo_2016-10-25_11-54-34

Some Brazilians in QtCon/Akademy 2016: KDHelio, Lamarque, Sandro, João, Aracele, Filipe (me)

by Filipe Saraiva at October 25, 2016 02:57 PM

Matthieu Brucher

Book review: Weapons Of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

Big data is the current hype, the thing you need to do to find the best job in the world. I’ve started using machine learning tools a decade ago, and when I saw this book, it felt like it was answering some concerns I had. Let’s see what’s inside.

Content and opinions

The first two chapters set the environment of the discussions in the book. Start with the way a model works, why people trust them, why we want to create new ones, and then another chapter on why we should trust the author. She has indeed the background to understand models and the job experience to see first hands how a model can be used. Of course, something that is missing here is that lots of the elements of the book are happening in the US. Hopefully the EU will be smart and learn from the US mistakes (at least there are efforts to lower the amount of data Facebook and Google are agglomerating on users).

OK, let’s start with the first weapon, the one targeted at students. I was quite baffled that all this started with a random ranking from a low profile newspaper. And now, every university tries to maximize its reputation based on numbers, not on the actual quality of new students. If universities are really spending that much money on advertisement to the point of driving tuition fees sky-high (which is a future crisis in waiting, by the way!).

The second weapon is one we all know: online ad. Almost all websites survive one way or another with revenue from online advertisement. All websites are connected through links to social networks, ad agencies… and these companies churn out information, deduction based on this gigantic pile of data. If advertisement didn’t have an effect on people, there would be no market for it.

Moving out to something completely difference: justice. It is something that also happens in France. We have far right extremists that want to have stats (it is forbidden to have racial stats there) to show that some categories of the population need to be checked more often than others. It is indeed completely unfair and also the proof that we are targeting some types of crimes and not others. I found the way the weapon worked was clearly, from the start, skewed. How could anyone not see the problem?

Then let’s go on with even worse with getting a job. Or the chapter after about keeping the job. Both times, the main issue is that the WMD helps the companies maximize their profit and minimize their risk. There are two issues there: the first one, only sure prospects are going to be hired, and this is based on… stats accumulated through the years and they are racially biased. And when they have a job, the objective is not to optimize the happiness of the employee, even if doing so would enhance the profitability.

The next two are also related, credit and insurance. It is nice to see that credit scores started as a way to remove biased, it is terrible to see that we went back there and scores are now dictated by obscure tools. And then, they know even impact insurance, not to optimize one’s cost, but to optimize revenue for the insurance company. I believe in all having to pay the same amount and all having the same covering on things like health (not for driving, because we can all be better drivers, but we cannot optimize our genes!). All goes to a really individualistic society, and it is scary.

Finally even elections are rigged. I knew that messages were sent to appeal to each category, but it is scary to see that it is actually used to lie. We all know that politicians are lying to us, but now, they even don’t care about giving us different lies. And social networks and ad companies have even more power to make us do things as they see fit. The fact that Facebook officially publishes some of its tests on users just makes me mad.

Conclusion

OK, the book is full of examples of bad usage of big data. I saw fist hand on scientific applications that it is easy to make a mistake when creating a model. In my case, the optimization of a modeler and more specifically the delta between each iteration. When trying to minimize the number of non convergence issues, if we only try to find the same time step as the original app, we are missing the point, we are trying to map a proxy. The real objective is to find a new time step that would also keep the number of convergence issues low, different ones.

Another example is just all these WDM actually. They are more often than not based on neural networks and deed learning algorithms (which is actually the same). We fuel lots of effort in making them better, but the issue is that we don’t know what they are doing (in that regards, all horror sic-fi movie with a crazy AI comes to mind, as well as Asimov’s books). This has been the case for decades, and although we know equivalent algorithms that could give us the explanation, we stay on these black boxes because they are cost-effective (we don’t have to choose the proper equivalent algorithm, we just train) and scalable (which may not be the case for the equivalent algorithm, as they don’t have the same priority in research it would seem!). The nice thing about the book is also that it underlines an issue that I haven’t even thought about. All these algorithms try to reproduce a past behavior. But humanity is evolving and things that were considered true in the past are not longer true (race anyone?). As such, if we are giving these WDM absolute power, we will just rot as a civilization and probably collapse.

I’m not against big data and machine learning. I think the current trend is clearly explained in this book and also corresponds to something I felt before this hype: let’s choose a good algorithm, let’s train the model and let’s see why it chooses some answers and not others. We may then be onto something or we may see that it is biased and we need to go back to the board. Considering the state of big data, we definitely need to go back to the board.

by Matt at October 25, 2016 07:25 AM

October 24, 2016

Enthought

Mayavi (Python 3D Data Visualization and Plotting Library) adds major new features in recent release

Key updates include: Jupyter notebook integration, movie recording capabilities, time series animation, updated VTK compatibility, and Python 3 support

by Prabhu Ramachandran, core developer of Mayavi and director, Enthought India

The Mayavi development team is pleased to announce Mayavi 4.5.0, which is an important release both for new features and core functionality updates.

Mayavi is a general purpose, cross-platform Python package for interactive 2-D and 3-D scientific data visualization. Mayavi integrates seamlessly with NumPy (fast numeric computation library for Python) and provides a convenient Pythonic wrapper for the powerful VTK (Visualization Toolkit) library. Mayavi provides a standalone UI to help visualize data, and is easy to extend and embed in your own dialogs and UIs. For full information, please see the Mayavi documentation.

Mayavi is part of the Enthought Tool Suite of open source application development packages and is available to install through Enthought Canopy’s Package Manager (you can download Canopy here).

Mayavi 4.5.0 is an important release which adds the following features:

  1. Jupyter notebook support: Adds basic support for displaying Mayavi images or interactive X3D scenes
  2. Support for recording movies and animating time series
  3. Support for the new matplotlib color schemes
  4. Improvements on the experimental Python 3 support from the previous release
  5. Compatibility with VTK-5.x, VTK-6.x, and 7.x. For more details on the full set of changes see here.

Let’s take a look at some of these new features in more detail:

Jupyter Notebook Support

This feature is still basic and experimental, but it is convenient. The feature allows one to embed either a static PNG image of the scene or a richer X3D scene into a Jupyter notebook. To use this feature, one should first initialize the notebook with the following:

from mayavi import mlab
mlab.init_notebook()

Subsequently, one may simply do:

s = mlab.test_plot3d()
s

This will embed a 3-D visualization producing something like this:

Mayavi in a Jupyter Notebook

Embedded 3-D visualization in a Jupyter notebook using Mayavi

When the init_notebook method is called it configures the Mayavi objects so they can be rendered on the Jupyter notebook. By default the init_notebook function selects the X3D backend. This will require a network connection and also reasonable off-screen support. This currently will not work on a remote Linux/OS X server unless VTK has been built with off-screen support via OSMesa as discussed here.

For more documentation on the Jupyter support see here.

Animating Time Series

This feature makes it very easy to animate a time series. Let us say one has a set of files that constitute a time series (files of the form some_name[0-9]*.ext). If one were to load any file that is part of this time series like so:

from mayavi import mlab
src = mlab.pipeline.open('data_01.vti')

Animating these is now very easy if one simply does the following:

src.play = True

This can also be done on the UI. There is also a convenient option to synchronize multiple time series files using the “sync timestep” option on the UI or from Python. The screenshot below highlights the new features in action on the UI:

Time Series Animation in Mayavi

New time series animation feature in the Python Mayavi 3D visualization library.

Recording Movies

One can also create a movie (really a stack of images) while playing a time series or running any animation. On the UI, one can select a Mayavi scene and navigate to the movie tab and select the “record” checkbox. Any animations will then record screenshots of the scene. For example:

from mayavi import mlab
f = mlab.figure()
f.scene.movie_maker.record = True
mlab.test_contour3d_anim()

This will create a set of images, one for each step of the animation. A gif animation of these is shown below:

Recording movies with Mayavi

Recording movies as gif animations using Mayavi

More than 50 pull requests were merged since the last release. We are thankful to Prabhu Ramachandran, Ioannis Tziakos, Kit Choi, Stefano Borini, Gregory R. Lee, Patrick Snape, Ryan Pepper, SiggyF, and daytonb for their contributions towards this release.

Additional Resources on Mayavi:

The post Mayavi (Python 3D Data Visualization and Plotting Library) adds major new features in recent release appeared first on Enthought Blog.

by Prabhu Ramachandran at October 24, 2016 03:51 PM

October 23, 2016

Titus Brown

What is open science?

Gabriella Coleman asked me for a short, general introduction to open science for a class, and I couldn't find anything that fit her needs. So I wrote up my own perspective. Feedback welcome!

Some background: Science advances because we share ideas and methods

Scientific progress relies on the sharing of both scientific ideas and scientific methodology - “If I have seen further it is by standing on the shoulders of Giants” (https://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants). The natural sciences advance not just when a researcher observes or understands a phenomenon, but also when we develop (and share) a new experimental technique (such as microscopy), a mathematical approach (e.g. calculus), or a new computational framework (such as multi scale modeling of chemical systems). This is most concretely illustrated by the practice of citation - when publishing, we cite the previous ideas we’re building on, the published methods we’re using, and the publicly available materials we relied upon. Science advances because of this sharing of ideas, and scientists are recognized for sharing ideas through citation and reputation.

Despite this, however, there are many barriers that lie in the way of freely sharing ideas and methods - ranging from cultural (e.g. peer review delays before publication) to economic (such as publishing behind a paywall) to methodological (for example, incomplete descriptions of procedures) to systemic (e.g. incentives to hide data and methods). Some of these barriers are well intentioned - peer review is intended to block incorrect work from being shared - while others, like closed access publishing, have simply evolved with science and are now vestigial.

So, what is open science??

Open science is the philosophical perspective that sharing is good and that barriers to sharing should be lowered as much as possible. The practice of open science is concerned with the details of how to lower or erase the technical, social, and cultural barriers to sharing. This includes not only what I think of as “the big three” components of open science -- open access to publications, open publication and dissemination of data, and open development, dissemination, and reuse of source code -- but also practice such as social media, open peer review, posting and publishing grants, open lab notebooks, and any other methods of disseminating ideas and methods quickly.

The potential value of open science should be immediately obvious: easier and faster access to ideas, methods, and data should drive science forward faster! But open science can also aid with reproducibility and replication, decrease the effects of economic inequality in the sciences by liberating ideas from subscription paywalls, and provide reusable materials for teaching and training. And indeed, there is some evidence for many of these benefits of open science even in the short term (see How open science helps researchers succeed, McKiernan et al. 2016). This is why many funding agencies and institutions are pushing for more science to be done more openly and made available sooner - because they want to better leverage their investment in scientific progress.

Some examples of open science

Here are a few examples of open science approaches, taken from my own experiences.

Preprints

In biology (and many other sciences), scientists can only publish papers after they undergo one or more rounds of peer review, in which 2-4 other scientists read through the paper and check it for mistakes or overstatements. Only after a journal editor has received the reviews and decided to accept the paper does it “count". However, in some fields, there are public sites where draft versions of papers can be publicly posted prior to peer review - these “preprint servers” work to disseminate work in advance of any formal review. The first widely used preprint server, arXiv, was created in the 1980s for math and physics, and in those fields preprints now often count towards promotion and grant decisions.

The advantages of preprints are that they get the work out there, typically with a citable identifier (DOI), and allow new methods and discoveries to spread quickly. They also typically count for establishing priority - a discovery announced in a preprint is viewed as a discovery, period, unless it is retracted after peer review. The practical disadvantages are few - the appearance of double-publishing was a concern, but is no longer, as most journals allow authors to preprint their work. In practice, most preprints just act as an extension of the traditional publishing system (but see this interesting post by Matt Stephens on "pre-review" by Biostatistics). What is viewed as the major disadvantage can also be an advantage - the work is published with the names of the authors, so the reputation of the authors can be affected both positively and negatively by their work. This is what some people tell me is the major drawback to preprints for them - that the work is publicly posted without any formal vetting process, which could catch major problems with the work that weren't obvious to the authors.

I have been submitting preprints since my first paper in 1993, which was written with a physicist for whom preprinting was the default (Adami and Brown, 1994). Many of my early papers were preprinted because my collaborators were used to it. While in graduate school, I lapsed in preprinting for many years because my field (developmental biology) didn’t “do” preprints. When I started my own lab, I returned to preprinting, and submitted all of my senior author papers to preprint servers. Far from suffering any harm to my career, I have found that our ideas and our software have spread more quickly because of it - for example, by the time my first senior author paper was reviewed, another group had already built on top of it based on our preprint (see Pell et al., 2014 which was originally posted at arXiv, and Chikhi and Rizk 2013).

Social media

There are increasingly many scientists of all career stages on Twitter and writing blogs, and they discuss their own and others’ work openly and even candidly. This has the effect of letting people restricted in their travel into social circles that would otherwise be closed to them, and accelerates cross-subject pollination of ideas. In bioinformatics, it is typical to hear about new bioinformatics or genomics methods online 6 months to a year before they are even available as a preprint. For those who participate, this results in fast dissemination and evaluation of methods and it can quickly generate a community consensus around new software.

The downsides of social media are the typical social media downsides: social media is its own club with its own cliques, however welcoming some of those cliques can be; identifiable women and people of color operate at a disadvantage here as elsewhere; cultivating a social media profile can require quite a bit of time that could be spent elsewhere; and online discussions about science can be gossipy, negative, and even unpleasant. Nonetheless there is little doubt that social media can be a useful scientific tool (see Bik and Goldstein, 2013), and can foster networking and connections in ways that don’t rely on physical presence - a major advantage to labs without significant travel funds, parents with small children, etc.

In my case, I tend to default to being open about my work on social media. I regularly write blog posts about my research and talk openly about ideas on twitter. This has led to many more international connections than I would have had otherwise, as well as a broad community of scientists that I consider personal friends and colleagues. In my field, this has been particularly important; since many of my bioinformatics colleagues tend to be housed in biology or computer science departments rather than any formal computational biology program, the online world of social media also serves as an excellent way of discovering colleagues and maintaining collegiality in an interdisciplinary world, completely independent of its use for spreading ideas and building reputation.

Posting grants

While reputation is the key currency of advancement in science, good ideas are fodder for this advancement. Ideas are typically written up in the most detail in grant proposals - requests for funding from government agencies or private foundations. The ideas in grant proposals are guarded jealously, with many professors refusing to share grant proposals even within their labs. A few people (myself included) have taken to publicly posting grants when they are submitted, for a variety of reasons (see Ethan White's blog post for details).

In my case, I posted my grants in the hopes of engaging with a broader community to discuss the ideas in my grant proposal; while I haven’t found this engagement, the grants did turn out to be useful for junior faculty who are confused about formatting and tone and are looking for examples of successful (or unsuccessful) grants. More recently, I have found that people are more than happy to skim my grants and tell me about work outside my field or even unpublished work that bears on my proposal. For example, with my most recent proposal, I discovered a number of potential collaborators within 24 hours of posting my draft.

Why not open science?

The open science perspective - "more sharing, more better" - is slowly spreading, but there are many challenges that are delaying its spread.

One challenge of open science is that sharing takes effort, while the immediate benefits of that sharing largely go to people other than the producer of the work being shared. Open data is a perfect example of this: it takes time and effort to clean up and publish data, and the primary benefit of doing so will be realized by other people. The same is true of software . Another challenge is that the positive consequences of sharing, such as serendipitous discoveries and collaboration, cannot be accurately evaluated or pitched to others in the short term - it requires years, and sometimes decades, to make progress on scientific problems, and the benefits of sharing do not necessarily appear on demand or in the short term.

Another block to open science is that many of the mechanisms of sharing are themselves somewhat new, and are rejected in unthinking conservatism of practice. In particular, most senior scientists entered science at a time when the Internet was young and the basic modalities and culture of communicating and sharing over the Internet hadn’t yet been developed. Since the pre-Internet practices work for them, they see no reason to change. Absent a specific reason to adopt new practices, they are unlikely to invest time and energy in adopting new practices. This can be seen in the rapid adoption of e-mail and web sites for peer review (making old practices faster and cheaper) in comparison to the slow and incomplete adoption of social media for communicating about science (which is seen by many scientists as an additional burden on their time, energy, and focus).

Metrics for evaluating products that can be shared are also underdeveloped. For example, it is often hard to track or summarize the contributions that a piece of software or a data set makes to advancing a field, because until recently it was hard to cite software and data. More, there is no good technical way to track software that supports other software, or data sets that are combined in a larger study or meta-study, so many of the indirect products of software and data may go underreported.

Intellectual property law also causes problems. For example, in the US, the Bayh-Dole Act stands in the way of sharing ideas early in their development. Bayh-Dole was intended to spur innovation by granting universities the intellectual property rights to their research discoveries and encouraging them to develop them, but I believe that it has also encouraged people to keep their ideas secret until they know if they are valuable. But in practice most academic research is not directly useful, and moreover it costs a significant amount of money to productize, so most ideas are never developed commercially. In effect this simply discourages early sharing of ideas.

Finally, there are also commercial entities that profit exorbitantly from restricting access to publications. Several academic publishers, including Elsevier and MacMillan, have profit margins of 30-40%! (Here, see Mike Taylor on The obscene profits of commercial scholarly publishers.) (One particularly outrageous common practice is to charge a single lump sum for access to a large number of journals each year, and only provide access to the archives in the journals through that current subscription - in effect making scientists pay annually for access to their own archival literature.) These corporations are invested in the current system and have worked politically to block government efforts towards encouraging open science.

Oddly, non-profit scientific societies have also lobbied to restrict access to scientific literature; here, their argument appears to be that the journal subscription fees support work done by the societies. Of note, this appears to be one of the reasons why an early proposal for an open access system didn't realize its full promise. For more on this, see Kling et al., 2001, who point out that the assumption that the scientific societies accurately represent the interests and goals of their constituents and of science itself is clearly problematic.

The overall effect of the subscription gateways resulting from closed access is to simply make it more difficult for scientists to access literature; in the last year or so, this fueled the rise of Sci-Hub, an illegal open archive of academic papers. This archive is heavily used by academics with subscriptions because it is easier to search and download from Sci-Hub than it is to use publishers' Web sites (see Justin Peters' excellent breakdown in Slate).

A vision for open science

A great irony of science is that a wildly successful model of sharing and innovation — the free and open source software (FOSS) development community— emerged from academic roots, but has largely failed to affect academic practice in return. The FOSS community is an exemplar of what science could be: highly reproducible, very collaborative, and completely open. However, science has gone in a different direction. (These ideas are explored in depth in Millman and Perez 2014.)

It is easy and (I think) correct to argue that science has been corrupted by the reputation game (see e.g. Chris Chambers' blog post on 'researchers seeking to command petty empires and prestigious careers') and that people are often more concerned about job and reputation than in making progress on hard problems. The decline in public funding for science, the decrease in tenured positions (here, see Alice Dreger's article in Aeon), and the increasing corporatization of research all stand in the way of more open and collaborative science. And it can easily be argued that they stand squarely in the way of faster scientific progress.

I remain hopeful, however, because of generational change. The Internet and the rise of free content has made younger generations more aware of the value of frictionless sharing and collaboration. Moreover, as data set sizes become larger and data becomes cheaper to generate, the value of sharing data and methods becomes much more obvious. Young scientists seem much more open to casual sharing and collaboration than older scientists; it’s the job of senior scientists who believe in accelerating science to see that they are rewarded, not punished, for this.


Other resources and links:

by C. Titus Brown at October 23, 2016 10:00 PM

October 18, 2016

Matthieu Brucher

Book review: Why You Love Music: From Mozart to Metallica

I have to say, I was intrigued when I saw the book. Lots of things about music seem intuitive, from movies to how it makes us feel. And the book puts a theoretical aspect on it. So definitely something I HAD to read.

Content and opinions

There are 15 chapters in the book, covering lots of different facets of music. The first chapter tries to associate music genre and psychological profile. It was really interesting to see that the evolution of the music we like is dictated by things we listened in our childhood. And I have to say that my favorite music is indeed tightly correlated to the music style I prefered in my teens! The second chapter is more classic, as it tackles lyrics. Of course, it is easier to dive in a song with lyrics, even when they are misunderstood!

Third chapter is about emotions in music. I think that emotions are definitely the foremost element that composers want to convey. It seems there are basic rules, although it can be different depending on the culture (which was also interesting to know). The chapter also goes on different mechanisms music “uses” to create emotions. Basic conclusion: it is good for you. Fourth tackles the effect of repetition. It seems that it is mandatory to enjoy the music, and I enjoy the repetition of goose bumps moments in the songs I prefer.

The next chapters address the effect of music on our lives, starting with health. The type of music we listen has an impact on our mood, and also indicates in what shape we are. Sad songs, and we may be soothing from something, happy songs, and we may be joyful. Some people also say that music makes people smarter, this is also an element of the book, and the conclusion didn’t surprise me that much 😉

Moving on to using music in movies. I thought a lot about this, and indeed the tone of the music does ‘impact the way we feel about the different scenes in a movie. I likes the different examples that were used here. Chapter 8 was more intriguing, as it is about talent. Are we naturally gifted, or is it work. The majority of the people are not talented, they are just hard-working people (which in itself may also be a talent!). There is hope for everyone!

Let’s move on to more scientific stuff for the next chapters. The explanations on sounds, waves and frequencies were simple but efficient. Of course, as I have a music training and a signal processing background, it may have been easier to figure out where the author wanted to go, but I think that the elements he mentioned and their interactions was simple enough for everyone to understand how music worked. The chapter after deals with the rules of music writing. It was quite nice to see how some rules were analyzed and why they were “created” (like the big jump up, small down).

Going on in the music analysis, the next chapter is about the difference between melody and accompaniment. There are examples in this section to show what happens and explanations as to why the brain can make the difference. The chapter after tackles the strange things that happen in a brain when it creates something that didn’t exist in the first place. The following chapter on dissonance may have been the one I enjoyed the most, as it explains something I felt for a long time: you can’t play ont he bass whatever you want. The notes get murky if there are too many of them, compared to a guitar melody. The physical explanation tied everything nicely together, the jigsaw puzzle is solved!

Then 14th chapter handles the effect of the way a musician plays the notes on the feeling we get. I always think of an ok drummer and a great one, between a jazz drummer and a hitting drummer. The notes may be the same, but the message is completely different and is appreciated differently depending on the song. Finally the conclusion remembers us that we probably used music since the dawn of humanity, and lots of our experience is derived from the usage we made of music since then.

Conclusion

I don’t think I know music better now. But perhaps thanks to this book I can understand how it acts on what I feel. Maybe I over-analyze things too much as well. But I definitely appreciated the analysis of music effect on us!

by Matt at October 18, 2016 07:41 AM

October 13, 2016

Titus Brown

A shotgun metagenome workshop at the Scripps Institute of Oceanography

We just finished teaching a two day workshop at the Scripps Institute of Oceanography down at UC San Diego. Dr. Harriet Alexander, a postdoc in my lab, and I spent two days going through cloud computing, short read quality and k-mer trimming, metagenome assembly, quantification of gene abundance, mapping of reads against the assembly, making CIRCOS plots, and workflow strategies for reproducible and open science. We skipped the slicing and dicing data sets with k-mers, though -- not enough time.

Whew, I'm tired just writing all of that!

The workshop was delivered "Software Carpentry" style - interactive hands-on walk throughs of the tutorials, with plenty of room for questions and discussion and whiteboarding.

Did I mention we recorded? Yep. We recorded it. You can watch it on YouTube, in four acts: day 1, morning, day 1, afternoon, day 2, morning, and day 2, afternoon.

Great thanks to Jessica Blanton and Tessa Pierce for inviting us down and wrangling everything to make it work out!

The bad

A few things didn't work out that well.

The materials weren't great

This was a first run of these materials, most of which were developed the week of the workshop. While most of the materials worked, there were hiccups from the last minute nature of things.

Amazon f-ups

Somewhat more frustrating, Amazon continues to institute checks that prevent new users from spinning up EC2 instances. It used to be that new users could sign up a bit in advance of the class and be able to start EC2 instances. Now, it seems like there's an additional verification that needs to be done AFTER the first phone verification and AFTER the first attempt to start an EC2 instance.

The workshop went something like this:

Me: "OK, now press launch, and we can wait for the machines to start up."

Student 1: "It didn't work for me. It says awaiting verification."

Student 2: "Me neither."

Chorus of students: "Me neither."

So I went and spun up 17 instances on my account and distributed the host names to all of the students via our EtherPad. Equanimity in the face of adversity...?

We didn't get to the really interesting stuff that I wanted to teach

There was a host of stuff - genome binning, taxonomic annotation, functional annotation - that I wanted to teach but that we basically ended up not having time to write up into tutorials (and we wouldn't have had time to present, either).

The good

The audience interaction was great. We got tons of good questions, we explored corners of metagenomics and assembly and sequencing and biology that needed to be explored, and everyone was super nice and friendly!

We wrote up the materials, so now we have them! We'll run more of these and when we do, the current materials will be there and waiting and we can write new and exciting materials!

The location was amazing, too ;). Our second day was in a little classroom overlooking the Pacific Ocean. For the whole second part of the day you could hear the waves crashing against the beach below!

The unknown

One of the reasons that we didn't write up anything on taxonomy, or binning, or functional annotation, was that we don't really run these programs ourselves all that much. We did get some recommendations from the Interwebs, and I need to explore those, but now is the time to tell us --

  • what's your favorite genome binning tool? We've had DESMAN and multi-metagenome recommended to us; any others?
  • functional annotation of assemblies: what do you use? I was hoping to use ShotMap. I had previously balked at using ShotMap on assembled data, for several reasons, including its design for use on raw reads. But, after Harriet pointed out that we could quantify the Prokka-annotated genes from contigs, I may give ShotMap a try with that approach. I still have to figure out how to feed the gene abundance into ShotMap, though.
  • What should I use for taxonomic assignment? Sheila Podell, the creator of DarkHorse, was in the audience and we got to talk a bit, and I was impressed with the approach, so I may give DarkHorse a try. There are also k-mer based approaches like MetaPalette that I want to try, but my experience so far has been that they are extremely database intensive and somewhat fragile. I'd also like to try marker gene approaches like PhyloSift. What tools are people using? Any strong personal recommendations?
  • What tool(s) do people use to do abundance calculations for genes in their metagenome? I can think of a few basic types of approaches --

...but I'm at a loss for specific software to use. Any help appreciated - just leave a comment or e-mail me at titus@idyll.org.

--titus

by C. Titus Brown at October 13, 2016 10:00 PM

October 11, 2016

Fabian Pedregosa

A fast, fully asynchronous variant of the SAGA algorithm

My friend Rémi Leblond has recently uploaded to ArXiv our preprint on an asynchronous version of the SAGA optimization algorithm.

The main contribution is to develop a parallel (fully asynchronous, no locks) variant of the SAGA algorighm. This is a stochastic variance-reduced method for general optimization, specially adapted for problems that arise frequently in machine learning such as (regularized) least squares and logistic regression. Besides the specification of the algorithm, we also provide a convergence proof and convergence rates. Furthermore, we fix some subtle technical issues present in previous literature (proving things in the asynchronous setting is hard!).

The core of the asynchronous algorithm is similar to Hogwild!, a popular asynchronous variant of stochastc gradient descent (SGD). The main difference is that instead of using SGD as a building block, we use SAGA. This has many advantages (and poses some challenges): faster (exponential!) rates of convergence and convergence to arbitrary precision with a fixed step size (hence clear stopping criterion), to name a few.

The speedups obtained versus the sequential version are quite impressive. For example, we have observed to commonly obtain 5x-7x speedups using 10 cores:

I will be with Rémi presenting this work at the NIPS OPT-ML workshop.

by Fabian Pedregosa at October 11, 2016 10:00 PM

Continuum Analytics news

Move Over Data Keepers: Open Source is Here to Stay

Tuesday, October 11, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

If you walked by our booth at Strata + Hadoop 2016 this year, you may have laid witness to a larger-than-life character, the almighty data keeper. He was hard to miss as he stood seven-feet tall wearing floor length robes.

For those who weren’t in attendance, the natural thing to ask is who is this data keeper, and what does he do? Let me explain. 

Before the creation of open source and the Anaconda platform, data was once locked away and hidden from the world - protected by the data keepers. These data keepers were responsible for maintaining data and providing access only to those who could comprehend the complicated languages such as base SAS®. 

For years, this exclusivity kept data from penetrating the outside world and allowing it to be used for good. However, as technology advanced, data began to become more accessible, taking power away from the data keepers, and giving it instead to the empowered data scientists and eventually to citizen data scientists. This technology movement is referred to as the “open data science revolution” - resulting in the creation of an open source world that allows everyone and anyone to participate and interact with data. 

As the open data science community began to grow, members joined together to solve complex problems by utilizing different tools and languages. This collaboration is what enabled the creation of the Anaconda platform. Anaconda is currently being used by millions of innovators from all over the world in diverse industries (from science to business to healthcare) to come up with solutions to make the world a better place. 

Thanks to open source, data keepers are no longer holding data under lock and key - data is now completely accessible, enabling those in open source communities the opportunity to utilize data for good. 

To learn more, watch our short film below. 

 

by swebster at October 11, 2016 03:34 PM

Continuum Analytics Launches AnacondaCrew Partner Program to Empower Data Scientists with Superpowers

Wednesday, October 12, 2016

Leading Open Data Science platform company is working with its ecosystem to drive transformation in a data-enabled world

AUSTIN, TEXAS—October 12, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced the launch of the AnacondaCrew Partner Program. Following the recent momentum in revenue and Anaconda growth of over three million downloads, this program is focused on enabling partners to leverage the power of Open Data Science technology to accelerate time-to-value for enterprises by empowering modern data science teams. 

The AnacondaCrew program is designed to drive mutual business growth and financial performance for Technology, Service and OEM partners to take advantage of the fast-growing data science market. 

The AnacondaCrew Partner Program includes:

  • Technology Partners offering a hardware platform, cloud service or software that integrates with the Anaconda platform
  • Service Partners delivering Anaconda-based solutions and services to enterprises
  • OEM Partners using Anaconda, an enterprise-grade Python platform, to embed into their application, hardware or appliance

"We are extremely excited about our training partnership with Continuum Analytics,” said Jonathan Cornelissen, CEO at DataCamp. “We can now combine the best instructors in the field with DataCamp’s interactive learning environment to create the new standard for online Python for Data Science learning."

In the last year, Continuum has quickly grown the AnacondaCrew Partner Program to include a dozen of the best known modern data science partners in the ecosystems, including Cloudera, DataCamp, Intel, Microsoft, NVIDIA, Docker and others

“As a market leader, Anaconda is uniquely positioned to embrace openness through the open source community and a vast ecosystem of partners focused on helping customers solve problems that change the world,” said Michele Chambers, EVP of Anaconda and CMO at Continuum Analytics. “Our fast growing AnacondaCrew Partner Program delivers an enterprise-ready connected ecosystem that makes it easy for customers to embark on the journey to Open Data Science and realize returns on their Big Data investments.”

Learn more about the Continuum Analytics AnacondaCrew Partner Program here or simply get in touch

About Anaconda Powered by Continuum Analytics

Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world. 

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations. 

Continuum Analytics' founders and developers have created and contributed to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup. 

To learn more, visit http://www.continuum.io.

###

Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com 

by swebster at October 11, 2016 02:31 PM

Esri Selects Anaconda to Enhance GIS Applications with Open Data Science

Wednesday, October 12, 2016

Streamlined access to Python simplifies and accelerates development of deep location-based analytics for improved operations and intelligent decision-making

AUSTIN, TEXAS—October 12, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced a new partnership with leading geographic information system (GIS) provider, Esri. By embedding Anaconda into Esri's flagship desktop application, ArcGIS Pro, organizations now have increased accessibility to Python for developing more powerful, location-centric analytics applications. 

The integration of Anaconda into ArcGIS Pro 1.3 enables GIS professionals to build detailed maps using the most current data and perform deep analysis to apply geography to problem solving and decision making for customers in dozens of industries. Anaconda provides a strong value-added component––particularly for Esri’s scientific and governmental customers who have to coordinate code across multiple machines or deploy software through centralized IT systems. Now, developers using ArcGIS Pro can easily integrate open source libraries into projects, create projects in multiple versions of Python and accelerate the process of installing nearly all publicly available Python packages.

“Python has a rich ecosystem of pre-existing code packages that users can leverage in their own script tools from within ArcGIS. But, managing packages can prove complex and time-consuming, especially when developing for multiple projects at once or trying to share code with others,” said Debra Parish, manager of global business strategies at Esri. “Anaconda solves these challenges and lets users easily create projects in multiple versions of Python. It really makes lives easier, especially for developers who deal with complex issues and appreciate the ease and agility Anaconda adds to the Python environment.”

ArcGIS for Desktop, which includes ArcGIS Pro, boasts the most powerful mapping software in the world. Used by Fortune 500 companies, national and local governments, public utilities and tech start-ups around the world, ArcGIS Pro’s mapping platform uncovers trends, patterns and spatial connections to provide actionable insights leading to data-informed business decisions. Additionally, ArcGIS Pro is accessible to developers to create and manage geospatial apps, regardless of developer experience.

“At Continuum Analytics, we know that data science is a team sport and collaboration is critical to the success of any analytics project. Anaconda empowers Esri developers with an accelerated path to open source Python projects and deeper analytics,” said Travis Oliphant, CEO and co-founder at Continuum Analytics. “More importantly, we see this as a partnering of two great communities, both offering best-in-class technology and recognizing that Open Data Science is a powerful solution to problem solving and decision making for organizations of all sizes.”

About Esri

Since 1969, Esri has been giving customers around the world the power to think and plan geographically. As the market leader in GIS technology, Esri software is used in more than 350,000 organizations worldwide including each of the 200 largest cities in the United States, most national governments, more than two-thirds of Fortune 500 companies, and more than 7,000 colleges and universities. Esri applications, running on more than one million desktops and thousands of web and enterprise servers, provide the backbone for the world's mapping and spatial analysis. Esri is the only vendor that provides complete technical solutions for desktop, mobile, server, and Internet platforms. Visit us at esri.com/news.

Copyright © 2016 Esri. All rights reserved. Esri, the Esri globe logo, GIS by Esri, Story Map Journal, esri.com, and @esri.com are trademarks, service marks, or registered marks of Esri in the United States, the European Community, or certain other jurisdictions. Other companies and products or services mentioned herein may be trademarks, service marks, or registered marks of their respective mark owners.

About Anaconda Powered by Continuum Analytics

Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world. 

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations. 

Continuum Analytics' founders and developers have created and contributed to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup. 

To learn more, visit http://www.continuum.io.

###

Media Contacts:
Anaconda--
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

Esri--
Karen Richardson, Public Relations Manager, Esri
Mobile: +1 587-873-0157
Email: krichardson@esri.com

by swebster at October 11, 2016 02:27 PM

October 10, 2016

Enthought

Geophysical Tutorial: Facies Classification using Machine Learning and Python

Published in the October 2016 edition of The Leading Edge magazine by the Society of Exploration Geophysicists. Read the full article here.

By Brendon Hall, Enthought Geosciences Applications Engineer 
Coordinated by Matt Hall, Agile Geoscience

ABSTRACT

There has been much excitement recently about big data and the dire need for data scientists who possess the ability to extract meaning from it. Geoscientists, meanwhile, have been doing science with voluminous data for years, without needing to brag about how big it is. But now that large, complex data sets are widely available, there has been a proliferation of tools and techniques for analyzing them. Many free and open-source packages now exist that provide powerful additions to the geoscientist’s toolbox, much of which used to be only available in proprietary (and expensive) software platforms.

One of the best examples is scikit-learn, a collection of tools for machine learning in Python. What is machine learning? You can think of it as a set of data-analysis methods that includes classification, clustering, and regression. These algorithms can be used to discover features and trends within the data without being explicitly programmed, in essence learning from the data itself.

Well logs and facies classification results from a single well.

Well logs and facies classification results from a single well.

In this tutorial, we will demonstrate how to use a classification algorithm known as a support vector machine to identify lithofacies based on well-log measurements. A support vector machine (or SVM) is a type of supervised-learning algorithm, which needs to be supplied with training data to learn the relationships between the measurements (or features) and the classes to be assigned. In our case, the features will be well-log data from nine gas wells. These wells have already had lithofacies classes assigned based on core descriptions. Once we have trained a classifier, we will use it to assign facies to wells that have not been described.

See the tutorial in The Leading Edge here.

ADDITIONAL RESOURCES:

The post Geophysical Tutorial: Facies Classification using Machine Learning and Python appeared first on Enthought Blog.

by admin at October 10, 2016 07:46 PM

William Stein

RethinkDB must relicense NOW

What is RethinkDB?

RethinkDB is a INCREDIBLE high quality polished open source realtime database that is easy to deploy, shard, replicate, and supports a reactive client programming model, which is useful for collaborative web-based applications. Shockingly, the 7-year old company that created RethinkDB has just shutdown. I am the CEO of a company, SageMath, Inc., that uses RethinkDB very heavily, so I have a strong interest in RethinkDB surviving as an independent open source project.

Three Types of Open Source Projects

There are many types of open source projects. RethinkDB was the type of open source project where most work on RethinkDB has been fulltime focused work, done by employees of the RethinkDB company. RethinkDB is licensed under the AGPL, but the company promised to make the software available to customers under other licenses.

Academia: I started the SageMath open source math software project in 2005, which has over 500 contributors, and a relatively healthy volunteer ecosystem, with about hundred contributors to each release, and many releases each year. These are mostly volunteer contributions by academics: usually grad students, postdocs, and math professors. They contribute because SageMath is directly relevant to their research, and they often contribute state of the art code that implements algorithms they have created or refined as part of their research. Sage is licensed under the GPL, and that license has worked extremely well for us. Academics sometimes even get significant grants from the NSF or the EU to support Sage development.

Companies: I also started the Cython compiler project in 2007, which has had dozens of contributors and is now the defacto standard for writing or wrapping fast code for use by Python. The developers of Cython mostly work at companies (e.g., Google) as a side project in their spare time. (Here's a message today about a new release from a Cython developer, who works at Google.) Cython is licensed under the Apache License.

What RethinkDB Will Become

RethinkDB will no longer be an open source project whose development is sponsored by a single company dedicated to the project. Will it be an academic project, a company-supported project, or dead?

A friend of mine at Oxford University surveyed his academic CS colleagues about RethinkDB, and they said they had zero interest in it. Indeed, from an academic research point of view, I agree that there is nothing interesting about RethinkDB. I myself am a college professor, and understand these people! Academic volunteer open source contributors are definitely not going to come to RethinkDB's rescue. The value in RethinkDB is not in the innovative new algorithms or ideas, but in the high quality carefully debugged implementations of standard algorithms (largely the work of bad ass German programmer Daniel Mewes). The RethinkDB devs had to carefully tune each parameter in those algorithms based on extensive automated testing, user feedback, the Jepsen tests, etc.

That leaves companies. Whether or not you like or agree with this, many companies will not touch AGPL licensed code:
"Google open source guru Chris DiBona says that the web giant continues to ban the lightning-rod AGPL open source license within the company because doing so "saves engineering time" and because most AGPL projects are of no use to the company."
This is just the way it is -- it's psychology and culture, so deal with it. In contrast, companies very frequently embrace open source code that is licensed under the Apache or BSD licenses, and they keep such projects alive. The extremely popular PostgreSQL database is licensed under an almost-BSD license. MySQL is freely licensed under the GPL, but there are good reasons why people buy a commercial MySQL license (from Oracle) for MySQL. Like RethinkDB, MongoDB is AGPL licensed, but they are happy to sell a different license to companies.

With RethinkDB today, the only option is AGPL. This very strongly discourage use by the only possible group of users and developers that have any chance to keep RethinkDB from death. If this situation is not resolved as soon as possible, I am extremely afraid that it never will be resolved. Ever. If you care about RethinkDB, you should be afraid too. Ignoring the landscape and culture of volunteer open source projects is dangerous.

A Proposal

I don't know who can make the decision to relicense RethinkDB. I don't kow what is going on with investors or who is in control. I am an outsider. Here is a proposal that might provide a way out today:

PROPOSAL: Dear RethinkDB, sell me an Apache (or BSD) license to the RethinkDB source code. Make this the last thing your company sells before it shuts down. Just do it.


Hacker News Discussion

by William Stein (noreply@blogger.com) at October 10, 2016 04:03 PM

October 05, 2016

William Stein

SageMath: "it's not research"

The University of Washington (UW) mathematics department has funding for grad students to "travel to conferences". What sort of travel funding?

  • The department has some money available.
  • The UW Graduate school has some money available: They only provide funding for students giving a talk or presenting a poster.
  • The UW GPSS has some money available: contact them directly to apply (they only provide funds for "active conference participation", which I think means giving a talk, presenting a poster, or similar)

One of my two Ph.D. students at UW asked our Grad program director: "I'll be going to Joint Mathematics Meetings (JMM) to help out at the SageMath booth. Is this a thing I can get funding for?"

ANSWER: Travel funds are primarily meant to support research, so although I appreciate people helping out at the SageMath booth, I think that's not the best use of the department's money.

I think this "it's not research" perspective on the value of mathematical software is unfortunate and shortsighted. Moreover, it's especially surprising as the person who wrote the above answer has contributed substantially to the algebraic topology functionality of Sage itself, so he knows exactly what Sage is.

Sigh. Can some blessed person with an NSF grant out there pay for this grad student's travel expenses to help with the Sage booth? Or do I have to use the handful of $10, $50, etc., donations I've got the last few months for this purpose?

by William Stein (noreply@blogger.com) at October 05, 2016 01:13 PM

September 27, 2016

Continuum Analytics news

Continuum Analytics Joins Forces with IBM to Bring Open Data Science to the Enterprise

Tuesday, September 27, 2016

Optimized Python experience empowers data scientists to develop advanced open source analytics on Spark   
 
AUSTIN, TEXAS—September 27, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced an alliance with IBM to advance open source analytics for the enterprise. Data scientists and data engineers in open source communities can now embrace Python and R to develop analytic and machine learning models in the Spark environment through its integration with IBM's Project DataWorks. 

Combining the power of IBM's Project DataWorks with Anaconda enables organizations to build high-performance Python and R data science models and visualization applications required to compete in today’s data-driven economy. The companies will collaborate on several open source initiatives including enhancements to Apache Spark that fully leverage Jupyter Notebooks with Apache Spark – benefiting the entire data science community.

“Our strategic relationship with Continuum Analytics empowers Project DataWorks users with full access to the Anaconda platform to streamline and help accelerate the development of advanced machine learning models and next-generation analytics apps,” said Ritika Gunnar, vice president, IBM Analytics. “This allows data science professionals to utilize the tools they are most comfortable with in an environment that reinforces collaboration with colleagues of different skillsets.”

By collaborating to bring about the best Spark experience for Open Data Science in IBM's Project DataWorks, enterprises are able to easily connect their data, analytics and compute with innovative machine learning to accelerate and deploy their data science solutions. 

“We welcome IBM to the growing family of industry titans that recognize Anaconda as the defacto Open Data Science platform for enterprises,” said Michele Chambers, EVP of Anaconda Business & CMO at Continuum Analytics. “As the next generation moves from machine learning to artificial intelligence, cloud-based solutions are key to help companies adopt and develop agile solutions––IBM recognizes that. We’re thrilled to be one of the driving forces powering the future of machine learning and artificial intelligence in the Spark environment.”

IBM's Project Dataworks the industry’s first cloud-based data and analytics platform that integrates all types of data to enable AI-powered decision making. With this, companies are able to realize the full promise of data by enabling data professionals to collaborate and build cognitive solutions by combining IBM data and analytics services and a growing ecosystem of data and analytics partners - all delivered on Apache Spark. Project Dataworks is designed to allow for faster development and deployment of data and analytics solutions with self-service user experiences to help accelerate business value. 

To learn more, join Bob Picciano, SVP of IBM Analytics and Travis Oliphant, CEO of Continuum Analytics at the IBM DataFirst Launch Event on Sept 27, 2016, Hudson Mercantile Building in NYC. The event is also available on livestream.

About Continuum Analytics
Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world. 

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations. 

Our community loves Anaconda because it empowers the entire data science team––data scientists, developers, DevOps, architects and business analysts––to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services. 

Continuum Analytics' founders and developers have created and contributed to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup. 

To learn more, visit http://www.continuum.io.

###
 
Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com 

by swebster at September 27, 2016 12:18 PM