## November 30, 2016

### Continuum Analytics news

#### Data Science in the Enterprise: Keys to Success

Wednesday, November 30, 2016
Travis Oliphant
Chief Executive Officer & Co-Founder
Continuum Analytics

## Anaconda Bust Graphic@1.5x-8.png

When examining the success of one of the most influential and iconic rock bands of all time, there’s no doubt that talent played a huge role. However, it would be unrealistic to attribute the phenomenon that was The Beatles to musical talents alone. Much of their success can be credited to the behind-the-scenes work of trusted advisors, managers and producers. There were many layers beneath the surface that contributed to their incredible fame—including implementing the proper team and tools to propel them from obscurity to commercial global success.

### Open Source: Where to Start

Similar to the music industry, success in Open Data Science relies heavily on many layers, including motivated data scientists, proper tools and the right vision for how to leverage data and perspective. Open Data Science is not a single technology, but a revolution within the data science community. It is an inclusive movement that connects open source tools for data science—preparation, analytics and visualization—so they can easily work together as a connected ecosystem. The challenge lies in figuring out how to successfully navigate the ecosystem and identifying the right Open Data Science enterprise vendors to partner with for the journey.

Most organizations have come to understand the value of Open Data Science, but they often struggle with how to adopt and implement it. Some select a “DIY” method when addressing open source, choosing one of the languages or tools available at low or no cost. Others augment an open source base and build proprietary technology into existing infrastructures to address data science needs.

Most organizations will engage enterprise-grade products and services when selecting other items, such as unified communication and collaboration tools, instead of opting for short-run cost-savings. For example, using consumer-grade instant messaging and mobile phones might save money this quarter, but over time this choice will end up costing an organization much more. This is due to the costs in labor and other services to make up for the lack of enterprise features, performance for enterprise use-cases and support and maintenance that is essential to successful production usage.

The same standards apply for Open Data Science and the open source that surrounds this movement. While it is tempting to try and go at it alone with open source and avoid paying a vendor, there are fundamental problems with that strategy that will result in delayed deliverables, staffing challenges, maintenance headaches for software and frustration when the innovative open source communities move faster than an organization can manage or in a direction that is unexpected. All of this hurts the bottom line and can be easily avoided by finding an open source vendor that can navigate the complexity and ensure the best use of what is available in Open Data Science. In the next section, we will discuss three specific reasons it is important to choose vendors that can leverage open source effectively in the enterprise.

### Finding Success: The Importance of Choosing the Right Vendor/Partner

First, look for a vendor who is contributing significantly to the open source ecosystem. An open source vendor will not only provide enterprise solutions and services on top of existing open source, but will also produce significant open source innovations themselves—building communities like PyData, as well as contributing to open source organizations like The Apache Software Foundation, NumFOCUS or Software Freedom Conservancy. In this way, the software purchase translates directly into sustainability for the entire open source ecosystem. This will also ensure that the open source vendor is plugged into where the impactful open source communities are heading.

Second, raw open source provides a fantastic foundation of innovation, but invariably does not contain all the common features necessary to adapt to an enterprise environment. Integration with disparate data sources, enterprise databases, single sign-on systems, scale-out management tools, tools for governance and control, as well as time-saving user interfaces, are all examples of things that typically do not exist in open source or exist in a very early form that lags behind proprietary offerings. Using internal resources to provide these common, enterprise-grade additions costs more money in the long run than purchasing these features from an open source vendor.

The figure on the left below shows the kinds of ad-hoc layers that a company must typically create to adapt their applications, processes and workflows to what is available in open source. These ad-hoc layers are not unique to any one business, are hard to maintain and end up costing a lot more money than a software subscription from an open source vendor that would cover these capabilities with some of their enterprise offerings.

## Screen Shot 2016-11-30 at 9.27.43 AM.png

The figure on the right above shows the addition of an enterprise layer that should be provided by an open source vendor. This layer can be proprietary, which w ill enable the vendor to build a sustainable software business that attracts investment, while it solves the fundamental adaptation problem as well.  As long as the vendor is deeply connected to open source ecosystems and is constantly aware of what part of the stack is better maintained as open source, businesses receive the best of supported enterprise software without the painful lock-in and innovation gaps of traditional proprietary-only software.

Maintaining ad-hoc interfaces to open source becomes very expensive, very quickly.   Each interface is typically understood by only a few people in an organization and if they leave or move to different roles, their ability to make changes evaporates. In addition, rather than amortizing the cost of these interfaces over thousands of companies like a software vendor can do, the business pays the entire cost on their own. This discussion does not yet include the opportunity cost of tying up internal resources building and maintaining these common enterprise features instead of having those internal resources work on the software that is unique to a business. The best return from scarce software development talent is on software critical to a business that gives them a unique edge. We have also not discussed the time-to-market gaps that occur when organizations try to go at it alone, rather than selecting an open source vendor who becomes a strategic partner. Engaging an open source vendor who has in-depth knowledge of the technology, is committed to growing the open source ecosystem and has the ability to make the Open Data Science ecosystem work for enterprises, saves organizations significant time and money.

Finally, working with an open source vendor provides a much needed avenue for the integration services, training and long-term support that is necessary when adapting an open source ecosystem to the enterprise. Open source communities develop for many reasons, but they are typically united in a passion for rapid innovation and continual progress. Adapting the rapid pace of this innovation to the more methodical gear of enterprise value creation requires a trusted open source vendor. Long-term support of older software releases, bug fixes that are less interesting to the community but essential to enterprises and industry-specific training for data science teams are all needed to fully leverage Open Data Science in the enterprise. The right enterprise vendor will help an enterprise obtain all of this seamlessly.

### The New World Order: Adopting Open Data Science in the Enterprise

The journey to executing successful data science in the enterprise lies in the combination of the proper resources and tools. In general, in-house IT does not typically have the expertise needed to exploit the immense possibilities inherent to Open Data Science.

Open Data Science platforms, like Anaconda, are a key mechanism to adopting Open Data Science across an organization. These platforms offer differing levels of empowerment for everyone from the citizen data scientist to the global enterprise data science team. Open Data Science in the enterprise has different needs from an individual or a small business. While the free foundational core of Anaconda may be enough for the individual data explorer or the small business looking to use marketing data to target market segments, a large enterprise will typically need much more support and enterprise features in order to successfully implement open source and therefore Open Data Science across their organization. Because of this, it is critical that larger organizations identify an enterprise open source vendor to both provide support and guidance as they implement Open Data Science.  This vendor should also be able to provide that enterprise layer between the applications, processes and workflows that the data science team produces and the diverse open source ecosystem. The complexity inherent to this process of maximizing insights from data will demand proficiency from both the team and vendors, in order to harness the power of the data to transform the business to one that is first data-aware and then data-driven.

Anaconda allows enterprises to innovate faster. It exposes previously unknown insights and improves the relationship between all members of the data science team. As a platform that embraces and deeply supports open source, it helps businesses to take full advantage of both the innovation at the core of the Open Data Science movement, as well as the enterprise adaptation that is essential to leveraging the full power of open source effectively in the business. It’s time to remove the chaos from open source and use Open Data Science platforms to simplify things, so that enterprises can fully realize their own superpowers to change the world.

## November 26, 2016

### Titus Brown

#### Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists!

This is an update to last week's blog post, "Efficiently searching MinHash Sketch collections".

Last week, Thanksgiving travel and post-turkey somnolescence gave me some time to work more with our combined MinHash/SBT implementation. One of the main things the last post contained was a collection of MinHash signatures of all of the bacterial genomes, together with a Sequence Bloom Tree index of them that enabled fast searching.

Working with the index from last week, a few problems emerged:

• In my initial index calculation, I'd ignored non-bacterial microbes. Conveniently my colleague Dr. Jiarong (Jaron) Guo had already downloaded the viral, archaeal, and protist genomes from NCBI for me.

• The MinHashes I'd calculated contained only the filenames of the genome assemblies, and didn't contain the name or accession numbers of the microbes. This made them really annoying to use.

(See the new --name-from-first argument to sourmash compute.)

• We guessed that we wanted more sensitive MinHash sketches for all the things, which would involve re-calculating the sketches with more hashes. (The default is 500, which gives you one hash per 10,000 k-mers for a 5 Mbp genome.)

• We also decided that we wanted more k-mer sizes; the sourmash default is 31, which is pretty specific and could limit the sensitivity of genome search. k=21 would enable more sensitivity, k=51 would enable more stringency.

• I also came up with some simple ideas for using MinHash for taxonomy breakdown of metagenome samples, but I needed the number of k-mers in each hashed genome to do a good job of this. (More on this later.)

(See the new --with-cardinality argument to sourmash compute.)

Unfortunately this meant I had to recalculate MinHashes for 52,000 genomes, and calculate them for 8,000 new genomes. And it wasn't going to take only 36 hours this time, because I was calculating approximately 6 times as much stuff...

Fortunately, 6 x 36 hrs still isn't very long, especially when you're dealing with pleasantly parallel low-memory computations. So I set it up to run on Friday, and ran six processes at the same time, and it finished in about 36 hours.

Indexing the MinHash signatures also took much longer than the first batch, probably because the signature files were much larger and hence took longer to load. For k=21, it took about 5 1/2 hours, and 6.5 GB of RAM, to index the 60,000 signatures. The end index -- which includes the signatures themselves -- is around 3.2 GB for each k-mer size. (Clearly if we're going to do this for the entire SRA we'll have to optimize things a bit.)

On the search side, though, searching takes roughly the same amount of time as before, because the indexed part of the signatures aren't much larger, and the Bloom filter internal nodes are the same size as before. But we can now search at k=21, and get better named results than before, too.

For example, go grab the Shewanella MR-1 genome:

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/165/GCF_000146165.2_ASM14616v2/GCF_000146165.2_ASM14616v2_genomic.fna.gz > shewanella.fna.gz

Next, convert it into a signature:

sourmash compute -k 21,31 -f --name-from-first shewanella.fna.gz

and search!

sourmash sbt_search -k 21 microbes shewanella.fna.gz.sig

This yields:

# running sourmash subcommand: sbt_search
1.00 NC_004347.2 Shewanella oneidensis MR-1 chromosome, complete genome
0.16 NZ_JGVI01000001.1 Shewanella xiamenensis strain BC01 contig1, whole genome shotgun sequence
0.16 NZ_LGYY01000235.1 Shewanella sp. Sh95 contig_1, whole genome shotgun sequence
0.15 NZ_AKZL01000001.1 Shewanella sp. POL2 contig00001, whole genome shotgun sequence
0.15 NZ_JTLE01000001.1 Shewanella sp. ZOR0012 L976_1, whole genome shotgun sequence
0.09 NZ_AXZL01000001.1 Shewanella decolorationis S12 Contig1, whole genome shotgun sequence
0.09 NC_008577.1 Shewanella sp. ANA-3 chromosome 1, complete sequence
0.08 NC_008322.1 Shewanella sp. MR-7, complete genome

## The updated MinHash signatures & indices are available!

Our MinHash signature collection now contains:

1. 53865 bacteria genomes
2. 5463 viral genomes
3. 475 archaeal genomes
4. 177 fungal genomes
5. 72 protist genomes

for a total of 60,052 genomes.

--titus

Index building cost for k=21:

Command being timed: "/home/ubuntu/sourmash/sourmash sbt_index microbes -k 21 --traverse-directory microbe-sigs-2016-11-27/"
User time (seconds): 18815.48
System time (seconds): 80.81
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:15:09
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6484264
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 7
Minor (reclaiming a frame) page faults: 94887308
Voluntary context switches: 5650
Involuntary context switches: 27059
Swaps: 0
File system inputs: 150624
File system outputs: 10366408
Socket messages sent: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

## November 21, 2016

### Paul Ivanov

#### November 9th, 2016

Two weeks ago, I went down to San Luis Obispo, California for a five day Jupyter team meeting with about twenty five others. This was the first such meeting since my return after being away for two years, and I enjoyed meeting some of the "newer" faces, as well as catching up with old friends.

It was both a productive and an emotionally challenging week, as the project proceeds along at breakneck pace on some fronts yet continues to face growing pains which come from having to scale in the human dimension.

On Wednesday, November 9th, 2016, we spent a good chunk of the day at a nearby beach: chatting, decompressing, and luckily I brought my journal with me and was able to capture the poem you will find below. I intended to read it at a local open mic the same evening, but by the time I got there with a handful of fellow Jovyans for support, all of the slots were taken. On Friday, the last day of our meeting, I got the opportunity to read it to most of the larger group. Here's a recording of that reading, courtesy of Matthias Bussonnier (thanks, Matthias!).

# November 9th, 2016

The lovely thing about the ocean is
that it
is
tireless
It never stops
incessant pendulum of salty foamy slush
Periodic and chaotic
raw, serene
Marine grandmother clock
crashing against both pier
and rock

Statuesque encampment of abandonment
recoiling with force
and blasting forth again
No end in sight
a train forever riding forth
and back
along a line
refined yet undefined
the spirit with
which it keeps time
in timeless unity of the moon's alignment

I. walk. forth.

Forth forward by the force
of obsolete contrition
the vision of a life forgotten
Excuses not
made real with sand, wet and compressed
beneath my heel and toes, yet reeling from
the blinding glimmer of our Sol
reflected by the glaze of distant hazy surf
upon whose shoulders foam amoebas roam

It's gone.
Tone deaf and muted by

anticipation
each coming wave
breaks up the pregnant pause
And here I am, barefoot in slacks and tie
experiencing sensations
of loss, rebirth and seldom
kelp bulbs popping in my soul.

## November 18, 2016

### Titus Brown

#### Efficiently searching MinHash Sketch collections

There is an update to this blog post: please see "Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists!

Note: This blog post is based largely on work done by Luiz Irber. Camille Scott, Luiz Irber, Lisa Cohen, and Russell Neches all collaborated on the SBT software implementation!

Note 2: Adam Phillipy points out in the comments below that they suggested using SBTs in the mash paper, which I reviewed. Well, they were right :)

---

We've been pretty enthusiastic about MinHash Sketches over here in Davis (read here and here for background, or go look at mash directly), and I've been working a lot on applying them to metagenomes. Meanwhile, Luiz Irber has been thinking about how to build MinHash signatures for all the data.

A problem that Luiz and I both needed to solve is the question of how you efficiently search hundreds, thousands, or even millions of MinHash Sketches. I thought about this on and off for a few months but didn't come up with an obvious solution.

Luckily, Luiz is way smarter than me and quickly figured out that Sequence Bloom Trees were the right answer. Conveniently as part of my review of Solomon and Kingsford (2015) I had put together a BSD-compatible SBT implementation in Python. Even more conveniently, my students and colleagues at UC Davis fixed my somewhat broken implementation, so we had something ready to use. It apparently took Luiz around a nanosecond to write up a Sequence Bloom Tree implementation that indexed, saved, loaded, and searched MinHash sketches. (I don't want to minimize his work - that was a nanosecond on top of an awful lot of training and experience. :)

## Sequence Bloom Trees can be used to search many MinHash sketches

Briefly, an SBT is a binary tree where the leaves are collections of k-mers (here, MinHash sketches) and the internal nodes are Bloom filters containing all of the k-mers in the leaves underneath them.

Here's a nice image from Luiz's notebook: here, the leaf nodes are MinHash signatures from our sea urchin RNAseq collection, and the internal nodes are khmer Nodegraph objects containing all the k-mers in the MinHashes beneath them.

These images can be very pretty for larger collections!

The basic idea is that you build the tree once, and then to search it you prune your search by skipping over internal nodes that DON'T contain k-mers of interest. As usual for this kind of search, if you search for something that is only in a few leaves, it's super efficient; if you search for something in a lot of leaves, you have to walk over lots of the tree.

This idea was so obviously good that I jumped on it and integrated the Luiz's SBT functionality into sourmash, our Python library for calculating and searching MinHash sketches. The pull request is still open -- more on that below -- but the PR currently adds two new functions, sbt_index and sbt_search, to index and search collections of sketches.

## Using sourmash to build and search MinHash collections

Starting from a blank Ubuntu 15.10 install, run:

sudo apt-get update && sudo apt-get -y install python3.5-dev \
python3-virtualenv python3-matplotlib python3-numpy g++ make

then create a new virtualenv,

cd
python3.5 -m virtualenv env -p python3.5 --system-site-packages
. env/bin/activate

You'll need to install a few things, including a recent version of khmer:

pip install screed pytest PyYAML
pip install git+https://github.com/dib-lab/khmer.git

Next, grab the sbt_search branch of sourmash:

cd
git clone https://github.com/dib-lab/sourmash.git -b sbt_search

and then build & install sourmash:

cd sourmash && make install

Once it's installed, you can index any collection of signatures like so:

cd ~/sourmash
sourmash sbt_index urchin demo/urchin/{var,purp}*.sig

It takes me about 4 seconds to load 70-odd sketches into an sbt index named 'urchin'.

Now, search!

This sig is in the index and takes about 1.6 seconds to find:

sourmash sbt_search urchin demo/urchin/variegatus-SRR1661406.sig

Note you can adjust the search threshold, in which case the search truncates appropriately and takes about 1 second:

sourmash sbt_search urchin demo/urchin/variegatus-SRR1661406.sig --threshold=0.3

This next sig is not in the index and the search takes about 0.2 seconds (which is basically how long it takes to load the tree structure and search the tree root).

sourmash sbt_search urchin demo/urchin/leucospilota-DRR023762.sig

How well does this scale? Suppose, just hypothetically, that you had, oh, say, a thousand bacterial genome signatures lying around and you wanted to index and search them?

mkdir bac
cd bac
curl -O http://teckla.idyll.org/~t/transfer/sigs1k.tar.gz
tar xzf sigs1k.tar.gz

# index
time sourmash sbt_index 1k *.sig
time sourmash sbt_search 1k GCF_001445095.1_ASM144509v1_genomic.fna.gz.sig

Here, the indexing takes about a minute, and the search takes about 5 seconds (mainly because there are a lot of closely related samples).

The data set sizes are nice and small -- the 1,000 signatures are 4 MB compressed and 12 MB uncompressed, the SBT index is about 64 MB, and this is all representing about 5 Gbp of genomic sequence. (We haven't put any time or effort into optimizing the index so things will only get smaller and faster.)

## How far can we push it?

There's lots of bacterial genomes out there, eh? Be an AWFUL SHAME if someone INDEXED them all for search, wouldn't it?

Jiarong Guo, a postdoc split between my lab and Jim Tiedje's lab at MSU, helpfully downloaded 52,000 bacterial genomes from NCBI for another project. So I indexed them with sourmash.

Indexing 52,000 bacterial genomes took about 36 hours on the MSU HPC, or about 2.5 seconds per genome. This produced about 1 GB of uncompressed signature files, which in tar.gz form ends up being about 208 MB.

I loaded them into an SBT like so:

curl -O http://spacegraphcats.ucdavis.edu.s3.amazonaws.com/bacteria-sourmash-signatures-2016-11-19.tar.gz
tar xzf bacteria-sourmash-signatures-2016-11-19.tar.gz
/usr/bin/time -s sourmash sbt_index bacteria --traverse-directory bacteria-sourmash-signatures-2016-11-19

The indexing step took about 53 minutes on an m4.xlarge EC2 instance, and required 4.2 GB of memory. The resulting tree was about 4 GB in size. (Download the 800 MB tar.gz here; just untar it somewhere.)

Searching all of the bacterial genomes for matches to one genome in particular took about 3 seconds (and found 31 matches). It requires only 100 MB of RAM, because it uses on-demand loading of the tree. To try it out yourself, run:

sourmash sbt_search bacteria bacteria-sourmash-signatures-2016-11-19/GCF_000006965.1_ASM696v1_genomic.fna.gz.sig

I'm sure we can speed this all up, but I have to say that's already pretty workable :).

Again, you can download the 800 MB .tar.gz containing the SBT for all bacterial genomes here: bacteria-sourmash-sbt-2016-11-19.tar.gz.

## Example use case: finding genomes close to Shewanella oneidensis MR-1

What would you use this for? Here's an example use case.

Suppose you were interested in genomes with similarity to Shewanella oneidensis MR-1.

First, go to the S. oneidensis MR-1 assembly page, click on the "Assembly:" link, and find the genome assembly .fna.gz file.

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/165/GCF_000146165.2_ASM14616v2/GCF_000146165.2_ASM14616v2_genomic.fna.gz > shewanella.fna.gz

Next, convert it into a signature:

sourmash compute -f shewanella.fna.gz

(which takes 2-3 seconds to produce shewanella.fna.gz.sig.

And, now, search with your new signature:

sourmash sbt_search bacteria shewanella.fna.gz.sig

which produces this output:

# running sourmash subcommand: sbt_search
1.00 ../GCF_000146165.2_ASM14616v2_genomic.fna.gz
0.09 ../GCF_000712635.2_SXM1.0_for_version_1_of_the_Shewanella_xiamenensis_genome_genomic.fna.gz
0.09 ../GCF_001308045.1_ASM130804v1_genomic.fna.gz
0.08 ../GCF_000282755.1_ASM28275v1_genomic.fna.gz
0.08 ../GCF_000798835.1_ZOR0012.1_genomic.fna.gz

telling us that not only is the original genome in the bacterial collection (the one with a similarity of 1!) but there are four other genomes in with about 9% similarity. These are other (distant) strains of Shewanella. The reason the similarity is so small is that sourmash is by default looking at k-mer sizes of 31, so we're asking how many k-mers of length 31 are in common between the two genomes.

With little modification (k-mer error trimming), this same pipeline can be used on unassembled FASTQ sequence; streaming classification of FASTQ reads and metagenome taxonomy breakdown are simple extensions and are left as exercises for the reader.

## What's next? What's missing?

This is all still early days; the code's not terribly well tested and a lot of polishing needs to happen. But it looks promising!

I still don't have a good sense for exactly how people are going to use MinHashes. A command line implementation is all well and good but some questions come to mind:

• what's the right output format? Clearly a CSV output format for the searching is in order. Do people want a scripting interface, or a command line interface, or what?
• related - what kind of structured metadata should we support in the signature files? Right now it's pretty thin, but if we do things like sketch all of the bacterial genomes and all of the SRA, we should probably make sure we put in some of the metadata :).
• what about at tagging interface so that you can subselect types of nodes to return?

If you are a potential user, what do you want to do with large collections of MinHash sketches?

On the developer side, we need to:

• test, refactor, and polish the SBT stuff;
• think about how best to pick Bloom filter sizes automatically;
• benchmark and optimize the indexing;
• make sure that we interoperate with mash
• evaluate the SBT approach on 100s of thousands of signatures, instead of just 50,000.

and probably lots of things I'm forgetting...

--titus

p.s. Output of /usr/bin/time -v on indexing 52,000 bacterial genome signatures:

Command being timed: "sourmash sbt_index bacteria --traverse-directory bacteria-sourmash-signatures-2016-11-19"
User time (seconds): 3192.58
System time (seconds): 14.66
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 53:35.72
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4279056
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 8014404
Voluntary context switches: 972
Involuntary context switches: 5742
Swaps: 0
File system inputs: 0
File system outputs: 6576144
Socket messages sent: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

### Continuum Analytics news

#### We Are Thankful

Friday, November 18, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

It’s hard to believe it but it’s almost time to baste the turkey, mash the potatoes and take a moment to reflect on what we are thankful for this year amongst our family and friends. Good health? A job we actually enjoy? Our supportive family? While our personal reflections are of foremost importance, as a proud leader in the Open Data Science community, we’re thankful for advancements and innovations that contribute to the betterment of the world. This Thanksgiving, we give thanks to...

1. Data. Though Big Data gave us the meat with which to collect critical information, until recently, the technology needed to make sense of of the huge amount of data was either disparate or accessible only to the most technologically advanced companies in the world (translation: barely anyone). Today, we have the ability to extract actionable insights from the infinite amounts of data that literally drive the way people and businesses make decisions.

2. Our data science teams. We’re thankful there is no “i” in team. While we may have all the data in the world available to us, without adding the element of intelligent human intuition, it would be devoid of the endless value it provides. Our strong, versatile team members––including data scientists, business analysts, data engineers, devops and developers––are what gets us up in the morning and out the door to work. Being a part of this tight-knit community that offers immense support makes us grateful for the opportunity to do what we do.

3. New, innovative ideas. We keep our fingers on the pulse of enterprise happenings. Our customers afford us the opportunity to contribute to incredible, previously impossible tech breakthroughs. We’re thankful for the ability to exchange ideas with colleagues and constantly stand on the edge of change.

4. The opportunity to help others change the world. From combatting rare genetic diseases and eradicating human trafficking to predicting the effects of public policy, we’re thankful for the opportunity to work with companies who are using Anaconda to bring to life amazing new solutions that truly make a difference in the world. They keep us inspired and help to fuel the seemingly endless innovation made possible by the Open Data Science community.

5. The Anaconda community. Last but not least, we are thankful for the robust, rapidly growing Anaconda community that keeps us connected with other data science teams around the globe. Collaboration is key. Helping others discover, analyze and learn by connecting curiosity and experience is one of our main passions. We are grateful for the wonderment of innovation we see passing through on a daily basis.

As the great late, great Arthur C. Nielsen once said, “the price of light is less than the cost of darkness.” We agree.

Happy Thanksgiving!

## RethinkDB and sustainable business models

Three weeks ago, I spent the evening of Sept 12, 2016 with Daniel Mewes, who is the lead engineer of RethinkDB (an open source database). I was also supposed to meet with the co-founders, Slava and Michael, but they were too busy fundraising and couldn't join us. I pestered Daniel the whole evening about what RethinkDB's business model actually was. Yesterday, on October 6, 2016, RethinkDB shut down.

I met with some RethinkDB devs because an investor who runs a fund at the VC firm Andreessen-Horowitz (A16Z) had kindly invited me there to explain my commercialization plans for SageMath, Inc., and RethinkDB is one of the companies that A16Z has invested in. At first, I wasn't going to take the meeting with A16Z, since I have never met with Venture Capitalists before, and do not intend to raise VC. However, some of my advisors convinced me that VC's can be very helpful even if you never intend to take their investment, so I accepted the meeting.

In the first draft of my slides for my presentation to A16Z, I had a slide with the question: "Why do you fund open source companies like RethinkDB and CoreOS, which have no clear (to me) business model? Is it out of some sense of charity to support the open source software ecosystem?" After talking with people at Google and the RethinkDB devs, I removed that slide, since charity is clearly not the answer (I don't know if there is a better answer than "by accident").

I have used RethinkDB intensely for nearly two years, and I might be their biggest user in some sense. My product SageMathCloud, which provides web-based course management, Python, R, Latex, etc., uses RethinkDB for everything. For example, every single time you enter some text in a realtime synchronized document, a RethinkDB table gets an entry inserted in it. I have RethinkDB tables with nearly 100 million records. I gave a talk at a RethinkDB meetup, filed numerous bug reports, and have been described by them as "their most unlucky user". In short, in 2015 I bet big on RethinkDB, just like I bet big on Python back in 2004 when starting SageMath. And when visiting the RethinkDB devs in San Francisco (this year and also last year), I have said to them many times "I have a very strong vested interest in you guys not failing." My company SageMath, Inc. also pays RethinkDB for a support contract.

Sustainable business models were very much on my mind, because of my upcoming meeting at A16Z and the upcoming board meeting for my company.  SageMath, Inc.'s business model involves making money from subscriptions to SageMathCloud (which is hosted on Google Cloud Platform); of course, there are tons of details about exactly how our business works, which we've been refining based on customer feedback. Though absolutely all of our software is open source, what we sell is convenience, easy of access and use, and we provide value by hosting hundreds of courses on shared infrastructure, so it is much cheaper and easier for universities to pay us rather than hosting our software themselves (which is also fairly easy). So that's our business model, and I would argue that it is working; at least our MRR is steadily increasing and is more than twice our hosting costs (we are not cash flow positive yet due to developer costs).

So far as I can determine, the business model of RethinkDB was to make money in the following ways: 1. Sell support contracts to companies (I bought one). 2. Sell a closed-source proprietary version of RethinkDB with extra features that were of interest to enterprise (they had a handful of such features, e.g., audit logs for queries). 3. Horizon would become a cloud-hosted competitor to Firebase, with unique advantages that users have the option to migrate from the cloud to their own private data center, and more customizability. This strategy depends on a trend for users to migrate away from the cloud, rather than to it, which some people at RethinkDB thought was a real trend (I disagree).

I don't know of anything else they were seriously trying right now. The closed-source proprietary version of RethinkDB also seemed like a very recent last ditch effort that had only just begun; perhaps it directly contradicted a desire to be a 100% open source company?

With enough users, it's easier to make certain business models work. I suspect RethinkDB does not have a lot of real users. Number of users tends to be roughly linearly related to mailing list traffic, and the RethinkDB mailing list has an order of magnitude less traffic compared to the SageMath mailing lists, and SageMath has around 50,000 users. RethinkDB wasn't even advertised to be production ready until just over a year ago, so even they were telling people not to use it seriously until relatively recently. The adoption cycle for database technology is slow -- people wisely wait for Aphyr's tests, benchmarks comparing with similar technology, etc. I was unusual in that I chose RethinkDB much earlier than most people would, since I love the design of RethinkDB so much. It's the first database I loved, having seen a lot over many decades.

Conclusion: RethinkDB wasn't a real business, and wouldn't become one without year(s) more runway.

I'm also very worried about the future of RethinkDB as an open source project. I don't know if the developers have experience growing an open source community of volunteers; it's incredibly hard and its unclear they are even going to be involved. At a bare minimum, I think they must switch to a very liberal license (Apache instead of AGPL), and make everything (e.g., automated testing code, documentation, etc) open source. It's insanely hard getting any support for open source infrastructure work -- support mostly comes from small government grants (for research software) or contributions from employees at companies (that use the software). Relicensing in a company friendly way is thus critical.

## Company Incentives

Companies can be incentived in various ways, including:
• to get to the next round of VC funding
• to be a sustainable profitable business by making more money from customers than they spend, or
• to grow to have a very large number of users and somehow pivot to making money later.
When founding a company, you have a chance to choose how your company will be incentived based on how much risk you are willing to take, the resources you have, the sort of business you are building, the current state of the market, and your model of what will happen in the future.

For me, SageMath is an open source project I started in 2004, and I'm in it for the long haul. I will make the business I'm building around SageMathCloud succeed, or I will die trying -- therefore I have very, very little tolerance for risk. Failure is not an option, and I am not looking for an exit. For me, the strategy that best matches my values is to incentive my company to build a profitable business, since that is most likely to survive, and also to give us the freedom to maintain our longterm support for open source and pure mathematics software.

Thus for my company, neither optimizing for raising the next round of VC or growing at all costs makes sense. You would be surprised how many people think I'm completely wrong for concluding this.

## Andreessen-Horowitz

I spent the evening with RethinkDB developers, which scared the hell out of me regarding their business prospects. They are probably the most open source friendly VC-funded company I know of, and they had given me hope that it is possible to build a successful VC-funded tech startup around open source. I prepared for my meeting at A16Z, and deleted my slide about RethinkDB.

I arrived at A16Z, and was greeted by incredibly friendly people. I was a little shocked when I saw their nuclear bomb art in the entry room, then went to a nice little office to wait. The meeting time arrived, and we went over my slides, and I explained my business model, goals, etc. They said there was no place for A16Z to invest directly in what I was planning to do, since I was very explicit that I'm not looking for an exit, and my plan about how big I wanted the company to grow in the next 5 years wasn't sufficiently ambitious. They were also worried about how small the total market cap of Mathematica and Matlab is (only a few hundred million?!). However, they generously and repeatedly offered to introduce me to more potential angel investors.

We argued about the value of outside investment to the company I am trying to build. I had hoped to get some insight or introductions related to their portfolio companies that are of interest to my company (e.g., Udacity, GitHub), but they deflected all such questions. There was also some confusion, since I showed them slides about what I'm doing, but was quite clear that I was not asking for money, which is not what they are used to. In any case, I greatly appreciated the meeting, and it really made me think. They were crystal clear that they believed I was completely wrong to not be trying to do everything possible to raise investor money.

## Basecamp

During the first year of SageMath, Inc., I was planning to raise a round of VC, and was doing everything to prepare for that. I then read some of DHH's books about Basecamp, and realized many of those arguments applied to my situation, given my values, and -- after a lot of reflection -- I changed my mind. I think Basecamp itself is mostly closed source, so they may have an advantage  in building a business. SageMathCloud (and SageMath) really are 100% open source, and building a completely open source business might be harder. Our open source IP is considered worthless by investors. Witness: RethinkDB just shut down and Stripe hired just the engineers -- all the IP, customers, etc., of RethinkDB was evidently considered worthless by investors.

The day after the A16Z meeting, I met with my board, which went well (we discussed a huge range of topics over several hours). Some of the board members also tried hard to convince me that I should raise a lot more investor money.

## Will Poole: you're doomed

Two weeks ago I met with Will Poole, who is a friend of a friend, and we talked about my company and plans. I described what I was doing, that everything was open source, that I was incentivizing the company around building a business rather than raising investor money. He listened and asked a lot of follow up questions, making it very clear he understands building a company very, very well.

His feedback was discouraging -- I said "So, you're saying that I'm basically doomed." He responded that I wasn't doomed, but might be able to run a small "lifestyle business" at best via my approach, but there was absolutely no way that what I was doing would have any impact or pay for my kids college tuition. If this was feedback from some random person, it might not have been so disturbing, but Will Poole joined Microsoft in 1996, where he went on to run Microsoft's multibillion dollar Windows business. Will Poole is like a retired four-star general that executed a successful campaign to conquer the world; he been around the block a few times. He tried pretty hard to convince me to make as much of SageMathCloud closed source as possible, and to try to convince my users to make content they create in SMC something that I can reuse however I want. I felt pretty shaken and convinced that I needed to close parts of SMC, e.g., the new Kubernetes-based backend that we spent all summer implementing. (Will: if you read this, though our discussion was really disturbing to me, I really appreciate it and respect you.)

My friend, who introduced me to Will Poole, introduced me to some other people and described me as that really frustrating sort of entrepreneur who doesn't want investor money. He then remarked that one of the things he learned in business school, which really surprised him, was that it is good for a company to have a lot of debt. I gave him a funny look, and he added "of course, I've never run a company".

I left that meeting with Will convinced that I would close source parts of SageMathCloud, to make things much more defensible. However, after thinking things through for several days, and talking this over with other people involved in the company, I have chosen not to close anything. This just makes our job harder. Way harder. But I'm not going to make any decisions based purely on fear. I don't care what anybody says, I do not think it is impossible to build an open source business (I think Wordpress is an example), and I do not need to raise VC.

Hacker News Discussion: https://news.ycombinator.com/item?id=12663599

Chinese version: http://www.infoq.com/cn/news/2016/10/Reflection-sustainable-profit-co

### Continuum Analytics news

#### DataCamp’s Online Platform Fuels the Future of Data Science, Powered By Anaconda

Thursday, November 17, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

There’s no doubt that the role of ‘data scientist’ is nearing a fever pitch as companies become increasingly data-driven. In fact, the position ranked number one on Glassdoor’s top jobs in 2016, and in 2012, HBR dubbed it “The Sexiest Job of the 21st Century.” Yet, while more organizations are adopting data science, there exists a shortage of people with the right training and skills to fill the role. This challenge is being met by our newest partner, DataCamp, a data science learning platform focused on cultivating the next generation of data scientists.

DataCamp’s interactive learning environment today launched the first of four Anaconda-based courses taught by Anaconda experts—Interactive Visualization with Bokeh. Our experts—both in academia and in the data science industry—provide users with maximum insight. While we’re proud to partner with companies representing various verticals, it is especially thrilling to contribute toward the creation of new data scientists, including citizen data scientists, both of which are extremely valued in the business community.

Research finds that 88 percent of professionals say online learning is more helpful than in-person training; DataCamp has already trained over 620,000 aspiring data scientists. Of the four new Anaconda-based courses, two are interactive trainings. This allows DataCamp to offer students the opportunity to benefit from unprecedented breadth and depth of online learning, leading to highly skilled, next-gen data scientists.

The data science revolution is growing by the day and DataCamp is poised to meet the challenge of scarcity in the market. By offering courses tailored to an individual’s unique pace, needs and expertise, DataCamp’s courses are generating more individuals with the skills to boast ‘the sexiest job of the 21st century.’

Interested in learning more or signing up for a course? Check out DataCamp’s blog.

## November 15, 2016

### Titus Brown

#### You can make GitHub repositories archival by using Zenodo or Figshare!

Bioinformatics researchers are increasingly pointing reviewers and readers at their GitHub repositories in the Methods sections of their papers. Great! Making the scripts and source code for methods available via a public version control system is a vast improvement over the methods of yore ("e-mail me for the scripts" or "here's a tarball that will go away in 6 months").

A common point of concern, however, is that GitHub repositories are not archival. That is, you can modify, rewrite, delete, or otherwise irreversibly mess with the contents of a git repository. And, of course, GitHub could go the way of Sourceforge and Google Code at any point.

So GitHub is not a solution to the problem of making scripts and software available as part of the permanent record of a publication.

But! Never fear! The folk at Zenodo and Mozilla Science Lab (in collaboration with Figshare) have solutions for you!

I'll tell you about the Zenodo solution, because that's the one we use, but the Figshare approach should work as well.

## How Zenodo works

Briely, at Zenodo you can set up a connection between Zenodo and GitHub where Zenodo watches your repository and produces a tarball and a DOI every time you cut a release.

For example, see https://zenodo.org/record/31258, which archives https://github.com/dib-lab/khmer/releases/tag/v2.0 and has the DOI http://doi.org/10.5281/zenodo.31258.

When we release khmer 2.1 (soon!), Zenodo will automatically detect the release, pull down the tar file of the repo at that version, and produce a new DOI.

The DOI and tarball will then be independent of GitHub and I cannot edit, modify or delete the contents of the Zenodo-produced archive from that point forward.

Yes, automatically. All of this will be done automatically. We just have to make a release.

## Yes, the DOI is permanent and Zenodo is archival!

Zenodo is an open-access archive that is recommended by Peter Suber (as is Figshare).

While I cannot quickly find a good high level summary of how DOIs and archiving and LOCKSS/CLOCKSS all work together, here is what I understand to be the case:

• Digital object identifiers are permanent and persistent. (See Wikipedia on DOIs)

• Zenodo policies say:

"Retention period

Items will be retained for the lifetime of the repository. This is currently the lifetime of the host laboratory CERN, which currently has an experimental programme defined for the next 20 years at least."

So I think this is at least as good as any other archival solution I've found.

## Why is this better than journal-specific archives and supplemental data?

Some journals request or require that you upload code and data to their own internal archive. This is often done in painful formats like PDF or XLSX, which may guarantee that a human can look at the files but does little to encourage reuse.

At least for source code and smallish data sets, having the code and data available in a version controlled repository is far superior. This is (hopefully :) the place where the code and data is actually being used by the original researchers, so having it kept in that format can only lower barriers to reuse.

And, just as importantly, getting a DOI for code and data means that people can be more granular in their citation and reference sections - they can cite the specific software they're referencing, they can point at specific versions, and they can indicate exactly which data set they're working with. This prevents readers from going down the citation network rabbit hole where they have to read the cited paper in order to figure out what data set or code is being reused and how it differs from the remixed version.

## Bonus: Why is the combination of GitHub/Zenodo/DOI better than an institutional repository?

I've had a few discussions with librarians who seem inclined to point researchers at their own institutional repositories for archiving code and data. Personally, I think having GitHub and Zenodo do all of this automatically for me is the perfect solution:

• quick and easy to configure (it takes about 3 minutes);
• polished and easy user interface;
• integrated with our daily workflow (GitHub);
• completely automatic;
• independent of whatever institution happens to be employing me today;

so I see no reason to switch to using anything else unless it solves even more problems for me :). I'd love to hear contrasting viewpoints, though!

thanks!

--titus

## November 14, 2016

### Continuum Analytics news

#### Can New Technologies Help Stop Crime In Its Tracks?

Tuesday, November 15, 2016
Peter Wang
Chief Technology Officer & Co-Founder
Continuum Analytics

Earlier this week, I shared my thoughts on crime prevention through technology with IDG Connect reporter Bianca Wright. Take a look and feel free to share your opinions in the comment section below (edited for length and clarity)!

## blog14.png

Or, to work with Spark interactively on the Cloudera CDH cluster, we can use Jupyter Notebooks via Anaconda Enterprise Notebooks, which is a multi-user notebook server with collaboration and support for enterprise authentication. You can configure Anaconda Enterprise Notebooks to use different Anaconda parcel installations on a per-job basis.

## blog15.png

### Get Started with Custom Anaconda Parcels in Your Enterprise

If you’re interested in generating custom Anaconda installers and parcels for Cloudera Manager, we can help! Get in touch with us by using our contact us page for more information about this functionality and our enterprise Anaconda platform subscriptions.

If you’d like to test-drive the on-premises, enterprise features of Anaconda on a bare-metal, on-premises or cloud-based cluster, get in touch with us at sales@continuum.io.

The enterprise features of the Anaconda platform, including the distributed functionality in Anaconda Scale and on-premises functionality of Anaconda Repository, are certified by Cloudera for use with Cloudera CDH 5.x.

#### Announcing the Continuum Founders Awards

Friday, October 28, 2016
Travis Oliphant
Chief Executive Officer & Co-Founder
Continuum Analytics

## Team Award

This award is presented to a team (either formal or informal) that consistently delivers on the company mission, understands the impact and importance of all aspects of the business, and exemplifies the qualities and output of a high-functioning team. The team members consistently demonstrate being humble, hungry, and smart.

Joel Hull

Erik Welch

Trent Oliphant

Jim Kitchen

This team was selected for at least the following activities that benefit all aspects of the business:

• Repeated, successful delivery on an important project at a major customer that has led the way for product sales and future contract-based work with the customer

• Internal work on Enterprise Notebooks to fix specific problems needed by a customer

• Continued coordination with Repository to enable it to be purchased by a large customer

• Erik’s work on helping with Dask

• Trent’s work on getting Anaconda installed at customer sites ensuring successful customer engagement

• The team’s initial work on productizing a successful project from their consulting project

• Coordinated the build and delivery of many client specific conda packages

## Mission, Values, Purpose (MVP) Award

This award is given to an individual who exemplifies the mission, values, and purpose of Continuum Analytics.

Stan Seibert

When Peter and Travis first organized the company values with other leaders we each separately envisioned several members of the Continuum Team that we thought exemplified what it meant to be at Continuum. Stan was the top of all of our lists.

Stan knows what it means to empower people to change the world. As a scientist he worked on improving neutrino detection, contributing to the experiment that was co-awarded the 2015 Nobel Prize in Physics and the 2016 Breakthrough Prize in Fundamental Physics. The Numba project has flourished under his leadership both in terms of project development as well as ensuring that funding for the project continues from government, non-profits, and companies. Stan shows the quality-first and owner-mentality of true craftsmanship in all of the projects he leads which has caused his customers to renew again and again. Stan also exemplifies continuous learning all of the time. In one example, he learned how to build a cluster from several Raspberry Pi systems --- including taking the initiative to attach an LCD display to the front of each one. His “daskmaster” has been a crowd favorite in the Continuum booth at several events.

## Customer-First Award

This award recognizes individuals who consistently demonstrate that customers matter and we will succeed to the degree that we solve problems for and support our customers and future customers.

Derek Orgeron

Atish Singh

Ian Stokes-Rees

Derek, Atish, and Ian put customers first every single day.

Derek and Atish follow-up from all of our marketing events and are often the first contact our customers have with Continuum. They have the responsibility to triage opportunities and determine which are the most likely to lead to sales in product, training, and/or consulting. They are pursuing many, many customer contacts far above industry averages and doing it while remaining positive, enthusiastic, and helpful to the people they reach.

Ian Stokes Rees has gone above and beyond to serve customers for the past year and beyond. He is always willing to do what makes sense. He filled in as a sales engineer as well as an implementation engineer as needed. He has worked tirelessly for Citibank to ensure success on that opportunity. He has single-handedly enabled Anaconda and SAS to work together. At multiple conferences (e.g. Strata) he is a tireless and articulate advocate for Anaconda ,explaining in detail how it will help our clients.

## October 26, 2016

### Continuum Analytics news

#### Recursion Pharmaceuticals Wants to Cure Rare Genetic Diseases - and We’re Going to Help

Wednesday, October 26, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

Today we are pleased to announce that Continuum Analytics and Recursion Pharmaceuticals are teaming up to use data science in the quest to find cures for rare genetic diseases. Using Bokeh on Anaconda, Recursion is building its drug discovery assay platform to analyze layered cell images and weigh the effectiveness of different remedies. As we always say, Anaconda arms data scientists with superpowers to change the world. This is especially valuable for Recursion, since success literally means saving lives and changing the world by bringing drug remedies for rare genetic diseases to market faster than ever before.

It’s estimated that there are over 6,000 genetic disorders, yet many of these diseases represent a small market. Pharmaceutical companies aren’t usually equipped to pursue the cure for each disease. Anaconda will help Recursion by blending biology, bioinformatics and machine learning, bringing cell data to life. By identifying patterns and assessing drug remedies quickly, Recursion is using data science to discover potential drug remedies for rare genetic diseases. In English - this company is trying to cure big, bad, killer diseases using Open Data Science.

The ODS community is important to us. Working with a company in the pharmaceutical industry, an industry that is poised to convert ideas into life-saving medications, is humbling. With so many challenges, not the least of which include regulatory roadblocks and lengthy and complex R&D processes, researchers must continually adapt and innovate to speed medical advances. Playing a part in that process? That’s why we do what we do. We’re excited to welcome Recursion to the family and observe as it uses its newfound superpowers to change the world, one remedy at a time.

#### Recursion Pharmaceuticals Selects Anaconda to Create Innovative Next Generation Drug Discovery Assay Platform to Eradicate Rare Genetic Diseases

Wednesday, October 26, 2016

Open Data Science Platform Accelerates Time-to-Market for Drug Remedies

AUSTIN, TX—October 26, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Recursion Pharmaceuticals, LLC, a drug discovery company focused on rare genetic diseases, has adopted Bokeh––a Continuum Analytics open source visualization framework that operates on the Anaconda platform. Bokeh on Anaconda makes it easy for biologists to identify genetic disease markers and assess drug efficacy when visualizing cell data, allowing for faster time-to-value for pharmaceutical companies.

“Bokeh on Anaconda enables us to perform analyses and make informative, actionable decisions that are driving real change in the treatment of rare genetic diseases,” said Blake Borgeson, CTO & co-founder at Recursion Pharmaceuticals. “By layering information and viewing images interactively, we are obtaining insights that were not previously possible and enabling our biologists to more quickly assess the efficacy of drugs. With the power of Open Data Science, we are one step closer to a world where genetic diseases are more effectively managed and more frequently cured, changing patient lives forever.”

By combining interactive, layered visualizations in Bokeh on Anaconda to show both healthy and diseased cells along with relevant data, biologists can experiment with thousands of potential drug remedies and immediately understand the effectiveness of the drug to remediate the genetic disease. Biologists realize faster insights, speeding up time-to-market for potential drug treatments.

“Recursion Pharmaceuticals’ data scientists crunch huge amounts of data to lay the foundation for some of the most advanced genetic research in the marketplace. With Anaconda, the Recursion data science team has created a breakthrough solution that allows biologists to quickly and cost effectively identify therapeutic treatments for rare genetic diseases,” said Peter Wang, CTO & co-founder at Continuum Analytics. “We are enabling companies like Recursion to harness the power of data on their terms, building solutions for both customized and universal insights that drive new value in all areas of business and science. Anaconda gives superpowers to people who change the world––and Recursion is a great example of how our Open Data Science vision is being realized and bringing solid, everyday value to critical healthcare processes.”

Data scientists at Recursion evaluate hundreds of genetic diseases, ranging from one evaluation per month to thousands in the same time frame. Bokeh on Anaconda delivers insights derived from heat maps, charts, plots and other scientific visualizations interactively and intuitively, while providing holistic data to enrich the context and allow biologists to discover potential treatments quickly. These visualizations empower the team with new ways to re-evaluate shelved pharmaceutical treatments and identify new potential uses for them. Ultimately, this creates new markets for pharmaceutical investments and helps develop new treatments for people suffering from genetic diseases.

Bokeh on Anaconda is a framework for creating versatile, interactive and browser-based visualizations of streaming data or Big Data from Python, R or Scala without writing any JavaScript. It allows for exploration, embedded visualization apps and interactive dashboards, so that users can create rich, contextual plots, graphs, charts and more to enable more comprehensive deductions from images.

Founded in 2013, Salt Lake City, Utah-based Recursion Pharmaceuticals, LLC is a drug discovery company. Recursion uses a novel drug screening platform to efficiently repurpose and reposition drugs to treat rare genetic diseases. Recursion’s novel drug screening platform combines experimental biology and bioinformatics in a massively parallel system to quickly and efficiently identify treatments for multiple rare genetic diseases. The core of the approach revolves around high-throughput automated screening using images of human cells, which allows the near simultaneous modeling of hundreds of genetic diseases. Rich data from these assays is probed using advanced statistical and machine learning approaches, and the effects of thousands of known drugs and shelved drug candidates can be investigated efficiently to identify those holding the most promise for the treatment of any one rare genetic disease.

The company’s lead candidate, a new treatment for Cerebral Cavernous Malformation, is approaching clinical trials, and the company has a rich pipeline of repurposed therapies in its development pipeline for diverse genetic diseases.

Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Continuum Analytics' founders and developers have created and contributed to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

## October 25, 2016

### Filipe Saraiva

#### My QtCon + Akademy 2016

From August 31th to September 10th I was em Berlin attending two amazing conferences: QtCon and Akademy.

QtCon brought together five communities to host their respective conferences at a same time and place, creating one big and diverse conference. Those communities were Qt, KDAB, KDE (celebrating 20th birthday), VLC and FSFE (both celebrating 15th birthday).

Main conference hall of QtCon at bcc

That diversity of themes was a very interesting characteristic of QtCon. I really appreciated see presentations of Qt and KDAB people, and I was surprised about topics related with VLC community. The strong technical aspects of trends like Qt in mobile, Qt in IoT (including autonomous cars), the future of Qt, Qt + Python, contributing to Qt, and more, called my attention during the conference.

On VLC I was surprised with the size of the community. I never imagined VLC had too much developers. In fact, I never imagined VideoLAN is in fact an umbrella of a lot of projects related with multimedia, like codecs, streaming tools, VLC ports to specific devices (including cars through Android Auto), and more. Yes, I really appreciated to find these persons and watch their presentations.

I was waiting for the VLC 3.0 release during QtCon, but unfortunately it did not happen. Of course the team is improving this new release and when it is finished I will have a VLC to use together with my Chromecast, so, keep this great work coneheads!

FSFE presentations were interesting as well. In Brazil there are several talks about political and philosophical aspects of free software in conferences like FISL and Latinoware. In QtCon, FSFE brought this type of presentation in an “European” style: sometimes the presentations looks like more pragmatically in their approaches. Other FSFE presentations talked about the infrastructure and organizational aspects of the foundation, a nice overview to be compared with others groups like ASL.org in Brazil.

Of course, there were a lot of amazing presentations from our gearheads. I highlight the talks about KDE history, Plasma Desktop latest news, Plasma Mobile status, KF5 on Android, the experience of Minuet in mobile world, among others.

The KDE Store announcement was really interesting and I expect it will bring more attention to the KDE ecosystem when software package bundles
(snap/flat/etc) be available in the store.

Other software called my attention was Peruse, a comic book reader. I expect developers can solve the current problems in order to release a mobile version of Peruse, so this software can reach a broad base of users of these platforms.

After the end of QtCon, Akademy had place in TU Berlin, in a very beautiful and comfortable campus. This phase of the conference was full of technical sessions and discussions, hacking, and fun.

I attended  to the Flatpack, Appstream, and Snapcraft BoFs. There were a lot of advanced technical discussions on those themes. Every Akademy I feel very impressed with the advanced level of the technical discussions performed by our hackers in KDE community. Really guys, you rocks!

The Snapcraft BoF was a tutorial about how to use that technology to create crossdistro bundle packages. That was interesting and I would like to test more and give a look in Flatpack in order to select something to create packages for Cantor.

Unfortunately I missed the BoF on Kube. I am very interested in an alternative PIM project for KDE, focused in E-Mail/Contacts/Calendar and more economic in computational resource demand. I am keeping my eyes and expectations on this project.

The others days basically I spent my time working on Cantor and having talk with our worldwide KDE fellows about several topics like KDE Edu, improvements in our Jabber/XMPP infrastructure, KDE 20th years, Plasma in small-size computers (thanks sebas for the Odroid-C1+ device ) WikiToLearn (could be interesting a way to import/export Cantor worksheets to/from WikiToLearn?), and of course, beers and Germany food.

And what about Berlin? It was my second time in the city, and like the previous one I was excited with the multicultural atmosphere, the food (<3 pork <3) and beers. We were in Kreuzberg, a hipster district in the city, so we could visit some bars and expat restaurants there. The QtCon+Akademy had interesting events as well, like the FSFE celebration in c-base and the Akademy daytrip in Peacock Island.

So, I would like to say thank you for KDE e.V. for funding my attendance in the events, thank you Petra for help us with the hostel, and thank your for all the volunteers for work hard and make this Akademy edition a real celebration of KDE community.

Some Brazilians in QtCon/Akademy 2016: KDHelio, Lamarque, Sandro, João, Aracele, Filipe (me)

### Matthieu Brucher

#### Book review: Weapons Of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

Big data is the current hype, the thing you need to do to find the best job in the world. I’ve started using machine learning tools a decade ago, and when I saw this book, it felt like it was answering some concerns I had. Let’s see what’s inside.

#### Content and opinions

The first two chapters set the environment of the discussions in the book. Start with the way a model works, why people trust them, why we want to create new ones, and then another chapter on why we should trust the author. She has indeed the background to understand models and the job experience to see first hands how a model can be used. Of course, something that is missing here is that lots of the elements of the book are happening in the US. Hopefully the EU will be smart and learn from the US mistakes (at least there are efforts to lower the amount of data Facebook and Google are agglomerating on users).

OK, let’s start with the first weapon, the one targeted at students. I was quite baffled that all this started with a random ranking from a low profile newspaper. And now, every university tries to maximize its reputation based on numbers, not on the actual quality of new students. If universities are really spending that much money on advertisement to the point of driving tuition fees sky-high (which is a future crisis in waiting, by the way!).

The second weapon is one we all know: online ad. Almost all websites survive one way or another with revenue from online advertisement. All websites are connected through links to social networks, ad agencies… and these companies churn out information, deduction based on this gigantic pile of data. If advertisement didn’t have an effect on people, there would be no market for it.

Moving out to something completely difference: justice. It is something that also happens in France. We have far right extremists that want to have stats (it is forbidden to have racial stats there) to show that some categories of the population need to be checked more often than others. It is indeed completely unfair and also the proof that we are targeting some types of crimes and not others. I found the way the weapon worked was clearly, from the start, skewed. How could anyone not see the problem?

Then let’s go on with even worse with getting a job. Or the chapter after about keeping the job. Both times, the main issue is that the WMD helps the companies maximize their profit and minimize their risk. There are two issues there: the first one, only sure prospects are going to be hired, and this is based on… stats accumulated through the years and they are racially biased. And when they have a job, the objective is not to optimize the happiness of the employee, even if doing so would enhance the profitability.

The next two are also related, credit and insurance. It is nice to see that credit scores started as a way to remove biased, it is terrible to see that we went back there and scores are now dictated by obscure tools. And then, they know even impact insurance, not to optimize one’s cost, but to optimize revenue for the insurance company. I believe in all having to pay the same amount and all having the same covering on things like health (not for driving, because we can all be better drivers, but we cannot optimize our genes!). All goes to a really individualistic society, and it is scary.

Finally even elections are rigged. I knew that messages were sent to appeal to each category, but it is scary to see that it is actually used to lie. We all know that politicians are lying to us, but now, they even don’t care about giving us different lies. And social networks and ad companies have even more power to make us do things as they see fit. The fact that Facebook officially publishes some of its tests on users just makes me mad.

#### Conclusion

OK, the book is full of examples of bad usage of big data. I saw fist hand on scientific applications that it is easy to make a mistake when creating a model. In my case, the optimization of a modeler and more specifically the delta between each iteration. When trying to minimize the number of non convergence issues, if we only try to find the same time step as the original app, we are missing the point, we are trying to map a proxy. The real objective is to find a new time step that would also keep the number of convergence issues low, different ones.

Another example is just all these WDM actually. They are more often than not based on neural networks and deed learning algorithms (which is actually the same). We fuel lots of effort in making them better, but the issue is that we don’t know what they are doing (in that regards, all horror sic-fi movie with a crazy AI comes to mind, as well as Asimov’s books). This has been the case for decades, and although we know equivalent algorithms that could give us the explanation, we stay on these black boxes because they are cost-effective (we don’t have to choose the proper equivalent algorithm, we just train) and scalable (which may not be the case for the equivalent algorithm, as they don’t have the same priority in research it would seem!). The nice thing about the book is also that it underlines an issue that I haven’t even thought about. All these algorithms try to reproduce a past behavior. But humanity is evolving and things that were considered true in the past are not longer true (race anyone?). As such, if we are giving these WDM absolute power, we will just rot as a civilization and probably collapse.

I’m not against big data and machine learning. I think the current trend is clearly explained in this book and also corresponds to something I felt before this hype: let’s choose a good algorithm, let’s train the model and let’s see why it chooses some answers and not others. We may then be onto something or we may see that it is biased and we need to go back to the board. Considering the state of big data, we definitely need to go back to the board.

## October 24, 2016

### Enthought

#### Key updates include: Jupyter notebook integration, movie recording capabilities, time series animation, updated VTK compatibility, and Python 3 support

by Prabhu Ramachandran, core developer of Mayavi and director, Enthought India

The Mayavi development team is pleased to announce Mayavi 4.5.0, which is an important release both for new features and core functionality updates.

Mayavi is a general purpose, cross-platform Python package for interactive 2-D and 3-D scientific data visualization. Mayavi integrates seamlessly with NumPy (fast numeric computation library for Python) and provides a convenient Pythonic wrapper for the powerful VTK (Visualization Toolkit) library. Mayavi provides a standalone UI to help visualize data, and is easy to extend and embed in your own dialogs and UIs. For full information, please see the Mayavi documentation.

Mayavi is part of the Enthought Tool Suite of open source application development packages and is available to install through Enthought Canopy’s Package Manager (you can download Canopy here).

#### Mayavi 4.5.0 is an important release which adds the following features:

1. Jupyter notebook support: Adds basic support for displaying Mayavi images or interactive X3D scenes
2. Support for recording movies and animating time series
3. Support for the new matplotlib color schemes
4. Improvements on the experimental Python 3 support from the previous release
5. Compatibility with VTK-5.x, VTK-6.x, and 7.x. For more details on the full set of changes see here.

Let’s take a look at some of these new features in more detail:

## Jupyter Notebook Support

This feature is still basic and experimental, but it is convenient. The feature allows one to embed either a static PNG image of the scene or a richer X3D scene into a Jupyter notebook. To use this feature, one should first initialize the notebook with the following:

from mayavi import mlab
mlab.init_notebook()

Subsequently, one may simply do:

s = mlab.test_plot3d()
s

This will embed a 3-D visualization producing something like this:

Embedded 3-D visualization in a Jupyter notebook using Mayavi

When the init_notebook method is called it configures the Mayavi objects so they can be rendered on the Jupyter notebook. By default the init_notebook function selects the X3D backend. This will require a network connection and also reasonable off-screen support. This currently will not work on a remote Linux/OS X server unless VTK has been built with off-screen support via OSMesa as discussed here.

For more documentation on the Jupyter support see here.

## Animating Time Series

This feature makes it very easy to animate a time series. Let us say one has a set of files that constitute a time series (files of the form some_name[0-9]*.ext). If one were to load any file that is part of this time series like so:

from mayavi import mlab
src = mlab.pipeline.open('data_01.vti')

Animating these is now very easy if one simply does the following:

src.play = True

This can also be done on the UI. There is also a convenient option to synchronize multiple time series files using the “sync timestep” option on the UI or from Python. The screenshot below highlights the new features in action on the UI:

New time series animation feature in the Python Mayavi 3D visualization library.

## Recording Movies

One can also create a movie (really a stack of images) while playing a time series or running any animation. On the UI, one can select a Mayavi scene and navigate to the movie tab and select the “record” checkbox. Any animations will then record screenshots of the scene. For example:

from mayavi import mlab
f = mlab.figure()
f.scene.movie_maker.record = True
mlab.test_contour3d_anim()

This will create a set of images, one for each step of the animation. A gif animation of these is shown below:

Recording movies as gif animations using Mayavi

More than 50 pull requests were merged since the last release. We are thankful to Prabhu Ramachandran, Ioannis Tziakos, Kit Choi, Stefano Borini, Gregory R. Lee, Patrick Snape, Ryan Pepper, SiggyF, and daytonb for their contributions towards this release.

## October 23, 2016

### Titus Brown

#### What is open science?

Gabriella Coleman asked me for a short, general introduction to open science for a class, and I couldn't find anything that fit her needs. So I wrote up my own perspective. Feedback welcome!

## Some background: Science advances because we share ideas and methods

Scientific progress relies on the sharing of both scientific ideas and scientific methodology - “If I have seen further it is by standing on the shoulders of Giants” (https://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants). The natural sciences advance not just when a researcher observes or understands a phenomenon, but also when we develop (and share) a new experimental technique (such as microscopy), a mathematical approach (e.g. calculus), or a new computational framework (such as multi scale modeling of chemical systems). This is most concretely illustrated by the practice of citation - when publishing, we cite the previous ideas we’re building on, the published methods we’re using, and the publicly available materials we relied upon. Science advances because of this sharing of ideas, and scientists are recognized for sharing ideas through citation and reputation.

Despite this, however, there are many barriers that lie in the way of freely sharing ideas and methods - ranging from cultural (e.g. peer review delays before publication) to economic (such as publishing behind a paywall) to methodological (for example, incomplete descriptions of procedures) to systemic (e.g. incentives to hide data and methods). Some of these barriers are well intentioned - peer review is intended to block incorrect work from being shared - while others, like closed access publishing, have simply evolved with science and are now vestigial.

## So, what is open science??

Open science is the philosophical perspective that sharing is good and that barriers to sharing should be lowered as much as possible. The practice of open science is concerned with the details of how to lower or erase the technical, social, and cultural barriers to sharing. This includes not only what I think of as “the big three” components of open science -- open access to publications, open publication and dissemination of data, and open development, dissemination, and reuse of source code -- but also practice such as social media, open peer review, posting and publishing grants, open lab notebooks, and any other methods of disseminating ideas and methods quickly.

The potential value of open science should be immediately obvious: easier and faster access to ideas, methods, and data should drive science forward faster! But open science can also aid with reproducibility and replication, decrease the effects of economic inequality in the sciences by liberating ideas from subscription paywalls, and provide reusable materials for teaching and training. And indeed, there is some evidence for many of these benefits of open science even in the short term (see How open science helps researchers succeed, McKiernan et al. 2016). This is why many funding agencies and institutions are pushing for more science to be done more openly and made available sooner - because they want to better leverage their investment in scientific progress.

## Some examples of open science

Here are a few examples of open science approaches, taken from my own experiences.

### Preprints

In biology (and many other sciences), scientists can only publish papers after they undergo one or more rounds of peer review, in which 2-4 other scientists read through the paper and check it for mistakes or overstatements. Only after a journal editor has received the reviews and decided to accept the paper does it “count". However, in some fields, there are public sites where draft versions of papers can be publicly posted prior to peer review - these “preprint servers” work to disseminate work in advance of any formal review. The first widely used preprint server, arXiv, was created in the 1980s for math and physics, and in those fields preprints now often count towards promotion and grant decisions.

The advantages of preprints are that they get the work out there, typically with a citable identifier (DOI), and allow new methods and discoveries to spread quickly. They also typically count for establishing priority - a discovery announced in a preprint is viewed as a discovery, period, unless it is retracted after peer review. The practical disadvantages are few - the appearance of double-publishing was a concern, but is no longer, as most journals allow authors to preprint their work. In practice, most preprints just act as an extension of the traditional publishing system (but see this interesting post by Matt Stephens on "pre-review" by Biostatistics). What is viewed as the major disadvantage can also be an advantage - the work is published with the names of the authors, so the reputation of the authors can be affected both positively and negatively by their work. This is what some people tell me is the major drawback to preprints for them - that the work is publicly posted without any formal vetting process, which could catch major problems with the work that weren't obvious to the authors.

I have been submitting preprints since my first paper in 1993, which was written with a physicist for whom preprinting was the default (Adami and Brown, 1994). Many of my early papers were preprinted because my collaborators were used to it. While in graduate school, I lapsed in preprinting for many years because my field (developmental biology) didn’t “do” preprints. When I started my own lab, I returned to preprinting, and submitted all of my senior author papers to preprint servers. Far from suffering any harm to my career, I have found that our ideas and our software have spread more quickly because of it - for example, by the time my first senior author paper was reviewed, another group had already built on top of it based on our preprint (see Pell et al., 2014 which was originally posted at arXiv, and Chikhi and Rizk 2013).

### Social media

There are increasingly many scientists of all career stages on Twitter and writing blogs, and they discuss their own and others’ work openly and even candidly. This has the effect of letting people restricted in their travel into social circles that would otherwise be closed to them, and accelerates cross-subject pollination of ideas. In bioinformatics, it is typical to hear about new bioinformatics or genomics methods online 6 months to a year before they are even available as a preprint. For those who participate, this results in fast dissemination and evaluation of methods and it can quickly generate a community consensus around new software.

The downsides of social media are the typical social media downsides: social media is its own club with its own cliques, however welcoming some of those cliques can be; identifiable women and people of color operate at a disadvantage here as elsewhere; cultivating a social media profile can require quite a bit of time that could be spent elsewhere; and online discussions about science can be gossipy, negative, and even unpleasant. Nonetheless there is little doubt that social media can be a useful scientific tool (see Bik and Goldstein, 2013), and can foster networking and connections in ways that don’t rely on physical presence - a major advantage to labs without significant travel funds, parents with small children, etc.

In my case, I tend to default to being open about my work on social media. I regularly write blog posts about my research and talk openly about ideas on twitter. This has led to many more international connections than I would have had otherwise, as well as a broad community of scientists that I consider personal friends and colleagues. In my field, this has been particularly important; since many of my bioinformatics colleagues tend to be housed in biology or computer science departments rather than any formal computational biology program, the online world of social media also serves as an excellent way of discovering colleagues and maintaining collegiality in an interdisciplinary world, completely independent of its use for spreading ideas and building reputation.

### Posting grants

While reputation is the key currency of advancement in science, good ideas are fodder for this advancement. Ideas are typically written up in the most detail in grant proposals - requests for funding from government agencies or private foundations. The ideas in grant proposals are guarded jealously, with many professors refusing to share grant proposals even within their labs. A few people (myself included) have taken to publicly posting grants when they are submitted, for a variety of reasons (see Ethan White's blog post for details).

In my case, I posted my grants in the hopes of engaging with a broader community to discuss the ideas in my grant proposal; while I haven’t found this engagement, the grants did turn out to be useful for junior faculty who are confused about formatting and tone and are looking for examples of successful (or unsuccessful) grants. More recently, I have found that people are more than happy to skim my grants and tell me about work outside my field or even unpublished work that bears on my proposal. For example, with my most recent proposal, I discovered a number of potential collaborators within 24 hours of posting my draft.

### Why not open science?

The open science perspective - "more sharing, more better" - is slowly spreading, but there are many challenges that are delaying its spread.

One challenge of open science is that sharing takes effort, while the immediate benefits of that sharing largely go to people other than the producer of the work being shared. Open data is a perfect example of this: it takes time and effort to clean up and publish data, and the primary benefit of doing so will be realized by other people. The same is true of software . Another challenge is that the positive consequences of sharing, such as serendipitous discoveries and collaboration, cannot be accurately evaluated or pitched to others in the short term - it requires years, and sometimes decades, to make progress on scientific problems, and the benefits of sharing do not necessarily appear on demand or in the short term.

Another block to open science is that many of the mechanisms of sharing are themselves somewhat new, and are rejected in unthinking conservatism of practice. In particular, most senior scientists entered science at a time when the Internet was young and the basic modalities and culture of communicating and sharing over the Internet hadn’t yet been developed. Since the pre-Internet practices work for them, they see no reason to change. Absent a specific reason to adopt new practices, they are unlikely to invest time and energy in adopting new practices. This can be seen in the rapid adoption of e-mail and web sites for peer review (making old practices faster and cheaper) in comparison to the slow and incomplete adoption of social media for communicating about science (which is seen by many scientists as an additional burden on their time, energy, and focus).

Metrics for evaluating products that can be shared are also underdeveloped. For example, it is often hard to track or summarize the contributions that a piece of software or a data set makes to advancing a field, because until recently it was hard to cite software and data. More, there is no good technical way to track software that supports other software, or data sets that are combined in a larger study or meta-study, so many of the indirect products of software and data may go underreported.

Intellectual property law also causes problems. For example, in the US, the Bayh-Dole Act stands in the way of sharing ideas early in their development. Bayh-Dole was intended to spur innovation by granting universities the intellectual property rights to their research discoveries and encouraging them to develop them, but I believe that it has also encouraged people to keep their ideas secret until they know if they are valuable. But in practice most academic research is not directly useful, and moreover it costs a significant amount of money to productize, so most ideas are never developed commercially. In effect this simply discourages early sharing of ideas.

Finally, there are also commercial entities that profit exorbitantly from restricting access to publications. Several academic publishers, including Elsevier and MacMillan, have profit margins of 30-40%! (Here, see Mike Taylor on The obscene profits of commercial scholarly publishers.) (One particularly outrageous common practice is to charge a single lump sum for access to a large number of journals each year, and only provide access to the archives in the journals through that current subscription - in effect making scientists pay annually for access to their own archival literature.) These corporations are invested in the current system and have worked politically to block government efforts towards encouraging open science.

Oddly, non-profit scientific societies have also lobbied to restrict access to scientific literature; here, their argument appears to be that the journal subscription fees support work done by the societies. Of note, this appears to be one of the reasons why an early proposal for an open access system didn't realize its full promise. For more on this, see Kling et al., 2001, who point out that the assumption that the scientific societies accurately represent the interests and goals of their constituents and of science itself is clearly problematic.

The overall effect of the subscription gateways resulting from closed access is to simply make it more difficult for scientists to access literature; in the last year or so, this fueled the rise of Sci-Hub, an illegal open archive of academic papers. This archive is heavily used by academics with subscriptions because it is easier to search and download from Sci-Hub than it is to use publishers' Web sites (see Justin Peters' excellent breakdown in Slate).

### A vision for open science

A great irony of science is that a wildly successful model of sharing and innovation — the free and open source software (FOSS) development community— emerged from academic roots, but has largely failed to affect academic practice in return. The FOSS community is an exemplar of what science could be: highly reproducible, very collaborative, and completely open. However, science has gone in a different direction. (These ideas are explored in depth in Millman and Perez 2014.)

It is easy and (I think) correct to argue that science has been corrupted by the reputation game (see e.g. Chris Chambers' blog post on 'researchers seeking to command petty empires and prestigious careers') and that people are often more concerned about job and reputation than in making progress on hard problems. The decline in public funding for science, the decrease in tenured positions (here, see Alice Dreger's article in Aeon), and the increasing corporatization of research all stand in the way of more open and collaborative science. And it can easily be argued that they stand squarely in the way of faster scientific progress.

I remain hopeful, however, because of generational change. The Internet and the rise of free content has made younger generations more aware of the value of frictionless sharing and collaboration. Moreover, as data set sizes become larger and data becomes cheaper to generate, the value of sharing data and methods becomes much more obvious. Young scientists seem much more open to casual sharing and collaboration than older scientists; it’s the job of senior scientists who believe in accelerating science to see that they are rewarded, not punished, for this.

## October 18, 2016

### Matthieu Brucher

#### Book review: Why You Love Music: From Mozart to Metallica

I have to say, I was intrigued when I saw the book. Lots of things about music seem intuitive, from movies to how it makes us feel. And the book puts a theoretical aspect on it. So definitely something I HAD to read.

#### Content and opinions

There are 15 chapters in the book, covering lots of different facets of music. The first chapter tries to associate music genre and psychological profile. It was really interesting to see that the evolution of the music we like is dictated by things we listened in our childhood. And I have to say that my favorite music is indeed tightly correlated to the music style I prefered in my teens! The second chapter is more classic, as it tackles lyrics. Of course, it is easier to dive in a song with lyrics, even when they are misunderstood!

Third chapter is about emotions in music. I think that emotions are definitely the foremost element that composers want to convey. It seems there are basic rules, although it can be different depending on the culture (which was also interesting to know). The chapter also goes on different mechanisms music “uses” to create emotions. Basic conclusion: it is good for you. Fourth tackles the effect of repetition. It seems that it is mandatory to enjoy the music, and I enjoy the repetition of goose bumps moments in the songs I prefer.

The next chapters address the effect of music on our lives, starting with health. The type of music we listen has an impact on our mood, and also indicates in what shape we are. Sad songs, and we may be soothing from something, happy songs, and we may be joyful. Some people also say that music makes people smarter, this is also an element of the book, and the conclusion didn’t surprise me that much

Moving on to using music in movies. I thought a lot about this, and indeed the tone of the music does ‘impact the way we feel about the different scenes in a movie. I likes the different examples that were used here. Chapter 8 was more intriguing, as it is about talent. Are we naturally gifted, or is it work. The majority of the people are not talented, they are just hard-working people (which in itself may also be a talent!). There is hope for everyone!

Let’s move on to more scientific stuff for the next chapters. The explanations on sounds, waves and frequencies were simple but efficient. Of course, as I have a music training and a signal processing background, it may have been easier to figure out where the author wanted to go, but I think that the elements he mentioned and their interactions was simple enough for everyone to understand how music worked. The chapter after deals with the rules of music writing. It was quite nice to see how some rules were analyzed and why they were “created” (like the big jump up, small down).

Going on in the music analysis, the next chapter is about the difference between melody and accompaniment. There are examples in this section to show what happens and explanations as to why the brain can make the difference. The chapter after tackles the strange things that happen in a brain when it creates something that didn’t exist in the first place. The following chapter on dissonance may have been the one I enjoyed the most, as it explains something I felt for a long time: you can’t play ont he bass whatever you want. The notes get murky if there are too many of them, compared to a guitar melody. The physical explanation tied everything nicely together, the jigsaw puzzle is solved!

Then 14th chapter handles the effect of the way a musician plays the notes on the feeling we get. I always think of an ok drummer and a great one, between a jazz drummer and a hitting drummer. The notes may be the same, but the message is completely different and is appreciated differently depending on the song. Finally the conclusion remembers us that we probably used music since the dawn of humanity, and lots of our experience is derived from the usage we made of music since then.

#### Conclusion

I don’t think I know music better now. But perhaps thanks to this book I can understand how it acts on what I feel. Maybe I over-analyze things too much as well. But I definitely appreciated the analysis of music effect on us!

## October 13, 2016

### Titus Brown

#### A shotgun metagenome workshop at the Scripps Institute of Oceanography

We just finished teaching a two day workshop at the Scripps Institute of Oceanography down at UC San Diego. Dr. Harriet Alexander, a postdoc in my lab, and I spent two days going through cloud computing, short read quality and k-mer trimming, metagenome assembly, quantification of gene abundance, mapping of reads against the assembly, making CIRCOS plots, and workflow strategies for reproducible and open science. We skipped the slicing and dicing data sets with k-mers, though -- not enough time.

Whew, I'm tired just writing all of that!

The workshop was delivered "Software Carpentry" style - interactive hands-on walk throughs of the tutorials, with plenty of room for questions and discussion and whiteboarding.

Did I mention we recorded? Yep. We recorded it. You can watch it on YouTube, in four acts: day 1, morning, day 1, afternoon, day 2, morning, and day 2, afternoon.

Great thanks to Jessica Blanton and Tessa Pierce for inviting us down and wrangling everything to make it work out!

A few things didn't work out that well.

### The materials weren't great

This was a first run of these materials, most of which were developed the week of the workshop. While most of the materials worked, there were hiccups from the last minute nature of things.

### Amazon f-ups

Somewhat more frustrating, Amazon continues to institute checks that prevent new users from spinning up EC2 instances. It used to be that new users could sign up a bit in advance of the class and be able to start EC2 instances. Now, it seems like there's an additional verification that needs to be done AFTER the first phone verification and AFTER the first attempt to start an EC2 instance.

The workshop went something like this:

Me: "OK, now press launch, and we can wait for the machines to start up."

Student 1: "It didn't work for me. It says awaiting verification."

Student 2: "Me neither."

Chorus of students: "Me neither."

So I went and spun up 17 instances on my account and distributed the host names to all of the students via our EtherPad. Equanimity in the face of adversity...?

### We didn't get to the really interesting stuff that I wanted to teach

There was a host of stuff - genome binning, taxonomic annotation, functional annotation - that I wanted to teach but that we basically ended up not having time to write up into tutorials (and we wouldn't have had time to present, either).

## The good

The audience interaction was great. We got tons of good questions, we explored corners of metagenomics and assembly and sequencing and biology that needed to be explored, and everyone was super nice and friendly!

We wrote up the materials, so now we have them! We'll run more of these and when we do, the current materials will be there and waiting and we can write new and exciting materials!

The location was amazing, too ;). Our second day was in a little classroom overlooking the Pacific Ocean. For the whole second part of the day you could hear the waves crashing against the beach below!

## The unknown

One of the reasons that we didn't write up anything on taxonomy, or binning, or functional annotation, was that we don't really run these programs ourselves all that much. We did get some recommendations from the Interwebs, and I need to explore those, but now is the time to tell us --

• what's your favorite genome binning tool? We've had DESMAN and multi-metagenome recommended to us; any others?
• functional annotation of assemblies: what do you use? I was hoping to use ShotMap. I had previously balked at using ShotMap on assembled data, for several reasons, including its design for use on raw reads. But, after Harriet pointed out that we could quantify the Prokka-annotated genes from contigs, I may give ShotMap a try with that approach. I still have to figure out how to feed the gene abundance into ShotMap, though.
• What should I use for taxonomic assignment? Sheila Podell, the creator of DarkHorse, was in the audience and we got to talk a bit, and I was impressed with the approach, so I may give DarkHorse a try. There are also k-mer based approaches like MetaPalette that I want to try, but my experience so far has been that they are extremely database intensive and somewhat fragile. I'd also like to try marker gene approaches like PhyloSift. What tools are people using? Any strong personal recommendations?
• What tool(s) do people use to do abundance calculations for genes in their metagenome? I can think of a few basic types of approaches --

...but I'm at a loss for specific software to use. Any help appreciated - just leave a comment or e-mail me at titus@idyll.org.

--titus

## October 11, 2016

### Fabian Pedregosa

#### A fast, fully asynchronous variant of the SAGA algorithm

My friend Rémi Leblond has recently uploaded to ArXiv our preprint on an asynchronous version of the SAGA optimization algorithm.

The main contribution is to develop a parallel (fully asynchronous, no locks) variant of the SAGA algorighm. This is a stochastic variance-reduced method for general optimization, specially adapted for problems that arise frequently in machine learning such as (regularized) least squares and logistic regression. Besides the specification of the algorithm, we also provide a convergence proof and convergence rates. Furthermore, we fix some subtle technical issues present in previous literature (proving things in the asynchronous setting is hard!).

The core of the asynchronous algorithm is similar to Hogwild!, a popular asynchronous variant of stochastc gradient descent (SGD). The main difference is that instead of using SGD as a building block, we use SAGA. This has many advantages (and poses some challenges): faster (exponential!) rates of convergence and convergence to arbitrary precision with a fixed step size (hence clear stopping criterion), to name a few.

The speedups obtained versus the sequential version are quite impressive. For example, we have observed to commonly obtain 5x-7x speedups using 10 cores:

I will be with Rémi presenting this work at the NIPS OPT-ML workshop.

### Continuum Analytics news

#### Move Over Data Keepers: Open Source is Here to Stay

Tuesday, October 11, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

If you walked by our booth at Strata + Hadoop 2016 this year, you may have laid witness to a larger-than-life character, the almighty data keeper. He was hard to miss as he stood seven-feet tall wearing floor length robes.

## IMG_0803 (1).JPG

For those who weren’t in attendance, the natural thing to ask is who is this data keeper, and what does he do? Let me explain.

Before the creation of open source and the Anaconda platform, data was once locked away and hidden from the world - protected by the data keepers. These data keepers were responsible for maintaining data and providing access only to those who could comprehend the complicated languages such as base SAS®.

For years, this exclusivity kept data from penetrating the outside world and allowing it to be used for good. However, as technology advanced, data began to become more accessible, taking power away from the data keepers, and giving it instead to the empowered data scientists and eventually to citizen data scientists. This technology movement is referred to as the “open data science revolution” - resulting in the creation of an open source world that allows everyone and anyone to participate and interact with data.

As the open data science community began to grow, members joined together to solve complex problems by utilizing different tools and languages. This collaboration is what enabled the creation of the Anaconda platform. Anaconda is currently being used by millions of innovators from all over the world in diverse industries (from science to business to healthcare) to come up with solutions to make the world a better place.

Thanks to open source, data keepers are no longer holding data under lock and key - data is now completely accessible, enabling those in open source communities the opportunity to utilize data for good.

#### Continuum Analytics Launches AnacondaCrew Partner Program to Empower Data Scientists with Superpowers

Wednesday, October 12, 2016

Leading Open Data Science platform company is working with its ecosystem to drive transformation in a data-enabled world

AUSTIN, TEXAS—October 12, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced the launch of the AnacondaCrew Partner Program. Following the recent momentum in revenue and Anaconda growth of over three million downloads, this program is focused on enabling partners to leverage the power of Open Data Science technology to accelerate time-to-value for enterprises by empowering modern data science teams.

The AnacondaCrew program is designed to drive mutual business growth and financial performance for Technology, Service and OEM partners to take advantage of the fast-growing data science market.

The AnacondaCrew Partner Program includes:

• Technology Partners offering a hardware platform, cloud service or software that integrates with the Anaconda platform
• Service Partners delivering Anaconda-based solutions and services to enterprises
• OEM Partners using Anaconda, an enterprise-grade Python platform, to embed into their application, hardware or appliance

"We are extremely excited about our training partnership with Continuum Analytics,” said Jonathan Cornelissen, CEO at DataCamp. “We can now combine the best instructors in the field with DataCamp’s interactive learning environment to create the new standard for online Python for Data Science learning."

In the last year, Continuum has quickly grown the AnacondaCrew Partner Program to include a dozen of the best known modern data science partners in the ecosystems, including Cloudera, DataCamp, Intel, Microsoft, NVIDIA, Docker and others

“As a market leader, Anaconda is uniquely positioned to embrace openness through the open source community and a vast ecosystem of partners focused on helping customers solve problems that change the world,” said Michele Chambers, EVP of Anaconda and CMO at Continuum Analytics. “Our fast growing AnacondaCrew Partner Program delivers an enterprise-ready connected ecosystem that makes it easy for customers to embark on the journey to Open Data Science and realize returns on their Big Data investments.”

Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Continuum Analytics' founders and developers have created and contributed to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

#### Esri Selects Anaconda to Enhance GIS Applications with Open Data Science

Wednesday, October 12, 2016

Streamlined access to Python simplifies and accelerates development of deep location-based analytics for improved operations and intelligent decision-making

AUSTIN, TEXAS—October 12, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced a new partnership with leading geographic information system (GIS) provider, Esri. By embedding Anaconda into Esri's flagship desktop application, ArcGIS Pro, organizations now have increased accessibility to Python for developing more powerful, location-centric analytics applications.

The integration of Anaconda into ArcGIS Pro 1.3 enables GIS professionals to build detailed maps using the most current data and perform deep analysis to apply geography to problem solving and decision making for customers in dozens of industries. Anaconda provides a strong value-added component––particularly for Esri’s scientific and governmental customers who have to coordinate code across multiple machines or deploy software through centralized IT systems. Now, developers using ArcGIS Pro can easily integrate open source libraries into projects, create projects in multiple versions of Python and accelerate the process of installing nearly all publicly available Python packages.

“Python has a rich ecosystem of pre-existing code packages that users can leverage in their own script tools from within ArcGIS. But, managing packages can prove complex and time-consuming, especially when developing for multiple projects at once or trying to share code with others,” said Debra Parish, manager of global business strategies at Esri. “Anaconda solves these challenges and lets users easily create projects in multiple versions of Python. It really makes lives easier, especially for developers who deal with complex issues and appreciate the ease and agility Anaconda adds to the Python environment.”

ArcGIS for Desktop, which includes ArcGIS Pro, boasts the most powerful mapping software in the world. Used by Fortune 500 companies, national and local governments, public utilities and tech start-ups around the world, ArcGIS Pro’s mapping platform uncovers trends, patterns and spatial connections to provide actionable insights leading to data-informed business decisions. Additionally, ArcGIS Pro is accessible to developers to create and manage geospatial apps, regardless of developer experience.

“At Continuum Analytics, we know that data science is a team sport and collaboration is critical to the success of any analytics project. Anaconda empowers Esri developers with an accelerated path to open source Python projects and deeper analytics,” said Travis Oliphant, CEO and co-founder at Continuum Analytics. “More importantly, we see this as a partnering of two great communities, both offering best-in-class technology and recognizing that Open Data Science is a powerful solution to problem solving and decision making for organizations of all sizes.”

Since 1969, Esri has been giving customers around the world the power to think and plan geographically. As the market leader in GIS technology, Esri software is used in more than 350,000 organizations worldwide including each of the 200 largest cities in the United States, most national governments, more than two-thirds of Fortune 500 companies, and more than 7,000 colleges and universities. Esri applications, running on more than one million desktops and thousands of web and enterprise servers, provide the backbone for the world's mapping and spatial analysis. Esri is the only vendor that provides complete technical solutions for desktop, mobile, server, and Internet platforms. Visit us at esri.com/news.

Copyright © 2016 Esri. All rights reserved. Esri, the Esri globe logo, GIS by Esri, Story Map Journal, esri.com, and @esri.com are trademarks, service marks, or registered marks of Esri in the United States, the European Community, or certain other jurisdictions. Other companies and products or services mentioned herein may be trademarks, service marks, or registered marks of their respective mark owners.

Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Continuum Analytics' founders and developers have created and contributed to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

Media Contacts:
Anaconda--
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

Esri--
Karen Richardson, Public Relations Manager, Esri
Mobile: +1 587-873-0157
Email: krichardson@esri.com

## October 10, 2016

### Enthought

#### Geophysical Tutorial: Facies Classification using Machine Learning and Python

By Brendon Hall, Enthought Geosciences Applications Engineer
Coordinated by Matt Hall, Agile Geoscience

ABSTRACT

There has been much excitement recently about big data and the dire need for data scientists who possess the ability to extract meaning from it. Geoscientists, meanwhile, have been doing science with voluminous data for years, without needing to brag about how big it is. But now that large, complex data sets are widely available, there has been a proliferation of tools and techniques for analyzing them. Many free and open-source packages now exist that provide powerful additions to the geoscientist’s toolbox, much of which used to be only available in proprietary (and expensive) software platforms.

One of the best examples is scikit-learn, a collection of tools for machine learning in Python. What is machine learning? You can think of it as a set of data-analysis methods that includes classification, clustering, and regression. These algorithms can be used to discover features and trends within the data without being explicitly programmed, in essence learning from the data itself.

Well logs and facies classification results from a single well.

In this tutorial, we will demonstrate how to use a classification algorithm known as a support vector machine to identify lithofacies based on well-log measurements. A support vector machine (or SVM) is a type of supervised-learning algorithm, which needs to be supplied with training data to learn the relationships between the measurements (or features) and the classes to be assigned. In our case, the features will be well-log data from nine gas wells. These wells have already had lithofacies classes assigned based on core descriptions. Once we have trained a classifier, we will use it to assign facies to wells that have not been described.

See the tutorial in The Leading Edge here.

## What is RethinkDB?

RethinkDB is a INCREDIBLE high quality polished open source realtime database that is easy to deploy, shard, replicate, and supports a reactive client programming model, which is useful for collaborative web-based applications. Shockingly, the 7-year old company that created RethinkDB has just shutdown. I am the CEO of a company, SageMath, Inc., that uses RethinkDB very heavily, so I have a strong interest in RethinkDB surviving as an independent open source project.

## Three Types of Open Source Projects

There are many types of open source projects. RethinkDB was the type of open source project where most work on RethinkDB has been fulltime focused work, done by employees of the RethinkDB company. RethinkDB is licensed under the AGPL, but the company promised to make the software available to customers under other licenses.

Academia: I started the SageMath open source math software project in 2005, which has over 500 contributors, and a relatively healthy volunteer ecosystem, with about hundred contributors to each release, and many releases each year. These are mostly volunteer contributions by academics: usually grad students, postdocs, and math professors. They contribute because SageMath is directly relevant to their research, and they often contribute state of the art code that implements algorithms they have created or refined as part of their research. Sage is licensed under the GPL, and that license has worked extremely well for us. Academics sometimes even get significant grants from the NSF or the EU to support Sage development.

Companies: I also started the Cython compiler project in 2007, which has had dozens of contributors and is now the defacto standard for writing or wrapping fast code for use by Python. The developers of Cython mostly work at companies (e.g., Google) as a side project in their spare time. (Here's a message today about a new release from a Cython developer, who works at Google.) Cython is licensed under the Apache License.

## What RethinkDB Will Become

RethinkDB will no longer be an open source project whose development is sponsored by a single company dedicated to the project. Will it be an academic project, a company-supported project, or dead?

A friend of mine at Oxford University surveyed his academic CS colleagues about RethinkDB, and they said they had zero interest in it. Indeed, from an academic research point of view, I agree that there is nothing interesting about RethinkDB. I myself am a college professor, and understand these people! Academic volunteer open source contributors are definitely not going to come to RethinkDB's rescue. The value in RethinkDB is not in the innovative new algorithms or ideas, but in the high quality carefully debugged implementations of standard algorithms (largely the work of bad ass German programmer Daniel Mewes). The RethinkDB devs had to carefully tune each parameter in those algorithms based on extensive automated testing, user feedback, the Jepsen tests, etc.

That leaves companies. Whether or not you like or agree with this, many companies will not touch AGPL licensed code:
"Google open source guru Chris DiBona says that the web giant continues to ban the lightning-rod AGPL open source license within the company because doing so "saves engineering time" and because most AGPL projects are of no use to the company."

With RethinkDB today, the only option is AGPL. This very strongly discourage use by the only possible group of users and developers that have any chance to keep RethinkDB from death. If this situation is not resolved as soon as possible, I am extremely afraid that it never will be resolved. Ever. If you care about RethinkDB, you should be afraid too. Ignoring the landscape and culture of volunteer open source projects is dangerous.

## A Proposal

I don't know who can make the decision to relicense RethinkDB. I don't kow what is going on with investors or who is in control. I am an outsider. Here is a proposal that might provide a way out today:

PROPOSAL: Dear RethinkDB, sell me an Apache (or BSD) license to the RethinkDB source code. Make this the last thing your company sells before it shuts down. Just do it.

Hacker News Discussion

## October 05, 2016

### William Stein

#### SageMath: "it's not research"

The University of Washington (UW) mathematics department has funding for grad students to "travel to conferences". What sort of travel funding?

• The department has some money available.
• The UW Graduate school has some money available: They only provide funding for students giving a talk or presenting a poster.
• The UW GPSS has some money available: contact them directly to apply (they only provide funds for "active conference participation", which I think means giving a talk, presenting a poster, or similar)

One of my two Ph.D. students at UW asked our Grad program director: "I'll be going to Joint Mathematics Meetings (JMM) to help out at the SageMath booth. Is this a thing I can get funding for?"

ANSWER: Travel funds are primarily meant to support research, so although I appreciate people helping out at the SageMath booth, I think that's not the best use of the department's money.

I think this "it's not research" perspective on the value of mathematical software is unfortunate and shortsighted. Moreover, it's especially surprising as the person who wrote the above answer has contributed substantially to the algebraic topology functionality of Sage itself, so he knows exactly what Sage is.

Sigh. Can some blessed person with an NSF grant out there pay for this grad student's travel expenses to help with the Sage booth? Or do I have to use the handful of $10,$50, etc., donations I've got the last few months for this purpose?

## September 27, 2016

### Continuum Analytics news

#### Continuum Analytics Joins Forces with IBM to Bring Open Data Science to the Enterprise

Tuesday, September 27, 2016

Optimized Python experience empowers data scientists to develop advanced open source analytics on Spark

AUSTIN, TEXAS—September 27, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced an alliance with IBM to advance open source analytics for the enterprise. Data scientists and data engineers in open source communities can now embrace Python and R to develop analytic and machine learning models in the Spark environment through its integration with IBM's Project DataWorks.

Combining the power of IBM's Project DataWorks with Anaconda enables organizations to build high-performance Python and R data science models and visualization applications required to compete in today’s data-driven economy. The companies will collaborate on several open source initiatives including enhancements to Apache Spark that fully leverage Jupyter Notebooks with Apache Spark – benefiting the entire data science community.

“Our strategic relationship with Continuum Analytics empowers Project DataWorks users with full access to the Anaconda platform to streamline and help accelerate the development of advanced machine learning models and next-generation analytics apps,” said Ritika Gunnar, vice president, IBM Analytics. “This allows data science professionals to utilize the tools they are most comfortable with in an environment that reinforces collaboration with colleagues of different skillsets.”

By collaborating to bring about the best Spark experience for Open Data Science in IBM's Project DataWorks, enterprises are able to easily connect their data, analytics and compute with innovative machine learning to accelerate and deploy their data science solutions.

“We welcome IBM to the growing family of industry titans that recognize Anaconda as the defacto Open Data Science platform for enterprises,” said Michele Chambers, EVP of Anaconda Business & CMO at Continuum Analytics. “As the next generation moves from machine learning to artificial intelligence, cloud-based solutions are key to help companies adopt and develop agile solutions––IBM recognizes that. We’re thrilled to be one of the driving forces powering the future of machine learning and artificial intelligence in the Spark environment.”

IBM's Project Dataworks the industry’s first cloud-based data and analytics platform that integrates all types of data to enable AI-powered decision making. With this, companies are able to realize the full promise of data by enabling data professionals to collaborate and build cognitive solutions by combining IBM data and analytics services and a growing ecosystem of data and analytics partners - all delivered on Apache Spark. Project Dataworks is designed to allow for faster development and deployment of data and analytics solutions with self-service user experiences to help accelerate business value.

To learn more, join Bob Picciano, SVP of IBM Analytics and Travis Oliphant, CEO of Continuum Analytics at the IBM DataFirst Launch Event on Sept 27, 2016, Hudson Mercantile Building in NYC. The event is also available on livestream.

Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team––data scientists, developers, DevOps, architects and business analysts––to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created and contributed to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

###

Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

## September 23, 2016

### Continuum Analytics news

#### Why Your Company Needs a Chief Data Science Officer

Friday, September 23, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

This article was originally posted in CMSWire and has been edited for length and clarity.

## Yellow_Bkgrd.jpg

Ten years ago, the Chief Data Science Officer (CDSO) role was non-existent. It came into being when D.J. Patil was named the first US Chief Data Scientist by President Obama in 2015. A product of the Chief Technology Officer (CTO), who is responsible for focusing on scientific and technological matters within an organization including the company’s hardware and software, the CDSO takes technology within the enterprise to a whole new level. Companies have been motivated to get on the data train since 1990, when they began implementing big data collection.

However, making sense of the data was a challenge. With no dedicated person to own and manage the huge piles of datasets being collected, organizations began to flail and sink under the weight of all their data. It didn’t happen overnight, but data scientists and, subsequently, the role of the CDSO, came to life once companies realized that proper data analysis was key to finding correlations needed to spot business trends and, ultimately, exploit the power of big data to deliver value.

The CDSO role confirms the criticality of collecting data properly to capitalize on it and make certain it is stored securely in the event of a disaster or emergency (some businesses have yet to recover data following Hurricane Sandy).

Fast forward to 2016. Big data has exploded, but companies are still struggling with how best to organize around it — as an activity, a business function and a capability.

But what exactly can we achieve with it?

CDSOs (and their team of data scientists) are key to the skill set needed to apply analytics to their business, explain how to use data to create a competitive advantage and surpass competitors and understand how to find true value from data by acting on it.

### Empowering Data Science Teams

Today, businesses are equipped with data science teams made up of a variety of roles––business analysts, machine learning experts, data engineers and more.

With the CDSO at the helm, the data science team can collaborate and centralize these skills, becoming a hub of intelligence and adding value to each business they serve. With a multifaceted perspective on data science as a whole, the CDSO allows for more innovative ideas and solutions for companies.

### Staying Cost Efficient

It’s no secret that how businesses handle data has a direct impact on the bottom line. An interesting example occurred at DuPont, a company that defines itself as “a science company dedicated to solving challenging global problems” and is well known for its distribution of Corian solid surface countertops across the world. When asked if it believed it was covering its entire total addressable market (TAM), company executives were definitive in their response: a resounding yes.

Executives knew they had covered every region in the market and had great insight into analytics via distributors. What they hadn’t taken into consideration, however, was the vast amounts of data embedded within end-customer insights Without knowing exactly where the product was being installed — literally, DuPont had no insight into locations where it had not saturated.

DuPont took this information and created countertops that embedded sensors driven by Internet of Things (IoT) technology. By not simply relying on the data provided by its suppliers, DuPont seized the opportunity to increase its pool of knowledge significantly, by adding data science into its product.

This is just one example of how data science and the CDSO can implement previously non-existent processes and drive increased business intelligence in the most beneficial way –– with increased value to its re-sellers and a direct impact on revenue.

### Changing the World

There is no room for doubt: it’s proven that innovation in the field of Open Data Science has led to the need for a CDSO to derive as much value from data as possible and help companies make an impact on the world.

John Deere, a 180-year-old company, is now revolutionizing farming by incorporating “smart farms.” Big data and IoT solutions allow farmers to make educated decisions based on real-time analysis of captured data. Giving this company the ability to put its data to good use, resulted in industry-wide and, in many areas, worldwide, positive changes — another reason why technology driven by the CDSO is an integral part of any organization.

The need for an executive-level decision maker proves to be an essential piece of the puzzle. The CDSO deserves a seat at the executive table to empower data science teams, drive cost efficiency and previously unimagined results and, most importantly, help companies change the world.

## September 22, 2016

### Matthew Rocklin

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

All code in this post is experimental. It should not be relied upon. For people looking to deploy dask.distributed on a cluster please refer instead to the documentation instead.

Dask is deployed today on the following systems in the wild:

• SGE
• SLURM,
• Torque
• Condor
• LSF
• Mesos
• Marathon
• Kubernetes
• SSH and custom scripts
• … there may be more. This is what I know of first-hand.

These systems provide users access to cluster resources and ensure that many distributed services / users play nicely together. They’re essential for any modern cluster deployment.

The people deploying Dask on these cluster resource managers are power-users; they know how their resource managers work and they read the documentation on how to setup Dask clusters. Generally these users are pretty happy; however we should reduce this barrier so that non-power-users with access to a cluster resource manager can use Dask on their cluster just as easily.

Unfortunately, there are a few challenges:

1. Several cluster resource managers exist, each with significant adoption. Finite developer time stops us from supporting all of them.
2. Policies for scaling out vary widely. For example we might want a fixed number of workers, or we might want workers that scale out based on current use. Different groups will want different solutions.
3. Individual cluster deployments are highly configurable. Dask needs to get out of the way quickly and let existing technologies configure themselves.

This post talks about some of these issues. It does not contain a definitive solution.

## Example: Kubernetes

For example, both Olivier Griesl (INRIA, scikit-learn) and Tim O’Donnell (Mount Sinai, Hammer lab) publish instructions on how to deploy Dask.distributed on Kubernetes.

These instructions are well organized. They include Dockerfiles, published images, Kubernetes config files, and instructions on how to interact with cloud providers’ infrastructure. Olivier and Tim both obviously know what they’re doing and care about helping others to do the same.

Tim (who came second) wasn’t aware of Olivier’s solution and wrote up his own. Tim was capable of doing this but many beginners wouldn’t be.

One solution would be to include a prominent registry of solutions like these within Dask documentation so that people can find quality references to use as starting points. I’ve started a list of resources here: dask/distributed #547 comments pointing to other resources would be most welcome..

However, even if Tim did find Olivier’s solution I suspect he would still need to change it. Tim has different software and scalability needs than Olivier. This raises the question of “What should Dask provide and what should it leave to administrators?” It may be that the best we can do is to support copy-paste-edit workflows.

What is Dask-specific, resource-manager specific, and what needs to be configured by hand each time?

In order to explore this topic of separable solutions I built a small adaptive deployment system for Dask.distributed on Marathon, an orchestration platform on top of Mesos.

This solution does two things:

1. It scales a Dask cluster dynamically based on the current use. If there are more tasks in the scheduler then it asks for more workers.
2. It deploys those workers using Marathon.

To encourage replication, these two different aspects are solved in two different pieces of code with a clean API boundary.

1. A backend-agnostic piece for adaptivity that says when to scale workers up and how to scale them down safely
2. A Marathon-specific piece that deploys or destroys dask-workers using the Marathon HTTP API

This combines a policy, adaptive scaling, with a backend, Marathon such that either can be replaced easily. For example we could replace the adaptive policy with a fixed one to always keep N workers online, or we could replace Marathon with Kubernetes or Yarn.

My hope is that this demonstration encourages others to develop third party packages. The rest of this post will be about diving into this particular solution.

The distributed.deploy.Adaptive class wraps around a Scheduler and determines when we should scale up and by how many nodes, and when we should scale down specifying which idle workers to release.

The current policy is fairly straightforward:

1. If there are unassigned tasks or any stealable tasks and no idle workers, or if the average memory use is over 50%, then increase the number of workers by a fixed factor (defaults to two).
2. If there are idle workers and the average memory use is below 50% then reclaim the idle workers with the least data on them (after moving data to nearby workers) until we’re near 50%

Think this policy could be improved or have other thoughts? Great. It was easy to implement and entirely separable from the main code so you should be able to edit it easily or create your own. The current implementation is about 80 lines (source).

However, this Adaptive class doesn’t actually know how to perform the scaling. Instead it depends on being handed a separate object, with two methods, scale_up and scale_down:

class MyCluster(object):
def scale_up(n):
"""
Bring the total count of workers up to n

This function/coroutine should bring the total number of workers up to
the number n.
"""
raise NotImplementedError()

def scale_down(self, workers):
"""
Remove workers from the cluster

Given a list of worker addresses this function should remove those
workers from the cluster.
"""
raise NotImplementedError()

This cluster object contains the backend-specific bits of how to scale up and down, but none of the adaptive logic of when to scale up and down. The single-machine LocalCluster object serves as reference implementation.

So we combine this adaptive scheme with a deployment scheme. We’ll use a tiny Dask-Marathon deployment library available here

from distributed import Scheduler

s = Scheduler()
mc = MarathonCluster(s, cpus=1, mem=4000,

This combines a policy, Adaptive, with a deployment scheme, Marathon in a composable way. The Adaptive cluster watches the scheduler and calls the scale_up/down methods on the MarathonCluster as necessary.

## Marathon code

Because we’ve isolated all of the “when” logic to the Adaptive code, the Marathon specific code is blissfully short and specific. We include a slightly simplified version below. There is a fair amount of Marathon-specific setup in the constructor and then simple scale_up/down methods below:

from marathon import MarathonClient, MarathonApp
from marathon.models.container import MarathonContainer

class MarathonCluster(object):
def __init__(self, scheduler,
name=None, cpus=1, mem=4000, **kwargs):
self.scheduler = scheduler

# Create Marathon App to run dask-worker
args = [
executable,
'--name', '$MESOS_TASK_ID', # use Mesos task ID as worker name '--worker-port', '$PORT_WORKER',
'--nanny-port', '$PORT_NANNY', '--http-port', '$PORT_HTTP'
]

ports = [{'port': 0,
'protocol': 'tcp',
'name': name}
for name in ['worker', 'nanny', 'http']]

args.extend(['--memory-limit',
str(int(mem * 0.6 * 1e6))])

kwargs['cmd'] = ' '.join(args)
container = MarathonContainer({'image': docker_image})

app = MarathonApp(instances=0,
container=container,
port_definitions=ports,
cpus=cpus, mem=mem, **kwargs)

# Connect and register app
self.app = self.client.create_app(name or 'dask-%s' % uuid.uuid4(), app)

def scale_up(self, instances):
self.client.scale_app(self.app.id, instances=instances)

def scale_down(self, workers):
for w in workers:
self.scheduler.worker_info[w]['name'],
scale=True)

This isn’t trivial, you need to know about Marathon for this to make sense, but fortunately you don’t need to know much else. My hope is that people familiar with other cluster resource managers will be able to write similar objects and will publish them as third party libraries as I have with this Marathon solution here: https://github.com/mrocklin/dask-marathon (thanks goes to Ben Zaitlen for setting up a great testing harness for this and getting everything started.)

Similarly, we can design new policies for deployment. You can read more about the policies for the Adaptive class in the documentation or the source (about eighty lines long). I encourage people to implement and use other policies and contribute back those policies that are useful in practice.

## Final thoughts

We laid out a problem

• How does a distributed system support a variety of cluster resource managers and a variety of scheduling policies while remaining sensible?

We proposed two solutions:

1. Maintain a registry of links to solutions, supporting copy-paste-edit practices
2. Develop an API boundary that encourages separable development of third party libraries.

It’s not clear that either solution is sufficient, or that the current implementation of either solution is any good. This is is an important problem though as Dask.distributed is, today, still mostly used by super-users. I would like to engage community creativity here as we search for a good solution.

## September 20, 2016

### Matthieu Brucher

#### Audio Toolkit: Handling denormals

While following a discussion on KVR, I thought about adding support for denormals handling in Audio Toolkit

# What are denormals?

Denormals or denormal number are numbers that can’t be represented the “usual” way in floating point representation. When this happens, the floating point units can’t be as fast as with the usual representation. These numbers are really low, almost 0, but not exactly 0. So this can often happen in audio processing at the end of the processing of a clip, and sometimes during computation for a handful of values.

In the past, on AMD CPUs, the FPU would even use the denormal process for bigger values than on the Intel CPUs, which lead to poorer performance. This doesn’t happen anymore AFAIK, but if your application is slow, you may want to take a look at a profile and determine if you could have an issue there. Denormals behavior can be detected by an abnormal ratio of floating point operations per cycle (the number is too low).

# Flush to zero on different platforms

There are different ways of avoiding denormals. One not so good one is to add background noise to the operations. The issue is what amount (random or constant) and the fact that not all algorithms can handle them.

The better solution is to use the CPU facilities for this. x86 processors have internal flags that can be use to flush denormals to zero. Unfortunately, the API is different on all platforms.

On Windows, the following function is used:

_controlfp_s(&previous_state, _MCW_DN, _DN_FLUSH);

On Linux, as an extension of C99, gcc added this function:

_mm_setcsr(_mm_getcsr() | (_MM_DENORMALS_ZERO_ON));

And finally, OS X has yet a different way of doing things.

fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);

ARM platform is yet different. The default compiler has also an API, but if you are compiling with GCC, flush to zero is activated through the command line option -funsafe-math-optimizations.

# Using flush to zero

The functions need to set the state before processing anything and then they need to be reused to set the state as it was before. This is to ensure that calling code has the same FPU state. What your functions can handle (arbitrary noise for small values) may not be acceptable for other applications.

The total amount of change in terms of performance may not be impressive. Using the functions to change FPU state means that there is an overhead (that may not be important, but an overhead nonetheless) and that the algorithms will behave slightly differently. So flushing to zero is about compromise.

# Conclusion

Flushing denormals to zero may not be mandatory, but having the option to enable it is neat. So this is now available in Audio Toolkit 1.3.2.

## September 19, 2016

### Continuum Analytics news

#### Continuum Analytics to Speak at Strata + Hadoop World, September 27 & 28, in New York City

Tuesday, September 27, 2016

Topics include successfully navigating Open Data Science on Hadoop; demystifying the secret to quickly building intelligent apps with Bokeh, Python

NEW YORK, NY. - September 27, 2016 - Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Peter Wang, co-founder and CTO, will be speaking at Strata + Hadoop World 2016 on September 28 at 4:35pm (EDT), along with software engineers, Bryan Van de Ven and Sarah Bird, on September 27 at 1:30pm (EDT). Strata + Hadoop World is the largest conference of its kind in the world, bringing together the data industry’s most influential minds to exchange ideas, examine case studies and shape the future of business and technology.

Wang's talk, titled, “Successful Open Data Science on Hadoop: From Sandbox to Production,”  will explore the next generation of hardware and cloud topologies. It will also highlight how Anaconda, the leading Open Data Science platform, continues to incorporate the latest innovations available for data scientists to work while preserving the ability for IT to manage and operate their production environment.

Van de Ven and Bird’s tutorial, “Interactive Data Applications in Python,” will provide attendees with a how-to on creating interactive visualizations that leverage Python using the Bokeh server efficiently, and tips for deploying and sharing the newly created data applications.

WHO: Peter Wang, CTO and co-founder, Continuum Analytics
WHAT: Session: “Successful Open Data Science on Hadoop: From Sandbox to Production”
WHEN: September 28, 4:35pm-5:15pm
WHERE:  Location: 1 E 09
REGISTER: HERE

WHO: Bryan Van de Ven and Sarah Bird, software engineers, Continuum Analytics
WHAT: Tutorial: “Interactive Data Applications in Python”
WHEN: September 27, 1:30-5pm
WHERE: 1 E 15/1 E 16
REGISTER: HERE

###

Continuum Analytics’ Anaconda is the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world. Anaconda is trusted by leading businesses worldwide and across industries––financial services, government, health and life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda helps data science teams discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage Open Data Science environments and harness the power of the latest open source analytic and technology innovations. Visit http://www.continuum.io

###

Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

## September 16, 2016

### Enthought

#### Canopy Data Import Tool: New Updates

In May of 2016 we released the Canopy Data Import Tool, a significant new feature of our Canopy graphical analysis environment software. With the Data Import Tool, users can now quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling.

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

With the latest version of the Data Import Tool released this month (v. 1.0.4), we’ve added new capabilities and enhancements, including:

1. The ability to select and import a specific table from among multiple tables on a webpage,
2. Intelligent alerts regarding the saved state of exported Python code, and
3. Unlimited file sizes supported for import.

New: Choosing from multiple tables on a webpage

The latest release of the Canopy Data Import Tool supports the selection of a specific table from a webpage for import, such as this Wikipedia page

In addition to CSVs and structured text files, the Canopy Data Import Tool (the Tool) provides the ability to load tables from a webpage. If the webpage contains multiple tables, by default the Tool loads the first table.

With this release, we provide the user with the ability to choose from multiple tables to import using a scrollable index parameter to select the table of interest for import.

For example, let’s try to load a table from the Demography of the UK wiki page using the Tool. In total, there are 10 tables on that wiki page.

• As you can see in the screenshot below, the Tool initially loads the first table on the wiki page.
• However, we are interested in loading the table ‘Vital statistics since 1960’, which is the fifth table on the page. (Note that indexing starts at 0). For a quick history lesson on why Python uses zero based indexing, see Guido van Rossum’s explanation here).
• After the initial read-in, we can click on the ‘Table index on page’ scroll bar, choose ‘4’ and click on ‘Refresh Data’ to load the table of interest in the Data Import Tool.

See how the Canopy Data Import Tool loads a table from a webpage and prepares the data for manipulation and interaction:

The Data Import Tool allows you to select a specific table from a webpage where multiple are present, with a simple drop down menu. Once you’ve selected your table, you can readily toggle between 3 views: the Pandas DataFrame generated by the Tool, the raw data and the corresponding auto-generated Python code. Consecutively, you can export the DataFrame to the IPython console for further plotting and further analysis.

• Further, as you can see, the first row contains column names and the first column looks like an index for the Data Frame. Therefore, you can select the ‘First row is column names’ checkbox and again click on ‘Refresh Data’ to prompt the Tool to re-read the table but, this time, use the data in the first row as column names. Then, we can right-click on the first column and select the ‘Set as Index’ option to make column 0 the index of the DataFrame.
• You can toggle between the DataFrame, Raw Data and Python Code tabs in the Tool, to peek at the raw data being loaded by the Tool and the corresponding Python code auto-generated by the Tool.
• Finally, you can click on the ‘Use DataFrame’ button, in the bottom right, to send the DataFrame to the IPython kernel in the Canopy User Environment, for plotting and further analysis.

New: Keeping track of exported Python scripts

The Tool generates Python commands for all operations performed by the user and provides the user with the ability to save the generated Python script. With this new update, the Tool keeps track of the saved and current states of the generated Python script and intelligently alerts the user if he/she clicks on theUse DataFrame’ button without saving changes in the Python script.

New: Unlimited file sizes supported for import

In the initial release, we chose to limit the file sizes that can be imported using the Tool to 70 MB, to ensure optimal performance. With this release, we removed that restriction and allow files of any size to be uploaded with the tool. For files over 70 MB we now provide the user with a warning that interaction, manipulation and operations on the imported Data Frame might be slower than normal, and allow them to select whether to continue or begin with a smaller subset of data to develop a script to be applied to the larger data set.

Along with the feature additions discussed above, based on continued user feedback, we implemented a number of UI/UX improvements and bug fixes in this release. For a complete list of changes introduced in version 1.0.4 of the Data Import Tool, please refer to the Release Notes page in the Tool’s documentation. If you have any feedback regarding the Data Import Tool, we’d love to hear from you at canopy.support@enthought.com.

Download Canopy and start a free 7 day trial of the data import tool

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

## September 14, 2016

### Continuum Analytics news

#### Working Efficiently with Big Data in Text Formats Using Free Software

Monday, September 12, 2016
David Mertz
Continuum Analytics

One of our first commercial software products at Continuum Analytics was a product called IOPro which we have sold continuously since 2012. Now, we will be releasing the code under a liberal open source license.

Following the path of widely adopted projects like conda, Blaze, Dask, odo, Numba, conda, Bokeh, datashader, DataShape, DyND and other software that Continuum has created, we hope that the code in IOPro becomes valuable to open source communities and data scientists worldwide.

However, we do not only hope this code is useful to you—we also hope you and your colleagues will be able to enhance, refine and develop the code further to increase its utility for the entire Python world.

For existing IOPro customers, we will be providing a free of charge license upon renewal until we release the open source version.

### What IOPro Does

IOPro loads NumPy arrays and pandas DataFrames directly from files, SQL databases and NoSQL stores—including ones with millions (or billions) of rows. It provides a drop-in replacement for NumPy data loading functions but dramatically improves performance and starkly reduces memory overhead.

The key concept in our code is that we access data via adapters which are like enhanced file handles or database cursors.  An adapter does not read data directly into memory, but rather provides a mechanism to use familiar NumPy/pandas slicing syntax to load manageable segments of a large dataset.  Moreover, an adapter provides fine-grained control over exactly how data is eventually read into memory, whether using custom patterns for how a line of data is parsed, choosing the precise data type of a textually represented number, or exposing data as "calculated fields" (that is, "virtual columns").

As well as local CSV, JSON or other textual data sources, IOPro can load data from Amazon S3 buckets.  When accessing large datasets—especially ones too large to load into memory—from files that do not have fixed record sizes, IOPro's indexing feature allows users to seek to a specific collection of records tens, hundreds or thousands of times faster than is possible with a linear scan.

### Our Release Schedule

The initial release of our open source code will be of the TextAdapter component that makes up the better part of the code in IOPro.  This code will be renamed, straightforwardly enough, as TextAdapter. The project will live at https://github.com/ContinuumIO/TextAdapter.  We will make this forked project available by October 15, 2016 under a BSD 3-Clause License.

Additionally, we will release the database adapters by December 31, 2016. That project will live at https://github.com/ContinuumIO/DBAdapter.

If you are a current paid customer of IOPro, and are due for renewal before January 1, 2017, your Anaconda Ambassador will get in touch with you to provide a license free of charge, so you do not experience any downtime.

Thank you to prior contributors at Continuum, especially Jay Bourque (jayvius), but notably also Francesc Alted (FrancescAlted), Óscar Villellas Guillén (ovillellas), Michael Kleehammer (mkleehammer) and Ilan Schnell (ilanschnell) for their wonderful contributions.  Any remaining bugs are my responsibility alone as current maintainer of the project.

### The Blaze Ecosystem

As part of the open source release of TextAdapter, we plan to integrate TextAdapter into the Blaze ecosystem.  Blaze itself, as well as odo, provides translation between data formats and querying of data within a large variety of formats. Putting TextAdapter clearly in this ecosystem will let an adapter act as one such data format, and hence leverage the indexing speedups and data massaging that TextAdapter provides.

### Other Open Source Tools

Other open source projects for interacting with large datasets provide either competitors or collaborative capabilities.

• The ParaText from Wise Technology looks like a very promising approach to accelerating raw reads of CSV data.  It doesn't currently provide regular expression matching nor as rich data typing as IOPro, but the raw reads are shockingly fast. Most importantly, perhaps, ParaText does not address indexing, so as fast as it is at linear scan, it remains stuck with big-O inefficiencies that TextAdapter addresses.  I personally think that (optionally) utilizing the underlying reader of ParaText as a layer underneath TextAdapter would be a wonderful combination.  Information about ParaText can be found at http://www.wise.io/tech/paratext.

Database access is almost always I/O bound rather than CPU bound, and hence the likely wins are by switching to asynchronous frameworks.  This does involve using a somewhat different programming style than synchronous adapters, but some recent ones look amazingly fast.  I am not yet sure whether it is worthwhile to create IOPro style adapters around these asyncio-based interfaces.

• asyncpg is a database interface library designed specifically for PostgreSQL and Python/asyncio. asyncpg is an efficient, clean implementation of PostgreSQL server binary protocol. Information about asyncpg can be found at https://magicstack.github.io/asyncpg/current/.

We will continue to monitor and reply to issues and discussion about these successor projects at their GitHub repositories - all questions should be addressed at one of the following:

## September 13, 2016

### Titus Brown

#### A draft genome for the tule elk

The tule elk (Cervus elaphus nannodes) is a California-endemic subspecies that underwent a major genetic bottleneck when its numbers were reduced to as few as 3 individuals in the 1870s (McCullough 1969; Meredith et al. 2007). Since then, the population has grown to an estimated 4,300 individuals which currently occur in 22 distinct herds (Hobbs 2014). Despite their higher numbers today, the historical loss of genetic diversity combined with the increasing fragmentation of remaining habitat pose a significant threat to the health and management of contemporary populations. As populations become increasingly fragmented by highways, reservoirs, and other forms of human development, risks intensify for genetic impacts associated with inbreeding. By some estimates, up to 44% of remaining genetic variation could be lost in small isolated herds in just a few generations (Williams et al. 2004). For this reason, the Draft Elk Conservation and Management Plan and California Wildlife Action Plan prioritize research aimed at facilitating habitat connectivity, as well as stemming genetic diversity loss and habitat fragmentation (Hobbs 2014; CDFW 2015).

We obtained 377,980,276 raw reads (i.e., 300 bp sequences from random points in the genome), containing a total of 113.394 Gbp of sequence, or approximately 40X coverage of the tule elk genome. More than 98% of these data passed quality filtering. The reads (and coverage) were distributed approximately equally among the 4 elk, resulting in approximately 10X coverage for each of the 4 elk.

...

The tule elk reads were de novo assembled into 602,862 contiguous sequences ("contigs") averaging 3,973 bp in length (N50 = 6,885 bp, maximum contig length = 72,391 bp), for a total genome sequence size of 2.395 billion bp (Gbp). All scaffolds and raw reads will be made publicly available on Genbank or a similar public database pending publication. Alignment of all elk reads back to these contigs revealed 3,571,069 polymorphic sites (0.15% of sites). Assuming a similar ratio of heterozygous (in individuals) to polymorphic (among the 4 elk) sites as we observed in the subsample aligned to the sheep genome, this would translate to a genome-wide heterozygosity of approximately 5e-4, which was about 5 times higher than that observed in the 25% of the genome mapping to the sheep genome. This magnitude of heterozygosity is in line with other bottlenecked mammal populations, including several of the island foxes (Urocyon littoralis), cheetah (Acinonyx jubatus), Tasmanian devil (Sarcophilus harrisii), and mountain gorilla (Gorilla beringei beringei; Robinson et al. 2016 and references therein). Although these interspecific comparisons provide a general reference, heterozygosity can vary substantially according to life-history, as well as demographic history, and does not necessarily imply a direct relationship to genetic health. Therefore, sequencing the closely related Rocky Mountain (C. elaphus nelsoni) and Roosevelt (C. elaphus roosevelti) elk in the future is necessary to provide the most meaningful comparison to the tule elk heterozygosity reported here.

Note, assembly method details are available on github.

#### Publishing Open Source Research Software in JOSS - an experience report

Our first JOSS submission (paper? package?) is about to be accepted and I wanted to enthuse about the process a bit.

JOSS, the Journal of Open Source Software, is a place to publish your research software packages. Quoting from the about page,

The Journal of Open Source Software (JOSS) is an academic journal with a formal peer review process that is designed to improve the quality of the software submitted. Upon acceptance into JOSS, a CrossRef DOI is minted and we list your paper on the JOSS website.

How is JOSS different?

In essentially all other academic journals, when you publish software you have to write a bunch of additional stuff about what the software does and how it works and why it's novel or exciting. This is true even in some of the newer models for software publication like F1000Research, which hitherto took the prize for least obnoxious software publication process.

JOSS takes the attitude that what the software does should be laid out in the software documentation. JOSS also has the philosophy that since software is the product perhaps the software itself should be reviewed rather than the software advertisement (aka scientific paper). (Note, I'm a reviewer for JOSS, and I'm totally in cahoots with most of the ed board, but I don't speak for JOSS in any way.)

To put it more succinctly, with JOSS the focus is on the software itself, not on ephemera associated with the software.

## The review experience

I submitted our sourmash project a few months back. Sourmash was a little package I'd put together to do MinHash sketch calculations on DNA, and it wasn't defensible as a novel package. Frankly, it's not that scientifically interesting either. But it's a potentially useful reimplementation of mash, and we'd already found it useful internally. So I submitted it to JOSS.

As you can see from the JOSS checklist, the reviewer checklist is both simple and reasonably comprehensive. Jeremy Kahn undertook to do the review, and found a host of big and small problems, ranging from licensing confusion to versioning issues to straight up install bugs. Nonetheless his initial review was pretty positive. (Most of the review items were filed as issues on the sourmash repository, which you can see referenced inline in the review/pull request.)

After his initial review, I addressed most of the issues and he did another round of review, where he recommended acceptance after fixing up some of the docs and details.

Probably the biggest impact of Jeremy's review was my realization that we needed to adopt a formal release checklist, which I did by copying Michael Crusoe's detailed and excellent checklist from khmer. This made doing an actual release much saner. But a lot of confusing stuff got cleared up and a few install and test bugs were removed as well.

So, basically, the review did what it should have done - checked our assumptions and found big and little nits that needed to be cleaned up. It was by no means a gimme, and I think it improved the package tremendously.

+1 for JOSS!

## Some thoughts on where JOSS fits

There are plenty of situations where a focus solely on the software isn't appropriate. With our khmer project, we publish new data structures and algorithms, apply our approaches to challenging data sets, benchmark various approaches, and describe the software suite at a high level. But in none of these papers did anyone really review the software (although some of the reviewers on the F1000 Research paper did poke it with a stick).

JOSS fills in a nice niche here where we could receive a 3rd-party review of the software itself. While I think Jeremy Kahn did an especially exemplary review of the sourmash and we could not expect such a deep review of the much larger khmer package, a broad review from a third-party perspective at each major release point would be most welcome. So I will plan on a JOSS submission for each major release of khmer, whether or not we also advertise the release elsewhere.

I suppose people might be concerned about publishing software in multiple ways and places, and how that's going to affect citation metrics. I have to say I don't have any concerns about salami slicing or citation inflation here, because software is still largely ignored by Serious Scientists and that's the primary struggle here. (Our experience is that people systematically mis-cite us (despite ridiculously clear guidelines) and my belief is that software and methods are generally undercited. I worry more about that than getting undue credit for software!)

JOSS is already seeing a fair amount of activity and, after my experience, if I see that something was published there, I will be much more likely to recommend it to others. I suggested you all check it out, if not as a place to publish yourself, as a place to find better quality software.

--titus

### NeuralEnsemble

#### Neo 0.5.0-alpha1 released

We are pleased to announce the first alpha release of Neo 0.5.0.

Neo is a Python library which provides data structures for working with electrophysiology data, whether from biological experiments or from simulations, together with a large library of input-output modules to allow reading from a large number of different electrophysiology file formats (and to write to a somewhat smaller subset, including HDF5 and Matlab).

For Neo 0.5, we have taken the opportunity to simplify the Neo object model. Although this will require an initial time investment for anyone who has written code with an earlier version of Neo, the benefits will be greater simplicity, both in your own code and within the Neo code base, which should allow us to move more quickly in fixing bugs, improving performance and adding new features. For details of what has changed and what has been added, see the Release notes.

If you are already using Neo for your data analysis, we encourage you to give the alpha release a try. The more feedback we get about the alpha release, the quicker we can find and fix bugs. If you do find a bug, please create a ticket. If you have questions, please post them on the mailing list or in the comments below.

Documentation:
Licence:
Modified BSD
Source code:
https://github.com/NeuralEnsemble/python-neo

### Matthieu Brucher

#### Playing with a Bela (1): Turning it on and compiling Audio Toolkit

I have now some time to play with this baby:
Beagleboard with Bela extension
The CPU may not be blazingly fast, but I hope I can still do something with it. The goal of this series will be to try different algorithms and see how they behave on the platform.

# Setting everything up

I got the Bela with the Kickstarter campaign. Although I could have used it as soon as I got it, I didn’t have enough time to really dig into it. Now is the time.

First, I had to update the Bela image with the last available one. This one allows you to connect to Internet directly from the Ethernet port, which is required if you need to get source code from the Internet or update the card. So nice change.

The root account is the one advised in the wiki, but I would suggest to create a user account, protect the root account so that you can’t log with it (especially if plugged on your private network!) and make your user account a sudoer.

Once this is done, let’s tackle Audio Toolkit compilation.

# Setting everything up

For this step, you need to start by getting the latest gcc, cmake, libeigen3-dev and the boost libraries (libboost-all-dev or just system, timer and test). Now we have all the dependencies.

Get the develop branch of Audio Toolkit (there will be a future release with the last updates soon that will support ARM code) from github: https://github.com/mbrucher/AudioTK and launch cmake to build the Makefiles.

If the C++11 flag is activated by default, it is not the case for the other flags that the ARM board requires. On top of it, we need -march=native -mfpu=neon -funsafe-math-optimizations. The first option triggers ARM code generation for the Beagleboard platform, the second one allows to use the NEON intrisincs and floating point instructions. The last one is the interesting one: it allows to optimize some math operations, like the denormal processing by flushing them to zero (effectively FE_DFL_DISABLE_SSE_DENORMS_ENV does on OS X or _MM_DENORMALS_ZERO_ON on GCC with x86).

The compilation takes time, but it finishes with all the libraries and tests.

# Conclusion

I have known the basic structure ready. The compilation of Audio Toolkit is slow, but I hope the code itself will be fast enough. Let’s keep this for next post in this series.

### Matthew Rocklin

This post compares two Python distributed task processing systems, Dask.distributed and Celery.

Disclaimer: technical comparisons are hard to do well. I am biased towards Dask and ignorant of correct Celery practices. Please keep this in mind. Critical feedback by Celery experts is welcome.

Celery is a distributed task queue built in Python and heavily used by the Python community for task-based workloads.

Dask is a parallel computing library popular within the PyData community that has grown a fairly sophisticated distributed task scheduler. This post explores if Dask.distributed can be useful for Celery-style problems.

Comparing technical projects is hard both because authors have bias, and also because the scope of each project can be quite large. This allows authors to gravitate towards the features that show off our strengths. Fortunately a Celery user asked how Dask compares on Github and they listed a few concrete features:

1. Handling multiple queues
2. Canvas (celery’s workflow)
3. Rate limiting
4. Retrying

These provide an opportunity to explore the Dask/Celery comparision from the bias of a Celery user rather than from the bias of a Dask developer.

In this post I’ll point out a couple of large differences, then go through the Celery hello world in both projects, and then address how these requested features are implemented or not within Dask. This anecdotal comparison over a few features should give us a general comparison.

## Biggest difference: Worker state and communication

First, the biggest difference (from my perspective) is that Dask workers hold onto intermediate results and communicate data between each other while in Celery all results flow back to a central authority. This difference was critical when building out large parallel arrays and dataframes (Dask’s original purpose) where we needed to engage our worker processes’ memory and inter-worker communication bandwidths. Computational systems like Dask do this, more data-engineering systems like Celery/Airflow/Luigi don’t. This is the main reason why Dask wasn’t built on top of Celery/Airflow/Luigi originally.

That’s not a knock against Celery/Airflow/Luigi by any means. Typically they’re used in settings where this doesn’t matter and they’ve focused their energies on several features that Dask similarly doesn’t care about or do well. Tasks usually read data from some globally accessible store like a database or S3 and either return very small results, or place larger results back in the global store.

The question on my mind is now is Can Dask be a useful solution in more traditional loose task scheduling problems where projects like Celery are typically used? What are the benefits and drawbacks?

## Hello World

To start we do the First steps with Celery walk-through both in Celery and Dask and compare the two:

### Celery

I follow the Celery quickstart, using Redis instead of RabbitMQ because it’s what I happen to have handy.

from celery import Celery

return x + y
$redis-server$ celery -A tasks worker --loglevel=info

In [2]: %time add.delay(1, 1).get()  # submit and retrieve roundtrip
CPU times: user 60 ms, sys: 8 ms, total: 68 ms
Wall time: 567 ms
Out[2]: 2

In [3]: %%time
...: futures = [add.delay(i, i) for i in range(1000)]
...: results = [f.get() for f in futures]
...:
CPU times: user 888 ms, sys: 72 ms, total: 960 ms
Wall time: 1.7 s

We do the same workload with dask.distributed’s concurrent.futures interface, using the default single-machine deployment.

In [1]: from distributed import Client

In [2]: c = Client()

In [3]: from operator import add

In [4]: %time c.submit(add, 1, 1).result()
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 20.7 ms
Out[4]: 2

In [5]: %%time
...: futures = [c.submit(add, i, i) for i in range(1000)]
...: results = c.gather(futures)
...:
CPU times: user 328 ms, sys: 12 ms, total: 340 ms
Wall time: 369 ms

### Comparison

• Functions: In Celery you register computations ahead of time on the server. This is good if you know what you want to run ahead of time (such as is often the case in data engineering workloads) and don’t want the security risk of allowing users to run arbitrary code on your cluster. It’s less pleasant on users who want to experiment. In Dask we choose the functions to run on the user side, not on the server side. This ends up being pretty critical in data exploration but may be a hinderance in more conservative/secure compute settings.
• Setup: In Celery we depend on other widely deployed systems like RabbitMQ or Redis. Dask depends on lower-level Torando TCP IOStreams and Dask’s own custom routing logic. This makes Dask trivial to set up, but also probably less durable. Redis and RabbitMQ have both solved lots of problems that come up in the wild and leaning on them inspires confidence.
• Performance: They both operate with sub-second latencies and millisecond-ish overheads. Dask is marginally lower-overhead but for data engineering workloads differences at this level are rarely significant. Dask is an order of magnitude lower-latency, which might be a big deal depending on your application. For example if you’re firing off tasks from a user clicking a button on a website 20ms is generally within interactive budget while 500ms feels a bit slower.

## Simple Dependencies

Often tasks depend on the results of other tasks. Both systems have ways to help users express these dependencies.

### Celery

The apply_async method has a link= parameter that can be used to call tasks after other tasks have run. For example we can compute (1 + 2) + 3 in Celery as follows:

With the Dask concurrent.futures API, futures can be used within submit calls and dependencies are implicit.

We could also use the dask.delayed decorator to annotate arbitrary functions and then use normal-ish Python.

return x + y

y.compute()

### Comparison

I prefer the Dask solution, but that’s subjective.

## Complex Dependencies

### Celery

Celery includes a rich vocabulary of terms to connect tasks in more complex ways including groups, chains, chords, maps, starmaps, etc.. More detail here in their docs for Canvas, the system they use to construct complex workflows: http://docs.celeryproject.org/en/master/userguide/canvas.html

For example here we chord many adds and then follow them with a sum.

In [2]: from celery import chord

In [3]: %time chord(add.s(i, i) for i in range(100))(tsum.s()).get()
CPU times: user 172 ms, sys: 12 ms, total: 184 ms
Wall time: 1.21 s
Out[3]: 9900

Dask’s trick of allowing futures in submit calls actually goes pretty far. Dask doesn’t really need any additional primitives. It can do all of the patterns expressed in Canvas fairly naturally with normal submit calls.

In [4]: %%time
...: futures = [c.submit(add, i, i) for i in range(100)]
...: total = c.submit(sum, futures)
...: total.result()
...:
CPU times: user 52 ms, sys: 0 ns, total: 52 ms
Wall time: 60.8 ms

futures = [add(i, i) for i in range(100)]
total.result()

## Multiple Queues

In Celery there is a notion of queues to which tasks can be submitted and that workers can subscribe. An example use case is having “high priority” workers that only process “high priority” tasks. Every worker can subscribe to the high-priority queue but certain workers will subscribe to that queue exclusively:

celery -A my-project worker -Q high-priority  # only subscribe to high priority
celery -A my-project worker -Q celery,high-priority  # subscribe to both
celery -A my-project worker -Q celery,high-priority
celery -A my-project worker -Q celery,high-priority

This is like the TSA pre-check line or the express lane in the grocery store.

Dask has a couple of topics that are similar or could fit this need in a pinch, but nothing that is strictly analogous.

First, for the common case above, tasks have priorities. These are typically set by the scheduler to minimize memory use but can be overridden directly by users to give certain tasks precedence over others.

Second, you can restrict tasks to run on subsets of workers. This was originally designed for data-local storage systems like the Hadoop FileSystem (HDFS) or clusters with special hardware like GPUs but can be used in the queues case as well. It’s not quite the same abstraction but could be used to achieve the same results in a pinch. For each task you can restrict the pool of workers on which it can run.

The relevant docs for this are here: http://distributed.readthedocs.io/en/latest/locality.html#user-control

Celery allows tasks to retry themselves on a failure.

try:
raise self.retry(exc=exc)

Sadly Dask currently has no support for this (see open issue). All functions are considered pure and final. If a task errs the exception is considered to be the true result. This could change though; it has been requested a couple of times now.

Until then users need to implement retry logic within the function (which isn’t a terrible idea regardless).

for i in range(n_retries):
try:
return
pass

## Rate Limiting

Celery lets you specify rate limits on tasks, presumably to help you avoid getting blocked from hammering external APIs

def query_external_api(...):
...

Dask definitely has nothing built in for this, nor is it planned. However, this could be done externally to Dask fairly easily. For example, Dask supports mapping functions over arbitrary Python Queues. If you send in a queue then all current and future elements in that queue will be mapped over. You could easily handle rate limiting in Pure Python on the client side by rate limiting your input queues. The low latency and overhead of Dask makes it fairly easy to manage logic like this on the client-side. It’s not as convenient, but it’s still straightforward.

>>> from queue import Queue

>>> q = Queue()

>>> out = c.map(query_external_api, q)
>>> type(out)
Queue

## Final Thoughts

Based on this very shallow exploration of Celery, I’ll foolishly claim that Dask can handle Celery workloads, if you’re not diving into deep API. However all of that deep API is actually really important. Celery evolved in this domain and developed tons of features that solve problems that arise over and over again. This history saves users an enormous amount of time. Dask evolved in a very different space and has developed a very different set of tricks. Many of Dask’s tricks are general enough that they can solve Celery problems with a small bit of effort, but there’s still that extra step. I’m seeing people applying that effort to problems now and I think it’ll be interesting to see what comes out of it.

Going through the Celery API was a good experience for me personally. I think that there are some good concepts from Celery that can inform future Dask development.

## September 12, 2016

### Matthew Rocklin

conda install dask distributed -c conda-forge
or

The last few months have seen a number of important user-facing features:

• Executor is renamed to Client
• Workers can spill excess data to disk when they run out of memory
• The Client.compute and Client.persist methods for dealing with dask collections (like dask.dataframe or dask.delayed) gain the ability to restrict sub-components of the computation to different parts of the cluster with a workers= keyword argument.
• IPython kernels can be deployed on the worker and schedulers for interactive debugging.
• The Bokeh web interface has gained new plots and improve the visual styling of old ones.

Additionally there are beta features in current development. These features are available now, but may change without warning in future versions. Experimentation and feedback by users comfortable with living on the bleeding edge is most welcome:

• Clients can publish named datasets on the scheduler to share between them
• Workers can restart themselves in new software environments provided by the user

There have also been significant internal changes. Other than increased performance these changes should not be directly apparent.

• The scheduler was refactored to a more state-machine like architecture. Doc page
• Short-lived connections are now managed by a connection pool
• Work stealing has changed and grown more responsive: Doc page
• General resilience improvements

The rest of this post will contain very brief explanations of the topics above. Some of these topics may become blogposts of their own at some point. Until then I encourage people to look at the distributed scheduler’s documentation which is separate from dask’s normal documentation and so may contain new information for some readers (Google Analytics reports about 5-10x the readership on http://dask.readthedocs.org than on http://distributed.readthedocs.org.

## Major Changes and Features

### Rename Executor to Client

The term Executor was originally chosen to coincide with the concurrent.futures Executor interface, which is what defines the behavior for the .submit, .map, .result methods and Future object used as the primary interface.

Unfortunately, this is the same term used by projects like Spark and Mesos for “the low-level thing that executes tasks on each of the workers” causing significant confusion when communicating with other communities or for transitioning users.

In response we rename Executor to a somewhat more generic term, Client to designate its role as the thing users interact with to control their computations.

>>> from distributed import Executor  # Old
>>> e = Executor()                    # Old

>>> from distributed import Client    # New
>>> c = Client()                      # New

Executor remains an alias for Client and will continue to be valid for some time, but there may be some backwards incompatible changes for internal use of executor= keywords within methods. Newer examples and materials will all use the term Client.

### Workers Spill Excess Data to Disk

When workers get close to running out of memory they can send excess data to disk. This is not on by default and instead requires adding the --memory-limit=auto option to dask-worker.

This will eventually become the default (and is now when using LocalCluster) but we’d like to see how things progress and phase it in slowly.

Generally this feature should improve robustness and allow the solution of larger problems on smaller clusters, although with a performance cost. Dask’s policies to reduce memory use through clever scheduling remain in place, so in the common case you should never need this feature, but it’s nice to have as a failsafe.

### Enable restriction of valid workers for compute and persist methods

Expert users of the distributed scheduler will be aware of the ability to restrict certain tasks to run only on certain computers. This tends to be useful when dealing with GPUs or with special databases or instruments only available on some machines.

Previously this option was available only on the submit, map, and scatter methods, forcing people to use the more immedate interface. Now the dask collection interface functions compute and persist support this keyword as well.

### IPython Integration

You can start IPython kernels on the workers or scheduler and then access them directly using either IPython magics or the QTConsole. This tends to be valuable when things go wrong and you want to interactively debug on the worker nodes themselves.

Start IPython on the Scheduler

>>> client.start_ipython_scheduler()  # Start IPython kernel on the scheduler
>>> %scheduler scheduler.processing   # Use IPython magics to inspect scheduler
{'127.0.0.1:3595': ['inc-1', 'inc-2'],

Start IPython on the Workers

>>> info = e.start_ipython_workers()  # Start IPython kernels on all workers
>>> list(info)
['127.0.0.1:4595', '127.0.0.1:53589']
>>> %remote info['127.0.0.1:3595'] worker.active  # Use IPython magics
{'inc-1', 'inc-2'}

### Bokeh Interface

The Bokeh web interface to the cluster continues to evolve both by improving existing plots and by adding new plots and new pages.

For example the progress bars have become more compact and shrink down dynamically to respond to addiional bars.

And we’ve added in extra tables and plots to monitor workers, such as their memory use and current backlog of tasks.

## Experimental Features

The features described below are experimental and may change without warning. Please do not depend on them in stable code.

### Publish Datasets

You can now save collections on the scheduler, allowing you to come back to the same computations later or allow collaborators to see and work off of your results. This can be useful in the following cases:

1. There is a dataset from which you frequently base all computations, and you want that dataset always in memory and easy to access without having to recompute it each time you start work, even if you disconnect.
2. You want to send results to a colleague working on the same Dask cluster and have them get immediate access to your computations without having to send them a script and without them having to repeat the work on the cluster.

Example: Client One

df2 = df[df.balance < 0]
df2 = client.persist(df2)

name  balance
0    Alice     -100
1      Bob     -200
2  Charlie     -300
3   Dennis     -400
4    Edith     -500

client.publish_dataset(accounts=df2)

Example: Client Two

>>> client.list_datasets()
['accounts']

>>> df = client.get_dataset('accounts')
name  balance
0    Alice     -100
1      Bob     -200
2  Charlie     -300
3   Dennis     -400
4    Edith     -500

You can now submit tasks to the cluster that themselves submit more tasks. This allows the submission of highly dynamic workloads that can shape themselves depending on future computed values without ever checking back in with the original client.

This is accomplished by starting new local Clients within the task that can interact with the scheduler.

def func():
from distributed import local_client
with local_client() as c2:
future = c2.submit(...)

c = Client(...)
future = c.submit(func)

There are a few straightforward use cases for this, like iterative algorithms with stoping criteria, but also many novel use cases including streaming and monitoring systems.

### Restart Workers in Redeployable Python Environments

You can now zip up and distribute full Conda environments, and ask dask-workers to restart themselves, live, in that environment. This involves the following:

1. Create a conda environment locally (or any redeployable directory including a python executable)
2. Zip up that environment and use the existing dask.distributed network to copy it to all of the workers
3. Shut down all of the workers and restart them within the new environment

This helps users to experiment with different software environments with a much faster turnaround time (typically tens of seconds) than asking IT to install libraries or building and deploying Docker containers (which is also a fine solution). Note that they typical solution of uploading individual python scripts or egg files has been around for a while, see API docs for upload_file

## Acknowledgements

Since version 1.12.0 on August 18th the following people have contributed commits to the dask/distributed repository

• Dave Hirschfeld
• dsidi
• Jim Crist
• Joseph Crail
• Loïc Estève
• Martin Durant
• Matthew Rocklin
• Min RK
• Scott Sievert

#### Where to Write Prose?

Code is only as good as its prose.

Like many programmers I spend more time writing prose than code. This is great; writing clean prose focuses my thoughts during design and disseminates understanding so that people see how a project can benefit them.

However, I now question how and where I should write and publish prose. When communicating to users there are generally two options:

1. Blogposts
2. Documentation

Given that developer time is finite we need to strike some balance between these two activities. I used to blog frequently, then I switched to almost only documentation, and I think I’m probably about to swing back a bit. Here’s why:

## Blogposts

Blogposts excel at generating interest, informing people of new functionality, and providing workable examples that people can copy and modify. I used to blog about Dask (my current software project) pretty regularly here on my blog and continuously got positive feedback from it. This felt great.

However, blogging about evolving software also generates debt. Such blogs grow stale and inaccurate and so when they’re the only source of information about a project, users grow confused when they try things that no longer work, and they’re stuck without a clear reference to turn. Basing core understanding on blogs can be a frustrating experience.

## Documentation

So I switched from writing blogposts to spending a lot of time writing technical documentation. This was a positive move. User comprehension seemed to increase, the questions I was fielding were of a far higher level than before.

Documentation gets updated as features mature. New pages assimilate cleanly and obsolete pages get cleaned up. Documentation is generally more densely linked than linear blogs, and readers tend to explore more deeply within the website. Comparing the Google Analytics results for my blog and my documentation show significantly increased engagement, both with longer page views as well as longer chains of navigation throughout the site. Documentation seems to engage readers more strongly than do blogs (at least more strongly than my blog).

However, documentation doesn’t get in front of people the same way that Blogs do. No one subscribes to receive documentation updates. Doc pages for new features rarely end up on Reddit or Hacker News. The way people pass around blog links encourages Google to point people there way more often than to doc pages. There is no way for interested users to keep up with the latest news except by subscribing to fairly dry release e-mails.

Blogposts are way sexier. This feels a little shallow if you’re not into sales and marketing, but lets remember that software dies without users and that users are busy people who have to be stimulated into taking the time to learn new things.

## Current Plan

I still think its wise for core developers to focus 80% of their prose time on documentation, especially for new or in-flux features that haven’t had a decent amount of time for users to provide feedback.

However I personally hope to blog more about concepts or timely experiences that have to do with development, if not the features themeselves. For example, right now I’m building a Mesos-powered Scheduler for Dask.distributed. I’ll probably write about the experiences of a developer meeting Mesos for the first time, but I probably won’t include a how-to of using Dask with Mesos.

I also hope to find some way to polish existing doc pages into blogposts once they have proven to be fairly stable. This mostly involves finding a meaningful and reproducible example to work through.

## Feedback

I would love to hear how other projects handle this tension between timely and timeless documentation.

## September 08, 2016

### Continuum Analytics news

#### Democratization of Compute: More Flops, More Users & Solving More Challenges

Thursday, September 8, 2016
Mike Lee
Technical, Enterprise and Cloud Compute Segment Manager, Developer Products Division
Intel Corporation

The past decade has seen compute capacity at the cluster scale grow faster than Moore’s Law. The relentless pursuit to exascale systems and beyond brings broad advances in the availability of a large amount of compute power to developers and users on “everyday” systems. Call it “trickle down” high performance computing if you like, but the effects are profound in the amount of computation that can be accessed. A teraflop system today can be easily had in a workstation, ready and able to tackle scientific compute problems, financial modeling exercises and plow through huge amounts of data for machine learning.

Programming of these high performance systems used to be the domain of native language developers who work in Fortran or C/C++, and scaling up and out with distributed computing via Message Passing Interface (MPI) to take advantage of cluster computing. While these languages are still the mainstay of high performance computing, scripting languages, such as Python, have been adopted by a broad community of users for its ease of use and short learning curve. While giving the ability for more users to do computing is a good thing, there is a limitation that makes it difficult for users of Python to get good performance. Namely, it is the global interpreter lock, or “GIL,” that runs in a single threaded mode and does not allow for any parallelism to take advantage of modern hardware with multicore/many-core and multi threaded CPUs. If only there was a way to make it easy and seamless to get performance from Python, we could broaden the availability of compute power to more users.

My colleagues at Intel in engineering and product marketing teams examined this limitation and saw that there were some solutions out there that were challenging to implement—thus began our close association with Continuum Analytics, a leader in the Open Data Science and Python community, to make these performance enhancements widely available to all. Collaboration with Continuum Analytics has helped us bring the Intel® Distribution for Python powered by Anaconda to the Python community, which leverages the Intel® Performance Libraries, such as Intel® Math Kernel Library, Intel® Data Analytics Library, Intel® MPI Library and Intel® Threading Building Blocks. The collaboration between Intel and Continuum Analytics helps provide a path to greater performance for Python developers and users.

And today, we are happy to announce a major milestone in our journey with the Intel Distribution. After a year in beta, the Distribution is now available in its first public version as the Intel® Distribution for Python 2017. It's been a wild ride—the thrills of successful compiles and builds, the agony of managing dependencies, chasing down the bugs, the race to meet project deadlines, the highs of good press, the lows of post release reported errors—but overall, we have the satisfaction of having delivered a solid product.

Our work is not done. We will continue to push the boundaries of performance to enable more flops to more users to solve more computing challenges. Live long and Python!

Tags:

#### Correlators for molecular and stochastic dynamics

Time correlations represent one of the most important data that one can obtain from doing molecular and stochastic dynamics. The two common methods to obtain them is via either post-processing or on-line analysis. Here I review several algorithms to compute correlation from numerical data: naive, Fourier transform and blocking scheme with illustrations from Langevin dynamics, using Python.

### Introduction¶

I conduct molecular and stochastic simulations of colloidal particles. One important quantity to extract from these simulations is the autocorrelation of the velocity. The reasoning also applies to other types of correlation functions, I am just focusing on this one to provide context.

There are several procedures to compute the correlation functions from the numerical data but I did not find a synthetic review about it, so I am making my own. The examples will use the Langevin equation and the tools from the SciPy stack when appropriate.

We trace the apparition of the methods using textbooks on molecular simulation when available or other references (articles, software documentation) when appropriate. If I missed a foundational reference, let me know in the comments or by email.

In 1987, Allen and Tildesley mention in their book Computer Simulation of Liquids [Allen1987] mention the direct algorithm for the autocorrelation $$C_{AA}(\tau = j\Delta t) = \frac{1}{N-j} \sum_i A_i A_{i+j}$$ where I use $A_i$ to denote the observable $A$ at time $i\Delta t$ ($\Delta t$ is the sampling interval). Typically, $\tau$ is called the lag time or simply the lag. The number of operations is $O(N_\textrm{cor}N)$, where $N$ is the number of items in the time series and $N_\textrm{cor}$ the number of correlations points that are computed. By storing the last $N_\textrm{cor}$ value of $A$, this algorithm is suitable for use during a simulation run. Allen and Tildesley then mention the Fast Fourier Transform (FFT) version of the algorithm that is more efficient, given its scaling in terms of $N\log N$. The FFT algorithm is based on the convolution theorem: performing the convolution (of the time-reversed data) in frequency space is a multiplication and much faster than the direct algorithm. The signal has to be zero-padded to avoid circular correlations due to the finiteness of the data. The requirements of the FFT method in terms of storage and the number of points to obtain for $C_{AA}$ influence what algorithm gives the fastest result.

Frenkel and Smit in their book Understanding Molecular Simulation [Frenkel2002] introduce what they call an "Order-n algorithm to measure correlations". The principle is to store the last $N_\textrm{cor}$ values for $A$ with a sampling interval $\Delta t$ and also the last $N_\textrm{cor}$ values for $A$ with a sampling interval $l \Delta t$ where $l$ is the block size, and recursively store the data with an interval $l$, $l^2$, $l^3$, etc. In their algorithm, the data is also averaged over the corresponding time interval and the observable is thus coarse-grained during the procedure.

A variation on this blocking scheme is used by Colberg and Höfling [Colberg2011], where no averaging is performed. Ramírez, Sukumaran, Vorselaars and Likhtman [Ramirez2010] propose a more flexible blocking scheme in which the block length and the duration of averaging can be set independently. They provide an estimate of the systematic and statistical errors induced by the choice of these parameters. This is the multiple tau correlator.

The "multiple-tau" correlator has since then been implemented in two popular Molecular Dynamics package:

The direct and FFT algorithms are available in NumPy and SciPy respectively.

In the field of molecular simulation, the FFT method was neglected for a long time but was put forward by the authors of the nMOLDYN software suite to analyze Molecular Dynamics simulation [Kneller1995].

### The direct algorithm and the implementation in NumPy¶

The discrete-time correlation $c_j$ from a numerical time series $s_i$ is defined as

$$c_j = \sum_i s_{i} s_{i+j}$$

where $j$ is the lag in discrete time steps, $s_i$ is the time series and the sum runs over all available indices. The values in $s$ represent a sampling with a time interval $\Delta t$ and we use interchangeably $s_i$ and $s(i\Delta t)$.

Note that this definition omits the normalization. What one is interested in is the normalized correlation function

$$c'_j = \frac{1}{N-j} \sum_i s_{i} s_{i+j}$$

NumPy provides a function numpy.correlate that computes explicitly the correlation of a scalar time series. For small sets of data it is sufficiently fast as it is actually calling a compiled routine. The time it takes grows quadratically as a function of the input size, which makes it unsuitable for many practical applications.

Note that the routine computes the un-normalized correlation function and that forgetting to normalize the result, or normalizing them with $N$ instead of $N-j$ will give incorrect results. numpy.correlate (with argument mode='full' returns an array of length $2N-1$ that contains the negative times as well as the positive times. For an autocorrelation, half of the array can be discarded.

Below, we test the CPU time scaling with respect to the input data size and compare it with the expected $O(N^2)$ scaling, as the method computes all possible $N$ values for the correlation.

In [2]:
n_data = 8*8**np.arange(6)
time_data = []
for n in n_data:
data = np.random.random(n)
t0 = time.time()
cor = np.correlate(data, data, mode='full')
time_data.append(time.time()-t0)
time_data = np.array(time_data)
In [3]:
plt.plot(n_data, time_data, marker='o')
plt.plot(n_data, n_data**2*time_data[-1]/(n_data[-1]**2))
plt.loglog()
plt.title('Performance of the direct algorithm')
plt.xlabel('Size N of time series')
plt.ylabel('CPU time')
Out[3]:
<matplotlib.text.Text at 0x7f1285affa90>

### SciPy's Fourier transform implementation¶

The SciPy package provides the routine scipy.signal.fftconvolve for FFT-based convolutions. As for NumPy's correlate routine, it outputs negative and positive time correlations, in the un-normalized form.

SciPy relies on the standard FFTPACK library to perform FFT operations.

Below, we test the CPU time scaling with respect to the input data size and compare it with the expected $O(N\log N)$ scaling. The maximum length for the data is already an order of magnitude larger than for the direct algorithm, the CPU time would be already too much for this.

In [4]:
n_data = []
time_data = []
for i in range(8):
n = 4*8**i
data = np.random.random(n)
t0 = time.time()
cor = scipy.signal.fftconvolve(data, data[::-1], mode='full')
n_data.append(n)
time_data.append(time.time()-t0)
n_data = np.array(n_data)
time_data = np.array(time_data)
In [5]:
plt.plot(n_data, time_data, marker='o')
plt.plot(n_data, n_data*np.log(n_data)*time_data[-1]/(n_data[-1]*np.log(n_data[-1])))
plt.loglog()
plt.xlabel('Size N of time series')
plt.ylabel('CPU time')
Out[5]:
<matplotlib.text.Text at 0x7f12856fc2e8>

### Comparison of the direct and the FFT approach¶

It is important to note, as mentioned in [Kneller1995] that the direct and FFT algorithms give the same result, up to rounding errors. This is show below by plotting the substraction of the two signals.

In [6]:
n = 2**14
sample_data = np.random.random(n)

direct_correlation = np.correlate(sample_data, sample_data, mode='full')
fft_correlation = scipy.signal.fftconvolve(sample_data, sample_data[::-1])

plt.plot(direct_correlation-fft_correlation)
Out[6]:
[<matplotlib.lines.Line2D at 0x7f1285828518>]

### Blocking scheme for correlations¶

Both schemes mentioned above require a storage of data that scales as $O(N)$. This includes the storage of the simulation data and the storage in RAM of the results (at least during the computation, it can be truncated afterwards if needed).

An alternative strategy is to store blocks of correlations, each successive block representing a coarser version of the data and correlation. The principle behind the blocking schemes is to use a decreasing time resolution to store correlation information for longer times. As the variations in the correlation typically decay with time, it makes sense to store less and less detail.

There are nice diagrams in the references cited above but for the sake of completeness, I will try my own diagram here.

• The black full circle in the signal ("s") is taken at discrete time $i=41$ and fills the $b=0$ block at position $41\mod 6 = 5$.
• The next point is the empty circle, it fills the block $b=0$ in position $42\mod 6=0$. As $42\mod l^b$ (for $l=6$ and $b=0$), the empty circle is also copied to position $1$ in the block $b=1$ at position $42 / l^b\mod l=1$ (for $b=1). In the following, •$l$is the length of the blocks •$B$is the total number of blocks The signal blocks contain samples of the signal, limited in time and at different timescales. The procedure to fill the signal blocks is quite simple: 1. Store the current value of the signal,$s_i$into the 0-th signal block, at the position$i\mod l$. 2. Once every$l$steps (that is when$i\mod l=0$), store the signal also in the next block, at position$i/l$3. Apply this procedure recursively up to block$B-1$, dividing$i$by$l$at every step Example of the application of step 3: store the signal in block$b$when when$i\mod l^b=0$, that is every$l$steps for block 1, every$l^2$steps for block 2, etc. The time of sampling for the blocks is 0, 1, 2, etc for block 0, then$0$,$l$,$2l$, etc for block 1 and so on. The procedure to compute the correlation blocks is to compute the correlation of every newly added data point with the$l-1$other values in the same block and with itself (for lag 0). This computation is carried out at the same time that the signal blocks are filled, else past data would have been discarded. This algorithm can thus be executed online, while the simulation is running, or while reading the data file in a single pass. I do not review the different averaging schemes, as they do not change much to the understanding of the blocking schemes. For this, see [Ramirez2010]. The implementation of a non-averaging blocking scheme (compare to papers) is provided below in plain Python. In [7]: def mylog_corr(data, B=3, l=32): cor = np.zeros((B, l)) val = np.zeros((B, l)) count = np.zeros(B) idx = np.zeros(B) for i, new_data in enumerate(data): for b in range(B): if i % (l**b) !=0: # only enter block if i modulo l^b == 0 break normed_idx = (i // l**b) % l val[b, normed_idx] = new_data # fill value block # correlate block b # wait for current block to have been filled at least once if i > l**(b+1): for j in range(l): cor[b, j] += new_data*val[b, (normed_idx - j) % l] count[b] += 1 return count, cor ### Example with Langevin dynamics¶ Having an overview of the available correlators, I now present a use case with the Langevin equation. I generate data using a first-order Euler scheme (for brevity, this should be avoided in actual studies) and apply the FFT and the blocking scheme. The theoretical result$C(\tau) = T e^{-\gamma \tau}$is also plotted for reference. In [8]: N = 400000 x, v = 0, 0 dt = 0.01 interval = 10 T = 2 gamma = 0.1 x_data = [] v_data = [] for i in range(N): for j in range(interval): x += v*dt v += np.random.normal(0, np.sqrt(2*gamma*T*dt)) - gamma*v*dt x_data.append(x) v_data.append(v) x_data = np.array(x_data) v_data = np.array(v_data) In [9]: B = 5 # number of blocks l = 8 # length of the blocks c, cor = mylog_corr(v_data, B=B, l=l) In [10]: for b in range(B): t = dt*interval*np.arange(l)*l**b plt.plot(t[1:], cor[b,1:]/c[b], color='g', marker='o', markersize=10, lw=2) fft_cor = scipy.signal.fftconvolve(v_data, v_data[::-1])[N-1:] fft_cor /= (N - np.arange(N)) t = dt*interval*np.arange(N) plt.plot(t, fft_cor, 'k-', lw=2) plt.plot(t, T*np.exp(-gamma*t)) plt.xlabel(r'lag$\tau$') plt.ylabel(r'$C_{vv}(\tau)$') plt.xscale('log') ### Summary¶ The choice of a correlator will depend on the situation at hand. Allen and Tildesley already mention the tradeoffs between disk and RAM memory requirements and CPU time. The table belows reviews the main practical properties of the correlators. "Online" means that the algorithm can be used during a simulation run. "Typical cost" is the "Big-O" number of operations. "Accuracy of data" is given in terms of the number of sampling points for the correlation value$c_j$. Algorithm Typical cost Storage Online use Accuracy of data Direct$N^2$or$N_\textrm{cor}$N$N_\textrm{cor}$for small$N_\textrm{cor}N-j$points for$c_j$FFT$N\log NO(N)$no$N-j$points for$c_j$Blocking$N\ BN\ B$yes$N/l^b$points in block$b$A distinct advantage of the direct and FFT methods is that they are readily available. If your data is small (up to$N\approx 10^2 - 10^3$) the direct method is a good choice, then the FFT method will outperform it significantly. An upcoming version of the SciPy library will even provide the choice of method as a keyword argument to the method scipy.signal.correlate, making it even more accessible. Now, for the evaluation of correlation functions in Molecular Dynamics simulations, the blocking scheme is the only practical solution for very long simulations for which both short- and long-time behaviours are of interest. To arrive in RMPCDMD! ### References¶ • [Allen1987] M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids (Clarendon Press, 1987). • [Frenkel2002] D. Frenkel and B. Smit, Understanding molecular simulation: From algorithms to applications (Academic Press, 2002). • [Ramirez2010] J. Ramírez, S. K. Sukumaran, B. Vorselaars and A. E. Likhtman, Efficient on the fly calculation of time correlation functions in computer simulations, J. Chem. Phys. 133 154103 (2010). • [Colberg2011] P. H. Colberg and Felix Höfling, Highly accelerated simulations of glassy dynamics using GPUs: Caveats on limited floating-point precision, Comp. Phys. Comm. 182, 1120 (2011). • [Kneller1995] Kneller, Keiner, Kneller and Schiller, nMOLDYN: A program package for a neutron scattering oriented analysis of Molecular Dynamics simulations, Comp. Phys. Comm. 91, 191 (1995). ## September 07, 2016 ### Continuum Analytics news #### Continuum Analytics Teams Up with Intel for Python Distribution Powered by Anaconda Thursday, September 8, 2016 Offers speed, agility and an optimized Python experience for data scientists AUSTIN, TEXAS—September 8, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, is pleased to announce a technical collaboration with Intel resulting in the Intel® Distribution for Python powered by Anaconda. Intel Distribution for Python powered by Anaconda was recently announced by Intel and will be delivered as part of Intel® Parallel Studio XE 2017 software development suite. With a common distribution for the Open Data Science community that increases Python and R performance up to 100X, Intel has empowered enterprises to build a new generation of intelligent applications that drive immediate business value. Combining the power of the Intel® Math Kernel Library (MKL) and Anaconda’s Open Data Science platform allows organizations to build the high performance analytic modeling and visualization applications required to compete in today’s data-driven economies. By combining two de facto standards, Intel MKL and Anaconda, into a single performance-boosted Python and R distribution, enterprises can meet and exceed performance targets for next-generation data science applications. The platform includes packages and technology that are accessible for beginner Python developers, however powerful enough to tackle data science projects for big data. Anaconda offers support for advanced analytics, numerical computing, just-in-time compilation, profiling, parallelism, interactive visualization, collaboration and other analytic needs. “While Python has been widely used by data scientists as an easy-to-use programming language, it was often at the expense of performance,” said Mike Lee, technical, enterprise and cloud compute segment manager, developer Products Division at Intel Corporation. “The Intel Distribution for Python powered by Anaconda, provides multiple methods and techniques to accelerate and scale Python applications to achieve near native code performance.” With the out-of-box distribution, Python applications immediately realize gains and can be tuned to optimize performance using the Intel® VTune™ Amplifier performance profiler. Python workloads can take advantage of multi-core Intel architectures and clusters using parallel thread scheduling and efficient communication with Intel MPI and Anaconda Scale through optimized Intel® Performance Libraries and Anaconda packages. “Our focus on delivering high performance data science deployments to enterprise customers was a catalyst for the collaboration with Intel who is powering the smart and connected digital world,” said Michele Chambers, EVP of Anaconda & CMO at Continuum Analytics. “Today’s announcement of Intel’s Python distribution based on Anaconda, illustrates both companies’ commitment to empowering Open Data Science through a common distribution that makes it easy to move intelligent applications from sandboxes to production environments.” The Intel Distribution for Python powered by Anaconda is designed for everyone from seasoned high-performance developers to data scientists looking to speed up workflows and deliver an easy-to-install, performance-optimized Python experience to meet enterprise needs. The collaboration enables users to accelerate Python performance on modern Intel architectures, adding simplicity and speed to applications through Intel’s performance libraries. This distribution makes it easy to install packages using conda and pip and access individual Intel-optimized packages hosted on Anaconda Cloud through conda. Features include: • Anaconda Distribution that has been downloaded over 3M times and is the de facto standard Python distribution for Microsoft Azure ML and Cloudera Hadoop • Intel Math Kernel performance-accelerated Python computation packages like NumPy, SciPy, scikit-learn • Anaconda Scale, which makes it easy to parallelize workloads using Directed Acyclic Graphs (DAGs) Intel Distribution for Python powered by Anaconda is delivered as part of the Intel Parallel Studio XE 2017. The new distribution is available for free and includes forum support. For more information, please visit https://software.intel.com/en-us/intel-distribution-for-python. For additional information about Anaconda, please visit: https://www.continuum.io/anaconda-overview and the Continuum Analytics’ Partner Program, visit https://www.continuum.io/partners. Also, check out this blog by Intel’s Mike Lee: https://www.continuum.io/blog/developer-blog/democratization-compute-intel. About Intel Intel (NASDAQ: INTC) expands the boundaries of technology to make the most amazing experiences possible. Information about Intel can be found at newsroom.intel.com and intel.com. Intel, the Intel logo, Core, and Ultrabook are trademarks of Intel Corporation or its subsidiaries in the United States and other countries. About Continuum Analytics Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world. With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations. Our community loves Anaconda because it empowers the entire data science team––data scientists, developers, DevOps, data engineers and business analysts––to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services. Continuum Analytics' founders and developers have created or contribute to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup. To learn more about Continuum Analytics, visit www.continuum.io. ### Media Contact: Jill Rosenthal InkHouse continuumanalytics@inkhouse.com Tags: ## September 03, 2016 ### Randy Olson #### Python 2.7 still reigns supreme in pip installs The Python 2 vs. Python 3 divide has long been a thorn in the Python community’s side. On one hand, Python package developers face the challenge of supporting two incompatible versions of Python, which is time that could be better ## September 01, 2016 ### Continuum Analytics news #### What’s Old and New with Conda Build Thursday, September 1, 2016 Michael Sarahan Continuum Analytics Conda build 2.0 has just been released. This marks an important evolution towards much greater test coverage and a stable API. With this release, it’s a great time to revisit some of conda-build’s features and best practices. ### Quick Recap of Conda Build Fundamentally, conda build is a tool to help you package your software so that it is easy to distribute using the conda package manager. Conda build takes input in the form of yaml files and shell/batch scripts (“recipes”), and outputs conda packages. Conda build also includes utilities for quickly generating recipes from external repositories, such as PyPI, CPAN, or CRAN. During each build process, conda build has 4 different phases that occur: rendering, building, post-processing/packaging, and testing. Rendering takes your input meta.yaml file, fills in any jinja2 templates, and applies selectors. The end result is a python object of the MetaData class, as defined in metadata.py. Source code for your package may be downloaded during rendering if it is necessary to provide information for rendering the recipe (for example, if the version is obtained from source, rather than provided in meta.yaml). The build step creates the build environment (also called the build “prefix”), and runs the build.sh (Linux/Mac) or bld.bat (Windows) scripts. Post-processing looks at which files in the build prefix are new - ones that were not there when the build prefix was created. These are the files that are packaged up into the .tar.bz2 file. Other inspection tasks, such as detecting files containing prefixes that need replacement at install time, are also done in the post-processing phase. Finally, the test phase creates a test environment, installs the created package, and runs any tests that you specify, either in meta.yaml, or in run_test.bat (Windows), run_test.sh (Linux, Mac), or run_test.py (all platforms). ### Meta.yaml Meta.yaml is the core of any conda recipe. It describes the package’s name, version, source location, and build/test environment specifications. Full documentation on meta.yaml is at http://conda.pydata.org/docs/building/meta-yaml.html. Let’s step through the options available to you. We’ll mention Jinja2 templating and selectors a few times in here. If you’re not familiar with these, just ignore them for now. These are described in much greater detail at the end of the article. ### Software Sources Conda build will happily obtain source code from local filesystems, http/https URLs, git repositories, mercurial repositories, and subversion repositories. Syntax for each of these is described at http://conda.pydata.org/docs/building/meta-yaml.html#source-section. Presently, Jinja2 template variables are populated only for git and mercurial repositories. These are described at http://conda.pydata.org/docs/building/environment-vars.html. Future work will add Jinja2 template variables for the remaining version control systems. As a general guideline, use tarballs (http/https URLs) with hashes (SHA preferably) where available. Version control system (VCS) tags can be moved to other commits, and your packages are less guaranteed to be repeatable. Failing this, using VCS hash values is also highly repeatable. Finally, with tarballs, it is better to paste a hash provided by your download site than it is to compute it yourself. If the download site does not provide one, you can compute a hash with openssl. Openssl is a requirement of miniconda so it is already available in every conda environment. openssl dgst -sha256 <path to file> ### Build Options The “build” section of meta.yaml includes options that change some build-related options in conda build. Here you can skip certain platforms, control prefix replacement, exclude the recipe from being packaged, add entry points, and more. ### Requirements In the requirements section, you define conda packages that should be installed before build, and before running your package. It is important to list your requirements for build here, because conda build does not allow you to download requirements using pip. This restriction ensures that builds are easier to reproduce. If you are missing dependencies and pip tries to install them, you will see a traceback. When you need a particular version of something, you can apply version constraints to your specification. This is often called “pinning.” There are 3 kinds of pinning: exact, “globbing,” and boolean logic. Each pinning is an additional string after the package specification in meta.yaml. For example: requirements: build: - python 2.7.12 For exact pinning, you specify the exact version you want. This should be used sparingly, as it can quickly make your package over-constrained and hard to install. Globbing uses the * character to allow any sub-version to be installed. For example, with semantic versioning, to allow bug fix releases, one could specify a version such as 1.2.* - no major or minor releases allowed. Not all packages use semantic versioning, though. Finally, boolean expressions of versions are valid. To allow a range of versions, you can use pinnings such as >=1.6.21,<1.7. There are some packages that need to be defined in a special way. For example, packages that compile with NumPy’s C API need the same version of NumPy at runtime that was used at build time. If your package uses NumPy via Cython, or if any part of your extension code includes numpy.h, then this probably applies to you. The special syntax for NumPy is: requirements: build: - numpy x.x run: - numpy x.x There is a lot of discussion around extending this to other packages, because it is common with compiled code to have build time versions determine runtime compatibility. This discussion is active at https://github.com/conda/conda-build/issues/1142 and is slated for the next major conda-build release. Build strings—that little bit of text in your output package name, like np110py27—is determined by default by the contents of your run requirements. You can change the build string manually in meta.yaml, but doing so disables conda’s automatic addition. ### Test Testing occurs by default automatically after building the package. If the tests fail, the package is moved into the “broken” folder, rather than the normal output folder for your platform. Tests have been confusing for many people for some time. If your package did not include the test files, it was difficult to figure out how to get your tests to run. Conda build 2.0 adds a new key to the test section, “source_files,” that accepts a list of files and/or folders from your source folder that will be copied from your source folder into your test folder at test time. These specifications are done with Python’s glob, so any glob pattern will work. test: source_files: - tests - some_important_test_file.txt - data/*.h5 ### Selectors Selectors are used to limit part of your meta.yaml file. Selectors exist for Python version, platform, and architecture. Selectors are parsed and applied after jinja2 templates, so you may use jinja2 templates for more dynamic selectors. The full list of available selectors is at http://conda.pydata.org/docs/building/meta-yaml.html#preprocessing-selectors. ### Jinja Templating Templates are not a new feature, but they are not always well understood. Templates are placeholders that are dynamically filled with content when your recipe is loaded by conda build. They are heavily used at conda-forge, where they make updating recipes easier: {% set version=”1.0.0” %} package: name: my_test_package version: {{ version }} source: url: http://some.url/package-{{ version }}.tar.gz Using templates this way means that you only have to change the version once, and it applies to multiple places. Jinja templates also support running Python code to do interesting things, such as getting versions from a setup.py file: {% set data = load_setup_py_data() %} package: name: conda-build-test-source-setup-py-data version: {{ data.get('version') }} # source will be downloaded prior to filling in jinja templates # Example assumes that this folder has setup.py in it source: path_url: ../ The Python code that is actually reading the setup.py file (load_setup_py_data) is part of conda build (jinja_context.py). Presently, we do not have an extension scheme. That will be part of future work, so that users can customize their recipes with their own Python functions. ### Binary Prefix Length A somewhat esoteric aspect of relocatability is that binaries on Linux and Mac have prefix strings embedded in them that tell the binary where to go look for shared libraries. At build time, conda build detects these prefixes, and makes a note of where they are. At install time, conda uses that list to replace those prefixes with the appropriate prefix for the new environment that it is installing into. Historically, the length of these embedded prefixes has been 80 characters. Conda build 2.0 increases this length to 255 characters. Unfortunately, to fully take advantage of this change, all packages that would be installed into an environment need to have been built by conda build 2.0 to have the longer prefix. In practice, this means rebuilding many of the lower-level dependencies. To aid in this effort, conda build has added a tool: conda inspect prefix-lengths <package path> [more packages] [--min-prefix-length <value, default 255>] More concretely: conda inspect prefix-lengths ~/miniconda2/pkgs/*.tar.bz2 This is presently not relevant to Windows, though conda build does now record binary prefixes on Windows, especially for pip-created entry point executables, so that they can function correctly. These entry point executables consist of a program, the prefix, and the entry point script all rolled into a single executable. The prefix length does not matter, because the binary can simply be recreated with any arbitrary prefix by concatenating the pieces together. ### Conda Build API Finally, the other large feature of conda build 2.0 has been the creation of a public API. This is a promise to our users that the interfaces will not change without a bump to the major version number. It is also an opportunity to divide the command line interface into smaller, more testable chunks. The CLI will still be available and users will now have the API as a different, more guaranteed-stable option. The full API is at https://github.com/conda/conda-build/blob/master/conda_build/api.py. A quick mapping of legacy CLI commands to interesting api functions is the following: command line interface command Python API functions conda build api.build conda build --ouput api.get_output_file_path conda render api.output_yaml conda sign api.sign, api.verify, api.keygen, api.import_sign_key conda skeleton api.skeletonize; api.list_skeletons conda develop api.develop conda inspect api.test_installable; api.inspect_linkages; api.inspect_objects; api.inspect_prefix_length conda index api.update_index conda metapackage api.create_metapackage ### Implementation Details of Potential Interest • Non-global Config: conda build 1.x used a global instance of the conda_build.config.Config class. This has been replaced by passing a local Config instance across all system calls. This allows for more direct customization of api calls, and obviates the need to create ArgParse namespace objects to interact with conda-build. • Build id and Build folder: conda build 1.x stored environments with other conda environments, and stored the build “work” folder and test work (test_tmp) folder in the conda-bld folder (by default). Conda-build 2.0 assigns a build id to each build, consisting of the recipe name joined with the number of milliseconds since the epoch. While it is theoretically possible for name collision here, it should be unlikely. Both the environments and the work folders have moved into folders named with the build id. Each build is thus self-contained, and multiple builds can run at once (in separate processes). The monotonically increasing build ids facilitate reuse of source with the “--dirty” build option. Tags: #### Introducing GeoViews Thursday, September 1, 2016 Jim Bednar Continuum Analytics Philipp Rudiger Continuum Analytics .bk-toolbar-active a[target="_blank"]:after { display:none; } GeoViews is a new Python library that makes it easy to explore and visualize geographical, meteorological, oceanographic, weather, climate, and other real-world data. GeoViews was developed by Continuum Analytics, in collaboration with the Met Office. GeoViews is completely open source, available under a BSD license freely for both commercial and non-commercial use, and can be obtained as described at the Github site. GeoViews is built on the HoloViews library for building flexible visualizations of multidimensional data. GeoViews adds a family of geographic plot types, transformations, and primitives based primarily on the Cartopy library, plotted using either the Matplotlib or Bokeh packages. GeoViews objects are just like HoloViews objects, except that they have an associated geographic projection based on cartopy.crs. For instance, you can overlay temperature data with coastlines using simple expressions like gv.Image(temperature)*gv.feature.coastline, and easily embed these in complex, multi-figure layouts and animations using both GeoViews and HoloViews primitives, while always being able to access the raw data underlying any plot. This post shows you how GeoViews makes it simple to use point, path, shape, and multidimensional gridded data on both geographic and non-geographic coordinate systems. import numpy as np import xarray as xr import pandas as pd import holoviews as hv import geoviews as gv import iris import cartopy from cartopy import crs from cartopy import feature as cf from geoviews import feature as gf hv.notebook_extension('bokeh','matplotlib') %output backend='matplotlib' %opts Feature [projection=crs.Robinson()] HoloViewsJS, MatplotlibJS, BokehJS successfully loaded in this cell. # Built-in geographic features¶ GeoViews provides a library of basic features based on cartopy.feature that are useful as backgrounds for your data, to situate it in a geographic context. Like all HoloViews Elements (objects that display themselves), these GeoElements can easily be laid out alongside each other using '+' or overlaid together using '*': gf.coastline + gf.ocean + gf.ocean*gf.land*gf.coastline Other Cartopy features not included by default can be used similarly by explicitly wrapping them in a gv.Feature GeoViews Element object: %%opts Feature.Lines (facecolor='none' edgecolor='gray') graticules = gv.Feature(cf.NaturalEarthFeature(category='physical', name='graticules_30',scale='110m'), group='Lines') graticules The '*' operator used above is a shorthand for hv.Overlay, which can be used to show the full set of feature primitives provided: %%output size=450 features = hv.Overlay([gf.ocean, gf.land, graticules, gf.rivers, gf.lakes, gf.borders, gf.coastline]) features # Projections¶ GeoViews allows incoming data to be specified in any coordinate system supported by Cartopy's crs module. This data is then transformed for display in another coordinate system, called the Projection. For instance, the features above are displayed in the Robinson projection, which was declared at the start of the notebook. Some of the other available projections include: projections = [crs.RotatedPole, crs.Mercator, crs.LambertCylindrical, crs.Geostationary, crs.Gnomonic, crs.PlateCarree, crs.Mollweide, crs.Miller, crs.LambertConformal, crs.AlbersEqualArea, crs.Orthographic, crs.Robinson] When using matplotlib, any of the available coordinate systems from cartopy.crs can be used as output projections, and we can use hv.Layout (what '+' is shorthand for) to show each of them: hv.Layout([features.relabel(group=p.__name__)(plot=dict(projection=p())) for p in projections]).display('all').cols(3) The Bokeh backend currently only supports a single output projection type, Web Mercator, but as long as you can use that projection, it offers full interactivity, including panning and zooming to see detail (after selecting tools usin the menu at the right of the plot): %%output backend='bokeh' %%opts Overlay [width=600 height=500 xaxis=None yaxis=None] Feature.Lines (line_color='gray' line_width=0.5) features # Tile Sources¶ As you can see if you zoom in closely to the above plot, the shapes and outlines are limited in resolution, due to the need to have relatively small files that can easily be downloaded to your local machine. To provide more detail when needed for zoomed-in plots, geographic data is often divided up into separate tiles that can be downloaded individually and then combined to cover the required area. GeoViews lets you use any tile provider supported by Matplotlib (via cartopy) or Bokeh, which lets you add image or map data underneath any other plot. For instance, different sets of tiles at an appropriate resolution will be selected for this plot, depending on the extent selected: %%output dpi=200 url = 'https://map1c.vis.earthdata.nasa.gov/wmts-geo/wmts.cgi' gv.WMTS(url, layer='VIIRS_CityLights_2012', crs=crs.PlateCarree(), extents=(0, -60, 360, 80)) Tile servers are particularly useful with the Bokeh backend, because the data required as you zoom in isn't requested until you actually do the zooming, which allows a single plot to cover the full range of zoom levels provided by the tile server. %%output backend='bokeh' %%opts WMTS [width=450 height=250 xaxis=None yaxis=None] from bokeh.models import WMTSTileSource from bokeh.tile_providers import STAMEN_TONER tiles = {'OpenMap': WMTSTileSource(url='http://c.tile.openstreetmap.org/{Z}/{X}/{Y}.png'), 'ESRI': WMTSTileSource(url='https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg'), 'Wikipedia': WMTSTileSource(url='https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png'), 'Stamen Toner': STAMEN_TONER} hv.NdLayout({name: gv.WMTS(wmts, extents=(0, -90, 360, 90), crs=crs.PlateCarree()) for name, wmts in tiles.items()}, kdims=['Source']).cols(2) If you select the "wheel zoom" tool in the Bokeh tools menu at the upper right of the above figure, you can use your scroll wheel to zoom into all of these plots at once, comparing the level of detail available at any location for each of these tile providers. Any WTMS tile provider that accepts URLs with an x and y location and a zoom level should work with bokeh; you can find more at openstreetmap.org. # Point data¶ Bokeh, Matplotlib, and GeoViews are mainly intended for plotting data, not just maps, and so the above tile sources and cartopy features are typically in the background of the actual data being plotted. When there is a data layer, the extent of the data will determine the extent of the plot, and so extent will not need to be provided explicitly as in the previous examples. The simplest kind of data to situate geographically is point data: longitude and latitude coordinates for locations on the Earth's surface. GeoViews makes it simple to overlay such plots onto Cartopy features, tile sources, or other geographic data. For instance, let's load a dataset of all the major cities in the world with their population counts over time: cities = pd.read_csv('./assets/cities.csv', encoding="ISO-8859-1") population = gv.Dataset(cities, kdims=['City', 'Country', 'Year']) cities.tail() City Country Latitude Longitude Year Population 10025 Valencia Venezuela (Bolivarian Republic of) 10.17 -68.00 2050.0 2266000.0 10026 Al-Hudaydah Yemen 14.79 42.94 2050.0 1854000.0 10027 Sana'a' Yemen 15.36 44.20 2050.0 4382000.0 10028 Ta'izz Yemen 13.57 44.01 2050.0 1743000.0 10029 Lusaka Zambia -15.42 28.17 2050.0 2047000.0 Now we can convert this text-based dataset to a set of visible points mapped by the latitude and longitude, and containing the population, country, and city name as values. The longitudes and latitudes in the dataframe are supplied in simple Plate Carree coordinates, which we will need to declare explicitly, since each value is just a number with no inherently associated units. The .to conversion interface lets us do this succinctly, giving us points that are instantly visualizable either on their own or in a geographic context: cities = population.to(gv.Points, kdims=['Longitude', 'Latitude'], vdims=['Population', 'City', 'Country'], crs=crs.PlateCarree()) %%output backend='bokeh' %%opts Overlay [width=600 height=300 xaxis=None yaxis=None] %%opts Points (size=0.005 cmap='inferno') [tools=['hover'] color_index=2] gv.WMTS(WMTSTileSource(url='https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png')) * cities Note that since we did not assign the Year dimension to the points key or value dimensions, it is automatically assigned to a HoloMap, rendering the data as an animation using a slider widget. Because this is a Bokeh plot, you can also zoom into any location to see more geographic detail, which will be requested dynamically from the tile provider (try it!). The matplotlib version of the same plot, using the Cartopy Ocean feature to provide context, provides a similar widget to explore the Year dimension not mapped onto the display: %%output size=200 %%opts Feature [projection=crs.PlateCarree()] %%opts Points (s=0.000007 cmap='inferno') [color_index=2] gf.ocean * cities[::4] However, the matplotlib version doesn't provide any interactivity within the plot; the output is just a series of PNG images encoded into the web page, with one image selected for display at any given time using the Year widget. # Shapes¶ Points are zero-dimensional objects with just a location. It is also important to be able to work with one-dimensional paths (such as borders and roads) and two-dimensional areas (such as land masses and regions). GeoViews provides special GeoElements for paths (gv.Path) and polygons (gv.Polygons). The GeoElement types are extensions of the basic HoloViews Elements hv.Path and hv.Polygons that add support for declaring geographic coordinate systems via the crs parameter and support for choosing the display coordinate systems via the projection parameter. Like their Holoviews equivalents, gv.Path and gv.Polygons accept lists of Numpy arrays or Pandas dataframes, which is good for working with low-level data. In practice, the higher-level GeoElements gv.Shape (which wraps around a shapely shape object) and gv.Feature (which wraps around a Cartopy Feature object) are more convenient, because they make it simple to work with large collections of shapes. For instance, the various features like gv.ocean and gv.coastline introduced above are gv.Feature types, based on cartopy.feature.OCEAN and cartopy.feature.COASTLINE, respectively. We can easily access the individual shapely objects underlying these features if we want to work with them separately. Here we will get the geometry corresponding to the Australian continent and display it using shapely's inbuilt SVG repr (not yet a HoloViews plot, just a bare SVG displayed by Jupyter directly): land_geoms = list(gf.land.data.geometries()) land_geoms[21] Instead of letting shapely render it as an SVG, we can now wrap it in the gv.Shape object and let matplotlib or bokeh render it, alone or with other GeoViews or HoloViews objects: %%opts Points (color="black") %%output dpi=120 australia = gv.Shape(land_geoms[21], crs=crs.PlateCarree()) australia * hv.Points([(133.870,-23.700)]) * hv.Text(133.870,-21.5, 'Alice Springs') The above plot uses HoloViews elements (notice the prefix hv.), which do not have any information about coordinate systems, and so the plot works properly only because it was specified as PlateCarree coordinates (bare longitude and latitude values). You can use other projections safely as long as you specify the coordinate system for the Text and Points objects explicitly, which requires using GeoViews Elements (prefix gv.): %%opts Points (color="black") pc=crs.PlateCarree() australia(plot=dict(projection=crs.Mollweide(central_longitude=133.87))) * \ gv.Points([(133.870,-23.700)],crs=pc) * gv.Text(133.870,-21.5, 'Alice Springs',crs=pc) You can see why the crs parameter is important if you change the above cell to omit ,crs=pc from gv.Points and gv.Text; neither the dot nor the text label will then be in the correct location, because they won't be transformed to match the Mollweide projection used for the rest of the plot. Multiple shapes can be combined into an NdOverlay object either explicitly: %output dpi=120 size=150 %%opts NdOverlay [aspect=2] hv.NdOverlay({i: gv.Shape(s, crs=crs.PlateCarree()) for i, s in enumerate(land_geoms)}) or by loading a collection of shapes from a shapefile, such as this collection of UK electoral district boundaries: %%opts NdOverlay [aspect=0.75] shapefile='./assets/boundaries/boundaries.shp' gv.Shape.from_shapefile(shapefile, crs=crs.PlateCarree()) One common use for Shapefiles make it possible to create choropleth maps, where each part of the geometry is assigned a value that will be used to color it. Constructing a choropleth by combining a bunch of shapes one by one can be a lot of effort and is error prone, but is straightforward when using a shapefile that assigns standardized codes to each shape. For instance, the shapefile for the above UK plot assigns a well-defined geographic code for each electoral district's MultiPolygon shapely object: shapes = cartopy.io.shapereader.Reader(shapefile) list(shapes.records())[0] <Record: <shapely.geometry.multipolygon.MultiPolygon object at 0x11786a0d0>, {'code': 'E07000007'}, <fields>> To make a choropleth map, we just need a dataset with values indexed using these same codes, such as this dataset of the 2016 EU Referendum result in the UK: referendum = pd.read_csv('./assets/referendum.csv') referendum = hv.Dataset(referendum) referendum.data.head() leaveVoteshare regionName turnout name code 0 4.100000 Gibraltar 83.500000 Gibraltar BS0005003 1 69.599998 North East 65.500000 Hartlepool E06000001 2 65.500000 North East 64.900002 Middlesbrough E06000002 3 66.199997 North East 70.199997 Redcar and Cleveland E06000003 4 61.700001 North East 71.000000 Stockton-on-Tees E06000004 To make it simpler to match up the data with the shape files, you can use the .from_records method of the gv.Shape object to build a gv.Shape overlay that automatically merges the data and the shapes to show the percentage of each electoral district who voted to leave the EU: %%opts NdOverlay [aspect=0.75] Shape (cmap='viridis') gv.Shape.from_records(shapes.records(), referendum, on='code', value='leaveVoteshare', index=['name', 'regionName'], crs=crs.PlateCarree()) As usual, the matplotlib output is static, but the Bokeh version of the same data is interactive, allowing both zooming and panning within the geographic area and revealing additional data such as the county name and numerical values when hovering over each shape: %%output backend='bokeh' %%opts Shape (cmap='viridis') [xaxis=None yaxis=None tools=['hover'] width=400 height=500] gv.Shape.from_records(shapes.records(), referendum, on='code', value='leaveVoteshare', index='name', crs=crs.PlateCarree(), group='EU Referendum') As you can see, essentially the same code as was needed for the static Matplotlib version now provides a fully interactive view of this dataset. For the remaining sections, let's set some default parameters: %opts Image [colorbar=True] Curve [xrotation=60] Feature [projection=crs.PlateCarree()] hv.Dimension.type_formatters[np.datetime64] = '%Y-%m-%d' %output dpi=100 size=100 Here in this blog post we will use only a limited number of frames and plot sizes, to avoid bloating the web page with too much data, but when working on a live server one can append widgets='live' to the %output line above. In live mode, plots are rendered dynamically using Python based on user interaction, which allows agile exploration of large, multidimensional parameter spaces, without having to precompute a fixed set of plots. # Gridded data¶ In addition to point, path, and shape data, GeoViews is designed to make full use of multidimensional gridded (raster) datasets, such as those produced by satellite sensing, systematic land-based surveys, and climate simulations. This data is often stored in netCDF files that can be read into Python with the xarray and Iris libraries. HoloViews and GeoViews can use data from either library in all of its objects, along with NumPy arrays, Pandas data frames, and Python dictionaries. In each case, the data can be left stored in its original, native format, wrapped in a HoloViews or GeoViews object that provides instant interactive visualizations. To get started, let's load a dataset originally taken from iris-sample-data) containing surface temperature data indexed by 'longitude', 'latitude', and 'time': xr_dataset = gv.Dataset(xr.open_dataset('./sample-data/ensemble.nc'), crs=crs.PlateCarree(), kdims=['latitude','longitude','time'], vdims=['surface_temperature']) xr_dataset :Dataset [latitude,longitude,time] (surface_temperature) Here there is one "value dimension", i.e. surface temperature, whose value can be obtained for any combination of the three "key dimensions" (coordinates) longitude, latitude, and time. We can quickly build an interactive tool for exploring how this data changes over time: surf_temp = xr_dataset.to(gv.Image, ['longitude', 'latitude']) * gf.coastline surf_temp[::2] Here the slider for 'time' was generated automatically, because we instructed HoloViews to lay out two of the key timensions as x and y coordinates in an Image when we called .to(), with the value dimension 'surface_temperature' mapping onto the color of the image pixels by default, but we did not specify what should be done with the remaining 'time' key dimension. HoloViews is designed to make everything visualizable somehow, so it automatically generates a slider to cover this "extra" dimension, allowing you to explore how the surface_temperature values change over time. In a static page like this blog post each frame will be embedded into the page, however in a live Jupyter notebook it is trivial to explore large datasets and render each frame dynamically. You could instead have told HoloViews to lay out the remaining dimension spatially (as faceted plots), in which case the slider will disappear because there is no remaining dimension to explore. As an example, here let's grab just the first three frames, then lay them out spatially: surf_temp[::2].layout() ## Normalization¶ By default, HoloViews will normalize items displayed together as frames in a slider or animation, applying a single colormap across all items of the same type sharing the same dimensions, so that differences are clear. In this particular dataset, the range changes relatively little, so that even if we turn off such normalization in layouts (or in animation frames using {+framewise}) the results are similar: %%opts Image {+axiswise} surf_temp[::2].layout() Here you can see that each frame has a different range in the color bar, but it's a subtle effect. If we really want to highlight changes over a certain range of interest, we can set explicit normalization limits. For this data, let's find the maximum temperature in the dataset, and use it to set a normalization range by using the redim method: max_surface_temp = xr_dataset.range('surface_temperature')[1] print max_surface_temp xr_dataset.redim(surface_temperature=dict(range=(300, max_surface_temp))).\ to(gv.Image,['longitude', 'latitude'])[::2] * gf.coastline(style=dict(edgecolor='white')) 317.331787109 Now we can see a clear cooling effect over time, as the yellow and white areas close to the top of the normalization range (317K) vanish in the Americas and in Africa. Values outside this range are clipped to the ends of the color map. ## Non-Image views of gridded data¶ gv.Image Elements are a common way to view gridded data, but the .to() conversion interface supports other types as well, such as filled or unfilled contours and points: %%output size=100 dpi=100 %%opts Points [color_index=2 size_index=None] (cmap='jet') hv.Layout([xr_dataset.to(el,['longitude', 'latitude'])[::5, 0:50, -30:40] * gf.coastline for el in [gv.FilledContours, gv.LineContours, gv.Points]]).cols(3) ## Non-geographic views of gridded data¶ So far we have focused entirely on geographic views of gridded data, plotting the data on a projection involving longitude and latitude. However the .to() conversion interface is completely general, allowing us to slice and dice the data in any way we like. To illustrate this, let's load an expanded version of the above surface temperature dataset thad adds an additional 'realization' dimension. kdims = ['realization', 'longitude', 'latitude', 'time'] xr_ensembles = xr.open_dataset('./sample-data/ensembles.nc') dataset = gv.Dataset(xr_ensembles, kdims=kdims, vdims=['surface_temperature'], crs=crs.PlateCarree()) dataset :Dataset [realization,longitude,latitude,time] (surface_temperature) The realization is effectively a certain set of modelling parameters that leads to different predicted values for the temperatures at given times. We can see this clearly if we map the data onto a temperature versus time plot: %%output backend='bokeh' %%opts Curve [xrotation=25] NdOverlay [width=600 height=400 legend_position='left'] sliced = dataset.select(latitude=(0, 5), longitude=(0,10)) sliced.to(hv.Curve, 'time').overlay('realization') Here there is no geographic organization to the visualization, because we selected non-geographic coordinates to display. Just as before, the key dimensions not selected for display have become sliders, but in this case the leftover dimensions are longitude and latitude. (Realization would also be left over and thus generate a slider, if it hadn't been mapped onto an overlay above.) Because this is a static web page, we selected only a small portion of the data to be available in the above plot, i.e. all data points in the range 0,10 for latitude and longitude. If this code were running on a live Python server, one could instead access all the data dynamically: hv.Layout([dataset.to(hv.Curve, 'time', dynamic=True).overlay('realization')]) We can also make non-geographic 2D plots, for instance as a HeatMap over time and realization, again at a specified longitude and latitude: %%opts HeatMap [show_values=False colorbar=True] sliced.to(hv.HeatMap, ['realization', 'time']) In general, any HoloViews Element type (of which there are many!) can be used for non-geographic dimensions selected in this way, while any GeoViews GeoElement type can be used for geographic data. ## Reducing and aggregating gridded data¶ So far all the conversions shown have incorporated each of the available coordinate dimensions, either explicitly as dimensions in the plot, or implicitly as sliders for the leftover dimensions. However, instead of revealing all the data individually in this way, we often want to see the spread of values along one or more dimensions, pooling all the other dimensions together. A simple example of this is a box plot where we might want to see the spread of surface_temperature on each day, pooled across all latitude and longitude coordinates. To pool across particular dimensions, we can explicitly declare the "map" dimensions, which are the key dimensions of the HoloMap container rather than those of the Elements contained in the HoloMap. By declaring an empty list of mdims, we can tell the conversion interface '.to()' to pool across all dimensions except the particular key dimension(s) supplied, in this case the 'time' (plot A) and 'realization' (plot B): %%opts BoxWhisker [xrotation=25 bgcolor='w'] hv.Layout([dataset.to.box(d, mdims=[]) for d in ['time', 'realization']]) This approach also gives us access to other statistical plot types. For instance, with the seaborn library installed, we can use the Distribution Element, which visualizes the data as a kernel density estimate. In this way we can visualize how the distribution of surface temperature values varies over time and the model realizations. We do this by omitting 'latitude' and 'longitude' from the list of mdims, generating a lower-dimensional view into the data, where a temperature histogram is shown for every 'realization' and 'time', using GridSpace: %opts GridSpace [shared_xaxis=True fig_size=150] %opts Distribution [bgcolor='w' show_grid=False xticks=[220, 300]] import seaborn dataset.to.distribution(mdims=['realization', 'time']).grid() ### Selecting a particular coordinate¶ To examine one particular coordinate, we can select it, cast the data to Curves, reindex the data to drop the now-constant latitude and longitude dimensions, and overlay the remaining 'realization' dimension: %%opts NdOverlay [xrotation=25 aspect=1.5 legend_position='right' legend_cols=2] Curve (color=Palette('Set1')) dataset.select(latitude=0, longitude=0).to(hv.Curve, ['time']).reindex().overlay() ### Aggregating coordinates¶ Another option is to aggregate over certain dimensions, so that we can get an idea of distributions of temperatures across all latitudes and longitudes. Here we compute the mean temperature and standard deviation by latitude and longitude, casting the resulting collapsed view to a Spread Element: %%output backend='bokeh' lat_agg = dataset.aggregate('latitude', np.mean, np.std) lon_agg = dataset.aggregate('longitude', np.mean, np.std) hv.Spread(lat_agg) * hv.Curve(lat_agg) + hv.Spread(lon_agg) * hv.Curve(lon_agg) As you can see, with GeoViews and HoloViews it is very simple to select precisely which aspects of complex, multidimensional datasets that you want to focus on. See holoviews.org and geo.holoviews.org to get started! ## August 31, 2016 ### Paul Ivanov #### Jupyter's Gravity I'm switching jobs. For the past two years I've been working with the great team at Disqus as a member of the backend and data teams. Before that, I spent a half-dozen years mostly not working on my thesis at UC Berkeley but instead contributing to to the scientific Python ecosystem, especially matplotlib, IPython, and the IPython notebook, which is now called Jupyter. So when Bloomberg reached out to me with a compelling position to work on those open-source projects again from their SF office, such a tremendous opportunity was hard to pass up. You could say Jupyter has a large gravitational pull that's hard to escape, but you'd be huge nerd. ;) I have a lot to catch up on, but I'm really excited and looking forward to contributing on these fronts again! ## August 25, 2016 ### Continuum Analytics news #### Celebrating U.S. Women's Equality Day with Women in Tech Thursday, August 25, 2016 August 26 is recognized as Women's Equality Day in the United States, celebrating the addition of the 19th Amendment to the Constitution in 1920, which granted women the right to vote. This amendment was the culmination of an immense movement in women's rights, dating all the way back to the first women's rights convention in Seneca Falls, New York, in 1848. To commemorate this day, we decided to reach out to influential, successful and all around superstar women in technology to ask them one question: If women were never granted the right to vote, how do you think the landscape of women in STEM would be different? ### Katy Huff, @katyhuff "If women were never granted the right to vote, I think it's fair to say that other important movements on the front lines of women's rights would not have followed either. Without that basic recognition of equality -- the ability to participate in democracy -- would we have ever seen Title VII of the Civil Rights Act (1964) or Title IX of the Education Amendments (1972)? Surely not. And without them, women could legally be discriminated against when seeking an education and then again later when seeking employment. There wouldn't merely be a minority of women in tech (as is currently the case) - there would be a super-minority. If there were any women at all able to compete for these lucrative jobs, that tiny minority could legally be paid less than their colleagues and looked upon as second class citizens without any more voice in the workplace than in their own democracy." ## katy_womensday.png ### Renee M. P. Teate, @BecomingDataSci "If women were never granted the right to vote in the U.S., the landscape of women in STEM would be very different, because the landscape of our entire country would be different. Voting is a basic right in a democracy, and it is important to allow citizens of all races, sexes/genders, religions, wealth statuses, and backgrounds to participate in electing our leaders, and therefore shaping the laws of our country. When anyone is excluded from participating, they are not represented and can be more easily marginalized or treated unfairly under the law. The 19th amendment gave women not only a vote and a voice, but "full legal status" as citizens. That definitely impacts our roles in the workplace and in STEM, because if the law doesn't treat you as a whole and valued participant, you can't expect peers or managers to, either. Additionally, if the law doesn't offer equal protection to everyone, discrimination would run (even more) rampant and there might be no legal recourse for incidents such as sexual harassment in the workplace. A celebration of women is important within STEM fields, because it wasn't long ago that women were not seen as able to be qualified for many careers in STEM, including roles hired by public/governmental organizations like NASA that are funded by taxpayers and report to our elected officials. Even today, there are many prejudices against women, including beliefs by some that women are inferior at performing jobs such as computer programming and scientific research. There are also institutional biases in both our educational system and the workplace that we still need to work on. When women succeed despite these additional barriers (not to mention negative comments by unsupportive people and other detrators), that is worth celebrating. Though there are still many issues relating to bias against women and people of color in STEM, without the basic right to vote we would be even further behind on the quest for equality in the U.S. than we are today." ## womensday_renee.png ### Carol Willing, @WillingCarol "From the 19th amendment ratification to now, several generations of women have made their contributions to technical fields. These women celebrated successes, failures, disappointments, hopes, and dreams. Sometimes, as a person in tech, I wonder if my actions make a difference on others. Is it worth the subtle putdowns, assumptions about my ability, and, at times, overt bullying to continue working as an engineer and software developer? Truthfully, sometimes the answer is no, but most days my feeling is “YES! I have a right to study and work on technical problems that I find fascinating." My daughter, my son, and you have that right too. Almost a decade ago, I watched the movie “Iron Jawed Angels” with my middle school daughter, her friend, and a friend of mine who taught middle school history. The movie was tough to watch. We were struck by the sacrifice made by suffragettes, Alice Paul and Lucy Burns, amid the brutal abuse from others that did not want women to vote. A powerful reminder that we can’t control the actions of others, but we can stand up for ourselves and our right to be engineers, developers, managers, and researcher in technical fields. Your presence in tech and your contributions make a difference to humanity now and tomorrow." ## womensday_carol.png ### Jasmine Sandhu, @sandhujasmine "Its a numbers game, if more people have an opportunity to contribute to a field, you have a lot more talent, many more ideas and that many more people working on solutions and new ideas. The "Science" in STEM is key - an informed citizenry that asks for evidence when confronted with the many pseudoscientific claims that we navigate in everday life is critical. It is important for all of us to learn the scientific method and see its relevance in day to day life, so we 'ask for evidence' when people around us make claims about our diet, about our health, our civic discourse, our politics. Similarly, I wish I had learned statistics since childhood. It is an idea with which we should be very comfortable. Randomness is a part of our daily lives and being able to make decisions and take risks based less on how we feel about things and be able to analyze critically the options would be wonderful. Of course, education has a far greater impact in our lives than simply the demographic that we represent in a field. I'm still struck by the pseudoscience books aimed at little girls (astrology) and the scientific books targetting the boys (astronomy) - of course, this is an anecdotal example, but in the US we still hear about girls losing interest in science and math in middle school. Hard to believe this is the case in the 21st century. Living in a place like Seattle in the 21st century has enabled opportunities for me that don't exist for a lot of women in the world. I work remotely in a technical field which gives me freedom to structure my day to care for my daughter, live close to my family which is my support structure, and earn well enough to provide for my daughter and I. STEM fields offer yet more opportunities for all people, including women." ## jasminewomensday.png We loved hearing the perspectives of these women in STEM. If you'd like to share your response, please respond in the comments below, or tweet us @ContinuumIO! We've also created a special Anaconda graphic to celebrate, which you can see below. If you're currently at PyData Chicago, find the NumFOCUS table to grab a sticker! ## anaconda_sticker.png ### Happy Women's Equality Day! -Team Anaconda #### Succeeding in the New World Order of Data Thursday, August 25, 2016 Travis Oliphant Chief Executive Officer & Co-Founder Continuum Analytics ## LaptopScreen 2.jpg "If you want to understand function, study structure." Sage advice from Francis Crick, who revolutionized genetics with his Nobel Prize winning co-discovery of the structure of DNA — launching more than six decades of fruitful research. Crick was referring to biology, but today's companies competing in the Big Data space should heed his advice. With change at a pace this intense, understanding and optimizing one’s data science infrastructure — and therefore functionality — makes all the difference. But, what’s the best way to do that? Fortunately, there's an ideal solution for evolving in a rapidly-changing context while generating competitive insights from today's deluge of data. That solution is an emerging movement called Open Data Science, which uses open source software to drive cutting-edge analytics that go far beyond what traditional proprietary data software can provide. ### Shoring up Your Infrastructure Open Data Science draws its power from four fundamental principles: accessibility, innovation, interoperability and transparency. These insure source code that’s accessible for the whole team — free from licensing restrictions or vendor release schedules — and works seamlessly with other tools. Because open source libraries are free, the barrier to entry is very low, allowing teams to dive in and freely experiment without the concerns of a massive financial commitment up front, which encourages innovation. Although transitioning to a new analytics infrastructure is never trivial, the community spirit of open source software and Open Data Science's commitment to interoperability makes it quite manageable. Anaconda, for example, provides over 720 well-tested Python libraries for the demands of today's data science, all available from a single install. Business analysts can be brought on board with Anaconda Fusion, providing access to data analysis functions in Python within the familiar Excel interface. With connectors to other languages, integration of legacy code, HPC and parallel computing, as well as visualizations easily deployed to the web, there’s no limit to what can be achieved with Open Data Science. ### Navigating Potential Pitfalls With traditional solutions, unforeseen limits can bring the train to a screeching halt. I know of a large government project that convened many experts to creatively solve problems using data. The agency had invested in a many node compute cluster with attached GPUs. But when the experts arrived, the software installed was not inclusive and allowed less than a third of them to actually use it. Organizations cannot simply buy the latest monolithic tech from vendors and expect data science to just happen. The software must enable data scientists and play to their strengths not only to the needs of IT operations. Unlike proprietary offerings, Open Data Science has evolved along with the Big Data revolution —and, to a significant extent, driven it. Its toolset is designed with compatibilities that drive progress. ### Setting up Your Scaffolding Making the shift to an Open Data Science infrastructure is more than just choosing software and databases. It must also include people. Companies should provision the time and resources necessary to set up new organizational structures and provide budgets to enable these groups to work effectively. A pilot data-exploration team, a center of excellence or an emerging technology team are all examples of models that enable organizations to begin to uncover the opportunity in their data. As the organization grows, individual roles may change or new ones may emerge. Details of which toolsets to use will need to be hammered out. Many developers are already familiar with common Open Data Science applications, such as data notebooks like Jupyter, while others may require more of a learning curve to implement. Choices such as programming languages will vary by developers' preferences and particular needs. Python is commonly used, and for good reason. It is, by far, the dominant language for scientific computing, and it integrates beautifully with Open Data Science. Finally, well-managed migration is critical to success. Open Data Science allows for a number of options — from "co-existence" of Open Data Science with current infrastructure to piecemeal, or even full migration, all depending on a company's tolerance for risk or willingness to commit. Legacy code can also be retained and integrated with Open Data Science wrappers, allowing old but debugged and stable code-bases to serve new duty in a modern analytics environment. ### Taking Data Science to a New Level When genetics boomed as a science in the 1950s, new insights were always on the way. But, to get the ball rolling, biologists needed to understand DNA's structure — and exploit that understanding. Francis Crick and others began the process, and society continues to benefit. Data Science is similarly poised on the cusp of an astounding future. Those organizations that understand their analytics infrastructure will excel in that new world, with Open Data Science as the instrument for success. ### Jake Vanderplas #### Conda: Myths and Misconceptions I've spent much of the last decade using Python for my research, teaching Python tools to other scientists and developers, and developing Python tools for efficient data manipulation, scientific and statistical computation, and visualization. The Python-for-data landscape has changed immensely since I first installed NumPy and SciPy from via a flickering CRT display. Among the new developments since those early days, the one with perhaps the broadest impact on my daily work has been the introduction of conda, the open-source cross-platform package manager first released in 2012. In the four years since its initial release, many words have been spilt introducing conda and espousing its merits, but one thing I have consistently noticed is the number of misconceptions that seem to remain in the (often fervent) discussions surrounding this tool. I hope in this post to do a small part in putting these myths and misconceptions to rest. I've tried to be as succinct as I can, but if you want to skim this article and get the gist of the discussion, you can read each heading along with the the bold summary just below it. ### Myth #1: Conda is a distribution, not a package manager Reality: Conda is a package manager; Anaconda is a distribution. Although Conda is packaged with Anaconda, the two are distinct entities with distinct goals. A software distribution is a pre-built and pre-configured collection of packages that can be installed and used on a system. A package manager is a tool that automates the process of installing, updating, and removing packages. Conda, with its "conda install", "conda update", and "conda remove" sub-commands, falls squarely under the second definition: it is a package manager. Perhaps the confusion here comes from the fact that Conda is tightly coupled to two software distributions: Anaconda and Miniconda. Anaconda is a full distribution of the central software in the PyData ecosystem, and includes Python itself along with binaries for several hundred third-party open-source projects. Miniconda is essentially an installer for an empty conda environment, containing only Conda and its dependencies, so that you can install what you need from scratch. But make no mistake: Conda is as distinct from Anaconda/Miniconda as is Python itself, and (if you wish) can be installed without ever touching Anaconda/Miniconda. For more on each of these, see the conda FAQ. ### Myth #2: Conda is a Python package manager Reality: Conda is a general-purpose package management system, designed to build and manage software of any type from any language. As such, it also works well with Python packages. Because conda arose from within the Python (more specifically PyData) community, many mistakenly assume that it is fundamentally a Python package manager. This is not the case: conda is designed to manage packages and dependencies within any software stack. In this sense, it's less like pip, and more like a cross-platform version of apt or yum. If you use conda, you are already probably taking advantage of many non-Python packages; the following command will list the ones in your environment:$ conda search --canonical  | grep -v 'py\d\d'

On my system, there are 350 results: these are packages within my Conda/Python environment that are fundamentally unmanageable by Python-only tools like pip & virtualenv.

### Myth #3: Conda and pip are direct competitors

Reality: Conda and pip serve different purposes, and only directly compete in a small subset of tasks: namely installing Python packages in isolated environments.

Pip, which stands for Pip Installs Packages, is Python's officially-sanctioned package manager, and is most commonly used to install packages published on the Python Package Index (PyPI). Both pip and PyPI are governed and supported by the Python Packaging Authority (PyPA).

In short, pip is a general-purpose manager for Python packages; conda is a language-agnostic cross-platform environment manager. For the user, the most salient distinction is probably this: pip installs python packages within any environment; conda installs any package within conda environments. If all you are doing is installing Python packages within an isolated environment, conda and pip+virtualenv are mostly interchangeable, modulo some difference in dependency handling and package availability. By isolated environment I mean a conda-env or virtualenv, in which you can install packages without modifying your system Python installation.

Even setting aside Myth #2, if we focus on just installation of Python packages, conda and pip serve different audiences and different purposes. If you want to, say, manage Python packages within an existing system Python installation, conda can't help you: by design, it can only install packages within conda environments. If you want to, say, work with the many Python packages which rely on external dependencies (NumPy, SciPy, and Matplotlib are common examples), while tracking those dependencies in a meaningful way, pip can't help you: by design, it manages Python packages and only Python packages.

Conda and pip are not competitors, but rather tools focused on different groups of users and patterns of use.

### Myth #4: Creating conda in the first place was irresponsible & divisive

Reality: Conda's creators pushed Python's standard packaging to its limits for over a decade, and only created a second tool when it was clear it was the only reasonable way forward.

According to the Zen of Python, when doing anything in Python "There should be one – and preferably only one – obvious way to do it." So why would the creators of conda muddy the field by introducing a new way to install Python packages? Why didn't they contribute back to the Python community and improve pip to overcome its deficiencies?

As it turns out, that is exactly what they did. Prior to 2012, the developers of the PyData/SciPy ecosystem went to great lengths to work within the constraints of the package management solutions developed by the Python community. As far back as 2001, the NumPy project forked distutils in an attempt to make it handle the complex requirements of a NumPy distribution. They bundled a large portion of NETLIB into a single monolithic Python package (you might know this as SciPy), in effect creating a distribution-as-python-package to circumvent the fact that Python's distribution tools cannot manage these extra-Python dependencies in any meaningful way. An entire generation of scientific Python users spent countless hours struggling with the installation hell created by this exercise of forcing a square peg into a round hole – and those were just ones lucky enough to be using Linux. If you were on Windows, forget about it. To read some of the details about these pain-points and how they led to Conda, I'd suggest Travis Oliphant's 2013 blog post on the topic.

But why didn't Conda's creators just talk to the Python packaging folks and figure out these challenges together? As it turns out, they did.

The genesis of Conda came after Guido van Rossum was invited to speak at the inaugural PyData meetup in 2012; in a Q&A on the subject of packaging difficulties, he told us that when it comes to packaging, "it really sounds like your needs are so unusual compared to the larger Python community that you're just better off building your own" (See video of this discussion). Even while following this nugget of advice from the BDFL, the PyData community continued dialog and collaboration with core Python developers on the topic: one more public example of this was the invitation of CPython core developer Nick Coghlan to keynote at SciPy 2014 (See video here). He gave an excellent talk which specifically discusses pip and conda in the context of the "unsolved problem" of software distribution, and mentions the value of having multiple means of distribution tailored to the needs of specific users.

Far from insinuating that Conda is divisive, Nick and others at the Python Packaging Authority officially recognize conda as one of many important redistributors of Python code, and are working hard to better enable such tools to work seamlessly with the Python Package Index.

### Myth #5: conda doesn't work with virtualenv, so it's useless for my workflow

Reality: You actually can install (some) conda packages within a virtualenv, but better is to use Conda's own environment manager: it is fully-compatible with pip and has several advantages over virtualenv.

virtualenv/venv are utilites that allow users to create isolated Python environments that work with pip. Conda has its own built-in environment manager that works seamlessly with both conda and pip, and in fact has several advantages over virtualenv/venv:

• conda environments integrate management of different Python versions, including installation and updating of Python itself. Virtualenvs must be created upon an existing, externally managed Python executable.
• conda environments can track non-python dependencies; for example seamlessly managing dependencies and parallel versions of essential tools like LAPACK or OpenSSL
• Rather than environments built on symlinks – which break the isolation of the virtualenv and can be flimsy at times for non-Python dependencies – conda-envs are true isolated environments within a single executable path.
• While virtualenvs are not compatible with conda packages, conda environments are entirely compatible with pip packages. First conda install pip, and then you can pip install any available package within that environment. You can even explicitly list pip packages in conda environment files, meaning the full software stack is entirely reproducible from a single environment metadata file.

That said, if you would like to use conda within your virtualenv, it is possible:

$virtualenv test_conda$ source test_conda/bin/activate

$pip install conda$ conda install numpy

This installs conda's MKL-enabled NumPy package within your virtualenv. I wouldn't recommend this: I can't find documentation for this feature, and the result seems to be fairly brittle – for example, trying to conda update python within the virtualenv fails in a very ungraceful and unrecoverable manner, seemingly related to the symlinks that underly virtualenv's architecture. This appears not to be some fundamental incompatibility between conda and virtualenv, but rather related to some subtle inconsistencies in the build process, and thus is potentially fixable (see conda Issue 1367 and anaconda Issue 498, for example).

If you want to avoid these difficulties, a better idea would be to pip install conda and then create a new conda environment in which to install conda packages. For someone accustomed to pip/virtualenv/venv command syntax who wants to try conda, the conda docs include a translation table between conda and pip/virtualenv commands.

### Myth #6: Now that pip uses wheels, conda is no longer necessary

Reality: wheels address just one of the many challenges that prompted the development of conda, and wheels have weaknesses that Conda's binaries address.

One difficulty which drove the creation of Conda was the fact that pip could distribute only source code, not pre-compiled binary distributions, an issue that was particularly challenging for users building extension-heavy modules like NumPy and SciPy. After Conda had solved this problem in its own way, pip itself added support for wheels, a binary format designed to address this difficulty within pip. With this issue addressed within the common tool, shouldn't Conda early-adopters now flock back to pip?

Not necessarily. Distribution of cross-platform binaries was only one of the many problems solved within conda. Compiled binaries spotlight the other essential piece of conda: the ability to meaningfully track non-Python dependencies. Because pip's dependency tracking is limited to Python packages, the main way of doing this within wheels is to bundle released versions of dependencies with the Python package binary, which makes updating such dependencies painful (recent security updates to OpenSSL come to mind). Additionally, conda includes a true dependency resolver, a component which pip currently lacks.

For scientific users, conda also allows things like linking builds to optimized linear algebra libraries, as Continuum does with its freely-provided MKL-enabled NumPy/SciPy. Conda can even distribute non-Python build requirements, such as gcc, which greatly streamlines the process of building other packages on top of the pre-compiled binaries it distributes. If you try to do this using pip's wheels, you better hope that your system has compilers and settings compatible with those used to originally build the wheel in question.

### Myth #7: conda is not open source; it is tied to a for-profit company who could start charging for the service whenever they want

Reality: conda (the package manager and build system) is 100% open-source, and Anaconda (the distribution) is nearly there as well.

In the open source world, there is (sometimes quite rightly) a fundamental distrust of for-profit entities, and the fact that Anaconda was created by Continuum Analytics and is a free component of a larger enterprise product causes some to worry.

Let's set aside the fact that Continuum is, in my opinion, one of the few companies really doing open software the right way (a topic for another time). Ignoring that, the fact is that Conda itself – the package manager that provides the utilities to build, distribute, install, update, and manage software in a cross-platform manner – is 100% open-source, available on GitHub and BSD-Licensed. Even for Anaconda (the distribution), the EULA is simply a standard BSD license, and the toolchain used to create Anaconda is also 100% open-source. In short, there is no need to worry about intellectual property issues when using Conda.

If the Anaconda/Miniconda distributions still worry you, rest assured: you don't need to install Anaconda or Miniconda to get conda, though those are convenient avenues to its use. As we saw above, you can "pip install conda" to install it via PyPI without ever touching Continuum's website.

### Myth #8: But Conda packages themselves are closed-source, right?

Reality: though conda's default channel is not yet entirely open, there is a community-led effort (Conda-Forge) to make conda packaging & distribution entirely open.

Historically, the package build process for the default conda channel have not been as open as they could be, and the process of getting a build updated has mostly relied on knowing someone at Continuum. Rumor is that this was largely because the original conda package creation process was not as well-defined and streamlined as it is today.

But this is changing. Continuum is making the effort to open their package recipes, and I've been told that only a few dozen of the 500+ packages remain to be ported. These few recipes are the only remaining piece of the Anaconda distribution that are not entirely open.

If that's not enough, there is a new community-led – not Continuum affiliated – project, introduced in early 2016, called conda-forge that contains tools for the creation of community-driven builds for any package. Packages are maintained in the open via github, with binaries automatically built using free CI tools like TravisCI for Mac OSX builds, AppVeyor for Windows builds, and CircleCI for Linux builds. All the metadata for each package lives in a Github repository, and package updates are accomplished through merging a Github pull request (here is an example of what a package update looks like in conda-forge).

Conda-forge is entirely community-founded and community-led, and while conda-forge is probably not yet mature enough to completely replace the default conda channel, Continuum's founders have publicly stated that this is a direction they would support. You can read more about the promise of conda-forge in Wes McKinney's recent blog post, conda-forge and PyData's CentOS moment.

### Myth #9: OK, but if Continuum Analytics folds, conda won't work anymore right?

Reality: nothing about Conda inherently ties it to Continuum Analytics; the company serves the community by providing free hosting of build artifacts. All software distributions need to be hosted by somebody, even PyPI.

It's true that even conda-forge publishes its package builds to http://anaconda.org/, a website owned and maintained by Continuum Analytics. But there is nothing in Conda that requires this site. In fact, the creation of Custom Channels in conda is well-documented, and there would be nothing to stop someone from building and hosting their own private distribution using Conda as a package manager (conda index is the relevant command). Given the openness of conda recipes and build systems on conda-forge, it would not be all that hard to mirror all of conda-forge on your own server if you have reason to do so.

If you're still worried about Continuum Analytics – a for-profit company – serving the community by hosting conda packages, you should probably be equally worried about Rackspace – a for-profit company – serving the community by hosting the Python Package Index. In both cases, a for-profit company is integral to the current manifestation of the community's package management system. But in neither case would the demise of that company threaten the underlying architecture of the build & distribution system, which is entirely free and open source. If either Rackspace or Continuum were to disappear, the community would simply have to find another host and/or financial sponsor for the open distribution it relies on.

### Myth #10: Everybody should abandon (conda | pip) and use (pip | conda) instead!

Reality: pip and conda serve different needs, and we should be focused less on how they compete and more on how they work together.

As mentioned in Myth #2, Conda and pip are different projects with different intended audiences: pip installs python packages within any environment; conda installs any package within conda environments. Given the lofty ideals raised in the Zen of Python, one might hope that pip and conda could somehow be combined, so that there would be one and only one obvious way of installing packages.

But this will never happen. The goals of the two projects are just too different. Unless the pip project is broadly re-scoped, it will never be able to meaningfully install and track all the non-Python packages that conda does: the architecture is Python-specific and (rightly) Python-focused. Pip, along with PyPI, aims to be a flexible publication & distribution platform and manager for Python packages, and it does phenomenally well at that.

Likewise, unless the conda package is broadly re-scoped, it will never make sense for it to replace pip/PyPI as a general publishing & distribution platform for Python code. At its very core, conda concerns itself with the type of detailed dependency tracking that is required for robustly running a complex multi-language software stack across multiple platforms. Every installation artifact in conda's repositories is tied to an exact dependency chain: by design, it wouldn't allow you to, say, substitute Jython for Python in a given package. You could certainly use conda to build a Jython software stack, but each package would require a new Jython-specific installation artifact – that is what is required to maintain the strict dependency chain that conda users rely on. Pip is much more flexible here, but once cost of that is its inability to precisely define and resolve dependencies as conda does.

Finally, the focus on pip vs. conda entirely misses the broad swath of purpose-designed redistributors of Python code. From platform-specific package managers like apt, yum, macports, and homebrew, to cross-platform tools like bento, buildout, hashdist, and spack, there are a wide range of specific packaging solutions aimed at installing Python (and other) packages for particular users. It would be more fruitful for us to view these, as the Python Packaging Authority does, not as competitors to pip/PyPI, but as downstream tools that can take advantage of the heroic efforts of all those who have developed and maintained pip, PyPI, and associated toolchain.

## Where to Go from Here?

So it seems we're left with two packaging solutions which are distinct, but yet have broad overlap for many Python users (i.e. when installing Python packages in isolated environments). So where should the community go from here? I think the main thing we can do is make sure the projects (1) work together as well as possible, and (2) learn from each other's successes.

### Conda

As mentioned above, conda is already has a fully open toolchain, and is on a steady trend toward fully open packages (but is not entirely there just yet). An obvious direction is to push forward on community development and maintenance of the conda stack via conda-forge, perhaps eventually using it to replace conda's current default channel.

As we push forward on this, I believe the conda and conda-forge community could benefit from imitating the clear and open governance model of the Python Packaging Authority. For example, PyPA has an open governance model with explicit goals, a clear roadmap for new developments and features, and well-defined channels of communication and discussion, and community oversight of the full pip/PyPI system from the ground up.

With conda and conda-forge, on the other hand, the code (and soon all recipes) is open, but the model for governance and control of the system is far less explicit. Given the importance of conda particularly in the PyData community, it would benefit all of this to clarify this somehow – perhaps under the umbrella of the NumFOCUS organization.

That being said, folks involved with conda-forge have told me that this is currently being addressed by the core team, including generation of governing documents, a code of conduct, and framework for enhancement proposals.

### PyPI/pip

While the Python Package Index seems to have its governance in order, there are aspects of conda/conda-forge that I think would benefit it. For example, currently most Python packages can be loaded to conda-forge with just a few steps:

1. Post a public code release somewhere on the web (on github, bitbucket, PyPI, etc.)
2. Create a recipe/metadata file that points to this code and lists dependencies
3. Open a pull request on conda-forge/staged-recipes

And that's it. Once the pull request is merged, the binary builds on Windows, OSX, and Linux are automatically created and loaded to the conda-forge channel. Additionally, managing and updating the package takes place transparently via github, where package updates can be reviewed by collaborators and tested by CI systems before they go live.

I find this process far preferable to the (by comparison relatively opaque and manual) process of publishing to PyPI, which is mostly done by a single user working in private at a local terminal. Perhaps PyPI could take advantage of conda-forge's existing build system, and creating an option to automatically build multi-platform wheels and source distributions, and automatically push them to PyPI in a single transparent command. It is definitely a possibility.

## Postscript: Which Tool Should I Use?

I hope I've convinced you that conda and pip both have a role to play within the Python community. With that behind us, which should you use if you're starting out? The answer depends on what you want to do:

If you have an existing system Python installation and you want to install packages in or on it, use pip+virtualenv. For example, perhaps you used apt or another system package manager to install Python, along with some packages linked to system tools that are not (yet) easily installable via conda or pip. Pip+virtualenv will allow you to install new Python packages and build environments on top of that existing distribution, and you should be able to rely on your system package manager for any difficult-to-install dependencies.

If you want to flexibly manage a multi-language software stack and don't mind using an isolated environment, use conda. Conda's multi-language dependency management and cross-platform binary installations can do things in this situation that pip cannot do. A huge benefit is that for most packages, the result will be immediately compatible with multiple operating systems.

If you want to install Python packages within an Isolated environment, pip+virtualenv and conda+conda-env are mostly interchangeable. This is the overlap region where both tools shine in their own way. That being said, I tend to prefer conda in this situation: Conda's uniform, cross-platform, full-stack management of multiple parallel Python environments with robust dependency management has proven to be an incredible time-saver in my research, my teaching, and my software development work. Additionally, I find that my needs and the needs of my colleagues more often stray into areas of conda's strengths (management of non-Python tools and dependencies) than into areas of pip's strengths (environment-agnostic Python package management).

As an example, years ago I spent nearly a quarter with a colleague trying to install the complicated (non-Python) software stack that powers the megaman package, which we were developing together. The result of all our efforts was a single non-reproducible working stack on a single machine. Then conda-forge was introduced. We went through the process again, this time creating a conda recipe, from which a conda-forge feedstock was built. We now have a cross-platform solution that will install a working version of the package and its dependencies with a single command, in seconds, on nearly any computer. If there is a way to build and distribute software with that kind of dependency graph seamlessly with pip+PyPI, I haven't seen it.

If you've read this far, I hope you've found this discussion useful. My own desire is that we as a community can continue to rally around both these tools, improving them for the benefit of current and future users. Python packaging has improved immensely in the last decade, and I'm excited to see where it will go from here.

Thanks to Filipe Fernandez, Aaron Meurer, Bryan van de Ven, and Phil Elson for helpful feedback on early drafts of this post. As always, any mistakes are my own.

### Matthew Rocklin

#### Supporting Users in Open Source

What are the social expectations of open source developers to help users understand their projects? What are the social expectations of users when asking for help?

As part of developing Dask, an open source library with growing adoption, I directly interact with users over GitHub issues for bug reports, StackOverflow for usage questions, a mailing list and live Gitter chat for community conversation. Dask is blessed with awesome users. These are researchers doing very cool work of high impact and with novel use cases. They report bugs and usage questions with such skill that it’s clear that they are Veteran Users of open source projects.

## Veteran Users are Heroes

It’s not easy being a veteran user. It takes a lot of time to distill a bug down to a reproducible example, or a question into an MCVE, or to read all of the documentation to make sure that a conceptual question definitely isn’t answered in the docs. And yet this effort really shines through and it’s incredibly valuable to making open source software better. These distilled reports are arguably more important than fixing the actual bug or writing the actual documentation.

Bugs occur in the wild, in code that is half related to the developer’s library (like Pandas or Dask) and half related to the user’s application. The veteran user works hard to pull away all of their code and data, creating a gem of an example that is trivial to understand and run anywhere that still shows off the problem.

This way the veteran user can show up with their problem to the development team and say “here is something that you will quickly understand to be a problem.” On the developer side this is incredibly valuable. They learn of a relevant bug and immediately understand what’s going on, without having to download someone else’s data or understand their domain. This switches from merely convenient to strictly necessary when the developers deal with 10+ such reports a day.

## Novice Users need help too

However there are a lot of novice users out there. We have all been novice users once, and even if we are veterans today we are probably still novices at something else. Knowing what to do and how to ask for help is hard. Having the guts to walk into a chat room where people will quickly see that you’re a novice is even harder. It’s like using public transit in a deeply foreign language. Respect is warranted here.

I categorize novice users into two groups:

1. Experienced technical novices, who are very experienced in their field and technical things generally, but who don’t yet have a thorough understanding of open source culture and how to ask questions smoothly. They’re entirely capable of behaving like a veteran user if pointed in the right directions.
2. Novice technical novices, who don’t yet have the ability to distill their problems into the digestible nuggets that open source developers expect.

In the first case of technically experienced novices, I’ve found that being direct works surprisingly well. I used to be apologetic in asking people to submit MCVEs. Today I’m more blunt but surprisingly I find that this group doesn’t seem to mind. I suspect that this group is accustomed to operating in situations where other people’s time is very costly.

The second case of novice novice users are more challenging for individual developers to handle one-by-one, both because novices are more common, and because solving their problems often requires more time commitment. Instead open source communities often depend on broadcast and crowd-sourced solutions, like documentation, StackOverflow, or meetups and user groups. For example in Dask we strongly point people towards StackOverflow in order to build up a knowledge-base of question-answer pairs. Pandas has done this well; almost every Pandas question you Google leads to a StackOverflow post, handling 90% of the traffic and improving the lives of thousands. Many projects simply don’t have the human capital to hand-hold individuals through using the library.

In a few projects there are enough generous and experienced users that they’re able to field questions from individual users. SymPy is a good example here. I learned open source programming within SymPy. Their community was broad enough that they were able to hold my hand as I learned Git, testing, communication practices and all of the other soft skills that we need to be effective in writing great software. The support structure of SymPy is something that I’ve never experienced anywhere else.

## My Apologies

I’ve found myself becoming increasingly impolite when people ask me for certain kinds of extended help with their code. I’ve been trying to track down why this is and I think that it comes from a mismatch of social contracts.

Large parts of technical society have an (entirely reasonable) belief that open source developers are available to answer questions about how we use their project. This was probably true in popular culture, where our stereotypical image of an open source developer was working out of their basement long into the night on things that relatively few enthusiasts bothered with. They were happy to engage and had the free time in which to do it.

In some ways things have changed a lot. We now have paid professionals building software that is used by thousands or millions of users. These professionals easily charge consulting fees of hundreds of dollars per hour for exactly the kind of assistance that people show up expecting for free under the previous model. These developers have to answer for how they spend their time when they’re at work, and when they’re not at work they now have families and kids that deserve just as much attention as their open source users.

Both of these cultures, the creative do-it-yourself basement culture and the more corporate culture, are important to the wonderful surge we’ve seen in open source software. How do we balance them? Should developers, like doctors or lawyers perform pro-bono work as part of their profession? Should grants specifically include paid time for community engagement and outreach? Should users, as part of receiving help feel an obligation to improve documentation or stick around and help others?

## Solutions?

I’m not sure what to do here. I feel an obligation to remain connected with users from a broad set of applications, even those that companies or grants haven’t decided to fund. However at the same time I don’t know how to say “I’m sorry, I simply don’t have the time to help you with your problem.” in a way that feels at all compassionate.

I think that people should still ask questions. I think that we need to foster an environment in which developers can say “Sorry. Busy.” more easily. I think that we as a community need better resources to teach novice users to become veteran users.

One positive approach is to honor veteran users, and through this public praise to encourage other users to “up their game”, much as developers do today with coding skills. There are thousands of blogposts about how to develop code well, and people strive tirelessly to improve themselves. My hope is that by attaching the language of skill, like the term “veteran”, to user behaviors we can create an environment where people are proud of how cleanly they can raise issues and how clearly they can describe questions for documentation. Doing this well is critical for a project’s success and requires substantial effort and personal investment.

## August 23, 2016

### Enthought

#### Webinar: Introducing the NEW Python Integration Toolkit for LabVIEW

See a recording of the webinar:

LabVIEW is a software platform made by National Instruments, used widely in industries such as semiconductors, telecommunications, aerospace, manufacturing, electronics, and automotive for test and measurement applications. In August 2016, Enthought released the Python Integration Toolkit for LabVIEW, which is a “bridge” between the LabVIEW and Python environments.

In this webinar, we’ll demonstrate:

1. How the new Python Integration Toolkit for LabVIEW from Enthought seamlessly brings the power of the Python ecosystem of scientific and engineering tools to LabVIEW
2. Examples of how you can extend LabVIEW with Python, including using Python for signal and image processing, cloud computing, web dashboards, machine learning, and more

Quickly and efficiently access scientific and engineering tools for signal processing, machine learning, image and array processing, web and cloud connectivity, and much more. With only minimal coding on the Python side, this extraordinarily simple interface provides access to all of Python’s capabilities.

Try it with your data, free for 30 days

Download a free 30 day trial of the Python Integration Toolkit for LabVIEW from the National Instruments LabVIEW Tools Network.

How LabVIEW users can benefit from Python :

• High-level, general purpose programming language ideally suited to the needs of engineers, scientists, and analysts
• Huge, international user base representing industries such as aerospace, automotive, manufacturing, military and defense, research and development, biotechnology, geoscience, electronics, and many more
• Tens of thousands of available packages, ranging from advanced 3D visualization frameworks to nonlinear equation solvers
• Simple, beginner-friendly syntax and fast learning curve

## August 21, 2016

### Titus Brown

#### Lessons on doing science from my father, Gerry Brown

(This is an invited chapter for a memorial book about my father. You can also read my remembrances from the day after he passed away.)

Dr. Gerald E. Brown was a well known nuclear physicist and astrophysicist who worked at Stony Brook University from 1968 until his death in 2013. He was internationally active in physics research from the late 1950s onwards, ran an active research group at Stony Brook until 2009, and supervised nearly a hundred PhD students during his life. He was also my father.

It's hard to write about someone who is owned, in part, by so many people. I came along late in my father's life (he was 48 when I was born), and so I didn't know him that well as an adult. However, growing up with a senior professor as a father had a huge impact on my scientific career, which I can recognize even more clearly now that I'm a professor myself.

Gerry (as I called him) didn't spend much time directly teaching his children about his work. When I was invited to write something for his memorial book, it was suggested that I write about what he had taught me about being a scientist. I found myself stymied, because to the best of my recollection we had never talked much about the practice of science. When I mentioned this to my oldest brother, Hans, we shared a good laugh -- he had exactly the same experience with our father, 20 years before me!

Most of what Gerry taught me was taught by example. Below are some of the examples that I remember most clearly, and of which I'm the most proud. While I don't know if either of my children will become scientists, if they do, I hope they take these to heart -- I can think of few better scientific legacies to pass on to them from my father.

## Publishing work that is interesting (but perhaps not correct) can make for a fine career.

My father was very proud of his publishing record, but not because he was always (or even frequently) right. In fact, several people told me that he was somewhat notorious for having a 1- in-10 "hit rate" -- he would come up with many crazy ideas, of which only about 1 in 10 would be worth pursuing. However, that 1 in 10 was enough for him to have a long and successful career. That this drove some people nuts was merely an added bonus in his view.

Gerry was also fond of publishing controversial work. Several times he told me he was proudest of the papers that caused the most discussion and collected the most rebuttals. He wryly noted that these papers often gathered many citations, even when they turned out to be incorrect.

## The best collaborations are both personal and professional friendships.

The last twenty-five years of Gerry's life were dominated by a collaboration with Hans Bethe on astrophysics, and they traveled to Pasadena every January until the early 2000s to work at the California Institute of Technology. During this month they lived in the same apartment, worked together closely, and met with a wide range of people on campus to explore scientific ideas; they also went on long hikes in the mountains above Pasadena (chronicled by Chris Adami in "Three Weeks with Hans Bethe"). These close interactions not only fueled his research for the remainder of the year, but emanated from a deep personal friendship. It was clear that, to Gerry, there was little distinction between personal and professional in his research life.

## Science is done by people, and people need to be supported.

Gerry was incredibly proud of his mentoring record, and did his best to support his students, postdocs, and junior colleagues both professionally and personally. He devoted the weeks around Christmas each year to writing recommendation letters for junior colleagues. He spent years working to successfully nominate colleagues to the National Academy of Sciences. He supported junior faculty with significant amounts of his time and sometimes by forgoing his own salary to boost theirs. While he never stated it explicitly, he considered most ideas somewhat ephemeral, and thought that his real legacy -- and the legacy most worth having -- lay in the students and colleagues who would continue after him.

## Always treat the administrative staff well.

Gerry was fond of pointing out that the secretaries and administrative staff had more practical power than most faculty, and that it was worth staying on their good side. This was less a statement of calculated intent and more an observation that many students, postdocs, and faculty treated non-scientists with less respect than they deserved. He always took the time to interact with them on a personal level, and certainly seemed to be well liked for it. I've been told by several colleagues who worked with Gerry that this was a lesson that they took to heart in their own interactions with staff, and it has also served me well.

## Hard work is more important than brilliance.

One of Gerry's favorite quotes was "Success is 99% perspiration, 1% inspiration", a statement attributed to Thomas Edison. According to Gerry, he simply wasn't as smart as many of his colleagues, but he made up for it by working very hard. I have no idea how modest he was being -- he was not always known for modesty -- but he certainly worked very hard, spending 10-14 hours a day writing in his home office, thinking in the garden, or meeting with colleagues at work. While I try for more balance in my work and life myself, he demonstrated to me that sometimes grueling hard work is a necessity when tackling tricky problems: for example, my Earthshine publications came after a tremendously unpleasant summer working on some incredibly messy and very tedious analysis code, but without the resulting analysis we wouldn't have been able to advance the project (which continues today, almost two decades later).

## Experiments should talk to theory, and vice versa.

Steve Koonin once explained to me that Gerry was a phenomenologist -- a theorist who worked well with experimental data -- and that this specialty was fairly rare because it required communicating effectively across sub disciplines. Gerry wasn't attracted to deep theoretical work and complex calculations, and in any case liked to talk to experimentalists too much to be a good theorist -- for example, some of our most frequent dinner guests when I was growing up were Peter Braun- Munzinger and Johanna Stachel, both experimentalists. So he chose to work at the interface of theory and experiment, where he could develop and refine his intuition based on competing world views emanating from the theorists (who sought clean mathematical solutions) and experimentalists (who had real data that needed to be reconciled with theory). I have tried to pursue a similar strategy in computational biology.

## Computers and mathematical models are tools, but the real insight comes from intuition.

Apart from some early experience with punch cards at Yale in the 1950s, Gerry avoided computers and computational models completely in his own research (although his students, postdocs and collaborators used them, of course). I am told that his theoretical models were often relatively simple approximations, and he himself often said that his work with Hans Bethe proceeded by choosing the right approximation for the problem at hand -- something at which Bethe excelled. Their choice of approximation was guided by intuition about the physical nature of the problem as much as by mathematical insight, and they could often use a few lines of the right equations to reach results similar to complex computational and mathematical models. This search for simple models and the utility of physical intuition in his research characterized many of our conversations, even when I became more mathematically trained.

## Teaching is largely about conveying intuition.

Once a year, Gerry would load up a backpack with mason jars full of thousands of pennies, and bring them into his Statistical Mechanics class. This was needed for one of his favorite exercises -- a hands-on demonstration of the Law of Large Numbers and the Central Limit Theorem, which lie at the heart of thermodynamics and statistical mechanics. He would have students flip 100 coins and record the average, and then do it again and again, and have the class plot the distributions of results. The feedback he got was that this was a very good way of viscerally communicating the basics of statistical mechanics to students, because it built their intuition about how averages really worked. This approach has carried through to my own teaching and training efforts, where I always try to integrate hands-on practice with more theoretical discussion.

## Benign neglect is a good default for mentoring.

Gerry was perhaps overly fond of the concept of "benign neglect" in parenting, in that much of my early upbringing was at the hands of my mother with only occasional input from him. However, in his oft-stated experience (and now mine as well), leaving smart graduate students and postdocs to their own devices most of the time was far better than trying to actively manage (or interfere in) their research for them. I think of it this way: if I tell my students what to do and I'm wrong (which is likely, research being research), then they either do it (and I suffer for having misdirected them) or they don't do it (and then I get upset at them for ignoring me). But if I don't tell my students what to do, then they usually figure out something better for themselves, or else get stuck and then come to me to discuss it. The latter two outcomes are much better from a mentoring perspective than the former two.

## Students need to figure it out for themselves

One of the most embarrassing (in retrospect) interactions I had with my father was during a long car ride where he tried to convince me that when x was a negative number, -x was positive. At the time, I didn't agree with this at all, which was probably because I was a stubborn 7 years old. While it took me a few more years to understand this concept, by the time I was a math major I did have the concept down! Regardless, in this, and many other interactions around science, he never browbeat me about it or got upset at my stupidity or stubbornness. I believe this carried through to his interactions with his students. In fact, the only time I ever heard him express exasperation was with colleagues who were acting badly.

## A small nudge at the right moment is sometimes all that is needed.

A pivotal moment in my life came when Gerry introduced me to Mark Galassi, a physics graduate student who also was the systems administrator for the UNIX systems in the Institute for Theoretical Physics at Stony Brook; Mark found out I was interested in computers and gave me access to the computer system. This was one of the defining moments in my research life, as my research is entirely computational! Similarly, when I took a year off from college, my father put me in touch with Steve Koonin, who needed a systems administrator for a new project; I ended up working with the Earthshine project, which was a core part of my research for several years. And when I was trying to decide what grad schools to apply to, Gerry suggested I ask Hans Bethe and Steve Koonin what they thought was the most promising area of research for the future -- their unequivocal answer was "biology!" This drove me to apply to biology graduate schools, get a PhD in biology, and ultimately led to my current faculty position. In all these cases, I now recognize the application of a light touch at the right moment, rather than the heavy-handed guidance that he must have desperately wanted to give at times.

## Conclusions

There are many more personal stories that could be told about Gerry Brown, including his (several, and hilarious) interactions with the East German secret police during the cold war, his (quite bad) jokes, his (quite good) cooking, and his (enthusiastic) ballroom dancing, but I will save those for another time. I hope that his friends and colleagues will see him in the examples above, and will remember him fondly.

## August 19, 2016

### Continuum Analytics news

Tuesday, August 16, 2016
Matthew Rocklin
Continuum Analytics

### Introduction

Institutions use software differently than individuals. Over the last few months I’ve had dozens of conversations about using Dask within larger organizations like universities, research labs, private companies, and non-profit learning systems. This post provides a very coarse summary of those conversations and extracts common questions. I’ll then try to answer those questions.

Note: some of this post will be necessarily vague at points. Some companies prefer privacy. All details here are either in public Dask issues or have come up with enough institutions (say at least five) that I’m comfortable listing the problem here.

### Common story

Institution X, a university/research lab/company/… has many scientists/analysts/modelers who develop models and analyze data with Python, the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code. These models/data sometimes grow to be large enough to need a moderately large amount of parallel computing.

Fortunately, Institution X has an in-house cluster acquired for exactly this purpose of accelerating modeling and analysis of large computations and datasets. Users can submit jobs to the cluster using a job scheduler like SGE/LSF/Mesos/Other.

However the cluster is still under-utilized and the users are still asking for help with parallel computing. Either users aren’t comfortable using the SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic workloads, or the interaction times aren’t good enough for the interactive use that users appreciate.

There was an internal effort to build a more complex/interactive/Pythonic system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and definitely isn’t something that Institution X wants to pursue. It turned out to be a harder problem than expected to design/build/maintain such a system in-house. They’d love to find an open source solution that was well featured and maintained by a community.

The Dask.distributed scheduler looks like it’s 90% of the system that Institution X needs. However there are a few open questions:

• How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job scheduler?
• How can we grow and shrink the cluster dynamically based on use?
• How do users manage software environments on the workers?
• How secure is the distributed scheduler?
• What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?
• How do we handle multiple concurrent users and priorities?
• How does this compare with Spark?

So for the rest of this post I’m going to answer these questions. As usual, few of answers will be of the form “Yes Dask can solve all of your problems.” These are open questions, not the questions that were easy to answer. We’ll get into what’s possible today and how we might solve these problems in the future.

### How do we integrate dask.distributed with SGE/LSF/Mesos/Other?

It’s not difficult to deploy dask.distributed at scale within an existing cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already a researcher within the institution doing this manually by running dask-scheduler on some static node in the cluster and launching dask-worker a few hundred times with their job scheduler and a small job script.

The goal now is how to formalize this process for the individual version of SGE/LSF/Mesos/Other used within the institution while also developing and maintaining a standard Pythonic interface so that all of these tools can be maintained cheaply by Dask developers into the foreseeable future. In some cases Institution X is happy to pay for the development of a convenient “start dask on my job scheduler” tool, but they are less excited about paying to maintain it forever.

We want Python users to be able to say something like the following:

c = SGECluster(nworkers=200, **options)
e = Executor(c)

… and have this same interface be standardized across different job schedulers.

### How can we grow and shrink the cluster dynamically based on use?

Alternatively, we could have a single dask.distributed deployment running 24/7 that scales itself up and down dynamically based on current load. Again, this is entirely possible today if you want to do it manually (you can add and remove workers on the fly) but we should add some signals to the scheduler like the following:

• “I’ve been idling for a while, please reclaim workers”

and connect these signals to a manager that talks to the job scheduler. This removes an element of control from the users and places it in the hands of a policy that IT can tune to play more nicely with their other services on the same network.

### How do users manage software environments on the workers?

Today Dask assumes that all users and workers share the exact same software environment. There are some small tools to send updated .py and .egg files to the workers but that’s it.

Generally Dask trusts that the full software environment will be handled by something else. This might be a network file system (NFS) mount on traditional cluster setups, or it might be handled by moving docker or conda environments around by some other tool like knit for YARN deployments or something more custom. For example Continuum sells proprietary software that does this.

Getting the standard software environment setup generally isn’t such a big deal for institutions. They typically have some system in place to handle this already. Where things become interesting is when users want to use drastically different environments from the system environment, like using Python 2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may also want to change the software environment many times in a single session.

The best solution I can think of here is to pass around fully downloaded conda environments using the dask.distributed network (it’s good at moving large binary blobs throughout the network) and then teaching the dask-workers to bootstrap themselves within this environment. We should be able to tear everything down and restart things within a small number of seconds. This requires some work; first to make relocatable conda binaries (which is usually fine but is not always fool-proof due to links) and then to help the dask-workers learn to bootstrap themselves.

Somewhat related, Hussain Sultan of Capital One recently contributed a dask-submit command to run scripts on the cluster: http://distributed.readthedocs.io/en/latest/submitting-applications.html

### How secure is the distributed scheduler?

Dask.distributed is incredibly insecure. It allows anyone with network access to the scheduler to execute arbitrary code in an unprotected environment. Data is sent in the clear. Any malicious actor can both steal your secrets and then cripple your cluster.

This is entirely the norm however. Security is usually handled by other services that manage computational frameworks like Dask.

For example we might rely on Docker to isolate workers from destroying their surrounding environment and rely on network access controls to protect data access.

Because Dask runs on Tornado, a serious networking library and web framework, there are some things we can do easily like enabling SSL, authentication, etc.. However I hesitate to jump into providing “just a little bit of security” without going all the way for fear of providing a false sense of security. In short, I have no plans to work on this without a lot of encouragement. Even then I would strongly recommend that institutions couple Dask with tools intended for security. I believe that is common practice for distributed computational systems generally.

can come and go. Clients can come and go. The state in the scheduler is currently irreplaceable and no attempt is made to back it up. There are a few things you could imagine here:

1. Backup state and recent events to some persistent storage so that state can be recovered in case of catastrophic loss
2. Have a hot failover node that gets a copy of every action that the scheduler takes
3. Have multiple peer schedulers operate simultaneously in a way that they can pick up slack from lost peers
4. Have clients remember what they have submitted and resubmit when a scheduler comes back online

Currently option 4 is currently the most feasible and gets us most of the way there. However options 2 or 3 would probably be necessary if Dask were to ever run as critical infrastructure in a giant institution. We’re not there yet.

As of recent work spurred on by Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back and everyone will reconnect. The state for computations in flight is entirely lost but the computational infrastructure remains intact so that people can resubmit jobs without significant loss of service.

Dask has a bit of a harder time with this topic because it offers a persistent stateful interface. This problem is much easier for distributed database projects that run ephemeral queries off of persistent storage, return the results, and then clear out state.

### What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?

The short answer is no. Other than number of cores and available RAM all workers are considered equal to each other (except when the user explicitly specifies otherwise).

However this problem and problems like it have come up a lot lately. Here are a few examples of similar cases:

1. Multiple data centers geographically distributed around the country
2. Multiple racks within a single data center
3. Multiple workers that have GPUs that can move data between each other easily
4. Multiple processes on a single machine

Having some notion of hierarchical worker group membership or inter-worker preferred relationships is probably inevitable long term. As with all distributed scheduling questions the hard part isn’t deciding that this is useful, or even coming up with a sensible design, but rather figuring out how to make decisions on the sensible design that are foolproof and operate in constant time. I don’t personally see a good approach here yet but expect one to arise as more high priority use cases come in.

### How do we handle multiple concurrent users and priorities?

There are several sub-questions here:

• Can multiple users use Dask on my cluster at the same time?

Yes, either by spinning up separate scheduler/worker sets or by sharing the same set.

• If they’re sharing the same workers then won’t they clobber each other’s data?

This is very unlikely. Dask is careful about naming tasks, so it’s very unlikely that the two users will submit conflicting computations that compute to different values but occupy the same key in memory. However if they both submit computations that overlap somewhat then the scheduler will nicely avoid recomputation. This can be very nice when you have many people doing slightly different computations on the same hardware. This works in the same way that Git works.

• If they’re sharing the same workers then won’t they clobber each other’s resources?

Yes, this is definitely possible. If you’re concerned about this then you should give everyone their own scheduler/workers (which is easy and standard practice). There is not currently much user management built into Dask.

### How does this compare with Spark?

At an institutional level Spark seems to primarily target ETL + Database-like computations. While Dask modules like Dask.bag and Dask.dataframe can happily play in this space this doesn’t seem to be the focus of recent conversations.

Recent conversations are almost entirely around supporting interactive custom parallelism (lots of small tasks with complex dependencies between them) rather than the big Map->Filter->Groupby->Join abstractions you often find in a database or Spark. That’s not to say that these operations aren’t hugely important; there is a lot of selection bias here. The people I talk to are people for whom Spark/Databases are clearly not an appropriate fit. They are tackling problems that are way more complex, more heterogeneous, and with a broader variety of users.

I usually describe this situation with an analogy comparing “Big data” systems to human transportation mechanisms in a city. Here we go:

• A Database is like a train: it goes between a set of well defined points with great efficiency, speed, and predictability. These are popular and profitable routes that many people travel between (e.g. business analytics). You do have to get from home to the train station on your own (ETL), but once you’re in the database/train you’re quite comfortable.
• Spark is like an automobile: it takes you door-to-door from your home to your destination with a single tool. While this may not be as fast as the train for the long-distance portion, it can be extremely convenient to do ETL, Database work, and some machine learning all from the comfort of a single system.
• Dask is like an all-terrain-vehicle: it takes you out of town on rough ground that hasn’t been properly explored before. This is a good match for the Python community, which typically does a lot of exploration into new approaches. You can also drive your ATV around town and you’ll be just fine, but if you want to do thousands of SQL queries then you should probably invest in a proper database or in Spark.

Again, there is a lot of selection bias here, if what you want is a database then you should probably get a database. Dask is not a database.

This is also wildly over-simplifying things. Databases like Oracle have lots of ETL and analytics tools, Spark is known to go off road, etc.. I obviously have a bias towards Dask. You really should never trust an author of a project to give a fair and unbiased view of the capabilities of the tools in the surrounding landscape.

### Conclusion

That’s a rough sketch of current conversations and open problems for “How Dask might evolve to support institutional use cases.” It’s really quite surprising just how prevalent this story is among the full spectrum from universities to hedge funds.

The problems listed above are by no means halting adoption. I’m not listing the 100 or so questions that are answered with “yes, that’s already supported quite well”. Right now I’m seeing Dask being adopted by individuals and small groups within various institutions. Those individuals and small groups are pushing that interest up the stack. It’s still several months before any 1000+ person organization adopts Dask as infrastructure, but the speed at which momentum is building is quite encouraging.

I’d also like to thank the several nameless people who exercise Dask on various infrastructures at various scales on interesting problems and have reported serious bugs. These people don’t show up on the GitHub issue tracker but their utility in flushing out bugs is invaluable.

As interest in Dask grows it’s interesting to see how it will evolve. Culturally Dask has managed to simultaneously cater to both the open science crowd as well as the private-sector crowd. The project gets both financial support and open source contributions from each side. So far there hasn’t been any conflict of interest (everyone is pushing in roughly the same direction) which has been a really fruitful experience for all involved I think.

This post was originally published by Matt Rocklin on his website, matthewrocklin.com

#### Mining Data Science Treasures with Open Source

Wednesday, August 17, 2016
Travis Oliphant
Chief Executive Officer & Co-Founder
Continuum Analytics

## oil-refinery.jpg

Data Science is a goldmine of potential insights for any organization, but unearthing those insights can be resource-intensive, requiring systems and teams to work seamlessly and effectively as a unit.

Integrating resources isn’t easy. Traditionally, businesses chose vendors with all-in-one solutions to cover the task. This approach may seem convenient, but what freedoms must be sacrificed in order to achieve it?

This vendor relationship resembles the troubled history of coal mining towns in the Old West, where one company would own everything for sale in the town. Workers were paid with vouchers that could only be redeemed at company-owned shops.

The old folk tune "Sixteen Tons" stated it best: "I owe my soul to the company store."

With any monopoly, these vendors have no incentive to optimize products and services. There's only one option available — take it or leave it. But, just as some miners would leave these towns and make their way across the Wild West, many companies have chosen to forge their own trails with the freedom of Open Data Science — and they've never looked back.

### Open Data Science: Providing Options

Innovation and flexibility are vital to the evolving field of Data Science, so any alternative to the locked-in vendor approach is attractive. Fortunately, Open Data Science provides the perfect ecosystem of options for true innovation.

Sometimes vendors provide innovation, such as with the infrastructure surrounding linear programming. This doesn’t mean they’re able to provide an out-of-the-box solution for all teams — adapting products to different businesses and industries requires work.

Most of the real innovation is emerging from the open source world. The tremendous popularity of Python and R, for example, bolsters innovation on all kinds of analytics approaches.

Given the wish to avoid a “mining town scenario” and the burgeoning innovation in Open Data Science, why are so many companies still reluctant to adopt it?

### Companies Should Not Hesitate to Embrace Open Source

There are several reasons companies balk at Open Data Science solutions:

• Licensing. Open source has many licenses: Apache, BSD, GPL, MIT, etc. This wide array of choices can produce analysis paralysis. In some cases, such as GPL, there is a requirement to make source code available for redistribution.
• Diffuse contact. Unlike with vendor products, open source doesn’t provide a single point of contact. It’s a non-hierarchical effort. Companies have to manage keeping software current, and this can feel overwhelming without a steady guide they can rely on.
• Education. With rapid change, companies find it difficult to stay on top of the many acronyms, project names, and new techniques required with each piece of open source software.

Fortunately, these concerns are completely surmountable. Most licenses are appropriate for commercial applications, and many companies are finding open source organizations to act as a contact point within the Open Data Science world — with the added benefit of a built-in guide to the ever-changing landscape of open source, thereby also solving the problem of education.

### The Best Approach for Starting an Open Data Science Initiative

There are several tactics organizations can use to effectively adopt Open Data Science.

For instance, education and establishing a serious training program is crucial. One focus here has to be on reproducibility. Team members should know how the latest graph was produced and how to generate the next iteration of it.

Much of this requires understanding the architecture of the applications one is using, so dependency management is important. Anything that makes the process transparent to the team will promote understanding and engagement.

Flexible governance models are also valuable, allowing intelligent assessment of open source pros and cons. For example, it shouldn’t be difficult to create an effective policy on what sort of open source licenses work best.

Finally, committing resources to successful adaptation and change management should be central to any Open Data Science arrangement. This will almost always require coding to integrate solutions into one’s workflow. But this effort also supports retention: companies that shun open source risk losing developers who seek the cutting edge.

### Tackle the Future Before You're Left Behind

Unlike the old mining monopolies, competition in Data Science is a rapidly-changing world of many participants. Companies that do not commit resources to education, understanding, governance and change management risk getting left behind as newer companies commit fully to open source.

Well-managed open source analytics environments, like Anaconda, provide a compatible and updated suite of modern analytics programs. Combine this with a steady guide to traverse the changing landscape of Open Data Science, and the Data Science gold mine becomes ripe for the taking.