September 23, 2016

Continuum Analytics news

Why Your Company Needs a Chief Data Science Officer

Friday, September 23, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

This article was originally posted in CMSWire and has been edited for length and clarity. 

Ten years ago, the Chief Data Science Officer (CDSO) role was non-existent. It came into being when D.J. Patil was named the first US Chief Data Scientist by President Obama in 2015. A product of the Chief Technology Officer (CTO), who is responsible for focusing on scientific and technological matters within an organization including the company’s hardware and software, the CDSO takes technology within the enterprise to a whole new level. Companies have been motivated to get on the data train since 1990, when they began implementing big data collection.

However, making sense of the data was a challenge. With no dedicated person to own and manage the huge piles of datasets being collected, organizations began to flail and sink under the weight of all their data. It didn’t happen overnight, but data scientists and, subsequently, the role of the CDSO, came to life once companies realized that proper data analysis was key to finding correlations needed to spot business trends and, ultimately, exploit the power of big data to deliver value.

The CDSO role confirms the criticality of collecting data properly to capitalize on it and make certain it is stored securely in the event of a disaster or emergency (some businesses have yet to recover data following Hurricane Sandy).

Fast forward to 2016. Big data has exploded, but companies are still struggling with how best to organize around it — as an activity, a business function and a capability.

But what exactly can we achieve with it?

CDSOs (and their team of data scientists) are key to the skill set needed to apply analytics to their business, explain how to use data to create a competitive advantage and surpass competitors and understand how to find true value from data by acting on it.

Empowering Data Science Teams

Today, businesses are equipped with data science teams made up of a variety of roles––business analysts, machine learning experts, data engineers and more.

With the CDSO at the helm, the data science team can collaborate and centralize these skills, becoming a hub of intelligence and adding value to each business they serve. With a multifaceted perspective on data science as a whole, the CDSO allows for more innovative ideas and solutions for companies.

Staying Cost Efficient

It’s no secret that how businesses handle data has a direct impact on the bottom line. An interesting example occurred at DuPont, a company that defines itself as “a science company dedicated to solving challenging global problems” and is well known for its distribution of Corian solid surface countertops across the world. When asked if it believed it was covering its entire total addressable market (TAM), company executives were definitive in their response: a resounding yes.

Executives knew they had covered every region in the market and had great insight into analytics via distributors. What they hadn’t taken into consideration, however, was the vast amounts of data embedded within end-customer insights Without knowing exactly where the product was being installed — literally, DuPont had no insight into locations where it had not saturated.

DuPont took this information and created countertops that embedded sensors driven by Internet of Things (IoT) technology. By not simply relying on the data provided by its suppliers, DuPont seized the opportunity to increase its pool of knowledge significantly, by adding data science into its product.

This is just one example of how data science and the CDSO can implement previously non-existent processes and drive increased business intelligence in the most beneficial way –– with increased value to its re-sellers and a direct impact on revenue.

Changing the World

There is no room for doubt: it’s proven that innovation in the field of Open Data Science has led to the need for a CDSO to derive as much value from data as possible and help companies make an impact on the world.

John Deere, a 180-year-old company, is now revolutionizing farming by incorporating “smart farms.” Big data and IoT solutions allow farmers to make educated decisions based on real-time analysis of captured data. Giving this company the ability to put its data to good use, resulted in industry-wide and, in many areas, worldwide, positive changes — another reason why technology driven by the CDSO is an integral part of any organization.

The need for an executive-level decision maker proves to be an essential piece of the puzzle. The CDSO deserves a seat at the executive table to empower data science teams, drive cost efficiency and previously unimagined results and, most importantly, help companies change the world.

 

by swebster at September 23, 2016 03:35 PM

September 22, 2016

Matthew Rocklin

Dask Cluster Deployments

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

All code in this post is experimental. It should not be relied upon. For people looking to deploy dask.distributed on a cluster please refer instead to the documentation instead.

Dask is deployed today on the following systems in the wild:

  • SGE
  • SLURM,
  • Torque
  • Condor
  • LSF
  • Mesos
  • Marathon
  • Kubernetes
  • SSH and custom scripts
  • … there may be more. This is what I know of first-hand.

These systems provide users access to cluster resources and ensure that many distributed services / users play nicely together. They’re essential for any modern cluster deployment.

The people deploying Dask on these cluster resource managers are power-users; they know how their resource managers work and they read the documentation on how to setup Dask clusters. Generally these users are pretty happy; however we should reduce this barrier so that non-power-users with access to a cluster resource manager can use Dask on their cluster just as easily.

Unfortunately, there are a few challenges:

  1. Several cluster resource managers exist, each with significant adoption. Finite developer time stops us from supporting all of them.
  2. Policies for scaling out vary widely. For example we might want a fixed number of workers, or we might want workers that scale out based on current use. Different groups will want different solutions.
  3. Individual cluster deployments are highly configurable. Dask needs to get out of the way quickly and let existing technologies configure themselves.

This post talks about some of these issues. It does not contain a definitive solution.

Example: Kubernetes

For example, both Olivier Griesl (INRIA, scikit-learn) and Tim O’Donnell (Mount Sinai, Hammer lab) publish instructions on how to deploy Dask.distributed on Kubernetes.

These instructions are well organized. They include Dockerfiles, published images, Kubernetes config files, and instructions on how to interact with cloud providers’ infrastructure. Olivier and Tim both obviously know what they’re doing and care about helping others to do the same.

Tim (who came second) wasn’t aware of Olivier’s solution and wrote up his own. Tim was capable of doing this but many beginners wouldn’t be.

One solution would be to include a prominent registry of solutions like these within Dask documentation so that people can find quality references to use as starting points. I’ve started a list of resources here: dask/distributed #547 comments pointing to other resources would be most welcome..

However, even if Tim did find Olivier’s solution I suspect he would still need to change it. Tim has different software and scalability needs than Olivier. This raises the question of “What should Dask provide and what should it leave to administrators?” It may be that the best we can do is to support copy-paste-edit workflows.

What is Dask-specific, resource-manager specific, and what needs to be configured by hand each time?

Adaptive Deployments

In order to explore this topic of separable solutions I built a small adaptive deployment system for Dask.distributed on Marathon, an orchestration platform on top of Mesos.

This solution does two things:

  1. It scales a Dask cluster dynamically based on the current use. If there are more tasks in the scheduler then it asks for more workers.
  2. It deploys those workers using Marathon.

To encourage replication, these two different aspects are solved in two different pieces of code with a clean API boundary.

  1. A backend-agnostic piece for adaptivity that says when to scale workers up and how to scale them down safely
  2. A Marathon-specific piece that deploys or destroys dask-workers using the Marathon HTTP API

This combines a policy, adaptive scaling, with a backend, Marathon such that either can be replaced easily. For example we could replace the adaptive policy with a fixed one to always keep N workers online, or we could replace Marathon with Kubernetes or Yarn.

My hope is that this demonstration encourages others to develop third party packages. The rest of this post will be about diving into this particular solution.

Adaptivity

The distributed.deploy.Adaptive class wraps around a Scheduler and determines when we should scale up and by how many nodes, and when we should scale down specifying which idle workers to release.

The current policy is fairly straightforward:

  1. If there are unassigned tasks or any stealable tasks and no idle workers, or if the average memory use is over 50%, then increase the number of workers by a fixed factor (defaults to two).
  2. If there are idle workers and the average memory use is below 50% then reclaim the idle workers with the least data on them (after moving data to nearby workers) until we’re near 50%

Think this policy could be improved or have other thoughts? Great. It was easy to implement and entirely separable from the main code so you should be able to edit it easily or create your own. The current implementation is about 80 lines (source).

However, this Adaptive class doesn’t actually know how to perform the scaling. Instead it depends on being handed a separate object, with two methods, scale_up and scale_down:

class MyCluster(object):
    def scale_up(n):
        """
        Bring the total count of workers up to ``n``

        This function/coroutine should bring the total number of workers up to
        the number ``n``.
        """
        raise NotImplementedError()

    def scale_down(self, workers):
        """
        Remove ``workers`` from the cluster

        Given a list of worker addresses this function should remove those
        workers from the cluster.
        """
        raise NotImplementedError()

This cluster object contains the backend-specific bits of how to scale up and down, but none of the adaptive logic of when to scale up and down. The single-machine LocalCluster object serves as reference implementation.

So we combine this adaptive scheme with a deployment scheme. We’ll use a tiny Dask-Marathon deployment library available here

from dask_marathon import MarathonCluster
from distributed import Scheduler
from distributed.deploy import Adaptive

s = Scheduler()
mc = MarathonCluster(s, cpus=1, mem=4000,
                     docker_image='mrocklin/dask-distributed')
ac = Adaptive(s, mc)

This combines a policy, Adaptive, with a deployment scheme, Marathon in a composable way. The Adaptive cluster watches the scheduler and calls the scale_up/down methods on the MarathonCluster as necessary.

Marathon code

Because we’ve isolated all of the “when” logic to the Adaptive code, the Marathon specific code is blissfully short and specific. We include a slightly simplified version below. There is a fair amount of Marathon-specific setup in the constructor and then simple scale_up/down methods below:

from marathon import MarathonClient, MarathonApp
from marathon.models.container import MarathonContainer


class MarathonCluster(object):
    def __init__(self, scheduler,
                 executable='dask-worker',
                 docker_image='mrocklin/dask-distributed',
                 marathon_address='http://localhost:8080',
                 name=None, cpus=1, mem=4000, **kwargs):
        self.scheduler = scheduler

        # Create Marathon App to run dask-worker
        args = [
            executable,
            scheduler.address,
            '--nthreads', str(cpus),
            '--name', '$MESOS_TASK_ID',  # use Mesos task ID as worker name
            '--worker-port', '$PORT_WORKER',
            '--nanny-port', '$PORT_NANNY',
            '--http-port', '$PORT_HTTP'
        ]

        ports = [{'port': 0,
                  'protocol': 'tcp',
                  'name': name}
                 for name in ['worker', 'nanny', 'http']]

        args.extend(['--memory-limit',
                     str(int(mem * 0.6 * 1e6))])

        kwargs['cmd'] = ' '.join(args)
        container = MarathonContainer({'image': docker_image})

        app = MarathonApp(instances=0,
                          container=container,
                          port_definitions=ports,
                          cpus=cpus, mem=mem, **kwargs)

        # Connect and register app
        self.client = MarathonClient(marathon_address)
        self.app = self.client.create_app(name or 'dask-%s' % uuid.uuid4(), app)

    def scale_up(self, instances):
        self.client.scale_app(self.app.id, instances=instances)

    def scale_down(self, workers):
        for w in workers:
            self.client.kill_task(self.app.id,
                                  self.scheduler.worker_info[w]['name'],
                                  scale=True)

This isn’t trivial, you need to know about Marathon for this to make sense, but fortunately you don’t need to know much else. My hope is that people familiar with other cluster resource managers will be able to write similar objects and will publish them as third party libraries as I have with this Marathon solution here: https://github.com/mrocklin/dask-marathon (thanks goes to Ben Zaitlen for setting up a great testing harness for this and getting everything started.)

Adaptive Policies

Similarly, we can design new policies for deployment. You can read more about the policies for the Adaptive class in the documentation or the source (about eighty lines long). I encourage people to implement and use other policies and contribute back those policies that are useful in practice.

Final thoughts

We laid out a problem

  • How does a distributed system support a variety of cluster resource managers and a variety of scheduling policies while remaining sensible?

We proposed two solutions:

  1. Maintain a registry of links to solutions, supporting copy-paste-edit practices
  2. Develop an API boundary that encourages separable development of third party libraries.

It’s not clear that either solution is sufficient, or that the current implementation of either solution is any good. This is is an important problem though as Dask.distributed is, today, still mostly used by super-users. I would like to engage community creativity here as we search for a good solution.

September 22, 2016 12:00 AM

September 20, 2016

Matthieu Brucher

Audio Toolkit: Handling denormals

While following a discussion on KVR, I thought about adding support for denormals handling in Audio Toolkit

What are denormals?

Denormals or denormal number are numbers that can’t be represented the “usual” way in floating point representation. When this happens, the floating point units can’t be as fast as with the usual representation. These numbers are really low, almost 0, but not exactly 0. So this can often happen in audio processing at the end of the processing of a clip, and sometimes during computation for a handful of values.

In the past, on AMD CPUs, the FPU would even use the denormal process for bigger values than on the Intel CPUs, which lead to poorer performance. This doesn’t happen anymore AFAIK, but if your application is slow, you may want to take a look at a profile and determine if you could have an issue there. Denormals behavior can be detected by an abnormal ratio of floating point operations per cycle (the number is too low).

Flush to zero on different platforms

There are different ways of avoiding denormals. One not so good one is to add background noise to the operations. The issue is what amount (random or constant) and the fact that not all algorithms can handle them.

The better solution is to use the CPU facilities for this. x86 processors have internal flags that can be use to flush denormals to zero. Unfortunately, the API is different on all platforms.

On Windows, the following function is used:

_controlfp_s(&previous_state, _MCW_DN, _DN_FLUSH);

On Linux, as an extension of C99, gcc added this function:

_mm_setcsr(_mm_getcsr() | (_MM_DENORMALS_ZERO_ON));

And finally, OS X has yet a different way of doing things.

fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);

ARM platform is yet different. The default compiler has also an API, but if you are compiling with GCC, flush to zero is activated through the command line option -funsafe-math-optimizations.

Using flush to zero

The functions need to set the state before processing anything and then they need to be reused to set the state as it was before. This is to ensure that calling code has the same FPU state. What your functions can handle (arbitrary noise for small values) may not be acceptable for other applications.

The total amount of change in terms of performance may not be impressive. Using the functions to change FPU state means that there is an overhead (that may not be important, but an overhead nonetheless) and that the algorithms will behave slightly differently. So flushing to zero is about compromise.

Conclusion

Flushing denormals to zero may not be mandatory, but having the option to enable it is neat. So this is now available in Audio Toolkit 1.3.2.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at September 20, 2016 07:11 AM

September 16, 2016

Enthought

Canopy Data Import Tool: New Updates

In May of 2016 we released the Canopy Data Import Tool, a significant new feature of our Canopy graphical analysis environment software. With the Data Import Tool, users can now quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling.

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

With the latest version of the Data Import Tool released this month (v. 1.0.4), we’ve added new capabilities and enhancements, including:

  1. The ability to select and import a specific table from among multiple tables on a webpage,
  2. Intelligent alerts regarding the saved state of exported Python code, and
  3. Unlimited file sizes supported for import.

Download Canopy and start a free 7 day trial of the data import tool

New: Choosing from multiple tables on a webpage

Example of page with multiple tables for selection

The latest release of the Canopy Data Import Tool supports the selection of a specific table from a webpage for import, such as this Wikipedia page

In addition to CSVs and structured text files, the Canopy Data Import Tool (the Tool) provides the ability to load tables from a webpage. If the webpage contains multiple tables, by default the Tool loads the first table.

With this release, we provide the user with the ability to choose from multiple tables to import using a scrollable index parameter to select the table of interest for import.

Example: loading and working with tables from a Wikipedia page

For example, let’s try to load a table from the Demography of the UK wiki page using the Tool. In total, there are 10 tables on that wiki page.

  • As you can see in the screenshot below, the Tool initially loads the first table on the wiki page.
  • However, we are interested in loading the table ‘Vital statistics since 1960’, which is the fifth table on the page. (Note that indexing starts at 0). For a quick history lesson on why Python uses zero based indexing, see Guido van Rossum’s explanation here).
  • After the initial read-in, we can click on the ‘Table index on page’ scroll bar, choose ‘4’ and click on ‘Refresh Data’ to load the table of interest in the Data Import Tool.

See how the Canopy Data Import Tool loads a table from a webpage and prepares the data for manipulation and interaction:

The Data Import Tool allows you to select a specific table from a webpage where multiple are present, with a simple drop down menu. Once you’ve selected your table, you can readily toggle between 3 views: the Pandas DataFrame generated by the Tool, the raw data and the corresponding auto-generated Python code. Consecutively, you can export the DataFrame to the IPython console for further plotting and further analysis.

  • Further, as you can see, the first row contains column names and the first column looks like an index for the Data Frame. Therefore, you can select the ‘First row is column names’ checkbox and again click on ‘Refresh Data’ to prompt the Tool to re-read the table but, this time, use the data in the first row as column names. Then, we can right-click on the first column and select the ‘Set as Index’ option to make column 0 the index of the DataFrame.
  • You can toggle between the DataFrame, Raw Data and Python Code tabs in the Tool, to peek at the raw data being loaded by the Tool and the corresponding Python code auto-generated by the Tool.
  • Finally, you can click on the ‘Use DataFrame’ button, in the bottom right, to send the DataFrame to the IPython kernel in the Canopy User Environment, for plotting and further analysis.

New: Keeping track of exported Python scripts

The Tool generates Python commands for all operations performed by the user and provides the user with the ability to save the generated Python script. With this new update, the Tool keeps track of the saved and current states of the generated Python script and intelligently alerts the user if he/she clicks on theUse DataFrame’ button without saving changes in the Python script.

New: Unlimited file sizes supported for import

In the initial release, we chose to limit the file sizes that can be imported using the Tool to 70 MB, to ensure optimal performance. With this release, we removed that restriction and allow files of any size to be uploaded with the tool. For files over 70 MB we now provide the user with a warning that interaction, manipulation and operations on the imported Data Frame might be slower than normal, and allow them to select whether to continue or begin with a smaller subset of data to develop a script to be applied to the larger data set.

Additions and Fixes

Along with the feature additions discussed above, based on continued user feedback, we implemented a number of UI/UX improvements and bug fixes in this release. For a complete list of changes introduced in version 1.0.4 of the Data Import Tool, please refer to the Release Notes page in the Tool’s documentation. If you have any feedback regarding the Data Import Tool, we’d love to hear from you at canopy.support@enthought.com.

Additional resources:

Download Canopy and start a free 7 day trial of the data import tool

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

 

by admin at September 16, 2016 04:47 PM

September 14, 2016

Continuum Analytics news

Working Efficiently with Big Data in Text Formats Using Free Software

Monday, September 12, 2016
David Mertz
David Mertz
Continuum Analytics

One of our first commercial software products at Continuum Analytics was a product called IOPro which we have sold continuously since 2012. Now, we will be releasing the code under a liberal open source license. 

Following the path of widely adopted projects like conda, Blaze, Dask, odo, Numba, conda, Bokeh, datashader, DataShape, DyND and other software that Continuum has created, we hope that the code in IOPro becomes valuable to open source communities and data scientists worldwide. 

However, we do not only hope this code is useful to you—we also hope you and your colleagues will be able to enhance, refine and develop the code further to increase its utility for the entire Python world.

For existing IOPro customers, we will be providing a free of charge license upon renewal until we release the open source version. 

What IOPro Does

IOPro loads NumPy arrays and pandas DataFrames directly from files, SQL databases and NoSQL stores—including ones with millions (or billions) of rows. It provides a drop-in replacement for NumPy data loading functions but dramatically improves performance and starkly reduces memory overhead.

The key concept in our code is that we access data via adapters which are like enhanced file handles or database cursors.  An adapter does not read data directly into memory, but rather provides a mechanism to use familiar NumPy/pandas slicing syntax to load manageable segments of a large dataset.  Moreover, an adapter provides fine-grained control over exactly how data is eventually read into memory, whether using custom patterns for how a line of data is parsed, choosing the precise data type of a textually represented number, or exposing data as "calculated fields" (that is, "virtual columns").

As well as local CSV, JSON or other textual data sources, IOPro can load data from Amazon S3 buckets.  When accessing large datasets—especially ones too large to load into memory—from files that do not have fixed record sizes, IOPro's indexing feature allows users to seek to a specific collection of records tens, hundreds or thousands of times faster than is possible with a linear scan.

Our Release Schedule

The initial release of our open source code will be of the TextAdapter component that makes up the better part of the code in IOPro.  This code will be renamed, straightforwardly enough, as TextAdapter. The project will live at https://github.com/ContinuumIO/TextAdapter.  We will make this forked project available by October 15, 2016 under a BSD 3-Clause License. 

Additionally, we will release the database adapters by December 31, 2016. That project will live at https://github.com/ContinuumIO/DBAdapter.

If you are a current paid customer of IOPro, and are due for renewal before January 1, 2017, your Anaconda Ambassador will get in touch with you to provide a license free of charge, so you do not experience any downtime. 

Thank you to prior contributors at Continuum, especially Jay Bourque (jayvius), but notably also Francesc Alted (FrancescAlted), Óscar Villellas Guillén (ovillellas), Michael Kleehammer (mkleehammer) and Ilan Schnell (ilanschnell) for their wonderful contributions.  Any remaining bugs are my responsibility alone as current maintainer of the project.

The Blaze Ecosystem

As part of the open source release of TextAdapter, we plan to integrate TextAdapter into the Blaze ecosystem.  Blaze itself, as well as odo, provides translation between data formats and querying of data within a large variety of formats. Putting TextAdapter clearly in this ecosystem will let an adapter act as one such data format, and hence leverage the indexing speedups and data massaging that TextAdapter provides.

Other Open Source Tools

Other open source projects for interacting with large datasets provide either competitors or collaborative capabilities.  

  • The ParaText from Wise Technology looks like a very promising approach to accelerating raw reads of CSV data.  It doesn't currently provide regular expression matching nor as rich data typing as IOPro, but the raw reads are shockingly fast. Most importantly, perhaps, ParaText does not address indexing, so as fast as it is at linear scan, it remains stuck with big-O inefficiencies that TextAdapter addresses.  I personally think that (optionally) utilizing the underlying reader of ParaText as a layer underneath TextAdapter would be a wonderful combination.  Information about ParaText can be found at http://www.wise.io/tech/paratext.

Database access is almost always I/O bound rather than CPU bound, and hence the likely wins are by switching to asynchronous frameworks.  This does involve using a somewhat different programming style than synchronous adapters, but some recent ones look amazingly fast.  I am not yet sure whether it is worthwhile to create IOPro style adapters around these asyncio-based interfaces.

  • asyncpg is a database interface library designed specifically for PostgreSQL and Python/asyncio. asyncpg is an efficient, clean implementation of PostgreSQL server binary protocol. Information about asyncpg can be found at https://magicstack.github.io/asyncpg/current/.

  • Motor presents a callback- or Future-based API for non-blocking access to MongoDB from Tornado or asyncio. Information about Motor can be found at http://motor.readthedocs.io/en/stable/.

We will continue to monitor and reply to issues and discussion about these successor projects at their GitHub repositories - all questions should be addressed at one of the following: 

by ryanwh at September 14, 2016 02:42 PM

September 13, 2016

Titus Brown

A draft genome for the tule elk

(Please contact us at bnsacks@ucdavis.edu if you are interested in access to any of this data. We're still working out how and when to make it public.)

The tule elk (Cervus elaphus nannodes) is a California-endemic subspecies that underwent a major genetic bottleneck when its numbers were reduced to as few as 3 individuals in the 1870s (McCullough 1969; Meredith et al. 2007). Since then, the population has grown to an estimated 4,300 individuals which currently occur in 22 distinct herds (Hobbs 2014). Despite their higher numbers today, the historical loss of genetic diversity combined with the increasing fragmentation of remaining habitat pose a significant threat to the health and management of contemporary populations. As populations become increasingly fragmented by highways, reservoirs, and other forms of human development, risks intensify for genetic impacts associated with inbreeding. By some estimates, up to 44% of remaining genetic variation could be lost in small isolated herds in just a few generations (Williams et al. 2004). For this reason, the Draft Elk Conservation and Management Plan and California Wildlife Action Plan prioritize research aimed at facilitating habitat connectivity, as well as stemming genetic diversity loss and habitat fragmentation (Hobbs 2014; CDFW 2015).


We obtained 377,980,276 raw reads (i.e., 300 bp sequences from random points in the genome), containing a total of 113.394 Gbp of sequence, or approximately 40X coverage of the tule elk genome. More than 98% of these data passed quality filtering. The reads (and coverage) were distributed approximately equally among the 4 elk, resulting in approximately 10X coverage for each of the 4 elk.

...

The tule elk reads were de novo assembled into 602,862 contiguous sequences ("contigs") averaging 3,973 bp in length (N50 = 6,885 bp, maximum contig length = 72,391 bp), for a total genome sequence size of 2.395 billion bp (Gbp). All scaffolds and raw reads will be made publicly available on Genbank or a similar public database pending publication. Alignment of all elk reads back to these contigs revealed 3,571,069 polymorphic sites (0.15% of sites). Assuming a similar ratio of heterozygous (in individuals) to polymorphic (among the 4 elk) sites as we observed in the subsample aligned to the sheep genome, this would translate to a genome-wide heterozygosity of approximately 5e-4, which was about 5 times higher than that observed in the 25% of the genome mapping to the sheep genome. This magnitude of heterozygosity is in line with other bottlenecked mammal populations, including several of the island foxes (Urocyon littoralis), cheetah (Acinonyx jubatus), Tasmanian devil (Sarcophilus harrisii), and mountain gorilla (Gorilla beringei beringei; Robinson et al. 2016 and references therein). Although these interspecific comparisons provide a general reference, heterozygosity can vary substantially according to life-history, as well as demographic history, and does not necessarily imply a direct relationship to genetic health. Therefore, sequencing the closely related Rocky Mountain (C. elaphus nelsoni) and Roosevelt (C. elaphus roosevelti) elk in the future is necessary to provide the most meaningful comparison to the tule elk heterozygosity reported here.


Note, assembly method details are available on github.

by Ben Sacks, Zach Lounsberry, Jessica Mizzi, C. Titus Brown, at September 13, 2016 10:00 PM

Publishing Open Source Research Software in JOSS - an experience report

Our first JOSS submission (paper? package?) is about to be accepted and I wanted to enthuse about the process a bit.

JOSS, the Journal of Open Source Software, is a place to publish your research software packages. Quoting from the about page,

The Journal of Open Source Software (JOSS) is an academic journal with a formal peer review process that is designed to improve the quality of the software submitted. Upon acceptance into JOSS, a CrossRef DOI is minted and we list your paper on the JOSS website.

How is JOSS different?

In essentially all other academic journals, when you publish software you have to write a bunch of additional stuff about what the software does and how it works and why it's novel or exciting. This is true even in some of the newer models for software publication like F1000Research, which hitherto took the prize for least obnoxious software publication process.

JOSS takes the attitude that what the software does should be laid out in the software documentation. JOSS also has the philosophy that since software is the product perhaps the software itself should be reviewed rather than the software advertisement (aka scientific paper). (Note, I'm a reviewer for JOSS, and I'm totally in cahoots with most of the ed board, but I don't speak for JOSS in any way.)

To put it more succinctly, with JOSS the focus is on the software itself, not on ephemera associated with the software.

The review experience

I submitted our sourmash project a few months back. Sourmash was a little package I'd put together to do MinHash sketch calculations on DNA, and it wasn't defensible as a novel package. Frankly, it's not that scientifically interesting either. But it's a potentially useful reimplementation of mash, and we'd already found it useful internally. So I submitted it to JOSS.

As you can see from the JOSS checklist, the reviewer checklist is both simple and reasonably comprehensive. Jeremy Kahn undertook to do the review, and found a host of big and small problems, ranging from licensing confusion to versioning issues to straight up install bugs. Nonetheless his initial review was pretty positive. (Most of the review items were filed as issues on the sourmash repository, which you can see referenced inline in the review/pull request.)

After his initial review, I addressed most of the issues and he did another round of review, where he recommended acceptance after fixing up some of the docs and details.

Probably the biggest impact of Jeremy's review was my realization that we needed to adopt a formal release checklist, which I did by copying Michael Crusoe's detailed and excellent checklist from khmer. This made doing an actual release much saner. But a lot of confusing stuff got cleared up and a few install and test bugs were removed as well.

So, basically, the review did what it should have done - checked our assumptions and found big and little nits that needed to be cleaned up. It was by no means a gimme, and I think it improved the package tremendously.

+1 for JOSS!

Some thoughts on where JOSS fits

There are plenty of situations where a focus solely on the software isn't appropriate. With our khmer project, we publish new data structures and algorithms, apply our approaches to challenging data sets, benchmark various approaches, and describe the software suite at a high level. But in none of these papers did anyone really review the software (although some of the reviewers on the F1000 Research paper did poke it with a stick).

JOSS fills in a nice niche here where we could receive a 3rd-party review of the software itself. While I think Jeremy Kahn did an especially exemplary review of the sourmash and we could not expect such a deep review of the much larger khmer package, a broad review from a third-party perspective at each major release point would be most welcome. So I will plan on a JOSS submission for each major release of khmer, whether or not we also advertise the release elsewhere.

I suppose people might be concerned about publishing software in multiple ways and places, and how that's going to affect citation metrics. I have to say I don't have any concerns about salami slicing or citation inflation here, because software is still largely ignored by Serious Scientists and that's the primary struggle here. (Our experience is that people systematically mis-cite us (despite ridiculously clear guidelines) and my belief is that software and methods are generally undercited. I worry more about that than getting undue credit for software!)

JOSS is already seeing a fair amount of activity and, after my experience, if I see that something was published there, I will be much more likely to recommend it to others. I suggested you all check it out, if not as a place to publish yourself, as a place to find better quality software.

--titus

by C. Titus Brown at September 13, 2016 10:00 PM

NeuralEnsemble

Neo 0.5.0-alpha1 released

We are pleased to announce the first alpha release of Neo 0.5.0.

Neo is a Python library which provides data structures for working with electrophysiology data, whether from biological experiments or from simulations, together with a large library of input-output modules to allow reading from a large number of different electrophysiology file formats (and to write to a somewhat smaller subset, including HDF5 and Matlab).

For Neo 0.5, we have taken the opportunity to simplify the Neo object model. Although this will require an initial time investment for anyone who has written code with an earlier version of Neo, the benefits will be greater simplicity, both in your own code and within the Neo code base, which should allow us to move more quickly in fixing bugs, improving performance and adding new features. For details of what has changed and what has been added, see the Release notes.

If you are already using Neo for your data analysis, we encourage you to give the alpha release a try. The more feedback we get about the alpha release, the quicker we can find and fix bugs. If you do find a bug, please create a ticket. If you have questions, please post them on the mailing list or in the comments below.

Documentation:
http://neo.readthedocs.io/en/neo-0.5.0alpha1/
Licence:
Modified BSD
Source code:
https://github.com/NeuralEnsemble/python-neo

by Andrew Davison (noreply@blogger.com) at September 13, 2016 01:24 PM

Matthieu Brucher

Playing with a Bela (1): Turning it on and compiling Audio Toolkit

I have now some time to play with this baby:
Beagleboard with Bela extensionBeagleboard with Bela extension
The CPU may not be blazingly fast, but I hope I can still do something with it. The goal of this series will be to try different algorithms and see how they behave on the platform.

Setting everything up

I got the Bela with the Kickstarter campaign. Although I could have used it as soon as I got it, I didn’t have enough time to really dig into it. Now is the time.

First, I had to update the Bela image with the last available one. This one allows you to connect to Internet directly from the Ethernet port, which is required if you need to get source code from the Internet or update the card. So nice change.

The root account is the one advised in the wiki, but I would suggest to create a user account, protect the root account so that you can’t log with it (especially if plugged on your private network!) and make your user account a sudoer.

Once this is done, let’s tackle Audio Toolkit compilation.

Setting everything up

For this step, you need to start by getting the latest gcc, cmake, libeigen3-dev and the boost libraries (libboost-all-dev or just system, timer and test). Now we have all the dependencies.

Get the develop branch of Audio Toolkit (there will be a future release with the last updates soon that will support ARM code) from github: https://github.com/mbrucher/AudioTK and launch cmake to build the Makefiles.

If the C++11 flag is activated by default, it is not the case for the other flags that the ARM board requires. On top of it, we need -march=native -mfpu=neon -funsafe-math-optimizations. The first option triggers ARM code generation for the Beagleboard platform, the second one allows to use the NEON intrisincs and floating point instructions. The last one is the interesting one: it allows to optimize some math operations, like the denormal processing by flushing them to zero (effectively FE_DFL_DISABLE_SSE_DENORMS_ENV does on OS X or _MM_DENORMALS_ZERO_ON on GCC with x86).

The compilation takes time, but it finishes with all the libraries and tests.

Conclusion

I have known the basic structure ready. The compilation of Audio Toolkit is slow, but I hope the code itself will be fast enough. Let’s keep this for next post in this series.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at September 13, 2016 07:12 AM

Matthew Rocklin

Dask and Celery

This post compares two Python distributed task processing systems, Dask.distributed and Celery.

Disclaimer: technical comparisons are hard to do well. I am biased towards Dask and ignorant of correct Celery practices. Please keep this in mind. Critical feedback by Celery experts is welcome.

Celery is a distributed task queue built in Python and heavily used by the Python community for task-based workloads.

Dask is a parallel computing library popular within the PyData community that has grown a fairly sophisticated distributed task scheduler. This post explores if Dask.distributed can be useful for Celery-style problems.

Comparing technical projects is hard both because authors have bias, and also because the scope of each project can be quite large. This allows authors to gravitate towards the features that show off our strengths. Fortunately a Celery user asked how Dask compares on Github and they listed a few concrete features:

  1. Handling multiple queues
  2. Canvas (celery’s workflow)
  3. Rate limiting
  4. Retrying

These provide an opportunity to explore the Dask/Celery comparision from the bias of a Celery user rather than from the bias of a Dask developer.

In this post I’ll point out a couple of large differences, then go through the Celery hello world in both projects, and then address how these requested features are implemented or not within Dask. This anecdotal comparison over a few features should give us a general comparison.

Biggest difference: Worker state and communication

First, the biggest difference (from my perspective) is that Dask workers hold onto intermediate results and communicate data between each other while in Celery all results flow back to a central authority. This difference was critical when building out large parallel arrays and dataframes (Dask’s original purpose) where we needed to engage our worker processes’ memory and inter-worker communication bandwidths. Computational systems like Dask do this, more data-engineering systems like Celery/Airflow/Luigi don’t. This is the main reason why Dask wasn’t built on top of Celery/Airflow/Luigi originally.

That’s not a knock against Celery/Airflow/Luigi by any means. Typically they’re used in settings where this doesn’t matter and they’ve focused their energies on several features that Dask similarly doesn’t care about or do well. Tasks usually read data from some globally accessible store like a database or S3 and either return very small results, or place larger results back in the global store.

The question on my mind is now is Can Dask be a useful solution in more traditional loose task scheduling problems where projects like Celery are typically used? What are the benefits and drawbacks?

Hello World

To start we do the First steps with Celery walk-through both in Celery and Dask and compare the two:

Celery

I follow the Celery quickstart, using Redis instead of RabbitMQ because it’s what I happen to have handy.

# tasks.py

from celery import Celery

app = Celery('tasks', broker='redis://localhost', backend='redis')

@app.task
def add(x, y):
    return x + y
$ redis-server
$ celery -A tasks worker --loglevel=info
In [1]: from tasks import add

In [2]: %time add.delay(1, 1).get()  # submit and retrieve roundtrip
CPU times: user 60 ms, sys: 8 ms, total: 68 ms
Wall time: 567 ms
Out[2]: 2

In [3]: %%time
...: futures = [add.delay(i, i) for i in range(1000)]
...: results = [f.get() for f in futures]
...:
CPU times: user 888 ms, sys: 72 ms, total: 960 ms
Wall time: 1.7 s

Dask

We do the same workload with dask.distributed’s concurrent.futures interface, using the default single-machine deployment.

In [1]: from distributed import Client

In [2]: c = Client()

In [3]: from operator import add

In [4]: %time c.submit(add, 1, 1).result()
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 20.7 ms
Out[4]: 2

In [5]: %%time
...: futures = [c.submit(add, i, i) for i in range(1000)]
...: results = c.gather(futures)
...:
CPU times: user 328 ms, sys: 12 ms, total: 340 ms
Wall time: 369 ms

Comparison

  • Functions: In Celery you register computations ahead of time on the server. This is good if you know what you want to run ahead of time (such as is often the case in data engineering workloads) and don’t want the security risk of allowing users to run arbitrary code on your cluster. It’s less pleasant on users who want to experiment. In Dask we choose the functions to run on the user side, not on the server side. This ends up being pretty critical in data exploration but may be a hinderance in more conservative/secure compute settings.
  • Setup: In Celery we depend on other widely deployed systems like RabbitMQ or Redis. Dask depends on lower-level Torando TCP IOStreams and Dask’s own custom routing logic. This makes Dask trivial to set up, but also probably less durable. Redis and RabbitMQ have both solved lots of problems that come up in the wild and leaning on them inspires confidence.
  • Performance: They both operate with sub-second latencies and millisecond-ish overheads. Dask is marginally lower-overhead but for data engineering workloads differences at this level are rarely significant. Dask is an order of magnitude lower-latency, which might be a big deal depending on your application. For example if you’re firing off tasks from a user clicking a button on a website 20ms is generally within interactive budget while 500ms feels a bit slower.

Simple Dependencies

The question asked about Canvas, Celery’s dependency management system.

Often tasks depend on the results of other tasks. Both systems have ways to help users express these dependencies.

Celery

The apply_async method has a link= parameter that can be used to call tasks after other tasks have run. For example we can compute (1 + 2) + 3 in Celery as follows:

add.apply_async((1, 2), link=add.s(3))

Dask.distributed

With the Dask concurrent.futures API, futures can be used within submit calls and dependencies are implicit.

x = c.submit(add, 1, 2)
y = c.submit(add, x, 3)

We could also use the dask.delayed decorator to annotate arbitrary functions and then use normal-ish Python.

@dask.delayed
def add(x, y):
    return x + y

x = add(1, 2)
y = add(x, 3)
y.compute()

Comparison

I prefer the Dask solution, but that’s subjective.

Complex Dependencies

Celery

Celery includes a rich vocabulary of terms to connect tasks in more complex ways including groups, chains, chords, maps, starmaps, etc.. More detail here in their docs for Canvas, the system they use to construct complex workflows: http://docs.celeryproject.org/en/master/userguide/canvas.html

For example here we chord many adds and then follow them with a sum.

In [1]: from tasks import add, tsum  # I had to add a sum method to tasks.py

In [2]: from celery import chord

In [3]: %time chord(add.s(i, i) for i in range(100))(tsum.s()).get()
CPU times: user 172 ms, sys: 12 ms, total: 184 ms
Wall time: 1.21 s
Out[3]: 9900

Dask

Dask’s trick of allowing futures in submit calls actually goes pretty far. Dask doesn’t really need any additional primitives. It can do all of the patterns expressed in Canvas fairly naturally with normal submit calls.

In [4]: %%time
...: futures = [c.submit(add, i, i) for i in range(100)]
...: total = c.submit(sum, futures)
...: total.result()
...:
CPU times: user 52 ms, sys: 0 ns, total: 52 ms
Wall time: 60.8 ms

Or with Dask.delayed

futures = [add(i, i) for i in range(100)]
total = dask.delayed(sum)(futures)
total.result()

Multiple Queues

In Celery there is a notion of queues to which tasks can be submitted and that workers can subscribe. An example use case is having “high priority” workers that only process “high priority” tasks. Every worker can subscribe to the high-priority queue but certain workers will subscribe to that queue exclusively:

celery -A my-project worker -Q high-priority  # only subscribe to high priority
celery -A my-project worker -Q celery,high-priority  # subscribe to both
celery -A my-project worker -Q celery,high-priority
celery -A my-project worker -Q celery,high-priority

This is like the TSA pre-check line or the express lane in the grocery store.

Dask has a couple of topics that are similar or could fit this need in a pinch, but nothing that is strictly analogous.

First, for the common case above, tasks have priorities. These are typically set by the scheduler to minimize memory use but can be overridden directly by users to give certain tasks precedence over others.

Second, you can restrict tasks to run on subsets of workers. This was originally designed for data-local storage systems like the Hadoop FileSystem (HDFS) or clusters with special hardware like GPUs but can be used in the queues case as well. It’s not quite the same abstraction but could be used to achieve the same results in a pinch. For each task you can restrict the pool of workers on which it can run.

The relevant docs for this are here: http://distributed.readthedocs.io/en/latest/locality.html#user-control

Retrying Tasks

Celery allows tasks to retry themselves on a failure.

@app.task(bind=True)
def send_twitter_status(self, oauth, tweet):
    try:
        twitter = Twitter(oauth)
        twitter.update_status(tweet)
    except (Twitter.FailWhaleError, Twitter.LoginError) as exc:
        raise self.retry(exc=exc)

# Example from http://docs.celeryproject.org/en/latest/userguide/tasks.html#retrying

Sadly Dask currently has no support for this (see open issue). All functions are considered pure and final. If a task errs the exception is considered to be the true result. This could change though; it has been requested a couple of times now.

Until then users need to implement retry logic within the function (which isn’t a terrible idea regardless).

@app.task(bind=True)
def send_twitter_status(self, oauth, tweet, n_retries=5):
    for i in range(n_retries):
        try:
            twitter = Twitter(oauth)
            twitter.update_status(tweet)
            return
        except (Twitter.FailWhaleError, Twitter.LoginError) as exc:
            pass

Rate Limiting

Celery lets you specify rate limits on tasks, presumably to help you avoid getting blocked from hammering external APIs

@app.task(rate_limit='1000/h')
def query_external_api(...):
    ...

Dask definitely has nothing built in for this, nor is it planned. However, this could be done externally to Dask fairly easily. For example, Dask supports mapping functions over arbitrary Python Queues. If you send in a queue then all current and future elements in that queue will be mapped over. You could easily handle rate limiting in Pure Python on the client side by rate limiting your input queues. The low latency and overhead of Dask makes it fairly easy to manage logic like this on the client-side. It’s not as convenient, but it’s still straightforward.

>>> from queue import Queue

>>> q = Queue()

>>> out = c.map(query_external_api, q)
>>> type(out)
Queue

Final Thoughts

Based on this very shallow exploration of Celery, I’ll foolishly claim that Dask can handle Celery workloads, if you’re not diving into deep API. However all of that deep API is actually really important. Celery evolved in this domain and developed tons of features that solve problems that arise over and over again. This history saves users an enormous amount of time. Dask evolved in a very different space and has developed a very different set of tricks. Many of Dask’s tricks are general enough that they can solve Celery problems with a small bit of effort, but there’s still that extra step. I’m seeing people applying that effort to problems now and I think it’ll be interesting to see what comes out of it.

Going through the Celery API was a good experience for me personally. I think that there are some good concepts from Celery that can inform future Dask development.

September 13, 2016 12:00 AM

September 12, 2016

Matthew Rocklin

Dask Distributed Release 1.13.0

I’m pleased to announce a release of Dask’s distributed scheduler, dask.distributed, version 1.13.0.

conda install dask distributed -c conda-forge
or
pip install dask distributed --upgrade

The last few months have seen a number of important user-facing features:

  • Executor is renamed to Client
  • Workers can spill excess data to disk when they run out of memory
  • The Client.compute and Client.persist methods for dealing with dask collections (like dask.dataframe or dask.delayed) gain the ability to restrict sub-components of the computation to different parts of the cluster with a workers= keyword argument.
  • IPython kernels can be deployed on the worker and schedulers for interactive debugging.
  • The Bokeh web interface has gained new plots and improve the visual styling of old ones.

Additionally there are beta features in current development. These features are available now, but may change without warning in future versions. Experimentation and feedback by users comfortable with living on the bleeding edge is most welcome:

  • Clients can publish named datasets on the scheduler to share between them
  • Tasks can launch other tasks
  • Workers can restart themselves in new software environments provided by the user

There have also been significant internal changes. Other than increased performance these changes should not be directly apparent.

  • The scheduler was refactored to a more state-machine like architecture. Doc page
  • Short-lived connections are now managed by a connection pool
  • Work stealing has changed and grown more responsive: Doc page
  • General resilience improvements

The rest of this post will contain very brief explanations of the topics above. Some of these topics may become blogposts of their own at some point. Until then I encourage people to look at the distributed scheduler’s documentation which is separate from dask’s normal documentation and so may contain new information for some readers (Google Analytics reports about 5-10x the readership on http://dask.readthedocs.org than on http://distributed.readthedocs.org.

Major Changes and Features

Rename Executor to Client

http://distributed.readthedocs.io/en/latest/api.html

The term Executor was originally chosen to coincide with the concurrent.futures Executor interface, which is what defines the behavior for the .submit, .map, .result methods and Future object used as the primary interface.

Unfortunately, this is the same term used by projects like Spark and Mesos for “the low-level thing that executes tasks on each of the workers” causing significant confusion when communicating with other communities or for transitioning users.

In response we rename Executor to a somewhat more generic term, Client to designate its role as the thing users interact with to control their computations.

>>> from distributed import Executor  # Old
>>> e = Executor()                    # Old

>>> from distributed import Client    # New
>>> c = Client()                      # New

Executor remains an alias for Client and will continue to be valid for some time, but there may be some backwards incompatible changes for internal use of executor= keywords within methods. Newer examples and materials will all use the term Client.

Workers Spill Excess Data to Disk

http://distributed.readthedocs.io/en/latest/worker.html#spill-excess-data-to-disk

When workers get close to running out of memory they can send excess data to disk. This is not on by default and instead requires adding the --memory-limit=auto option to dask-worker.

dask-worker scheduler:8786                      # Old
dask-worker scheduler:8786 --memory-limit=auto  # New

This will eventually become the default (and is now when using LocalCluster) but we’d like to see how things progress and phase it in slowly.

Generally this feature should improve robustness and allow the solution of larger problems on smaller clusters, although with a performance cost. Dask’s policies to reduce memory use through clever scheduling remain in place, so in the common case you should never need this feature, but it’s nice to have as a failsafe.

Enable restriction of valid workers for compute and persist methods

http://distributed.readthedocs.io/en/latest/locality.html#user-control

Expert users of the distributed scheduler will be aware of the ability to restrict certain tasks to run only on certain computers. This tends to be useful when dealing with GPUs or with special databases or instruments only available on some machines.

Previously this option was available only on the submit, map, and scatter methods, forcing people to use the more immedate interface. Now the dask collection interface functions compute and persist support this keyword as well.

IPython Integration

http://distributed.readthedocs.io/en/latest/ipython.html

You can start IPython kernels on the workers or scheduler and then access them directly using either IPython magics or the QTConsole. This tends to be valuable when things go wrong and you want to interactively debug on the worker nodes themselves.

Start IPython on the Scheduler

>>> client.start_ipython_scheduler()  # Start IPython kernel on the scheduler
>>> %scheduler scheduler.processing   # Use IPython magics to inspect scheduler
{'127.0.0.1:3595': ['inc-1', 'inc-2'],
 '127.0.0.1:53589': ['inc-2', 'add-5']}

Start IPython on the Workers

>>> info = e.start_ipython_workers()  # Start IPython kernels on all workers
>>> list(info)
['127.0.0.1:4595', '127.0.0.1:53589']
>>> %remote info['127.0.0.1:3595'] worker.active  # Use IPython magics
{'inc-1', 'inc-2'}

Bokeh Interface

http://distributed.readthedocs.io/en/latest/web.html

The Bokeh web interface to the cluster continues to evolve both by improving existing plots and by adding new plots and new pages.

dask progress bar

For example the progress bars have become more compact and shrink down dynamically to respond to addiional bars.

And we’ve added in extra tables and plots to monitor workers, such as their memory use and current backlog of tasks.

Experimental Features

The features described below are experimental and may change without warning. Please do not depend on them in stable code.

Publish Datasets

http://distributed.readthedocs.io/en/latest/publish.html

You can now save collections on the scheduler, allowing you to come back to the same computations later or allow collaborators to see and work off of your results. This can be useful in the following cases:

  1. There is a dataset from which you frequently base all computations, and you want that dataset always in memory and easy to access without having to recompute it each time you start work, even if you disconnect.
  2. You want to send results to a colleague working on the same Dask cluster and have them get immediate access to your computations without having to send them a script and without them having to repeat the work on the cluster.

Example: Client One

from dask.distributed import Client
client = Client('scheduler-address:8786')

import dask.dataframe as dd
df = dd.read_csv('s3://my-bucket/*.csv')
df2 = df[df.balance < 0]
df2 = client.persist(df2)

>>> df2.head()
      name  balance
0    Alice     -100
1      Bob     -200
2  Charlie     -300
3   Dennis     -400
4    Edith     -500

client.publish_dataset(accounts=df2)

Example: Client Two

>>> from dask.distributed import Client
>>> client = Client('scheduler-address:8786')

>>> client.list_datasets()
['accounts']

>>> df = client.get_dataset('accounts')
>>> df.head()
      name  balance
0    Alice     -100
1      Bob     -200
2  Charlie     -300
3   Dennis     -400
4    Edith     -500

Launch Tasks from tasks

http://distributed.readthedocs.io/en/latest/task-launch.html

You can now submit tasks to the cluster that themselves submit more tasks. This allows the submission of highly dynamic workloads that can shape themselves depending on future computed values without ever checking back in with the original client.

This is accomplished by starting new local Clients within the task that can interact with the scheduler.

def func():
    from distributed import local_client
    with local_client() as c2:
        future = c2.submit(...)

c = Client(...)
future = c.submit(func)

There are a few straightforward use cases for this, like iterative algorithms with stoping criteria, but also many novel use cases including streaming and monitoring systems.

Restart Workers in Redeployable Python Environments

You can now zip up and distribute full Conda environments, and ask dask-workers to restart themselves, live, in that environment. This involves the following:

  1. Create a conda environment locally (or any redeployable directory including a python executable)
  2. Zip up that environment and use the existing dask.distributed network to copy it to all of the workers
  3. Shut down all of the workers and restart them within the new environment

This helps users to experiment with different software environments with a much faster turnaround time (typically tens of seconds) than asking IT to install libraries or building and deploying Docker containers (which is also a fine solution). Note that they typical solution of uploading individual python scripts or egg files has been around for a while, see API docs for upload_file

Acknowledgements

Since version 1.12.0 on August 18th the following people have contributed commits to the dask/distributed repository

  • Dave Hirschfeld
  • dsidi
  • Jim Crist
  • Joseph Crail
  • Loïc Estève
  • Martin Durant
  • Matthew Rocklin
  • Min RK
  • Scott Sievert

September 12, 2016 12:00 AM

Where to Write Prose?

Code is only as good as its prose.

Like many programmers I spend more time writing prose than code. This is great; writing clean prose focuses my thoughts during design and disseminates understanding so that people see how a project can benefit them.

However, I now question how and where I should write and publish prose. When communicating to users there are generally two options:

  1. Blogposts
  2. Documentation

Given that developer time is finite we need to strike some balance between these two activities. I used to blog frequently, then I switched to almost only documentation, and I think I’m probably about to swing back a bit. Here’s why:

Blogposts

Blogposts excel at generating interest, informing people of new functionality, and providing workable examples that people can copy and modify. I used to blog about Dask (my current software project) pretty regularly here on my blog and continuously got positive feedback from it. This felt great.

However, blogging about evolving software also generates debt. Such blogs grow stale and inaccurate and so when they’re the only source of information about a project, users grow confused when they try things that no longer work, and they’re stuck without a clear reference to turn. Basing core understanding on blogs can be a frustrating experience.

Documentation

So I switched from writing blogposts to spending a lot of time writing technical documentation. This was a positive move. User comprehension seemed to increase, the questions I was fielding were of a far higher level than before.

Documentation gets updated as features mature. New pages assimilate cleanly and obsolete pages get cleaned up. Documentation is generally more densely linked than linear blogs, and readers tend to explore more deeply within the website. Comparing the Google Analytics results for my blog and my documentation show significantly increased engagement, both with longer page views as well as longer chains of navigation throughout the site. Documentation seems to engage readers more strongly than do blogs (at least more strongly than my blog).

However, documentation doesn’t get in front of people the same way that Blogs do. No one subscribes to receive documentation updates. Doc pages for new features rarely end up on Reddit or Hacker News. The way people pass around blog links encourages Google to point people there way more often than to doc pages. There is no way for interested users to keep up with the latest news except by subscribing to fairly dry release e-mails.

Blogposts are way sexier. This feels a little shallow if you’re not into sales and marketing, but lets remember that software dies without users and that users are busy people who have to be stimulated into taking the time to learn new things.

Current Plan

I still think its wise for core developers to focus 80% of their prose time on documentation, especially for new or in-flux features that haven’t had a decent amount of time for users to provide feedback.

However I personally hope to blog more about concepts or timely experiences that have to do with development, if not the features themeselves. For example, right now I’m building a Mesos-powered Scheduler for Dask.distributed. I’ll probably write about the experiences of a developer meeting Mesos for the first time, but I probably won’t include a how-to of using Dask with Mesos.

I also hope to find some way to polish existing doc pages into blogposts once they have proven to be fairly stable. This mostly involves finding a meaningful and reproducible example to work through.

Feedback

I would love to hear how other projects handle this tension between timely and timeless documentation.

September 12, 2016 12:00 AM

September 08, 2016

Continuum Analytics news

Democratization of Compute: More Flops, More Users & Solving More Challenges

Thursday, September 8, 2016
Mike Lee
Technical, Enterprise and Cloud Compute Segment Manager, Developer Products Division
Intel Corporation

The past decade has seen compute capacity at the cluster scale grow faster than Moore’s Law. The relentless pursuit to exascale systems and beyond brings broad advances in the availability of a large amount of compute power to developers and users on “everyday” systems. Call it “trickle down” high performance computing if you like, but the effects are profound in the amount of computation that can be accessed. A teraflop system today can be easily had in a workstation, ready and able to tackle scientific compute problems, financial modeling exercises and plow through huge amounts of data for machine learning.

Programming of these high performance systems used to be the domain of native language developers who work in Fortran or C/C++, and scaling up and out with distributed computing via Message Passing Interface (MPI) to take advantage of cluster computing. While these languages are still the mainstay of high performance computing, scripting languages, such as Python, have been adopted by a broad community of users for its ease of use and short learning curve. While giving the ability for more users to do computing is a good thing, there is a limitation that makes it difficult for users of Python to get good performance. Namely, it is the global interpreter lock, or “GIL,” that runs in a single threaded mode and does not allow for any parallelism to take advantage of modern hardware with multicore/many-core and multi threaded CPUs. If only there was a way to make it easy and seamless to get performance from Python, we could broaden the availability of compute power to more users.

My colleagues at Intel in engineering and product marketing teams examined this limitation and saw that there were some solutions out there that were challenging to implement—thus began our close association with Continuum Analytics, a leader in the Open Data Science and Python community, to make these performance enhancements widely available to all. Collaboration with Continuum Analytics has helped us bring the Intel® Distribution for Python powered by Anaconda to the Python community, which leverages the Intel® Performance Libraries, such as Intel® Math Kernel Library, Intel® Data Analytics Library, Intel® MPI Library and Intel® Threading Building Blocks. The collaboration between Intel and Continuum Analytics helps provide a path to greater performance for Python developers and users.

And today, we are happy to announce a major milestone in our journey with the Intel Distribution. After a year in beta, the Distribution is now available in its first public version as the Intel® Distribution for Python 2017. It's been a wild ride—the thrills of successful compiles and builds, the agony of managing dependencies, chasing down the bugs, the race to meet project deadlines, the highs of good press, the lows of post release reported errors—but overall, we have the satisfaction of having delivered a solid product.
                                    
Our work is not done. We will continue to push the boundaries of performance to enable more flops to more users to solve more computing challenges. Live long and Python!

Questions about the Intel® Distribution for Python powered by Anaconda? Read our FAQ.

by swebster at September 08, 2016 01:06 PM

Pierre de Buyl

Correlators for molecular and stochastic dynamics

License: CC-BY

Time correlations represent one of the most important data that one can obtain from doing molecular and stochastic dynamics. The two common methods to obtain them is via either post-processing or on-line analysis. Here I review several algorithms to compute correlation from numerical data: naive, Fourier transform and blocking scheme with illustrations from Langevin dynamics, using Python.

Introduction

I conduct molecular and stochastic simulations of colloidal particles. One important quantity to extract from these simulations is the autocorrelation of the velocity. The reasoning also applies to other types of correlation functions, I am just focusing on this one to provide context.

There are several procedures to compute the correlation functions from the numerical data but I did not find a synthetic review about it, so I am making my own. The examples will use the Langevin equation and the tools from the SciPy stack when appropriate.

We trace the apparition of the methods using textbooks on molecular simulation when available or other references (articles, software documentation) when appropriate. If I missed a foundational reference, let me know in the comments or by email.

In 1987, Allen and Tildesley mention in their book Computer Simulation of Liquids [Allen1987] mention the direct algorithm for the autocorrelation $$C_{AA}(\tau = j\Delta t) = \frac{1}{N-j} \sum_i A_i A_{i+j}$$ where I use $A_i$ to denote the observable $A$ at time $i\Delta t$ ($\Delta t$ is the sampling interval). Typically, $\tau$ is called the lag time or simply the lag. The number of operations is $O(N_\textrm{cor}N)$, where $N$ is the number of items in the time series and $N_\textrm{cor}$ the number of correlations points that are computed. By storing the last $N_\textrm{cor}$ value of $A$, this algorithm is suitable for use during a simulation run. Allen and Tildesley then mention the Fast Fourier Transform (FFT) version of the algorithm that is more efficient, given its scaling in terms of $N\log N$. The FFT algorithm is based on the convolution theorem: performing the convolution (of the time-reversed data) in frequency space is a multiplication and much faster than the direct algorithm. The signal has to be zero-padded to avoid circular correlations due to the finiteness of the data. The requirements of the FFT method in terms of storage and the number of points to obtain for $C_{AA}$ influence what algorithm gives the fastest result.

Frenkel and Smit in their book Understanding Molecular Simulation [Frenkel2002] introduce what they call an "Order-n algorithm to measure correlations". The principle is to store the last $N_\textrm{cor}$ values for $A$ with a sampling interval $\Delta t$ and also the last $N_\textrm{cor}$ values for $A$ with a sampling interval $l \Delta t$ where $l$ is the block size, and recursively store the data with an interval $l$, $l^2$, $l^3$, etc. In their algorithm, the data is also averaged over the corresponding time interval and the observable is thus coarse-grained during the procedure.

A variation on this blocking scheme is used by Colberg and Höfling [Colberg2011], where no averaging is performed. Ramírez, Sukumaran, Vorselaars and Likhtman [Ramirez2010] propose a more flexible blocking scheme in which the block length and the duration of averaging can be set independently. They provide an estimate of the systematic and statistical errors induced by the choice of these parameters. This is the multiple tau correlator.

The "multiple-tau" correlator has since then been implemented in two popular Molecular Dynamics package:

The direct and FFT algorithms are available in NumPy and SciPy respectively.

In the field of molecular simulation, the FFT method was neglected for a long time but was put forward by the authors of the nMOLDYN software suite to analyze Molecular Dynamics simulation [Kneller1995].

The direct algorithm and the implementation in NumPy

The discrete-time correlation $c_j$ from a numerical time series $s_i$ is defined as

$$ c_j = \sum_i s_{i} s_{i+j}$$

where $j$ is the lag in discrete time steps, $s_i$ is the time series and the sum runs over all available indices. The values in $s$ represent a sampling with a time interval $\Delta t$ and we use interchangeably $s_i$ and $s(i\Delta t)$.

Note that this definition omits the normalization. What one is interested in is the normalized correlation function

$$ c'_j = \frac{1}{N-j} \sum_i s_{i} s_{i+j}$$

NumPy provides a function numpy.correlate that computes explicitly the correlation of a scalar time series. For small sets of data it is sufficiently fast as it is actually calling a compiled routine. The time it takes grows quadratically as a function of the input size, which makes it unsuitable for many practical applications.

Note that the routine computes the un-normalized correlation function and that forgetting to normalize the result, or normalizing them with $N$ instead of $N-j$ will give incorrect results. numpy.correlate (with argument mode='full' returns an array of length $2N-1$ that contains the negative times as well as the positive times. For an autocorrelation, half of the array can be discarded.

Below, we test the CPU time scaling with respect to the input data size and compare it with the expected $O(N^2)$ scaling, as the method computes all possible $N$ values for the correlation.

In [2]:
n_data = 8*8**np.arange(6)
time_data = []
for n in n_data:
    data = np.random.random(n)
    t0 = time.time()
    cor = np.correlate(data, data, mode='full')
    time_data.append(time.time()-t0)
time_data = np.array(time_data)
In [3]:
plt.plot(n_data, time_data, marker='o')
plt.plot(n_data, n_data**2*time_data[-1]/(n_data[-1]**2))
plt.loglog()
plt.title('Performance of the direct algorithm')
plt.xlabel('Size N of time series')
plt.ylabel('CPU time')
Out[3]:
<matplotlib.text.Text at 0x7f1285affa90>

SciPy's Fourier transform implementation

The SciPy package provides the routine scipy.signal.fftconvolve for FFT-based convolutions. As for NumPy's correlate routine, it outputs negative and positive time correlations, in the un-normalized form.

SciPy relies on the standard FFTPACK library to perform FFT operations.

Below, we test the CPU time scaling with respect to the input data size and compare it with the expected $O(N\log N)$ scaling. The maximum length for the data is already an order of magnitude larger than for the direct algorithm, the CPU time would be already too much for this.

In [4]:
n_data = []
time_data = []
for i in range(8):
    n = 4*8**i
    data = np.random.random(n)
    t0 = time.time()
    cor = scipy.signal.fftconvolve(data, data[::-1], mode='full')
    n_data.append(n)
    time_data.append(time.time()-t0)
n_data = np.array(n_data)
time_data = np.array(time_data)
In [5]:
plt.plot(n_data, time_data, marker='o')
plt.plot(n_data, n_data*np.log(n_data)*time_data[-1]/(n_data[-1]*np.log(n_data[-1])))
plt.loglog()
plt.xlabel('Size N of time series')
plt.ylabel('CPU time')
Out[5]:
<matplotlib.text.Text at 0x7f12856fc2e8>

Comparison of the direct and the FFT approach

It is important to note, as mentioned in [Kneller1995] that the direct and FFT algorithms give the same result, up to rounding errors. This is show below by plotting the substraction of the two signals.

In [6]:
n = 2**14
sample_data = np.random.random(n)

direct_correlation = np.correlate(sample_data, sample_data, mode='full')
fft_correlation = scipy.signal.fftconvolve(sample_data, sample_data[::-1])

plt.plot(direct_correlation-fft_correlation)
Out[6]:
[<matplotlib.lines.Line2D at 0x7f1285828518>]

Blocking scheme for correlations

Both schemes mentioned above require a storage of data that scales as $O(N)$. This includes the storage of the simulation data and the storage in RAM of the results (at least during the computation, it can be truncated afterwards if needed).

An alternative strategy is to store blocks of correlations, each successive block representing a coarser version of the data and correlation. The principle behind the blocking schemes is to use a decreasing time resolution to store correlation information for longer times. As the variations in the correlation typically decay with time, it makes sense to store less and less detail.

There are nice diagrams in the references cited above but for the sake of completeness, I will try my own diagram here.

  • The black full circle in the signal ("s") is taken at discrete time $i=41$ and fills the $b=0$ block at position $41\mod 6 = 5$.
  • The next point is the empty circle, it fills the block $b=0$ in position $42\mod 6=0$. As $42\mod l^b$ (for $l=6$ and $b=0$), the empty circle is also copied to position $1$ in the block $b=1$ at position $42 / l^b\mod l=1$ (for $b=1).

Diagram of the blocking scheme

In the following,

  • $l$ is the length of the blocks
  • $B$ is the total number of blocks

The signal blocks contain samples of the signal, limited in time and at different timescales.

The procedure to fill the signal blocks is quite simple:

  1. Store the current value of the signal, $s_i$ into the 0-th signal block, at the position $i\mod l$.
  2. Once every $l$ steps (that is when $i\mod l=0$), store the signal also in the next block, at position $i/l$
  3. Apply this procedure recursively up to block $B-1$, dividing $i$ by $l$ at every step

Example of the application of step 3: store the signal in block $b$ when when $i\mod l^b=0$, that is every $l$ steps for block 1, every $l^2$ steps for block 2, etc. The time of sampling for the blocks is 0, 1, 2, etc for block 0, then $0$, $l$, $2l$, etc for block 1 and so on.

The procedure to compute the correlation blocks is to compute the correlation of every newly added data point with the $l-1$ other values in the same block and with itself (for lag 0). This computation is carried out at the same time that the signal blocks are filled, else past data would have been discarded. This algorithm can thus be executed online, while the simulation is running, or while reading the data file in a single pass.

I do not review the different averaging schemes, as they do not change much to the understanding of the blocking schemes. For this, see [Ramirez2010].

The implementation of a non-averaging blocking scheme (compare to papers) is provided below in plain Python.

In [7]:
def mylog_corr(data, B=3, l=32):
    cor = np.zeros((B, l))
    val = np.zeros((B, l))
    count = np.zeros(B)
    idx = np.zeros(B)

    for i, new_data in enumerate(data):
        for b in range(B):
            if i % (l**b) !=0:   # only enter block if i modulo l^b == 0
                break
            normed_idx = (i // l**b) % l
            val[b, normed_idx] = new_data  # fill value block
            # correlate block b
            # wait for current block to have been filled at least once
            if i > l**(b+1):          
                for j in range(l):
                    cor[b, j] += new_data*val[b, (normed_idx - j) % l]
                count[b] += 1
    return count, cor

Example with Langevin dynamics

Having an overview of the available correlators, I now present a use case with the Langevin equation. I generate data using a first-order Euler scheme (for brevity, this should be avoided in actual studies) and apply the FFT and the blocking scheme.

The theoretical result $C(\tau) = T e^{-\gamma \tau}$ is also plotted for reference.

In [8]:
N = 400000

x, v = 0, 0
dt = 0.01
interval = 10
T = 2
gamma = 0.1
x_data = []
v_data = []
for i in range(N):
    for j in range(interval):
        x += v*dt
        v += np.random.normal(0, np.sqrt(2*gamma*T*dt)) - gamma*v*dt
    x_data.append(x)
    v_data.append(v)
x_data = np.array(x_data)
v_data = np.array(v_data)
In [9]:
B = 5 # number of blocks
l = 8 # length of the blocks
c, cor = mylog_corr(v_data, B=B, l=l)
In [10]:
for b in range(B):
    t = dt*interval*np.arange(l)*l**b
    plt.plot(t[1:], cor[b,1:]/c[b], color='g', marker='o', markersize=10, lw=2)
    
fft_cor = scipy.signal.fftconvolve(v_data, v_data[::-1])[N-1:]
fft_cor /= (N - np.arange(N))
t = dt*interval*np.arange(N)
plt.plot(t, fft_cor, 'k-', lw=2)

plt.plot(t, T*np.exp(-gamma*t))

plt.xlabel(r'lag $\tau$')
plt.ylabel(r'$C_{vv}(\tau)$')
plt.xscale('log')

Summary

The choice of a correlator will depend on the situation at hand. Allen and Tildesley already mention the tradeoffs between disk and RAM memory requirements and CPU time. The table belows reviews the main practical properties of the correlators. "Online" means that the algorithm can be used during a simulation run. "Typical cost" is the "Big-O" number of operations. "Accuracy of data" is given in terms of the number of sampling points for the correlation value $c_j$.

Algorithm Typical cost Storage Online use Accuracy of data
Direct $N^2$ or $N_\textrm{cor}$N $N_\textrm{cor}$ for small $N_\textrm{cor}$ $N-j$ points for $c_j$
FFT $N\log N$ $O(N)$ no $N-j$ points for $c_j$
Blocking $N\ B$ $N\ B$ yes $N/l^b$ points in block $b$

A distinct advantage of the direct and FFT methods is that they are readily available. If your data is small (up to $N\approx 10^2 - 10^3$) the direct method is a good choice, then the FFT method will outperform it significantly.

An upcoming version of the SciPy library will even provide the choice of method as a keyword argument to the method scipy.signal.correlate, making it even more accessible.

Now, for the evaluation of correlation functions in Molecular Dynamics simulations, the blocking scheme is the only practical solution for very long simulations for which both short- and long-time behaviours are of interest. To arrive in RMPCDMD!

References

  • [Allen1987] M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids (Clarendon Press, 1987).
  • [Frenkel2002] D. Frenkel and B. Smit, Understanding molecular simulation: From algorithms to applications (Academic Press, 2002).
  • [Ramirez2010] J. Ramírez, S. K. Sukumaran, B. Vorselaars and A. E. Likhtman, Efficient on the fly calculation of time correlation functions in computer simulations, J. Chem. Phys. 133 154103 (2010).
  • [Colberg2011] P. H. Colberg and Felix Höfling, Highly accelerated simulations of glassy dynamics using GPUs: Caveats on limited floating-point precision, Comp. Phys. Comm. 182, 1120 (2011).
  • [Kneller1995] Kneller, Keiner, Kneller and Schiller, nMOLDYN: A program package for a neutron scattering oriented analysis of Molecular Dynamics simulations, Comp. Phys. Comm. 91, 191 (1995).

by Pierre de Buyl at September 08, 2016 10:00 AM

September 07, 2016

Continuum Analytics news

Continuum Analytics Teams Up with Intel for Python Distribution Powered by Anaconda

Thursday, September 8, 2016

Offers speed, agility and an optimized Python experience for data scientists
 
AUSTIN, TEXAS—September 8, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, is pleased to announce a technical collaboration with Intel resulting in the Intel® Distribution for Python powered by Anaconda. Intel Distribution for Python powered by Anaconda was recently announced by Intel and will be delivered as part of Intel® Parallel Studio XE 2017 software development suite. With a common distribution for the Open Data Science community that increases Python and R performance up to 100X, Intel has empowered enterprises to build a new generation of intelligent applications that drive immediate business value. Combining the power of the Intel® Math Kernel Library (MKL) and Anaconda’s Open Data Science platform allows organizations to build the high performance analytic modeling and visualization applications required to compete in today’s data-driven economies. 
 
By combining two de facto standards, Intel MKL and Anaconda, into a single performance-boosted Python and R distribution, enterprises can meet and exceed performance targets for next-generation data science applications. The platform includes packages and technology that are accessible for beginner Python developers, however powerful enough to tackle data science projects for big data. Anaconda offers support for advanced analytics, numerical computing, just-in-time compilation, profiling, parallelism, interactive visualization, collaboration and other analytic needs.
 
“While Python has been widely used by data scientists as an easy-to-use programming language, it was often at the expense of performance,” said Mike Lee, technical, enterprise and cloud compute segment manager, developer Products Division at Intel Corporation. “The Intel Distribution for Python powered by Anaconda, provides multiple methods and techniques to accelerate and scale Python applications to achieve near native code performance.”
 
With the out-of-box distribution, Python applications immediately realize gains and can be tuned to optimize performance using the Intel® VTune™ Amplifier performance profiler. Python workloads can take advantage of multi-core Intel architectures and clusters using parallel thread scheduling and efficient communication with Intel MPI and Anaconda Scale through optimized Intel® Performance Libraries and Anaconda packages.
 
“Our focus on delivering high performance data science deployments to enterprise customers was a catalyst for the collaboration with Intel who is powering the smart and connected digital world,” said Michele Chambers, EVP of Anaconda & CMO at Continuum Analytics. “Today’s announcement of Intel’s Python distribution based on Anaconda, illustrates both companies’ commitment to empowering Open Data Science through a common distribution that makes it easy to move intelligent applications from sandboxes to production environments.”

The Intel Distribution for Python powered by Anaconda is designed for everyone from seasoned high-performance developers to data scientists looking to speed up workflows and deliver an easy-to-install, performance-optimized Python experience to meet enterprise needs. The collaboration enables users to accelerate Python performance on modern Intel architectures, adding simplicity and speed to applications through Intel’s performance libraries. This distribution makes it easy to install packages using conda and pip and access individual Intel-optimized packages hosted on Anaconda Cloud through conda.
 
Features include:

  • Anaconda Distribution that has been downloaded over 3M times and is the de facto standard Python distribution for Microsoft Azure ML and Cloudera Hadoop
  • Intel Math Kernel performance-accelerated Python computation packages like NumPy, SciPy, scikit-learn
  • Anaconda Scale, which makes it easy to parallelize workloads using Directed Acyclic Graphs (DAGs)

Intel Distribution for Python powered by Anaconda is delivered as part of the Intel Parallel Studio XE 2017. The new distribution is available for free and includes forum support. For more information, please visit https://software.intel.com/en-us/intel-distribution-for-python. For additional information about Anaconda, please visit: https://www.continuum.io/anaconda-overview and the Continuum Analytics’ Partner Program, visit https://www.continuum.io/partners. Also, check out this blog by Intel’s Mike Lee: https://www.continuum.io/blog/developer-blog/democratization-compute-intel.
 
About Intel
Intel (NASDAQ: INTC) expands the boundaries of technology to make the most amazing experiences possible. Information about Intel can be found at newsroom.intel.com and intel.com.

Intel, the Intel logo, Core, and Ultrabook are trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
 
About Continuum Analytics
Continuum Analytics is the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world.
 
With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries––financial services, government, health & life sciences, technology, retail & CPG, oil & gas––to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations.
 
Our community loves Anaconda because it empowers the entire data science team––data scientists, developers, DevOps, data engineers and business analysts––to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.
 
To learn more about Continuum Analytics, visit www.continuum.io.
 
###
 
Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

by swebster at September 07, 2016 06:04 PM

September 03, 2016

Randy Olson

Python 2.7 still reigns supreme in pip installs

The Python 2 vs. Python 3 divide has long been a thorn in the Python community’s side. On one hand, Python package developers face the challenge of supporting two incompatible versions of Python, which is time that could be better

by Randy Olson at September 03, 2016 11:04 PM

September 01, 2016

Continuum Analytics news

What’s Old and New with Conda Build

Thursday, September 1, 2016
Michael Sarahan
Continuum Analytics

Conda build 2.0 has just been released. This marks an important evolution towards much greater test coverage and a stable API. With this release, it’s a great time to revisit some of conda-build’s features and best practices.

Quick Recap of Conda Build

Fundamentally, conda build is a tool to help you package your software so that it is easy to distribute using the conda package manager. Conda build takes input in the form of yaml files and shell/batch scripts (“recipes”), and outputs conda packages. Conda build also includes utilities for quickly generating recipes from external repositories, such as PyPI, CPAN, or CRAN.  During each build process, conda build has 4 different phases that occur: rendering, building, post-processing/packaging, and testing. Rendering takes your input meta.yaml file, fills in any jinja2 templates, and applies selectors. The end result is a python object of the MetaData class, as defined in metadata.py. Source code for your package may be downloaded during rendering if it is necessary to provide information for rendering the recipe (for example, if the version is obtained from source, rather than provided in meta.yaml). The build step creates the build environment (also called the build “prefix”), and runs the build.sh (Linux/Mac) or bld.bat (Windows) scripts. Post-processing looks at which files in the build prefix are new - ones that were not there when the build prefix was created. These are the files that are packaged up into the .tar.bz2 file. Other inspection tasks, such as detecting files containing prefixes that need replacement at install time, are also done in the post-processing phase. Finally, the test phase creates a test environment, installs the created package, and runs any tests that you specify, either in meta.yaml, or in run_test.bat (Windows), run_test.sh (Linux, Mac), or run_test.py (all platforms).

Meta.yaml

Meta.yaml is the core of any conda recipe. It describes the package’s name, version, source location, and build/test environment specifications. Full documentation on meta.yaml is at http://conda.pydata.org/docs/building/meta-yaml.html.

Let’s step through the options available to you. We’ll mention Jinja2 templating and selectors a few times in here. If you’re not familiar with these, just ignore them for now. These are described in much greater detail at the end of the article.

Software Sources

Conda build will happily obtain source code from local filesystems, http/https URLs, git repositories, mercurial repositories, and subversion repositories. Syntax for each of these is described at http://conda.pydata.org/docs/building/meta-yaml.html#source-section.

Presently, Jinja2 template variables are populated only for git and mercurial repositories. These are described at http://conda.pydata.org/docs/building/environment-vars.html. Future work will add Jinja2 template variables for the remaining version control systems.

As a general guideline, use tarballs (http/https URLs) with hashes (SHA preferably) where available. Version control system (VCS) tags can be moved to other commits, and your packages are less guaranteed to be repeatable. Failing this, using VCS hash values is also highly repeatable. Finally, with tarballs, it is better to paste a hash provided by your download site than it is to compute it yourself.  If the download site does not provide one, you can compute a hash with openssl. Openssl is a requirement of miniconda so it is already available in every conda environment.

openssl dgst -sha256 <path to file>

 

Build Options

The “build” section of meta.yaml includes options that change some build-related options in conda build. Here you can skip certain platforms, control prefix replacement, exclude the recipe from being packaged, add entry points, and more.

Requirements

In the requirements section, you define conda packages that should be installed before build, and before running your package. It is important to list your requirements for build here, because conda build does not allow you to download requirements using pip. This restriction ensures that builds are easier to reproduce. If you are missing dependencies and pip tries to install them, you will see a traceback.

When you need a particular version of something, you can apply version constraints to your specification. This is often called “pinning.” There are 3 kinds of pinning: exact, “globbing,” and boolean logic. Each pinning is an additional string after the package specification in meta.yaml. For example:

requirements:
    build:
        - python   2.7.12

 

For exact pinning, you specify the exact version you want. This should be used sparingly, as it can quickly make your package over-constrained and hard to install. Globbing uses the * character to allow any sub-version to be installed. For example, with semantic versioning, to allow bug fix releases, one could specify a version such as 1.2.* - no major or minor releases allowed. Not all packages use semantic versioning, though. Finally, boolean expressions of versions are valid. To allow a range of versions, you can use pinnings such as >=1.6.21,<1.7.

There are some packages that need to be defined in a special way. For example, packages that compile with NumPy’s C API need the same version of NumPy at runtime that was used at build time. If your package uses NumPy via Cython, or if any part of your extension code includes numpy.h, then this probably applies to you. The special syntax for NumPy is:

requirements: 
  build:
    - numpy x.x
  run:
    - numpy x.x 

 

There is a lot of discussion around extending this to other packages, because it is common with compiled code to have build time versions determine runtime compatibility. This discussion is active at https://github.com/conda/conda-build/issues/1142 and is slated for the next major conda-build release.

Build strings—that little bit of text in your output package name, like np110py27—is determined by default by the contents of your run requirements. You can change the build string manually in meta.yaml, but doing so disables conda’s automatic addition.

Test

Testing occurs by default automatically after building the package. If the tests fail, the package is moved into the “broken” folder, rather than the normal output folder for your platform.

Tests have been confusing for many people for some time. If your package did not include the test files, it was difficult to figure out how to get your tests to run. Conda build 2.0 adds a new key to the test section, “source_files,” that accepts a list of files and/or folders from your source folder that will be copied from your source folder into your test folder at test time. These specifications are done with Python’s glob, so any glob pattern will work.

test:
  source_files:
    - tests
    - some_important_test_file.txt
    - data/*.h5

 

Selectors

Selectors are used to limit part of your meta.yaml file. Selectors exist for Python version, platform, and architecture. Selectors are parsed and applied after jinja2 templates, so you may use jinja2 templates for more dynamic selectors. The full list of available selectors is at http://conda.pydata.org/docs/building/meta-yaml.html#preprocessing-selectors.

Jinja Templating

Templates are not a new feature, but they are not always well understood. Templates are placeholders that are dynamically filled with content when your recipe is loaded by conda build. They are heavily used at conda-forge, where they make updating recipes easier:

  {% set version=”1.0.0” %}

  package:
    name: my_test_package
    version: {{ version }}

  source:
    url: http://some.url/package-{{ version }}.tar.gz

Using templates this way means that you only have to change the version once, and it applies to multiple places. Jinja templates also support running Python code to do interesting things, such as getting versions from a setup.py file:

    {% set data = load_setup_py_data() %}

    package:
      name: conda-build-test-source-setup-py-data
      version: {{ data.get('version') }}

    # source will be downloaded prior to filling in jinja templates
    # Example assumes that this folder has setup.py in it
    source:
      path_url: ../

 

The Python code that is actually reading the setup.py file (load_setup_py_data) is part of conda build (jinja_context.py). Presently, we do not have an extension scheme. That will be part of future work, so that users can customize their recipes with their own Python functions.

Binary Prefix Length

A somewhat esoteric aspect of relocatability is that binaries on Linux and Mac have prefix strings embedded in them that tell the binary where to go look for shared libraries. At build time, conda build detects these prefixes, and makes a note of where they are. At install time, conda uses that list to replace those prefixes with the appropriate prefix for the new environment that it is installing into. Historically, the length of these embedded prefixes has been 80 characters. Conda build 2.0 increases this length to 255 characters. Unfortunately, to fully take advantage of this change, all packages that would be installed into an environment need to have been built by conda build 2.0 to have the longer prefix. In practice, this means rebuilding many of the lower-level dependencies. To aid in this effort, conda build has added a tool:

   conda inspect prefix-lengths <package path> [more packages] [--min-prefix-length <value, default 255>]

 

More concretely:

   conda inspect prefix-lengths ~/miniconda2/pkgs/*.tar.bz2

 

This is presently not relevant to Windows, though conda build does now record binary prefixes on Windows, especially for pip-created entry point executables, so that they can function correctly. These entry point executables consist of a program, the prefix, and the entry point script all rolled into a single executable. The prefix length does not matter, because the binary can simply be recreated with any arbitrary prefix by concatenating the pieces together.

Conda Build API

Finally, the other large feature of conda build 2.0 has been the creation of a public API. This is a promise to our users that the interfaces will not change without a bump to the major version number. It is also an opportunity to divide the command line interface into smaller, more testable chunks. The CLI will still be available and users will now have the API as a different, more guaranteed-stable option. The full API is at https://github.com/conda/conda-build/blob/master/conda_build/api.py.

A quick mapping of legacy CLI commands to interesting api functions is the following:

command line interface command Python API functions
conda build api.build
conda build --ouput api.get_output_file_path
conda render api.output_yaml
conda sign api.sign, api.verify, api.keygen, api.import_sign_key
conda skeleton api.skeletonize; api.list_skeletons
conda develop api.develop
conda inspect api.test_installable; api.inspect_linkages; api.inspect_objects; api.inspect_prefix_length
conda index api.update_index
conda metapackage api.create_metapackage

Implementation Details of Potential Interest

  • Non-global Config: conda build 1.x used a global instance of the conda_build.config.Config class. This has been replaced by passing a local Config instance across all system calls. This allows for more direct customization of api calls, and obviates the need to create ArgParse namespace objects to interact with conda-build.

  • Build id and Build folder: conda build 1.x stored environments with other conda environments, and stored the build “work” folder and test work (test_tmp) folder in the conda-bld folder (by default). Conda-build 2.0 assigns a build id to each build, consisting of the recipe name joined with the number of milliseconds since the epoch. While it is theoretically possible for name collision here, it should be unlikely. Both the environments and the work folders have moved into folders named with the build id. Each build is thus self-contained, and multiple builds can run at once (in separate processes). The monotonically increasing build ids facilitate reuse of source with the “--dirty” build option.

by swebster at September 01, 2016 05:20 PM

Introducing GeoViews

Thursday, September 1, 2016
Jim Bednar
Continuum Analytics

Philipp Rudiger
Continuum Analytics

.bk-toolbar-active a[target="_blank"]:after { display:none; }

GeoViews is a new Python library that makes it easy to explore and visualize geographical, meteorological, oceanographic, weather, climate, and other real-world data. GeoViews was developed by Continuum Analytics, in collaboration with the Met Office. GeoViews is completely open source, available under a BSD license freely for both commercial and non-commercial use, and can be obtained as described at the Github site.

GeoViews is built on the HoloViews library for building flexible visualizations of multidimensional data. GeoViews adds a family of geographic plot types, transformations, and primitives based primarily on the Cartopy library, plotted using either the Matplotlib or Bokeh packages. GeoViews objects are just like HoloViews objects, except that they have an associated geographic projection based on cartopy.crs. For instance, you can overlay temperature data with coastlines using simple expressions like gv.Image(temperature)*gv.feature.coastline, and easily embed these in complex, multi-figure layouts and animations using both GeoViews and HoloViews primitives, while always being able to access the raw data underlying any plot.

This post shows you how GeoViews makes it simple to use point, path, shape, and multidimensional gridded data on both geographic and non-geographic coordinate systems.

import numpy as np
import xarray as xr
import pandas as pd
import holoviews as hv
import geoviews as gv
import iris
import cartopy

from cartopy import crs
from cartopy import feature as cf
from geoviews import feature as gf

hv.notebook_extension('bokeh','matplotlib')
%output backend='matplotlib'
%opts Feature [projection=crs.Robinson()]
HoloViewsJS, MatplotlibJS, BokehJS successfully loaded in this cell.

Built-in geographic features

GeoViews provides a library of basic features based on cartopy.feature that are useful as backgrounds for your data, to situate it in a geographic context. Like all HoloViews Elements (objects that display themselves), these GeoElements can easily be laid out alongside each other using '+' or overlaid together using '*':

gf.coastline + gf.ocean + gf.ocean*gf.land*gf.coastline

Other Cartopy features not included by default can be used similarly by explicitly wrapping them in a gv.Feature GeoViews Element object:

%%opts Feature.Lines (facecolor='none' edgecolor='gray')
graticules = gv.Feature(cf.NaturalEarthFeature(category='physical', name='graticules_30',scale='110m'), group='Lines')
graticules

The '*' operator used above is a shorthand for hv.Overlay, which can be used to show the full set of feature primitives provided:

%%output size=450
features = hv.Overlay([gf.ocean, gf.land, graticules, gf.rivers, gf.lakes, gf.borders, gf.coastline])
features

Projections

GeoViews allows incoming data to be specified in any coordinate system supported by Cartopy's crs module. This data is then transformed for display in another coordinate system, called the Projection. For instance, the features above are displayed in the Robinson projection, which was declared at the start of the notebook. Some of the other available projections include:

projections = [crs.RotatedPole, crs.Mercator, crs.LambertCylindrical, crs.Geostationary, 
               crs.Gnomonic, crs.PlateCarree, crs.Mollweide, crs.Miller, 
               crs.LambertConformal, crs.AlbersEqualArea, crs.Orthographic, crs.Robinson]

When using matplotlib, any of the available coordinate systems from cartopy.crs can be used as output projections, and we can use hv.Layout (what '+' is shorthand for) to show each of them:

hv.Layout([features.relabel(group=p.__name__)(plot=dict(projection=p()))
           for p in projections]).display('all').cols(3)

The Bokeh backend currently only supports a single output projection type, Web Mercator, but as long as you can use that projection, it offers full interactivity, including panning and zooming to see detail (after selecting tools usin the menu at the right of the plot):

%%output backend='bokeh'
%%opts Overlay [width=600 height=500 xaxis=None yaxis=None] Feature.Lines (line_color='gray' line_width=0.5)
features

Tile Sources

As you can see if you zoom in closely to the above plot, the shapes and outlines are limited in resolution, due to the need to have relatively small files that can easily be downloaded to your local machine. To provide more detail when needed for zoomed-in plots, geographic data is often divided up into separate tiles that can be downloaded individually and then combined to cover the required area. GeoViews lets you use any tile provider supported by Matplotlib (via cartopy) or Bokeh, which lets you add image or map data underneath any other plot. For instance, different sets of tiles at an appropriate resolution will be selected for this plot, depending on the extent selected:

%%output dpi=200
url = 'https://map1c.vis.earthdata.nasa.gov/wmts-geo/wmts.cgi'
gv.WMTS(url, layer='VIIRS_CityLights_2012', crs=crs.PlateCarree(), extents=(0, -60, 360, 80))

Tile servers are particularly useful with the Bokeh backend, because the data required as you zoom in isn't requested until you actually do the zooming, which allows a single plot to cover the full range of zoom levels provided by the tile server.

%%output backend='bokeh'
%%opts WMTS [width=450 height=250 xaxis=None yaxis=None]

from bokeh.models import WMTSTileSource
from bokeh.tile_providers import STAMEN_TONER

tiles = {'OpenMap': WMTSTileSource(url='http://c.tile.openstreetmap.org/{Z}/{X}/{Y}.png'),
         'ESRI': WMTSTileSource(url='https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg'),
         'Wikipedia': WMTSTileSource(url='https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png'),
         'Stamen Toner': STAMEN_TONER}

hv.NdLayout({name: gv.WMTS(wmts, extents=(0, -90, 360, 90), crs=crs.PlateCarree())
            for name, wmts in tiles.items()}, kdims=['Source']).cols(2)

If you select the "wheel zoom" tool in the Bokeh tools menu at the upper right of the above figure, you can use your scroll wheel to zoom into all of these plots at once, comparing the level of detail available at any location for each of these tile providers. Any WTMS tile provider that accepts URLs with an x and y location and a zoom level should work with bokeh; you can find more at openstreetmap.org.

Point data

Bokeh, Matplotlib, and GeoViews are mainly intended for plotting data, not just maps, and so the above tile sources and cartopy features are typically in the background of the actual data being plotted. When there is a data layer, the extent of the data will determine the extent of the plot, and so extent will not need to be provided explicitly as in the previous examples.

The simplest kind of data to situate geographically is point data: longitude and latitude coordinates for locations on the Earth's surface. GeoViews makes it simple to overlay such plots onto Cartopy features, tile sources, or other geographic data. For instance, let's load a dataset of all the major cities in the world with their population counts over time:

cities = pd.read_csv('./assets/cities.csv', encoding="ISO-8859-1")
population = gv.Dataset(cities, kdims=['City', 'Country', 'Year'])
cities.tail()
City Country Latitude Longitude Year Population
10025 Valencia Venezuela (Bolivarian Republic of) 10.17 -68.00 2050.0 2266000.0
10026 Al-Hudaydah Yemen 14.79 42.94 2050.0 1854000.0
10027 Sana'a' Yemen 15.36 44.20 2050.0 4382000.0
10028 Ta'izz Yemen 13.57 44.01 2050.0 1743000.0
10029 Lusaka Zambia -15.42 28.17 2050.0 2047000.0

Now we can convert this text-based dataset to a set of visible points mapped by the latitude and longitude, and containing the population, country, and city name as values. The longitudes and latitudes in the dataframe are supplied in simple Plate Carree coordinates, which we will need to declare explicitly, since each value is just a number with no inherently associated units. The .to conversion interface lets us do this succinctly, giving us points that are instantly visualizable either on their own or in a geographic context:

cities = population.to(gv.Points, kdims=['Longitude', 'Latitude'],
                    vdims=['Population', 'City', 'Country'], crs=crs.PlateCarree())
%%output backend='bokeh'
%%opts Overlay [width=600 height=300 xaxis=None yaxis=None] 
%%opts Points (size=0.005 cmap='inferno') [tools=['hover'] color_index=2]
gv.WMTS(WMTSTileSource(url='https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png')) * cities

Note that since we did not assign the Year dimension to the points key or value dimensions, it is automatically assigned to a HoloMap, rendering the data as an animation using a slider widget. Because this is a Bokeh plot, you can also zoom into any location to see more geographic detail, which will be requested dynamically from the tile provider (try it!). The matplotlib version of the same plot, using the Cartopy Ocean feature to provide context, provides a similar widget to explore the Year dimension not mapped onto the display:

%%output size=200
%%opts Feature [projection=crs.PlateCarree()]
%%opts Points (s=0.000007 cmap='inferno') [color_index=2]
gf.ocean * cities[::4]

However, the matplotlib version doesn't provide any interactivity within the plot; the output is just a series of PNG images encoded into the web page, with one image selected for display at any given time using the Year widget.

Shapes

Points are zero-dimensional objects with just a location. It is also important to be able to work with one-dimensional paths (such as borders and roads) and two-dimensional areas (such as land masses and regions). GeoViews provides special GeoElements for paths (gv.Path) and polygons (gv.Polygons). The GeoElement types are extensions of the basic HoloViews Elements hv.Path and hv.Polygons that add support for declaring geographic coordinate systems via the crs parameter and support for choosing the display coordinate systems via the projection parameter. Like their Holoviews equivalents, gv.Path and gv.Polygons accept lists of Numpy arrays or Pandas dataframes, which is good for working with low-level data.

In practice, the higher-level GeoElements gv.Shape (which wraps around a shapely shape object) and gv.Feature (which wraps around a Cartopy Feature object) are more convenient, because they make it simple to work with large collections of shapes. For instance, the various features like gv.ocean and gv.coastline introduced above are gv.Feature types, based on cartopy.feature.OCEAN and cartopy.feature.COASTLINE, respectively.

We can easily access the individual shapely objects underlying these features if we want to work with them separately. Here we will get the geometry corresponding to the Australian continent and display it using shapely's inbuilt SVG repr (not yet a HoloViews plot, just a bare SVG displayed by Jupyter directly):

land_geoms = list(gf.land.data.geometries())
land_geoms[21]

Instead of letting shapely render it as an SVG, we can now wrap it in the gv.Shape object and let matplotlib or bokeh render it, alone or with other GeoViews or HoloViews objects:

%%opts Points (color="black")
%%output dpi=120
australia = gv.Shape(land_geoms[21], crs=crs.PlateCarree())

australia * hv.Points([(133.870,-23.700)]) * hv.Text(133.870,-21.5, 'Alice Springs')

The above plot uses HoloViews elements (notice the prefix hv.), which do not have any information about coordinate systems, and so the plot works properly only because it was specified as PlateCarree coordinates (bare longitude and latitude values). You can use other projections safely as long as you specify the coordinate system for the Text and Points objects explicitly, which requires using GeoViews Elements (prefix gv.):

%%opts Points (color="black")
pc=crs.PlateCarree()
australia(plot=dict(projection=crs.Mollweide(central_longitude=133.87))) * \
gv.Points([(133.870,-23.700)],crs=pc) * gv.Text(133.870,-21.5, 'Alice Springs',crs=pc)

You can see why the crs parameter is important if you change the above cell to omit ,crs=pc from gv.Points and gv.Text; neither the dot nor the text label will then be in the correct location, because they won't be transformed to match the Mollweide projection used for the rest of the plot.

Multiple shapes can be combined into an NdOverlay object either explicitly:

%output dpi=120 size=150
%%opts NdOverlay [aspect=2]
hv.NdOverlay({i: gv.Shape(s, crs=crs.PlateCarree()) for i, s in enumerate(land_geoms)})

or by loading a collection of shapes from a shapefile, such as this collection of UK electoral district boundaries:

%%opts NdOverlay [aspect=0.75]
shapefile='./assets/boundaries/boundaries.shp'
gv.Shape.from_shapefile(shapefile, crs=crs.PlateCarree())

One common use for Shapefiles make it possible to create choropleth maps, where each part of the geometry is assigned a value that will be used to color it. Constructing a choropleth by combining a bunch of shapes one by one can be a lot of effort and is error prone, but is straightforward when using a shapefile that assigns standardized codes to each shape. For instance, the shapefile for the above UK plot assigns a well-defined geographic code for each electoral district's MultiPolygon shapely object:

shapes = cartopy.io.shapereader.Reader(shapefile)
list(shapes.records())[0]
<Record: <shapely.geometry.multipolygon.MultiPolygon object at 0x11786a0d0>, {'code': 'E07000007'}, <fields>>

To make a choropleth map, we just need a dataset with values indexed using these same codes, such as this dataset of the 2016 EU Referendum result in the UK:

referendum = pd.read_csv('./assets/referendum.csv')
referendum = hv.Dataset(referendum)
referendum.data.head()
leaveVoteshare regionName turnout name code
0 4.100000 Gibraltar 83.500000 Gibraltar BS0005003
1 69.599998 North East 65.500000 Hartlepool E06000001
2 65.500000 North East 64.900002 Middlesbrough E06000002
3 66.199997 North East 70.199997 Redcar and Cleveland E06000003
4 61.700001 North East 71.000000 Stockton-on-Tees E06000004

To make it simpler to match up the data with the shape files, you can use the .from_records method of the gv.Shape object to build a gv.Shape overlay that automatically merges the data and the shapes to show the percentage of each electoral district who voted to leave the EU:

%%opts NdOverlay [aspect=0.75] Shape (cmap='viridis')
gv.Shape.from_records(shapes.records(), referendum, on='code', value='leaveVoteshare',
                     index=['name', 'regionName'], crs=crs.PlateCarree())

As usual, the matplotlib output is static, but the Bokeh version of the same data is interactive, allowing both zooming and panning within the geographic area and revealing additional data such as the county name and numerical values when hovering over each shape:

%%output backend='bokeh'
%%opts Shape (cmap='viridis') [xaxis=None yaxis=None tools=['hover'] width=400 height=500]
gv.Shape.from_records(shapes.records(), referendum, on='code', value='leaveVoteshare',
                     index='name', crs=crs.PlateCarree(), group='EU Referendum')

As you can see, essentially the same code as was needed for the static Matplotlib version now provides a fully interactive view of this dataset.

For the remaining sections, let's set some default parameters:

%opts Image [colorbar=True] Curve [xrotation=60] Feature [projection=crs.PlateCarree()]
hv.Dimension.type_formatters[np.datetime64] = '%Y-%m-%d'
%output dpi=100 size=100

Here in this blog post we will use only a limited number of frames and plot sizes, to avoid bloating the web page with too much data, but when working on a live server one can append widgets='live' to the %output line above. In live mode, plots are rendered dynamically using Python based on user interaction, which allows agile exploration of large, multidimensional parameter spaces, without having to precompute a fixed set of plots.

Gridded data

In addition to point, path, and shape data, GeoViews is designed to make full use of multidimensional gridded (raster) datasets, such as those produced by satellite sensing, systematic land-based surveys, and climate simulations. This data is often stored in netCDF files that can be read into Python with the xarray and Iris libraries. HoloViews and GeoViews can use data from either library in all of its objects, along with NumPy arrays, Pandas data frames, and Python dictionaries. In each case, the data can be left stored in its original, native format, wrapped in a HoloViews or GeoViews object that provides instant interactive visualizations.

To get started, let's load a dataset originally taken from iris-sample-data) containing surface temperature data indexed by 'longitude', 'latitude', and 'time':

xr_dataset = gv.Dataset(xr.open_dataset('./sample-data/ensemble.nc'), crs=crs.PlateCarree(), 
                        kdims=['latitude','longitude','time'],
                        vdims=['surface_temperature'])
xr_dataset
:Dataset   [latitude,longitude,time]   (surface_temperature)

Here there is one "value dimension", i.e. surface temperature, whose value can be obtained for any combination of the three "key dimensions" (coordinates) longitude, latitude, and time.

We can quickly build an interactive tool for exploring how this data changes over time:

surf_temp = xr_dataset.to(gv.Image, ['longitude', 'latitude']) * gf.coastline
surf_temp[::2]

Here the slider for 'time' was generated automatically, because we instructed HoloViews to lay out two of the key timensions as x and y coordinates in an Image when we called .to(), with the value dimension 'surface_temperature' mapping onto the color of the image pixels by default, but we did not specify what should be done with the remaining 'time' key dimension. HoloViews is designed to make everything visualizable somehow, so it automatically generates a slider to cover this "extra" dimension, allowing you to explore how the surface_temperature values change over time. In a static page like this blog post each frame will be embedded into the page, however in a live Jupyter notebook it is trivial to explore large datasets and render each frame dynamically.

You could instead have told HoloViews to lay out the remaining dimension spatially (as faceted plots), in which case the slider will disappear because there is no remaining dimension to explore. As an example, here let's grab just the first three frames, then lay them out spatially:

surf_temp[::2].layout()

Normalization

By default, HoloViews will normalize items displayed together as frames in a slider or animation, applying a single colormap across all items of the same type sharing the same dimensions, so that differences are clear. In this particular dataset, the range changes relatively little, so that even if we turn off such normalization in layouts (or in animation frames using {+framewise}) the results are similar:

%%opts Image {+axiswise}
surf_temp[::2].layout()

Here you can see that each frame has a different range in the color bar, but it's a subtle effect. If we really want to highlight changes over a certain range of interest, we can set explicit normalization limits. For this data, let's find the maximum temperature in the dataset, and use it to set a normalization range by using the redim method:

max_surface_temp = xr_dataset.range('surface_temperature')[1]
print max_surface_temp
xr_dataset.redim(surface_temperature=dict(range=(300, max_surface_temp))).\
  to(gv.Image,['longitude', 'latitude'])[::2] * gf.coastline(style=dict(edgecolor='white')) 
317.331787109

Now we can see a clear cooling effect over time, as the yellow and white areas close to the top of the normalization range (317K) vanish in the Americas and in Africa. Values outside this range are clipped to the ends of the color map.

Non-Image views of gridded data

gv.Image Elements are a common way to view gridded data, but the .to() conversion interface supports other types as well, such as filled or unfilled contours and points:

%%output size=100 dpi=100
%%opts Points [color_index=2 size_index=None] (cmap='jet')
hv.Layout([xr_dataset.to(el,['longitude', 'latitude'])[::5, 0:50, -30:40] * gf.coastline
           for el in [gv.FilledContours, gv.LineContours, gv.Points]]).cols(3)

Non-geographic views of gridded data

So far we have focused entirely on geographic views of gridded data, plotting the data on a projection involving longitude and latitude. However the .to() conversion interface is completely general, allowing us to slice and dice the data in any way we like. To illustrate this, let's load an expanded version of the above surface temperature dataset thad adds an additional 'realization' dimension.

kdims = ['realization', 'longitude', 'latitude', 'time']

xr_ensembles = xr.open_dataset('./sample-data/ensembles.nc')
dataset = gv.Dataset(xr_ensembles, kdims=kdims, vdims=['surface_temperature'], crs=crs.PlateCarree())
dataset
:Dataset   [realization,longitude,latitude,time]   (surface_temperature)

The realization is effectively a certain set of modelling parameters that leads to different predicted values for the temperatures at given times. We can see this clearly if we map the data onto a temperature versus time plot:

%%output backend='bokeh'
%%opts Curve [xrotation=25] NdOverlay [width=600 height=400 legend_position='left']
sliced = dataset.select(latitude=(0, 5), longitude=(0,10))
sliced.to(hv.Curve, 'time').overlay('realization')

Here there is no geographic organization to the visualization, because we selected non-geographic coordinates to display. Just as before, the key dimensions not selected for display have become sliders, but in this case the leftover dimensions are longitude and latitude. (Realization would also be left over and thus generate a slider, if it hadn't been mapped onto an overlay above.)

Because this is a static web page, we selected only a small portion of the data to be available in the above plot, i.e. all data points in the range 0,10 for latitude and longitude. If this code were running on a live Python server, one could instead access all the data dynamically:

hv.Layout([dataset.to(hv.Curve, 'time', dynamic=True).overlay('realization')])

We can also make non-geographic 2D plots, for instance as a HeatMap over time and realization, again at a specified longitude and latitude:

%%opts HeatMap [show_values=False colorbar=True]
sliced.to(hv.HeatMap, ['realization', 'time'])

In general, any HoloViews Element type (of which there are many!) can be used for non-geographic dimensions selected in this way, while any GeoViews GeoElement type can be used for geographic data.

Reducing and aggregating gridded data

So far all the conversions shown have incorporated each of the available coordinate dimensions, either explicitly as dimensions in the plot, or implicitly as sliders for the leftover dimensions. However, instead of revealing all the data individually in this way, we often want to see the spread of values along one or more dimensions, pooling all the other dimensions together.

A simple example of this is a box plot where we might want to see the spread of surface_temperature on each day, pooled across all latitude and longitude coordinates. To pool across particular dimensions, we can explicitly declare the "map" dimensions, which are the key dimensions of the HoloMap container rather than those of the Elements contained in the HoloMap. By declaring an empty list of mdims, we can tell the conversion interface '.to()' to pool across all dimensions except the particular key dimension(s) supplied, in this case the 'time' (plot A) and 'realization' (plot B):

%%opts BoxWhisker [xrotation=25 bgcolor='w']
hv.Layout([dataset.to.box(d, mdims=[]) for d in ['time', 'realization']])

This approach also gives us access to other statistical plot types. For instance, with the seaborn library installed, we can use the Distribution Element, which visualizes the data as a kernel density estimate. In this way we can visualize how the distribution of surface temperature values varies over time and the model realizations. We do this by omitting 'latitude' and 'longitude' from the list of mdims, generating a lower-dimensional view into the data, where a temperature histogram is shown for every 'realization' and 'time', using GridSpace:

%opts GridSpace [shared_xaxis=True fig_size=150] 
%opts Distribution [bgcolor='w' show_grid=False xticks=[220, 300]]
import seaborn

dataset.to.distribution(mdims=['realization', 'time']).grid()

Selecting a particular coordinate

To examine one particular coordinate, we can select it, cast the data to Curves, reindex the data to drop the now-constant latitude and longitude dimensions, and overlay the remaining 'realization' dimension:

%%opts NdOverlay [xrotation=25 aspect=1.5 legend_position='right' legend_cols=2] Curve (color=Palette('Set1'))
dataset.select(latitude=0, longitude=0).to(hv.Curve, ['time']).reindex().overlay()

Aggregating coordinates

Another option is to aggregate over certain dimensions, so that we can get an idea of distributions of temperatures across all latitudes and longitudes. Here we compute the mean temperature and standard deviation by latitude and longitude, casting the resulting collapsed view to a Spread Element:

%%output backend='bokeh'
lat_agg = dataset.aggregate('latitude', np.mean, np.std)
lon_agg = dataset.aggregate('longitude', np.mean, np.std)
hv.Spread(lat_agg) * hv.Curve(lat_agg) + hv.Spread(lon_agg) * hv.Curve(lon_agg)

As you can see, with GeoViews and HoloViews it is very simple to select precisely which aspects of complex, multidimensional datasets that you want to focus on. See holoviews.org and geo.holoviews.org to get started!

by ryanwh at September 01, 2016 02:48 PM

August 31, 2016

Paul Ivanov

Jupyter's Gravity

I'm switching jobs.

For the past two years I've been working with the great team at Disqus as a member of the backend and data teams. Before that, I spent a half-dozen years mostly not working on my thesis at UC Berkeley but instead contributing to to the scientific Python ecosystem, especially matplotlib, IPython, and the IPython notebook, which is now called Jupyter. So when Bloomberg reached out to me with a compelling position to work on those open-source projects again from their SF office, such a tremendous opportunity was hard to pass up. You could say Jupyter has a large gravitational pull that's hard to escape, but you'd be huge nerd. ;)

I have a lot to catch up on, but I'm really excited and looking forward to contributing on these fronts again!

by Paul Ivanov at August 31, 2016 07:00 AM

August 25, 2016

Continuum Analytics news

Celebrating U.S. Women's Equality Day with Women in Tech

Thursday, August 25, 2016

August 26 is recognized as Women's Equality Day in the United States, celebrating the addition of the 19th Amendment to the Constitution in 1920, which granted women the right to vote. This amendment was the culmination of an immense movement in women's rights, dating all the way back to the first women's rights convention in Seneca Falls, New York, in 1848. 

To commemorate this day, we decided to reach out to influential, successful and all around superstar women in technology to ask them one question: 

If women were never granted the right to vote, how do you think the landscape of women in STEM would be different?

Katy Huff, @katyhuff 

"If women were never granted the right to vote, I think it's fair to say that other important movements on the front lines of women's rights would not have followed either. Without that basic recognition of equality -- the ability to participate in democracy -- would we have ever seen Title VII of the Civil Rights Act (1964) or Title IX of the Education Amendments (1972)? Surely not. And without them, women could legally be discriminated against when seeking an education and then again later when seeking employment. There wouldn't merely be a minority of women in tech (as is currently the case) - there would be a super-minority. If there were any women at all able to compete for these lucrative jobs, that tiny minority could legally be paid less than their colleagues and looked upon as second class citizens without any more voice in the workplace than in their own democracy."

Renee M. P. Teate, @BecomingDataSci

"If women were never granted the right to vote in the U.S., the landscape of women in STEM would be very different, because the landscape of our entire country would be different. Voting is a basic right in a democracy, and it is important to allow citizens of all races, sexes/genders, religions, wealth statuses, and backgrounds to participate in electing our leaders, and therefore shaping the laws of our country. When anyone is excluded from participating, they are not represented and can be more easily marginalized or treated unfairly under the law.

The 19th amendment gave women not only a vote and a voice, but "full legal status" as citizens. That definitely impacts our roles in the workplace and in STEM, because if the law doesn't treat you as a whole and valued participant, you can't expect peers or managers to, either. Additionally, if the law doesn't offer equal protection to everyone, discrimination would run (even more) rampant and there might be no legal recourse for incidents such as sexual harassment in the workplace.

A celebration of women is important within STEM fields, because it wasn't long ago that women were not seen as able to be qualified for many careers in STEM, including roles hired by public/governmental organizations like NASA that are funded by taxpayers and report to our elected officials. Even today, there are many prejudices against women, including beliefs by some that women are inferior at performing jobs such as computer programming and scientific research. There are also institutional biases in both our educational system and the workplace that we still need to work on. When women succeed despite these additional barriers (not to mention negative comments by unsupportive people and other detrators), that is worth celebrating.

Though there are still many issues relating to bias against women and people of color in STEM, without the basic right to vote we would be even further behind on the quest for equality in the U.S. than we are today."

Carol Willing, @WillingCarol

"From the 19th amendment ratification to now, several generations of women have made their contributions to technical fields. These women celebrated successes, failures, disappointments, hopes, and dreams.

Sometimes, as a person in tech, I wonder if my actions make a difference on others. Is it worth the subtle putdowns, assumptions about my ability, and, at times, overt bullying to continue working as an engineer and software developer? Truthfully, sometimes the answer is no, but most days my feeling is “YES! I have a right to study and work on technical problems that I find fascinating." My daughter, my son, and you have that right too.

Almost a decade ago, I watched the movie “Iron Jawed Angels” with my middle school daughter, her friend, and a friend of mine who taught middle school history. The movie was tough to watch. We were struck by the sacrifice made by suffragettes, Alice Paul and Lucy Burns, amid the brutal abuse from others that did not want women to vote. A powerful reminder that we can’t control the actions of others, but we can stand up for ourselves and our right to be engineers, developers, managers, and researcher in technical fields. Your presence in tech and your contributions make a difference to humanity now and tomorrow."

Jasmine Sandhu, @sandhujasmine

"Its a numbers game, if more people have an opportunity to contribute to a field, you have a lot more talent, many more ideas and that many more people working on solutions and new ideas.

The "Science" in STEM is key - an informed citizenry that asks for evidence when confronted with the many pseudoscientific claims that we navigate in everday life is critical. It is important for all of us to learn the scientific method and see its relevance in day to day life, so we 'ask for evidence' when people around us make claims about our diet, about our health, our civic discourse, our politics. Similarly, I wish I had learned statistics since childhood. It is an idea with which we should be very comfortable. Randomness is a part of our daily lives and being able to make decisions and take risks based less on how we feel about things and be able to analyze critically the options would be wonderful. Of course, education has a far greater impact in our lives than simply the demographic that we represent in a field. I'm still struck by the pseudoscience books aimed at little girls (astrology) and the scientific books targetting the boys (astronomy) - of course, this is an anecdotal example, but in the US we still hear about girls losing interest in science and math in middle school. Hard to believe this is the case in the 21st century.

Living in a place like Seattle in the 21st century has enabled opportunities for me that don't exist for a lot of women in the world. I work remotely in a technical field which gives me freedom to structure my day to care for my daughter, live close to my family which is my support structure, and earn well enough to provide for my daughter and I. STEM fields offer yet more opportunities for all people, including women."

We loved hearing the perspectives of these women in STEM. If you'd like to share your response, please respond in the comments below, or tweet us @ContinuumIO!

We've also created a special Anaconda graphic to celebrate, which you can see below. If you're currently at PyData Chicago, find the NumFOCUS table to grab a sticker! 

Happy Women's Equality Day! 

 

-Team Anaconda

 

 

 

by swebster at August 25, 2016 08:42 PM

Succeeding in the New World Order of Data

Thursday, August 25, 2016
Travis Oliphant
Chief Executive Officer & Co-Founder
Continuum Analytics

"If you want to understand function, study structure."

Sage advice from Francis Crick, who revolutionized genetics with his Nobel Prize winning co-discovery of the structure of DNA — launching more than six decades of fruitful research.

Crick was referring to biology, but today's companies competing in the Big Data space should heed his advice. With change at a pace this intense, understanding and optimizing one’s data science infrastructure — and therefore functionality — makes all the difference.

But, what’s the best way to do that?

Fortunately, there's an ideal solution for evolving in a rapidly-changing context while generating competitive insights from today's deluge of data.

That solution is an emerging movement called Open Data Science, which uses open source software to drive cutting-edge analytics that go far beyond what traditional proprietary data software can provide.

Shoring up Your Infrastructure

Open Data Science draws its power from four fundamental principles: accessibility, innovation, interoperability and transparency. These insure source code that’s accessible for the whole team — free from licensing restrictions or vendor release schedules — and works seamlessly with other tools.

Because open source libraries are free, the barrier to entry is very low, allowing teams to dive in and freely experiment without the concerns of a massive financial commitment up front, which encourages innovation.

Although transitioning to a new analytics infrastructure is never trivial, the community spirit of open source software and Open Data Science's commitment to interoperability makes it quite manageable.

Anaconda, for example, provides over 720 well-tested Python libraries for the demands of today's data science, all available from a single install. Business analysts can be brought on board with Anaconda Fusion, providing access to data analysis functions in Python within the familiar Excel interface.

With connectors to other languages, integration of legacy code, HPC and parallel computing, as well as visualizations easily deployed to the web, there’s no limit to what can be achieved with Open Data Science. 

Navigating Potential Pitfalls

With traditional solutions, unforeseen limits can bring the train to a screeching halt.

I know of a large government project that convened many experts to creatively solve problems using data. The agency had invested in a many node compute cluster with attached GPUs. But when the experts arrived, the software installed was not inclusive and allowed less than a third of them to actually use it.

Organizations cannot simply buy the latest monolithic tech from vendors and expect data science to just happen.  The software must enable data scientists and play to their strengths not only to the needs of IT operations.

Unlike proprietary offerings, Open Data Science has evolved along with the Big Data revolution —and, to a significant extent, driven it. Its toolset is designed with compatibilities that drive progress.

Setting up Your Scaffolding

Making the shift to an Open Data Science infrastructure is more than just choosing software and databases. It must also include people.

Companies should provision the time and resources necessary to set up new organizational structures and provide budgets to enable these groups to work effectively.  A pilot data-exploration team, a center of excellence or an emerging technology team are all examples of models that enable organizations to begin to uncover the opportunity in their data.  As the organization grows, individual roles may change or new ones may emerge.

Details of which toolsets to use will need to be hammered out. Many developers are already familiar with common Open Data Science applications, such as data notebooks like Jupyter, while others may require more of a learning curve to implement.

Choices such as programming languages will vary by developers' preferences and particular needs. Python is commonly used, and for good reason. It is, by far, the dominant language for scientific computing, and it integrates beautifully with Open Data Science.

Finally, well-managed migration is critical to success. Open Data Science allows for a number of options — from "co-existence" of Open Data Science with current infrastructure to piecemeal, or even full migration, all depending on a company's tolerance for risk or willingness to commit. Legacy code can also be retained and integrated with Open Data Science wrappers, allowing old but debugged and stable code-bases to serve new duty in a modern analytics environment.

Taking Data Science to a New Level

When genetics boomed as a science in the 1950s, new insights were always on the way. But, to get the ball rolling, biologists needed to understand DNA's structure — and exploit that understanding. Francis Crick and others began the process, and society continues to benefit.

Data Science is similarly poised on the cusp of an astounding future. Those organizations that understand their analytics infrastructure will excel in that new world, with Open Data Science as the instrument for success.

by swebster at August 25, 2016 04:31 PM

Jake Vanderplas

Conda: Myths and Misconceptions

I've spent much of the last decade using Python for my research, teaching Python tools to other scientists and developers, and developing Python tools for efficient data manipulation, scientific and statistical computation, and visualization. The Python-for-data landscape has changed immensely since I first installed NumPy and SciPy from via a flickering CRT display. Among the new developments since those early days, the one with perhaps the broadest impact on my daily work has been the introduction of conda, the open-source cross-platform package manager first released in 2012.

In the four years since its initial release, many words have been spilt introducing conda and espousing its merits, but one thing I have consistently noticed is the number of misconceptions that seem to remain in the (often fervent) discussions surrounding this tool. I hope in this post to do a small part in putting these myths and misconceptions to rest.

I've tried to be as succinct as I can, but if you want to skim this article and get the gist of the discussion, you can read each heading along with the the bold summary just below it.

Myth #1: Conda is a distribution, not a package manager

Reality: Conda is a package manager; Anaconda is a distribution. Although Conda is packaged with Anaconda, the two are distinct entities with distinct goals.

A software distribution is a pre-built and pre-configured collection of packages that can be installed and used on a system. A package manager is a tool that automates the process of installing, updating, and removing packages. Conda, with its "conda install", "conda update", and "conda remove" sub-commands, falls squarely under the second definition: it is a package manager.

Perhaps the confusion here comes from the fact that Conda is tightly coupled to two software distributions: Anaconda and Miniconda. Anaconda is a full distribution of the central software in the PyData ecosystem, and includes Python itself along with binaries for several hundred third-party open-source projects. Miniconda is essentially an installer for an empty conda environment, containing only Conda and its dependencies, so that you can install what you need from scratch.

But make no mistake: Conda is as distinct from Anaconda/Miniconda as is Python itself, and (if you wish) can be installed without ever touching Anaconda/Miniconda. For more on each of these, see the conda FAQ.

Myth #2: Conda is a Python package manager

Reality: Conda is a general-purpose package management system, designed to build and manage software of any type from any language. As such, it also works well with Python packages.

Because conda arose from within the Python (more specifically PyData) community, many mistakenly assume that it is fundamentally a Python package manager. This is not the case: conda is designed to manage packages and dependencies within any software stack. In this sense, it's less like pip, and more like a cross-platform version of apt or yum.

If you use conda, you are already probably taking advantage of many non-Python packages; the following command will list the ones in your environment:

$ conda search --canonical  | grep -v 'py\d\d'

On my system, there are 350 results: these are packages within my Conda/Python environment that are fundamentally unmanageable by Python-only tools like pip & virtualenv.

Myth #3: Conda and pip are direct competitors

Reality: Conda and pip serve different purposes, and only directly compete in a small subset of tasks: namely installing Python packages in isolated environments.

Pip, which stands for Pip Installs Packages, is Python's officially-sanctioned package manager, and is most commonly used to install packages published on the Python Package Index (PyPI). Both pip and PyPI are governed and supported by the Python Packaging Authority (PyPA).

In short, pip is a general-purpose manager for Python packages; conda is a language-agnostic cross-platform environment manager. For the user, the most salient distinction is probably this: pip installs python packages within any environment; conda installs any package within conda environments. If all you are doing is installing Python packages within an isolated environment, conda and pip+virtualenv are mostly interchangeable, modulo some difference in dependency handling and package availability. By isolated environment I mean a conda-env or virtualenv, in which you can install packages without modifying your system Python installation.

Even setting aside Myth #2, if we focus on just installation of Python packages, conda and pip serve different audiences and different purposes. If you want to, say, manage Python packages within an existing system Python installation, conda can't help you: by design, it can only install packages within conda environments. If you want to, say, work with the many Python packages which rely on external dependencies (NumPy, SciPy, and Matplotlib are common examples), while tracking those dependencies in a meaningful way, pip can't help you: by design, it manages Python packages and only Python packages.

Conda and pip are not competitors, but rather tools focused on different groups of users and patterns of use.

Myth #4: Creating conda in the first place was irresponsible & divisive

Reality: Conda's creators pushed Python's standard packaging to its limits for over a decade, and only created a second tool when it was clear it was the only reasonable way forward.

According to the Zen of Python, when doing anything in Python "There should be one – and preferably only one – obvious way to do it." So why would the creators of conda muddy the field by introducing a new way to install Python packages? Why didn't they contribute back to the Python community and improve pip to overcome its deficiencies?

As it turns out, that is exactly what they did. Prior to 2012, the developers of the PyData/SciPy ecosystem went to great lengths to work within the constraints of the package management solutions developed by the Python community. As far back as 2001, the NumPy project forked distutils in an attempt to make it handle the complex requirements of a NumPy distribution. They bundled a large portion of NETLIB into a single monolithic Python package (you might know this as SciPy), in effect creating a distribution-as-python-package to circumvent the fact that Python's distribution tools cannot manage these extra-Python dependencies in any meaningful way. An entire generation of scientific Python users spent countless hours struggling with the installation hell created by this exercise of forcing a square peg into a round hole – and those were just ones lucky enough to be using Linux. If you were on Windows, forget about it. To read some of the details about these pain-points and how they led to Conda, I'd suggest Travis Oliphant's 2013 blog post on the topic.

But why didn't Conda's creators just talk to the Python packaging folks and figure out these challenges together? As it turns out, they did.

The genesis of Conda came after Guido van Rossum was invited to speak at the inaugural PyData meetup in 2012; in a Q&A on the subject of packaging difficulties, he told us that when it comes to packaging, "it really sounds like your needs are so unusual compared to the larger Python community that you're just better off building your own" (See video of this discussion). Even while following this nugget of advice from the BDFL, the PyData community continued dialog and collaboration with core Python developers on the topic: one more public example of this was the invitation of CPython core developer Nick Coghlan to keynote at SciPy 2014 (See video here). He gave an excellent talk which specifically discusses pip and conda in the context of the "unsolved problem" of software distribution, and mentions the value of having multiple means of distribution tailored to the needs of specific users.

Far from insinuating that Conda is divisive, Nick and others at the Python Packaging Authority officially recognize conda as one of many important redistributors of Python code, and are working hard to better enable such tools to work seamlessly with the Python Package Index.

Myth #5: conda doesn't work with virtualenv, so it's useless for my workflow

Reality: You actually can install (some) conda packages within a virtualenv, but better is to use Conda's own environment manager: it is fully-compatible with pip and has several advantages over virtualenv.

virtualenv/venv are utilites that allow users to create isolated Python environments that work with pip. Conda has its own built-in environment manager that works seamlessly with both conda and pip, and in fact has several advantages over virtualenv/venv:

  • conda environments integrate management of different Python versions, including installation and updating of Python itself. Virtualenvs must be created upon an existing, externally managed Python executable.
  • conda environments can track non-python dependencies; for example seamlessly managing dependencies and parallel versions of essential tools like LAPACK or OpenSSL
  • Rather than environments built on symlinks – which break the isolation of the virtualenv and can be flimsy at times for non-Python dependencies – conda-envs are true isolated environments within a single executable path.
  • While virtualenvs are not compatible with conda packages, conda environments are entirely compatible with pip packages. First conda install pip, and then you can pip install any available package within that environment. You can even explicitly list pip packages in conda environment files, meaning the full software stack is entirely reproducible from a single environment metadata file.

That said, if you would like to use conda within your virtualenv, it is possible:

$ virtualenv test_conda

$ source test_conda/bin/activate

$ pip install conda

$ conda install numpy

This installs conda's MKL-enabled NumPy package within your virtualenv. I wouldn't recommend this: I can't find documentation for this feature, and the result seems to be fairly brittle – for example, trying to conda update python within the virtualenv fails in a very ungraceful and unrecoverable manner, seemingly related to the symlinks that underly virtualenv's architecture. This appears not to be some fundamental incompatibility between conda and virtualenv, but rather related to some subtle inconsistencies in the build process, and thus is potentially fixable (see conda Issue 1367 and anaconda Issue 498, for example).

If you want to avoid these difficulties, a better idea would be to pip install conda and then create a new conda environment in which to install conda packages. For someone accustomed to pip/virtualenv/venv command syntax who wants to try conda, the conda docs include a translation table between conda and pip/virtualenv commands.

Myth #6: Now that pip uses wheels, conda is no longer necessary

Reality: wheels address just one of the many challenges that prompted the development of conda, and wheels have weaknesses that Conda's binaries address.

One difficulty which drove the creation of Conda was the fact that pip could distribute only source code, not pre-compiled binary distributions, an issue that was particularly challenging for users building extension-heavy modules like NumPy and SciPy. After Conda had solved this problem in its own way, pip itself added support for wheels, a binary format designed to address this difficulty within pip. With this issue addressed within the common tool, shouldn't Conda early-adopters now flock back to pip?

Not necessarily. Distribution of cross-platform binaries was only one of the many problems solved within conda. Compiled binaries spotlight the other essential piece of conda: the ability to meaningfully track non-Python dependencies. Because pip's dependency tracking is limited to Python packages, the main way of doing this within wheels is to bundle released versions of dependencies with the Python package binary, which makes updating such dependencies painful (recent security updates to OpenSSL come to mind). Additionally, conda includes a true dependency resolver, a component which pip currently lacks.

For scientific users, conda also allows things like linking builds to optimized linear algebra libraries, as Continuum does with its freely-provided MKL-enabled NumPy/SciPy. Conda can even distribute non-Python build requirements, such as gcc, which greatly streamlines the process of building other packages on top of the pre-compiled binaries it distributes. If you try to do this using pip's wheels, you better hope that your system has compilers and settings compatible with those used to originally build the wheel in question.

Myth #7: conda is not open source; it is tied to a for-profit company who could start charging for the service whenever they want

Reality: conda (the package manager and build system) is 100% open-source, and Anaconda (the distribution) is nearly there as well.

In the open source world, there is (sometimes quite rightly) a fundamental distrust of for-profit entities, and the fact that Anaconda was created by Continuum Analytics and is a free component of a larger enterprise product causes some to worry.

Let's set aside the fact that Continuum is, in my opinion, one of the few companies really doing open software the right way (a topic for another time). Ignoring that, the fact is that Conda itself – the package manager that provides the utilities to build, distribute, install, update, and manage software in a cross-platform manner – is 100% open-source, available on GitHub and BSD-Licensed. Even for Anaconda (the distribution), the EULA is simply a standard BSD license, and the toolchain used to create Anaconda is also 100% open-source. In short, there is no need to worry about intellectual property issues when using Conda.

If the Anaconda/Miniconda distributions still worry you, rest assured: you don't need to install Anaconda or Miniconda to get conda, though those are convenient avenues to its use. As we saw above, you can "pip install conda" to install it via PyPI without ever touching Continuum's website.

Myth #8: But Conda packages themselves are closed-source, right?

Reality: though conda's default channel is not yet entirely open, there is a community-led effort (Conda-Forge) to make conda packaging & distribution entirely open.

Historically, the package build process for the default conda channel have not been as open as they could be, and the process of getting a build updated has mostly relied on knowing someone at Continuum. Rumor is that this was largely because the original conda package creation process was not as well-defined and streamlined as it is today.

But this is changing. Continuum is making the effort to open their package recipes, and I've been told that only a few dozen of the 500+ packages remain to be ported. These few recipes are the only remaining piece of the Anaconda distribution that are not entirely open.

If that's not enough, there is a new community-led – not Continuum affiliated – project, introduced in early 2016, called conda-forge that contains tools for the creation of community-driven builds for any package. Packages are maintained in the open via github, with binaries automatically built using free CI tools like TravisCI for Mac OSX builds, AppVeyor for Windows builds, and CircleCI for Linux builds. All the metadata for each package lives in a Github repository, and package updates are accomplished through merging a Github pull request (here is an example of what a package update looks like in conda-forge).

Conda-forge is entirely community-founded and community-led, and while conda-forge is probably not yet mature enough to completely replace the default conda channel, Continuum's founders have publicly stated that this is a direction they would support. You can read more about the promise of conda-forge in Wes McKinney's recent blog post, conda-forge and PyData's CentOS moment.

Myth #9: OK, but if Continuum Analytics folds, conda won't work anymore right?

Reality: nothing about Conda inherently ties it to Continuum Analytics; the company serves the community by providing free hosting of build artifacts. All software distributions need to be hosted by somebody, even PyPI.

It's true that even conda-forge publishes its package builds to http://anaconda.org/, a website owned and maintained by Continuum Analytics. But there is nothing in Conda that requires this site. In fact, the creation of Custom Channels in conda is well-documented, and there would be nothing to stop someone from building and hosting their own private distribution using Conda as a package manager (conda index is the relevant command). Given the openness of conda recipes and build systems on conda-forge, it would not be all that hard to mirror all of conda-forge on your own server if you have reason to do so.

If you're still worried about Continuum Analytics – a for-profit company – serving the community by hosting conda packages, you should probably be equally worried about Rackspace – a for-profit company – serving the community by hosting the Python Package Index. In both cases, a for-profit company is integral to the current manifestation of the community's package management system. But in neither case would the demise of that company threaten the underlying architecture of the build & distribution system, which is entirely free and open source. If either Rackspace or Continuum were to disappear, the community would simply have to find another host and/or financial sponsor for the open distribution it relies on.

Myth #10: Everybody should abandon (conda | pip) and use (pip | conda) instead!

Reality: pip and conda serve different needs, and we should be focused less on how they compete and more on how they work together.

As mentioned in Myth #2, Conda and pip are different projects with different intended audiences: pip installs python packages within any environment; conda installs any package within conda environments. Given the lofty ideals raised in the Zen of Python, one might hope that pip and conda could somehow be combined, so that there would be one and only one obvious way of installing packages.

But this will never happen. The goals of the two projects are just too different. Unless the pip project is broadly re-scoped, it will never be able to meaningfully install and track all the non-Python packages that conda does: the architecture is Python-specific and (rightly) Python-focused. Pip, along with PyPI, aims to be a flexible publication & distribution platform and manager for Python packages, and it does phenomenally well at that.

Likewise, unless the conda package is broadly re-scoped, it will never make sense for it to replace pip/PyPI as a general publishing & distribution platform for Python code. At its very core, conda concerns itself with the type of detailed dependency tracking that is required for robustly running a complex multi-language software stack across multiple platforms. Every installation artifact in conda's repositories is tied to an exact dependency chain: by design, it wouldn't allow you to, say, substitute Jython for Python in a given package. You could certainly use conda to build a Jython software stack, but each package would require a new Jython-specific installation artifact – that is what is required to maintain the strict dependency chain that conda users rely on. Pip is much more flexible here, but once cost of that is its inability to precisely define and resolve dependencies as conda does.

Finally, the focus on pip vs. conda entirely misses the broad swath of purpose-designed redistributors of Python code. From platform-specific package managers like apt, yum, macports, and homebrew, to cross-platform tools like bento, buildout, hashdist, and spack, there are a wide range of specific packaging solutions aimed at installing Python (and other) packages for particular users. It would be more fruitful for us to view these, as the Python Packaging Authority does, not as competitors to pip/PyPI, but as downstream tools that can take advantage of the heroic efforts of all those who have developed and maintained pip, PyPI, and associated toolchain.

Where to Go from Here?

So it seems we're left with two packaging solutions which are distinct, but yet have broad overlap for many Python users (i.e. when installing Python packages in isolated environments). So where should the community go from here? I think the main thing we can do is make sure the projects (1) work together as well as possible, and (2) learn from each other's successes.

Conda

As mentioned above, conda is already has a fully open toolchain, and is on a steady trend toward fully open packages (but is not entirely there just yet). An obvious direction is to push forward on community development and maintenance of the conda stack via conda-forge, perhaps eventually using it to replace conda's current default channel.

As we push forward on this, I believe the conda and conda-forge community could benefit from imitating the clear and open governance model of the Python Packaging Authority. For example, PyPA has an open governance model with explicit goals, a clear roadmap for new developments and features, and well-defined channels of communication and discussion, and community oversight of the full pip/PyPI system from the ground up.

With conda and conda-forge, on the other hand, the code (and soon all recipes) is open, but the model for governance and control of the system is far less explicit. Given the importance of conda particularly in the PyData community, it would benefit all of this to clarify this somehow – perhaps under the umbrella of the NumFOCUS organization.

That being said, folks involved with conda-forge have told me that this is currently being addressed by the core team, including generation of governing documents, a code of conduct, and framework for enhancement proposals.

PyPI/pip

While the Python Package Index seems to have its governance in order, there are aspects of conda/conda-forge that I think would benefit it. For example, currently most Python packages can be loaded to conda-forge with just a few steps:

  1. Post a public code release somewhere on the web (on github, bitbucket, PyPI, etc.)
  2. Create a recipe/metadata file that points to this code and lists dependencies
  3. Open a pull request on conda-forge/staged-recipes

And that's it. Once the pull request is merged, the binary builds on Windows, OSX, and Linux are automatically created and loaded to the conda-forge channel. Additionally, managing and updating the package takes place transparently via github, where package updates can be reviewed by collaborators and tested by CI systems before they go live.

I find this process far preferable to the (by comparison relatively opaque and manual) process of publishing to PyPI, which is mostly done by a single user working in private at a local terminal. Perhaps PyPI could take advantage of conda-forge's existing build system, and creating an option to automatically build multi-platform wheels and source distributions, and automatically push them to PyPI in a single transparent command. It is definitely a possibility.

Postscript: Which Tool Should I Use?

I hope I've convinced you that conda and pip both have a role to play within the Python community. With that behind us, which should you use if you're starting out? The answer depends on what you want to do:

If you have an existing system Python installation and you want to install packages in or on it, use pip+virtualenv. For example, perhaps you used apt or another system package manager to install Python, along with some packages linked to system tools that are not (yet) easily installable via conda or pip. Pip+virtualenv will allow you to install new Python packages and build environments on top of that existing distribution, and you should be able to rely on your system package manager for any difficult-to-install dependencies.

If you want to flexibly manage a multi-language software stack and don't mind using an isolated environment, use conda. Conda's multi-language dependency management and cross-platform binary installations can do things in this situation that pip cannot do. A huge benefit is that for most packages, the result will be immediately compatible with multiple operating systems.

If you want to install Python packages within an Isolated environment, pip+virtualenv and conda+conda-env are mostly interchangeable. This is the overlap region where both tools shine in their own way. That being said, I tend to prefer conda in this situation: Conda's uniform, cross-platform, full-stack management of multiple parallel Python environments with robust dependency management has proven to be an incredible time-saver in my research, my teaching, and my software development work. Additionally, I find that my needs and the needs of my colleagues more often stray into areas of conda's strengths (management of non-Python tools and dependencies) than into areas of pip's strengths (environment-agnostic Python package management).

As an example, years ago I spent nearly a quarter with a colleague trying to install the complicated (non-Python) software stack that powers the megaman package, which we were developing together. The result of all our efforts was a single non-reproducible working stack on a single machine. Then conda-forge was introduced. We went through the process again, this time creating a conda recipe, from which a conda-forge feedstock was built. We now have a cross-platform solution that will install a working version of the package and its dependencies with a single command, in seconds, on nearly any computer. If there is a way to build and distribute software with that kind of dependency graph seamlessly with pip+PyPI, I haven't seen it.


If you've read this far, I hope you've found this discussion useful. My own desire is that we as a community can continue to rally around both these tools, improving them for the benefit of current and future users. Python packaging has improved immensely in the last decade, and I'm excited to see where it will go from here.

Thanks to Filipe Fernandez, Aaron Meurer, Bryan van de Ven, and Phil Elson for helpful feedback on early drafts of this post. As always, any mistakes are my own.

by Jake Vanderplas at August 25, 2016 04:00 PM

Matthew Rocklin

Supporting Users in Open Source

What are the social expectations of open source developers to help users understand their projects? What are the social expectations of users when asking for help?

As part of developing Dask, an open source library with growing adoption, I directly interact with users over GitHub issues for bug reports, StackOverflow for usage questions, a mailing list and live Gitter chat for community conversation. Dask is blessed with awesome users. These are researchers doing very cool work of high impact and with novel use cases. They report bugs and usage questions with such skill that it’s clear that they are Veteran Users of open source projects.

Veteran Users are Heroes

It’s not easy being a veteran user. It takes a lot of time to distill a bug down to a reproducible example, or a question into an MCVE, or to read all of the documentation to make sure that a conceptual question definitely isn’t answered in the docs. And yet this effort really shines through and it’s incredibly valuable to making open source software better. These distilled reports are arguably more important than fixing the actual bug or writing the actual documentation.

Bugs occur in the wild, in code that is half related to the developer’s library (like Pandas or Dask) and half related to the user’s application. The veteran user works hard to pull away all of their code and data, creating a gem of an example that is trivial to understand and run anywhere that still shows off the problem.

This way the veteran user can show up with their problem to the development team and say “here is something that you will quickly understand to be a problem.” On the developer side this is incredibly valuable. They learn of a relevant bug and immediately understand what’s going on, without having to download someone else’s data or understand their domain. This switches from merely convenient to strictly necessary when the developers deal with 10+ such reports a day.

Novice Users need help too

However there are a lot of novice users out there. We have all been novice users once, and even if we are veterans today we are probably still novices at something else. Knowing what to do and how to ask for help is hard. Having the guts to walk into a chat room where people will quickly see that you’re a novice is even harder. It’s like using public transit in a deeply foreign language. Respect is warranted here.

I categorize novice users into two groups:

  1. Experienced technical novices, who are very experienced in their field and technical things generally, but who don’t yet have a thorough understanding of open source culture and how to ask questions smoothly. They’re entirely capable of behaving like a veteran user if pointed in the right directions.
  2. Novice technical novices, who don’t yet have the ability to distill their problems into the digestible nuggets that open source developers expect.

In the first case of technically experienced novices, I’ve found that being direct works surprisingly well. I used to be apologetic in asking people to submit MCVEs. Today I’m more blunt but surprisingly I find that this group doesn’t seem to mind. I suspect that this group is accustomed to operating in situations where other people’s time is very costly.

The second case of novice novice users are more challenging for individual developers to handle one-by-one, both because novices are more common, and because solving their problems often requires more time commitment. Instead open source communities often depend on broadcast and crowd-sourced solutions, like documentation, StackOverflow, or meetups and user groups. For example in Dask we strongly point people towards StackOverflow in order to build up a knowledge-base of question-answer pairs. Pandas has done this well; almost every Pandas question you Google leads to a StackOverflow post, handling 90% of the traffic and improving the lives of thousands. Many projects simply don’t have the human capital to hand-hold individuals through using the library.

In a few projects there are enough generous and experienced users that they’re able to field questions from individual users. SymPy is a good example here. I learned open source programming within SymPy. Their community was broad enough that they were able to hold my hand as I learned Git, testing, communication practices and all of the other soft skills that we need to be effective in writing great software. The support structure of SymPy is something that I’ve never experienced anywhere else.

My Apologies

I’ve found myself becoming increasingly impolite when people ask me for certain kinds of extended help with their code. I’ve been trying to track down why this is and I think that it comes from a mismatch of social contracts.

Large parts of technical society have an (entirely reasonable) belief that open source developers are available to answer questions about how we use their project. This was probably true in popular culture, where our stereotypical image of an open source developer was working out of their basement long into the night on things that relatively few enthusiasts bothered with. They were happy to engage and had the free time in which to do it.

In some ways things have changed a lot. We now have paid professionals building software that is used by thousands or millions of users. These professionals easily charge consulting fees of hundreds of dollars per hour for exactly the kind of assistance that people show up expecting for free under the previous model. These developers have to answer for how they spend their time when they’re at work, and when they’re not at work they now have families and kids that deserve just as much attention as their open source users.

Both of these cultures, the creative do-it-yourself basement culture and the more corporate culture, are important to the wonderful surge we’ve seen in open source software. How do we balance them? Should developers, like doctors or lawyers perform pro-bono work as part of their profession? Should grants specifically include paid time for community engagement and outreach? Should users, as part of receiving help feel an obligation to improve documentation or stick around and help others?

Solutions?

I’m not sure what to do here. I feel an obligation to remain connected with users from a broad set of applications, even those that companies or grants haven’t decided to fund. However at the same time I don’t know how to say “I’m sorry, I simply don’t have the time to help you with your problem.” in a way that feels at all compassionate.

I think that people should still ask questions. I think that we need to foster an environment in which developers can say “Sorry. Busy.” more easily. I think that we as a community need better resources to teach novice users to become veteran users.

One positive approach is to honor veteran users, and through this public praise to encourage other users to “up their game”, much as developers do today with coding skills. There are thousands of blogposts about how to develop code well, and people strive tirelessly to improve themselves. My hope is that by attaching the language of skill, like the term “veteran”, to user behaviors we can create an environment where people are proud of how cleanly they can raise issues and how clearly they can describe questions for documentation. Doing this well is critical for a project’s success and requires substantial effort and personal investment.

August 25, 2016 12:00 AM

August 23, 2016

Enthought

Webinar: Introducing the NEW Python Integration Toolkit for LabVIEW

See a recording of the webinar:

LabVIEW is a software platform made by National Instruments, used widely in industries such as semiconductors, telecommunications, aerospace, manufacturing, electronics, and automotive for test and measurement applications. In August 2016, Enthought released the Python Integration Toolkit for LabVIEW, which is a “bridge” between the LabVIEW and Python environments.

In this webinar, we’ll demonstrate:

  1. How the new Python Integration Toolkit for LabVIEW from Enthought seamlessly brings the power of the Python ecosystem of scientific and engineering tools to LabVIEW
  2. Examples of how you can extend LabVIEW with Python, including using Python for signal and image processing, cloud computing, web dashboards, machine learning, and more

Python Integration Toolkit for LabVIEW

Quickly and efficiently access scientific and engineering tools for signal processing, machine learning, image and array processing, web and cloud connectivity, and much more. With only minimal coding on the Python side, this extraordinarily simple interface provides access to all of Python’s capabilities.

Try it with your data, free for 30 days

Download a free 30 day trial of the Python Integration Toolkit for LabVIEW from the National Instruments LabVIEW Tools Network.

How LabVIEW users can benefit from Python :

  • High-level, general purpose programming language ideally suited to the needs of engineers, scientists, and analysts
  • Huge, international user base representing industries such as aerospace, automotive, manufacturing, military and defense, research and development, biotechnology, geoscience, electronics, and many more
  • Tens of thousands of available packages, ranging from advanced 3D visualization frameworks to nonlinear equation solvers
  • Simple, beginner-friendly syntax and fast learning curve

by admin at August 23, 2016 02:44 PM

August 21, 2016

Titus Brown

Lessons on doing science from my father, Gerry Brown

(This is an invited chapter for a memorial book about my father. You can also read my remembrances from the day after he passed away.)

Dr. Gerald E. Brown was a well known nuclear physicist and astrophysicist who worked at Stony Brook University from 1968 until his death in 2013. He was internationally active in physics research from the late 1950s onwards, ran an active research group at Stony Brook until 2009, and supervised nearly a hundred PhD students during his life. He was also my father.

It's hard to write about someone who is owned, in part, by so many people. I came along late in my father's life (he was 48 when I was born), and so I didn't know him that well as an adult. However, growing up with a senior professor as a father had a huge impact on my scientific career, which I can recognize even more clearly now that I'm a professor myself.

Gerry (as I called him) didn't spend much time directly teaching his children about his work. When I was invited to write something for his memorial book, it was suggested that I write about what he had taught me about being a scientist. I found myself stymied, because to the best of my recollection we had never talked much about the practice of science. When I mentioned this to my oldest brother, Hans, we shared a good laugh -- he had exactly the same experience with our father, 20 years before me!

Most of what Gerry taught me was taught by example. Below are some of the examples that I remember most clearly, and of which I'm the most proud. While I don't know if either of my children will become scientists, if they do, I hope they take these to heart -- I can think of few better scientific legacies to pass on to them from my father.

Publishing work that is interesting (but perhaps not correct) can make for a fine career.

My father was very proud of his publishing record, but not because he was always (or even frequently) right. In fact, several people told me that he was somewhat notorious for having a 1- in-10 "hit rate" -- he would come up with many crazy ideas, of which only about 1 in 10 would be worth pursuing. However, that 1 in 10 was enough for him to have a long and successful career. That this drove some people nuts was merely an added bonus in his view.

Gerry was also fond of publishing controversial work. Several times he told me he was proudest of the papers that caused the most discussion and collected the most rebuttals. He wryly noted that these papers often gathered many citations, even when they turned out to be incorrect.

The best collaborations are both personal and professional friendships.

The last twenty-five years of Gerry's life were dominated by a collaboration with Hans Bethe on astrophysics, and they traveled to Pasadena every January until the early 2000s to work at the California Institute of Technology. During this month they lived in the same apartment, worked together closely, and met with a wide range of people on campus to explore scientific ideas; they also went on long hikes in the mountains above Pasadena (chronicled by Chris Adami in "Three Weeks with Hans Bethe"). These close interactions not only fueled his research for the remainder of the year, but emanated from a deep personal friendship. It was clear that, to Gerry, there was little distinction between personal and professional in his research life.

Science is done by people, and people need to be supported.

Gerry was incredibly proud of his mentoring record, and did his best to support his students, postdocs, and junior colleagues both professionally and personally. He devoted the weeks around Christmas each year to writing recommendation letters for junior colleagues. He spent years working to successfully nominate colleagues to the National Academy of Sciences. He supported junior faculty with significant amounts of his time and sometimes by forgoing his own salary to boost theirs. While he never stated it explicitly, he considered most ideas somewhat ephemeral, and thought that his real legacy -- and the legacy most worth having -- lay in the students and colleagues who would continue after him.

Always treat the administrative staff well.

Gerry was fond of pointing out that the secretaries and administrative staff had more practical power than most faculty, and that it was worth staying on their good side. This was less a statement of calculated intent and more an observation that many students, postdocs, and faculty treated non-scientists with less respect than they deserved. He always took the time to interact with them on a personal level, and certainly seemed to be well liked for it. I've been told by several colleagues who worked with Gerry that this was a lesson that they took to heart in their own interactions with staff, and it has also served me well.

Hard work is more important than brilliance.

One of Gerry's favorite quotes was "Success is 99% perspiration, 1% inspiration", a statement attributed to Thomas Edison. According to Gerry, he simply wasn't as smart as many of his colleagues, but he made up for it by working very hard. I have no idea how modest he was being -- he was not always known for modesty -- but he certainly worked very hard, spending 10-14 hours a day writing in his home office, thinking in the garden, or meeting with colleagues at work. While I try for more balance in my work and life myself, he demonstrated to me that sometimes grueling hard work is a necessity when tackling tricky problems: for example, my Earthshine publications came after a tremendously unpleasant summer working on some incredibly messy and very tedious analysis code, but without the resulting analysis we wouldn't have been able to advance the project (which continues today, almost two decades later).

Experiments should talk to theory, and vice versa.

Steve Koonin once explained to me that Gerry was a phenomenologist -- a theorist who worked well with experimental data -- and that this specialty was fairly rare because it required communicating effectively across sub disciplines. Gerry wasn't attracted to deep theoretical work and complex calculations, and in any case liked to talk to experimentalists too much to be a good theorist -- for example, some of our most frequent dinner guests when I was growing up were Peter Braun- Munzinger and Johanna Stachel, both experimentalists. So he chose to work at the interface of theory and experiment, where he could develop and refine his intuition based on competing world views emanating from the theorists (who sought clean mathematical solutions) and experimentalists (who had real data that needed to be reconciled with theory). I have tried to pursue a similar strategy in computational biology.

Computers and mathematical models are tools, but the real insight comes from intuition.

Apart from some early experience with punch cards at Yale in the 1950s, Gerry avoided computers and computational models completely in his own research (although his students, postdocs and collaborators used them, of course). I am told that his theoretical models were often relatively simple approximations, and he himself often said that his work with Hans Bethe proceeded by choosing the right approximation for the problem at hand -- something at which Bethe excelled. Their choice of approximation was guided by intuition about the physical nature of the problem as much as by mathematical insight, and they could often use a few lines of the right equations to reach results similar to complex computational and mathematical models. This search for simple models and the utility of physical intuition in his research characterized many of our conversations, even when I became more mathematically trained.

Teaching is largely about conveying intuition.

Once a year, Gerry would load up a backpack with mason jars full of thousands of pennies, and bring them into his Statistical Mechanics class. This was needed for one of his favorite exercises -- a hands-on demonstration of the Law of Large Numbers and the Central Limit Theorem, which lie at the heart of thermodynamics and statistical mechanics. He would have students flip 100 coins and record the average, and then do it again and again, and have the class plot the distributions of results. The feedback he got was that this was a very good way of viscerally communicating the basics of statistical mechanics to students, because it built their intuition about how averages really worked. This approach has carried through to my own teaching and training efforts, where I always try to integrate hands-on practice with more theoretical discussion.

Benign neglect is a good default for mentoring.

Gerry was perhaps overly fond of the concept of "benign neglect" in parenting, in that much of my early upbringing was at the hands of my mother with only occasional input from him. However, in his oft-stated experience (and now mine as well), leaving smart graduate students and postdocs to their own devices most of the time was far better than trying to actively manage (or interfere in) their research for them. I think of it this way: if I tell my students what to do and I'm wrong (which is likely, research being research), then they either do it (and I suffer for having misdirected them) or they don't do it (and then I get upset at them for ignoring me). But if I don't tell my students what to do, then they usually figure out something better for themselves, or else get stuck and then come to me to discuss it. The latter two outcomes are much better from a mentoring perspective than the former two.

Students need to figure it out for themselves

One of the most embarrassing (in retrospect) interactions I had with my father was during a long car ride where he tried to convince me that when x was a negative number, -x was positive. At the time, I didn't agree with this at all, which was probably because I was a stubborn 7 years old. While it took me a few more years to understand this concept, by the time I was a math major I did have the concept down! Regardless, in this, and many other interactions around science, he never browbeat me about it or got upset at my stupidity or stubbornness. I believe this carried through to his interactions with his students. In fact, the only time I ever heard him express exasperation was with colleagues who were acting badly.

A small nudge at the right moment is sometimes all that is needed.

A pivotal moment in my life came when Gerry introduced me to Mark Galassi, a physics graduate student who also was the systems administrator for the UNIX systems in the Institute for Theoretical Physics at Stony Brook; Mark found out I was interested in computers and gave me access to the computer system. This was one of the defining moments in my research life, as my research is entirely computational! Similarly, when I took a year off from college, my father put me in touch with Steve Koonin, who needed a systems administrator for a new project; I ended up working with the Earthshine project, which was a core part of my research for several years. And when I was trying to decide what grad schools to apply to, Gerry suggested I ask Hans Bethe and Steve Koonin what they thought was the most promising area of research for the future -- their unequivocal answer was "biology!" This drove me to apply to biology graduate schools, get a PhD in biology, and ultimately led to my current faculty position. In all these cases, I now recognize the application of a light touch at the right moment, rather than the heavy-handed guidance that he must have desperately wanted to give at times.

Conclusions

There are many more personal stories that could be told about Gerry Brown, including his (several, and hilarious) interactions with the East German secret police during the cold war, his (quite bad) jokes, his (quite good) cooking, and his (enthusiastic) ballroom dancing, but I will save those for another time. I hope that his friends and colleagues will see him in the examples above, and will remember him fondly.

Acknowledgements

I thank Chris Adami, Erich Schwarz, Tracy Teal, and my mother, Elizabeth Brown, for their comments on drafts of this article.

by C. Titus Brown at August 21, 2016 10:00 PM

August 19, 2016

Continuum Analytics news

Dask for Institutions

Tuesday, August 16, 2016
Matthew Rocklin
Continuum Analytics

Introduction

Institutions use software differently than individuals. Over the last few months I’ve had dozens of conversations about using Dask within larger organizations like universities, research labs, private companies, and non-profit learning systems. This post provides a very coarse summary of those conversations and extracts common questions. I’ll then try to answer those questions.

Note: some of this post will be necessarily vague at points. Some companies prefer privacy. All details here are either in public Dask issues or have come up with enough institutions (say at least five) that I’m comfortable listing the problem here.

Common story

Institution X, a university/research lab/company/… has many scientists/analysts/modelers who develop models and analyze data with Python, the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code. These models/data sometimes grow to be large enough to need a moderately large amount of parallel computing.

Fortunately, Institution X has an in-house cluster acquired for exactly this purpose of accelerating modeling and analysis of large computations and datasets. Users can submit jobs to the cluster using a job scheduler like SGE/LSF/Mesos/Other.

However the cluster is still under-utilized and the users are still asking for help with parallel computing. Either users aren’t comfortable using the SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic workloads, or the interaction times aren’t good enough for the interactive use that users appreciate.

There was an internal effort to build a more complex/interactive/Pythonic system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and definitely isn’t something that Institution X wants to pursue. It turned out to be a harder problem than expected to design/build/maintain such a system in-house. They’d love to find an open source solution that was well featured and maintained by a community.

The Dask.distributed scheduler looks like it’s 90% of the system that Institution X needs. However there are a few open questions:

  • How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job scheduler?
  • How can we grow and shrink the cluster dynamically based on use?
  • How do users manage software environments on the workers?
  • How secure is the distributed scheduler?
  • Dask is resilient to worker failure, how about scheduler failure?
  • What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?
  • How do we handle multiple concurrent users and priorities?
  • How does this compare with Spark?

So for the rest of this post I’m going to answer these questions. As usual, few of answers will be of the form “Yes Dask can solve all of your problems.” These are open questions, not the questions that were easy to answer. We’ll get into what’s possible today and how we might solve these problems in the future.

How do we integrate dask.distributed with SGE/LSF/Mesos/Other?

It’s not difficult to deploy dask.distributed at scale within an existing cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already a researcher within the institution doing this manually by running dask-scheduler on some static node in the cluster and launching dask-worker a few hundred times with their job scheduler and a small job script.

The goal now is how to formalize this process for the individual version of SGE/LSF/Mesos/Other used within the institution while also developing and maintaining a standard Pythonic interface so that all of these tools can be maintained cheaply by Dask developers into the foreseeable future. In some cases Institution X is happy to pay for the development of a convenient “start dask on my job scheduler” tool, but they are less excited about paying to maintain it forever.

We want Python users to be able to say something like the following:

from dask.distributed import Executor, SGECluster

c = SGECluster(nworkers=200, **options)
e = Executor(c)

… and have this same interface be standardized across different job schedulers.

How can we grow and shrink the cluster dynamically based on use?

Alternatively, we could have a single dask.distributed deployment running 24/7 that scales itself up and down dynamically based on current load. Again, this is entirely possible today if you want to do it manually (you can add and remove workers on the fly) but we should add some signals to the scheduler like the following:

  • “I’m under duress, please add workers”
  • “I’ve been idling for a while, please reclaim workers”

and connect these signals to a manager that talks to the job scheduler. This removes an element of control from the users and places it in the hands of a policy that IT can tune to play more nicely with their other services on the same network.

How do users manage software environments on the workers?

Today Dask assumes that all users and workers share the exact same software environment. There are some small tools to send updated .py and .egg files to the workers but that’s it.

Generally Dask trusts that the full software environment will be handled by something else. This might be a network file system (NFS) mount on traditional cluster setups, or it might be handled by moving docker or conda environments around by some other tool like knit for YARN deployments or something more custom. For example Continuum sells proprietary software that does this.

Getting the standard software environment setup generally isn’t such a big deal for institutions. They typically have some system in place to handle this already. Where things become interesting is when users want to use drastically different environments from the system environment, like using Python 2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may also want to change the software environment many times in a single session.

The best solution I can think of here is to pass around fully downloaded conda environments using the dask.distributed network (it’s good at moving large binary blobs throughout the network) and then teaching the dask-workers to bootstrap themselves within this environment. We should be able to tear everything down and restart things within a small number of seconds. This requires some work; first to make relocatable conda binaries (which is usually fine but is not always fool-proof due to links) and then to help the dask-workers learn to bootstrap themselves.

Somewhat related, Hussain Sultan of Capital One recently contributed a dask-submit command to run scripts on the cluster: http://distributed.readthedocs.io/en/latest/submitting-applications.html

How secure is the distributed scheduler?

Dask.distributed is incredibly insecure. It allows anyone with network access to the scheduler to execute arbitrary code in an unprotected environment. Data is sent in the clear. Any malicious actor can both steal your secrets and then cripple your cluster.

This is entirely the norm however. Security is usually handled by other services that manage computational frameworks like Dask.

For example we might rely on Docker to isolate workers from destroying their surrounding environment and rely on network access controls to protect data access.

Because Dask runs on Tornado, a serious networking library and web framework, there are some things we can do easily like enabling SSL, authentication, etc.. However I hesitate to jump into providing “just a little bit of security” without going all the way for fear of providing a false sense of security. In short, I have no plans to work on this without a lot of encouragement. Even then I would strongly recommend that institutions couple Dask with tools intended for security. I believe that is common practice for distributed computational systems generally.

Dask is resilient to worker failure, how about scheduler failure?

can come and go. Clients can come and go. The state in the scheduler is currently irreplaceable and no attempt is made to back it up. There are a few things you could imagine here:

  1. Backup state and recent events to some persistent storage so that state can be recovered in case of catastrophic loss
  2. Have a hot failover node that gets a copy of every action that the scheduler takes
  3. Have multiple peer schedulers operate simultaneously in a way that they can pick up slack from lost peers
  4. Have clients remember what they have submitted and resubmit when a scheduler comes back online

Currently option 4 is currently the most feasible and gets us most of the way there. However options 2 or 3 would probably be necessary if Dask were to ever run as critical infrastructure in a giant institution. We’re not there yet.

As of recent work spurred on by Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back and everyone will reconnect. The state for computations in flight is entirely lost but the computational infrastructure remains intact so that people can resubmit jobs without significant loss of service.

Dask has a bit of a harder time with this topic because it offers a persistent stateful interface. This problem is much easier for distributed database projects that run ephemeral queries off of persistent storage, return the results, and then clear out state.

What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?

The short answer is no. Other than number of cores and available RAM all workers are considered equal to each other (except when the user explicitly specifies otherwise).

However this problem and problems like it have come up a lot lately. Here are a few examples of similar cases:

  1. Multiple data centers geographically distributed around the country
  2. Multiple racks within a single data center
  3. Multiple workers that have GPUs that can move data between each other easily
  4. Multiple processes on a single machine

Having some notion of hierarchical worker group membership or inter-worker preferred relationships is probably inevitable long term. As with all distributed scheduling questions the hard part isn’t deciding that this is useful, or even coming up with a sensible design, but rather figuring out how to make decisions on the sensible design that are foolproof and operate in constant time. I don’t personally see a good approach here yet but expect one to arise as more high priority use cases come in.

How do we handle multiple concurrent users and priorities?

There are several sub-questions here:

  • Can multiple users use Dask on my cluster at the same time?

Yes, either by spinning up separate scheduler/worker sets or by sharing the same set.

  • If they’re sharing the same workers then won’t they clobber each other’s data?

This is very unlikely. Dask is careful about naming tasks, so it’s very unlikely that the two users will submit conflicting computations that compute to different values but occupy the same key in memory. However if they both submit computations that overlap somewhat then the scheduler will nicely avoid recomputation. This can be very nice when you have many people doing slightly different computations on the same hardware. This works in the same way that Git works.

  • If they’re sharing the same workers then won’t they clobber each other’s resources?

Yes, this is definitely possible. If you’re concerned about this then you should give everyone their own scheduler/workers (which is easy and standard practice). There is not currently much user management built into Dask.

How does this compare with Spark?

At an institutional level Spark seems to primarily target ETL + Database-like computations. While Dask modules like Dask.bag and Dask.dataframe can happily play in this space this doesn’t seem to be the focus of recent conversations.

Recent conversations are almost entirely around supporting interactive custom parallelism (lots of small tasks with complex dependencies between them) rather than the big Map->Filter->Groupby->Join abstractions you often find in a database or Spark. That’s not to say that these operations aren’t hugely important; there is a lot of selection bias here. The people I talk to are people for whom Spark/Databases are clearly not an appropriate fit. They are tackling problems that are way more complex, more heterogeneous, and with a broader variety of users.

I usually describe this situation with an analogy comparing “Big data” systems to human transportation mechanisms in a city. Here we go:

  • A Database is like a train: it goes between a set of well defined points with great efficiency, speed, and predictability. These are popular and profitable routes that many people travel between (e.g. business analytics). You do have to get from home to the train station on your own (ETL), but once you’re in the database/train you’re quite comfortable.
  • Spark is like an automobile: it takes you door-to-door from your home to your destination with a single tool. While this may not be as fast as the train for the long-distance portion, it can be extremely convenient to do ETL, Database work, and some machine learning all from the comfort of a single system.
  • Dask is like an all-terrain-vehicle: it takes you out of town on rough ground that hasn’t been properly explored before. This is a good match for the Python community, which typically does a lot of exploration into new approaches. You can also drive your ATV around town and you’ll be just fine, but if you want to do thousands of SQL queries then you should probably invest in a proper database or in Spark.

Again, there is a lot of selection bias here, if what you want is a database then you should probably get a database. Dask is not a database.

This is also wildly over-simplifying things. Databases like Oracle have lots of ETL and analytics tools, Spark is known to go off road, etc.. I obviously have a bias towards Dask. You really should never trust an author of a project to give a fair and unbiased view of the capabilities of the tools in the surrounding landscape.

Conclusion

That’s a rough sketch of current conversations and open problems for “How Dask might evolve to support institutional use cases.” It’s really quite surprising just how prevalent this story is among the full spectrum from universities to hedge funds.

The problems listed above are by no means halting adoption. I’m not listing the 100 or so questions that are answered with “yes, that’s already supported quite well”. Right now I’m seeing Dask being adopted by individuals and small groups within various institutions. Those individuals and small groups are pushing that interest up the stack. It’s still several months before any 1000+ person organization adopts Dask as infrastructure, but the speed at which momentum is building is quite encouraging.

I’d also like to thank the several nameless people who exercise Dask on various infrastructures at various scales on interesting problems and have reported serious bugs. These people don’t show up on the GitHub issue tracker but their utility in flushing out bugs is invaluable.

As interest in Dask grows it’s interesting to see how it will evolve. Culturally Dask has managed to simultaneously cater to both the open science crowd as well as the private-sector crowd. The project gets both financial support and open source contributions from each side. So far there hasn’t been any conflict of interest (everyone is pushing in roughly the same direction) which has been a really fruitful experience for all involved I think.

This post was originally published by Matt Rocklin on his website, matthewrocklin.com 

by ryanwh at August 19, 2016 04:14 PM

Mining Data Science Treasures with Open Source

Wednesday, August 17, 2016
Travis Oliphant
Chief Executive Officer & Co-Founder
Continuum Analytics


Data Science is a goldmine of potential insights for any organization, but unearthing those insights can be resource-intensive, requiring systems and teams to work seamlessly and effectively as a unit.

Integrating resources isn’t easy. Traditionally, businesses chose vendors with all-in-one solutions to cover the task. This approach may seem convenient, but what freedoms must be sacrificed in order to achieve it?

This vendor relationship resembles the troubled history of coal mining towns in the Old West, where one company would own everything for sale in the town. Workers were paid with vouchers that could only be redeemed at company-owned shops.

The old folk tune "Sixteen Tons" stated it best: "I owe my soul to the company store."

With any monopoly, these vendors have no incentive to optimize products and services. There's only one option available — take it or leave it. But, just as some miners would leave these towns and make their way across the Wild West, many companies have chosen to forge their own trails with the freedom of Open Data Science — and they've never looked back.

Open Data Science: Providing Options

Innovation and flexibility are vital to the evolving field of Data Science, so any alternative to the locked-in vendor approach is attractive. Fortunately, Open Data Science provides the perfect ecosystem of options for true innovation.

Sometimes vendors provide innovation, such as with the infrastructure surrounding linear programming. This doesn’t mean they’re able to provide an out-of-the-box solution for all teams — adapting products to different businesses and industries requires work.

Most of the real innovation is emerging from the open source world. The tremendous popularity of Python and R, for example, bolsters innovation on all kinds of analytics approaches.

Given the wish to avoid a “mining town scenario” and the burgeoning innovation in Open Data Science, why are so many companies still reluctant to adopt it?

Companies Should Not Hesitate to Embrace Open Source

There are several reasons companies balk at Open Data Science solutions:

  • Licensing. Open source has many licenses: Apache, BSD, GPL, MIT, etc. This wide array of choices can produce analysis paralysis. In some cases, such as GPL, there is a requirement to make source code available for redistribution.
  • Diffuse contact. Unlike with vendor products, open source doesn’t provide a single point of contact. It’s a non-hierarchical effort. Companies have to manage keeping software current, and this can feel overwhelming without a steady guide they can rely on.
  • Education. With rapid change, companies find it difficult to stay on top of the many acronyms, project names, and new techniques required with each piece of open source software.

Fortunately, these concerns are completely surmountable. Most licenses are appropriate for commercial applications, and many companies are finding open source organizations to act as a contact point within the Open Data Science world — with the added benefit of a built-in guide to the ever-changing landscape of open source, thereby also solving the problem of education.

The Best Approach for Starting an Open Data Science Initiative

There are several tactics organizations can use to effectively adopt Open Data Science.

For instance, education and establishing a serious training program is crucial. One focus here has to be on reproducibility. Team members should know how the latest graph was produced and how to generate the next iteration of it.

Much of this requires understanding the architecture of the applications one is using, so dependency management is important. Anything that makes the process transparent to the team will promote understanding and engagement.

Flexible governance models are also valuable, allowing intelligent assessment of open source pros and cons. For example, it shouldn’t be difficult to create an effective policy on what sort of open source licenses work best.

Finally, committing resources to successful adaptation and change management should be central to any Open Data Science arrangement. This will almost always require coding to integrate solutions into one’s workflow. But this effort also supports retention: companies that shun open source risk losing developers who seek the cutting edge.

Tackle the Future Before You're Left Behind

Unlike the old mining monopolies, competition in Data Science is a rapidly-changing world of many participants. Companies that do not commit resources to education, understanding, governance and change management risk getting left behind as newer companies commit fully to open source.

Well-managed open source analytics environments, like Anaconda, provide a compatible and updated suite of modern analytics programs. Combine this with a steady guide to traverse the changing landscape of Open Data Science, and the Data Science gold mine becomes ripe for the taking.

by ryanwh at August 19, 2016 04:12 PM

August 17, 2016

Continuum Analytics news

Mining Data Science Treasures with Open Source

Posted Wednesday, August 17, 2016

Data Science is a goldmine of potential insights for any organization, but unearthing those insights can be resource-intensive, requiring systems and teams to work seamlessly and effectively as a unit.

Integrating resources isn’t easy. Traditionally, businesses chose vendors with all-in-one solutions to cover the task. This approach may seem convenient, but what freedoms must be sacrificed in order to achieve it?

This vendor relationship resembles the troubled history of coal mining towns in the Old West, where one company would own everything for sale in the town. Workers were paid with vouchers that could only be redeemed at company-owned shops.

The old folk tune "Sixteen Tons" stated it best: "I owe my soul to the company store."

With any monopoly, these vendors have no incentive to optimize products and services. There's only one option available — take it or leave it. But, just as some miners would leave these towns and make their way across the Wild West, many companies have chosen to forge their own trails with the freedom of Open Data Science — and they've never looked back.

Open Data Science: Providing Options

Innovation and flexibility are vital to the evolving field of Data Science, so any alternative to the locked-in vendor approach is attractive. Fortunately, Open Data Science provides the perfect ecosystem of options for true innovation.

Sometimes vendors provide innovation, such as with the infrastructure surrounding linear programming. This doesn’t mean they’re able to provide an out-of-the-box solution for all teams — adapting products to different businesses and industries requires work.

Most of the real innovation is emerging from the open source world. The tremendous popularity of Python and R, for example, bolsters innovation on all kinds of analytics approaches.

Given the wish to avoid a “mining town scenario” and the burgeoning innovation in Open Data Science, why are so many companies still reluctant to adopt it?

Companies Should Not Hesitate to Embrace Open Source

There are several reasons companies balk at Open Data Science solutions:

  • Licensing. Open source has many licenses: Apache, BSD, GPL, MIT, etc. This wide array of choices can produce analysis paralysis. In some cases, such as GPL, there is a requirement to make source code available for redistribution.
  • Diffuse contact. Unlike with vendor products, open source doesn’t provide a single point of contact. It’s a non-hierarchical effort. Companies have to manage keeping software current, and this can feel overwhelming without a steady guide they can rely on.
  • Education. With rapid change, companies find it difficult to stay on top of the many acronyms, project names, and new techniques required with each piece of open source software.

Fortunately, these concerns are completely surmountable. Most licenses are appropriate for commercial applications, and many companies are finding open source organizations to act as a contact point within the Open Data Science world — with the added benefit of a built-in guide to the ever-changing landscape of open source, thereby also solving the problem of education.

The Best Approach for Starting an Open Data Science Initiative

There are several tactics organizations can use to effectively adopt Open Data Science.

For instance, education and establishing a serious training program is crucial. One focus here has to be on reproducibility. Team members should know how the latest graph was produced and how to generate the next iteration of it.

Much of this requires understanding the architecture of the applications one is using, so dependency management is important. Anything that makes the process transparent to the team will promote understanding and engagement.

Flexible governance models are also valuable, allowing intelligent assessment of open source pros and cons. For example, it shouldn’t be difficult to create an effective policy on what sort of open source licenses work best.

Finally, committing resources to successful adaptation and change management should be central to any Open Data Science arrangement. This will almost always require coding to integrate solutions into one’s workflow. But this effort also supports retention: companies that shun open source risk losing developers who seek the cutting edge.

Tackle the Future Before You're Left Behind

Unlike the old mining monopolies, competition in Data Science is a rapidly-changing world of many participants. Companies that do not commit resources to education, understanding, governance and change management risk getting left behind as newer companies commit fully to open source.

Well-managed open source analytics environments, like Anaconda, provide a compatible and updated suite of modern analytics programs. Combine this with a steady guide to traverse the changing landscape of Open Data Science, and the Data Science gold mine becomes ripe for the taking.

by swebster at August 17, 2016 02:41 PM

Matthieu Brucher

Building Boost.Python with a custom Python3

I’ve started working on porting some Python libraries to Python3, but I required using an old Visual Studio (2012) for which there is no Python3 version. In the end, I tried following this tutorial. The issue with the tutorial is that you are downloading the externals by hand. It is actually simpler to call get_externals.bat from the PCBuild folder.

Be aware that the solution is a little bit flawed. pylauncher is built in win32 mode in Release instead of x64. This has an impact on deployment.

Once this is done, I had to deploy the build to a proper location so that it is self contained. I inspired myself heavily from another tutorial by the same author, only adding 64 bits support in this gist.

Once this was done, time to build Boost.Python! To start, just compile bjam the usual way, don’t add Python options on the command line, this will utterly fail in Boost.Build. Then add in user-config.jam the following lines (with the proper folders):

using python : 3.4 : D:/Tools/Python-3.4.5/_INSTALL/python.exe : D:/Tools/Python-3.4.5/_INSTALL/include : D:/Tools/Python-3.4.5/_INSTALL/libs ;

This should build the debug and release mode with this line:

.\b2 --with-python --layout=versioned toolset=msvc-11.0 link=shared stage address-model=64
.\b2 --with-python --layout=versioned toolset=msvc-11.0 link=shared stage address-model=64 python-debugging=on
Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at August 17, 2016 07:56 AM

August 16, 2016

Titus Brown

What I did on my summer vacation: Microbial Diversity 2016

I just left Woods Hole, MA, where I spent the last 6 and a half weeks taking the Microbial Diversity course as a student. It was fun, exhausting, stimulating, and life changing!

The course had three components: a lecture series, in which world-class microbiologists gave 2-3 hrs of talks each morning, every Monday-Saturday of each week; a lab practical component, in which we learned to do environmental enrichments and isolations, use microscopes and mass specs and everything else; and a miniproject, a self-directed project that took place the last 2-3 weeks of the course, and for which we could make use of all of the equipment. The days typically went from 9am to 10pm, with a break every Sunday.

I think it's fair to say I've been microbiologized pretty thoroughly :).

The course directors were Dianne Newman and Jared Leadbetter, two Caltech professors that I knew of old (from my own days at Caltech). The main topic of the course was environmental microbiology and biogeochemistry, although we did cover some biomedical topics along with bioinformatics and genomics. The lectures were excellent, and there's only two or three for which I didn't take copious notes (or live-tweet under my student account, @tituslearnz). The other 18 students were amazing - most were grad students, with one postdoc and two other faculty. The course TAs were grad students, postdocs, and staff from the directors' labs as well as other labs. My just-former employee Jessica Mizzi had signed on as course coordinator, too, and it was great to see her basically running the course on a daily basis. Everyone was fantastic, and all in all, it was an incredibly intense and in-depth experience.

Why Microbial Diversity?

I applied to take the course for several reasons. The main one is this: even before I got involved in metagenomics at Michigan State, I'd been moonlighting on microbial genomics for quite a while. I have a paper on E. coli genomics, another paper on syntrophic interactions between methane oxidizing and sulfur reducing bacteria, a third paper on Shewanella oneidensis (with Dianne), and a fourth paper on bone-eating worm symbionts. While I do have a molecular biology background from my grad work in molecular developmental biology, these papers made me painfully aware that I knew essentially no microbial physiology. Yet I now work with metagenomics data regularly, and I'm starting to get into oceanography and marine biogeochemical cycling... I needed to know more about microbes!

On top of wanting to know more, I was also in a good position to take 6 weeks "off" to learn some new skills. I received tenure with my move to UC Davis, and with the Moore Foundation DDD Investigator Award, I have some breathing room on funding. Moreover, I have four new postdocs coming on board (I'll be up to 6 by September!) but they're not here yet, and most of the people in my lab were gone for some or all of the summer. And, finally, my wife had been invited to Woods Hole to lecture in both Microbial Diversity and the STAMPS course, so she would be there for three weeks.

I also expected it to be fun! I took the Embryology course back in 2005, and it was one of the best scientific experience of my life, and it's one of the reasons I stuck with academia. The combination of intense intellectualism, the general Woods Hole gemisch of awesome scientists hanging out for the summer, and the beautiful Woods Hole weather and beaches, make it a summer mecca for science -- so why not go back?

What were the highlights?

The whole thing was awesome, frankly, although by week three I was pretty exhausted (and then it went on for another three weeks :).

One part of the experience that amazed me was how easy it was to find new microbes. We split into groups of 4-5 students for the first three weeks, and each group did about 15 enrichments and isolations from the environment, using different media and growth conditions. (Some of these also formed the basis for some miniprojects.) As soon as we got critters into pure culture, we would sequence their 16s regions and evaluate whether or not they were already known. From this, I personally isolated about 8 new species; some other students generated dozens of isolates.

My miniproject was super fun, too. (You can read my final report here and see my powerpoint presentation here.) I set out to enrich for sulfur oxidizing autotrophs but ended up finding some likely heterotrophs that grew anaerobically on acetate and nitrate. It was great fun to research media design, perform and "debug" enrichments, and throw machines and equipment at problems (I learned to use the anaerobic chamber, the gas chromatograph, HPLC, and ion chromatography, along with reacquainting myself with autoclaves and molecular biology). I ended up finding a microbe that probably does something relatively pedestrian (oxidizing acetate with nitrate as an electron acceptor), with the twist that I identified the microbe as a Shewanella species, and apparently Shewanella do not generally metabolize acetate anaerobically (although it has been seen before). I sent Jeff Gralnick some of my isolates and he'll look into it, and maybe we'll sequence and analyze the genomes as a follow-on.

I also got to play with Winogradsky columns! Kirsten G. and I put together some Winogradsky columns from environmental silt from Trunk River, a brackish marine ecosystem. We spent a lot of time poking them with microsensors and we're hoping to follow up on things with some metagenomics.

More importantly for my lab's research, I got even more thoroughly confused about the role and value of meta-omics than I had been before. There is an interesting gulf between how we know microbes work in the real world (in complex and rich ecosystems) and how we study them (frequently in isolates in the lab). I've been edging ever closer to working on ecosystem and molecular modeling along with data integration of many 'omic data types, and this course really helped me get a handle on what types of data people are generating and why. I also got a lot of insight into the different types of questions people want to answer, from questions of biogeochemical cycling to resilience and environmental adaptation to phylogeny. I am now better prepared to read papers across all of these areas and I'm hoping to delve deeper into these research areas in the next decade or so.

Also, Lisa Cohen from my lab visited the course with our MinION sequencer and we finally generated a bunch of real genomic data from two isolates! The genomes each resolved into two contigs, and we hope to publish them soon. You can read Lisa's extremely thorough blog post on the Nanopore sequencer here and also see the tutorial she put together for dealing with nanopore data, here

A personal highlight: at the end of the course, the TAs gave me the Tenure Deadbeat award, for "the most disturbing degeneration from faculty to student and hopefully back." It's gonna go on my office wall.

Surely something wasn't great? Anything?

Well,

  • six weeks of cafeteria food was rough.

  • six weeks of dorm life was somewhat rough (ameliorated by an awesome roommate, Mike B. from Thea Whitman's lab).

  • there was no air conditioning (really, no airflow at all) in my dorm room. And it gets very hot and humid in Woods Hole at times. Doooooom.

    Luckily I was exhausted most of the time because otherwise getting to sleep would have been impossible instead of merely very difficult.

  • my family was in town for three weeks, but I couldn't spend much time with them, and I missed them when they weren't in town!

  • my e-mail inbox is now a complete and utter disaster.

...but all of these were more or less expected.

Interestingly, I ignored Twitter and much of my usual "net monitoring" for the summer and kind of enjoyed being in my little bubble. I doubt I'll be able to keep it up, though.

In sum

  • Awesome experience.
  • Learned much microbiology.
  • Need to learn more chemistry!

by C. Titus Brown at August 16, 2016 10:00 PM

Continuum Analytics news

Dask for Institutions

Posted Tuesday, August 16, 2016

Introduction

Institutions use software differently than individuals. Over the last few months I’ve had dozens of conversations about using Dask within larger organizations like universities, research labs, private companies, and non-profit learning systems. This post provides a very coarse summary of those conversations and extracts common questions. I’ll then try to answer those questions.

Note: some of this post will be necessarily vague at points. Some companies prefer privacy. All details here are either in public Dask issues or have come up with enough institutions (say at least five) that I’m comfortable listing the problem here.

Common story

Institution X, a university/research lab/company/… has many scientists/analysts/modelers who develop models and analyze data with Python, the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code. These models/data sometimes grow to be large enough to need a moderately large amount of parallel computing.

Fortunately, Institution X has an in-house cluster acquired for exactly this purpose of accelerating modeling and analysis of large computations and datasets. Users can submit jobs to the cluster using a job scheduler like SGE/LSF/Mesos/Other.

However the cluster is still under-utilized and the users are still asking for help with parallel computing. Either users aren’t comfortable using the SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic workloads, or the interaction times aren’t good enough for the interactive use that users appreciate.

There was an internal effort to build a more complex/interactive/Pythonic system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and definitely isn’t something that Institution X wants to pursue. It turned out to be a harder problem than expected to design/build/maintain such a system in-house. They’d love to find an open source solution that was well featured and maintained by a community.

The Dask.distributed scheduler looks like it’s 90% of the system that Institution X needs. However there are a few open questions:

  • How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job scheduler?
  • How can we grow and shrink the cluster dynamically based on use?
  • How do users manage software environments on the workers?
  • How secure is the distributed scheduler?
  • Dask is resilient to worker failure, how about scheduler failure?
  • What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?
  • How do we handle multiple concurrent users and priorities?
  • How does this compare with Spark?

So for the rest of this post I’m going to answer these questions. As usual, few of answers will be of the form “Yes Dask can solve all of your problems.” These are open questions, not the questions that were easy to answer. We’ll get into what’s possible today and how we might solve these problems in the future.

How do we integrate dask.distributed with SGE/LSF/Mesos/Other?

It’s not difficult to deploy dask.distributed at scale within an existing cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already a researcher within the institution doing this manually by running dask-scheduler on some static node in the cluster and launching dask-worker a few hundred times with their job scheduler and a small job script.

The goal now is how to formalize this process for the individual version of SGE/LSF/Mesos/Other used within the institution while also developing and maintaining a standard Pythonic interface so that all of these tools can be maintained cheaply by Dask developers into the foreseeable future. In some cases Institution X is happy to pay for the development of a convenient “start dask on my job scheduler” tool, but they are less excited about paying to maintain it forever.

We want Python users to be able to say something like the following:

from dask.distributed import Executor, SGECluster

c = SGECluster(nworkers=200, **options)
e = Executor(c)

… and have this same interface be standardized across different job schedulers.

How can we grow and shrink the cluster dynamically based on use?

Alternatively, we could have a single dask.distributed deployment running 24/7 that scales itself up and down dynamically based on current load. Again, this is entirely possible today if you want to do it manually (you can add and remove workers on the fly) but we should add some signals to the scheduler like the following:

  • “I’m under duress, please add workers”
  • “I’ve been idling for a while, please reclaim workers”

and connect these signals to a manager that talks to the job scheduler. This removes an element of control from the users and places it in the hands of a policy that IT can tune to play more nicely with their other services on the same network.

How do users manage software environments on the workers?

Today Dask assumes that all users and workers share the exact same software environment. There are some small tools to send updated .py and .egg files to the workers but that’s it.

Generally Dask trusts that the full software environment will be handled by something else. This might be a network file system (NFS) mount on traditional cluster setups, or it might be handled by moving docker or conda environments around by some other tool like knit for YARN deployments or something more custom. For example Continuum sells proprietary software that does this.

Getting the standard software environment setup generally isn’t such a big deal for institutions. They typically have some system in place to handle this already. Where things become interesting is when users want to use drastically different environments from the system environment, like using Python 2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may also want to change the software environment many times in a single session.

The best solution I can think of here is to pass around fully downloaded conda environments using the dask.distributed network (it’s good at moving large binary blobs throughout the network) and then teaching the dask-workers to bootstrap themselves within this environment. We should be able to tear everything down and restart things within a small number of seconds. This requires some work; first to make relocatable conda binaries (which is usually fine but is not always fool-proof due to links) and then to help the dask-workers learn to bootstrap themselves.

Somewhat related, Hussain Sultan of Capital One recently contributed a dask-submit command to run scripts on the cluster: http://distributed.readthedocs.io/en/latest/submitting-applications.html

How secure is the distributed scheduler?

Dask.distributed is incredibly insecure. It allows anyone with network access to the scheduler to execute arbitrary code in an unprotected environment. Data is sent in the clear. Any malicious actor can both steal your secrets and then cripple your cluster.

This is entirely the norm however. Security is usually handled by other services that manage computational frameworks like Dask.

For example we might rely on Docker to isolate workers from destroying their surrounding environment and rely on network access controls to protect data access.

Because Dask runs on Tornado, a serious networking library and web framework, there are some things we can do easily like enabling SSL, authentication, etc.. However I hesitate to jump into providing “just a little bit of security” without going all the way for fear of providing a false sense of security. In short, I have no plans to work on this without a lot of encouragement. Even then I would strongly recommend that institutions couple Dask with tools intended for security. I believe that is common practice for distributed computational systems generally.

Dask is resilient to worker failure, how about scheduler failure?

can come and go. Clients can come and go. The state in the scheduler is currently irreplaceable and no attempt is made to back it up. There are a few things you could imagine here:

  1. Backup state and recent events to some persistent storage so that state can be recovered in case of catastrophic loss
  2. Have a hot failover node that gets a copy of every action that the scheduler takes
  3. Have multiple peer schedulers operate simultaneously in a way that they can pick up slack from lost peers
  4. Have clients remember what they have submitted and resubmit when a scheduler comes back online

Currently option 4 is currently the most feasible and gets us most of the way there. However options 2 or 3 would probably be necessary if Dask were to ever run as critical infrastructure in a giant institution. We’re not there yet.

As of recent work spurred on by Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back and everyone will reconnect. The state for computations in flight is entirely lost but the computational infrastructure remains intact so that people can resubmit jobs without significant loss of service.

Dask has a bit of a harder time with this topic because it offers a persistent stateful interface. This problem is much easier for distributed database projects that run ephemeral queries off of persistent storage, return the results, and then clear out state.

What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?

The short answer is no. Other than number of cores and available RAM all workers are considered equal to each other (except when the user explicitly specifies otherwise).

However this problem and problems like it have come up a lot lately. Here are a few examples of similar cases:

  1. Multiple data centers geographically distributed around the country
  2. Multiple racks within a single data center
  3. Multiple workers that have GPUs that can move data between each other easily
  4. Multiple processes on a single machine

Having some notion of hierarchical worker group membership or inter-worker preferred relationships is probably inevitable long term. As with all distributed scheduling questions the hard part isn’t deciding that this is useful, or even coming up with a sensible design, but rather figuring out how to make decisions on the sensible design that are foolproof and operate in constant time. I don’t personally see a good approach here yet but expect one to arise as more high priority use cases come in.

How do we handle multiple concurrent users and priorities?

There are several sub-questions here:

  • Can multiple users use Dask on my cluster at the same time?

Yes, either by spinning up separate scheduler/worker sets or by sharing the same set.

  • If they’re sharing the same workers then won’t they clobber each other’s data?

This is very unlikely. Dask is careful about naming tasks, so it’s very unlikely that the two users will submit conflicting computations that compute to different values but occupy the same key in memory. However if they both submit computations that overlap somewhat then the scheduler will nicely avoid recomputation. This can be very nice when you have many people doing slightly different computations on the same hardware. This works in the same way that Git works.

  • If they’re sharing the same workers then won’t they clobber each other’s resources?

Yes, this is definitely possible. If you’re concerned about this then you should give everyone their own scheduler/workers (which is easy and standard practice). There is not currently much user management built into Dask.

How does this compare with Spark?

At an institutional level Spark seems to primarily target ETL + Database-like computations. While Dask modules like Dask.bag and Dask.dataframe can happily play in this space this doesn’t seem to be the focus of recent conversations.

Recent conversations are almost entirely around supporting interactive custom parallelism (lots of small tasks with complex dependencies between them) rather than the big Map->Filter->Groupby->Join abstractions you often find in a database or Spark. That’s not to say that these operations aren’t hugely important; there is a lot of selection bias here. The people I talk to are people for whom Spark/Databases are clearly not an appropriate fit. They are tackling problems that are way more complex, more heterogeneous, and with a broader variety of users.

I usually describe this situation with an analogy comparing “Big data” systems to human transportation mechanisms in a city. Here we go:

  • A Database is like a train: it goes between a set of well defined points with great efficiency, speed, and predictability. These are popular and profitable routes that many people travel between (e.g. business analytics). You do have to get from home to the train station on your own (ETL), but once you’re in the database/train you’re quite comfortable.
  • Spark is like an automobile: it takes you door-to-door from your home to your destination with a single tool. While this may not be as fast as the train for the long-distance portion, it can be extremely convenient to do ETL, Database work, and some machine learning all from the comfort of a single system.
  • Dask is like an all-terrain-vehicle: it takes you out of town on rough ground that hasn’t been properly explored before. This is a good match for the Python community, which typically does a lot of exploration into new approaches. You can also drive your ATV around town and you’ll be just fine, but if you want to do thousands of SQL queries then you should probably invest in a proper database or in Spark.

Again, there is a lot of selection bias here, if what you want is a database then you should probably get a database. Dask is not a database.

This is also wildly over-simplifying things. Databases like Oracle have lots of ETL and analytics tools, Spark is known to go off road, etc.. I obviously have a bias towards Dask. You really should never trust an author of a project to give a fair and unbiased view of the capabilities of the tools in the surrounding landscape.

Conclusion

That’s a rough sketch of current conversations and open problems for “How Dask might evolve to support institutional use cases.” It’s really quite surprising just how prevalent this story is among the full spectrum from universities to hedge funds.

The problems listed above are by no means halting adoption. I’m not listing the 100 or so questions that are answered with “yes, that’s already supported quite well”. Right now I’m seeing Dask being adopted by individuals and small groups within various institutions. Those individuals and small groups are pushing that interest up the stack. It’s still several months before any 1000+ person organization adopts Dask as infrastructure, but the speed at which momentum is building is quite encouraging.

I’d also like to thank the several nameless people who exercise Dask on various infrastructures at various scales on interesting problems and have reported serious bugs. These people don’t show up on the GitHub issue tracker but their utility in flushing out bugs is invaluable.

As interest in Dask grows it’s interesting to see how it will evolve. Culturally Dask has managed to simultaneously cater to both the open science crowd as well as the private-sector crowd. The project gets both financial support and open source contributions from each side. So far there hasn’t been any conflict of interest (everyone is pushing in roughly the same direction) which has been a really fruitful experience for all involved I think.

This post was originally published by Matt Rocklin on his website, matthewrocklin.com 

by swebster at August 16, 2016 02:51 PM

Matthew Rocklin

Dask for Institutions

This work is supported by Continuum Analytics

Introduction

Institutions use software differently than individuals. Over the last few months I’ve had dozens of conversations about using Dask within larger organizations like universities, research labs, private companies, and non-profit learning systems. This post provides a very coarse summary of those conversations and extracts common questions. I’ll then try to answer those questions.

Note: some of this post will be necessarily vague at points. Some companies prefer privacy. All details here are either in public Dask issues or have come up with enough institutions (say at least five) that I’m comfortable listing the problem here.

Common story

Institution X, a university/research lab/company/… has many scientists/analysts/modelers who develop models and analyze data with Python, the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code. These models/data sometimes grow to be large enough to need a moderately large amount of parallel computing.

Fortunately, Institution X has an in-house cluster acquired for exactly this purpose of accelerating modeling and analysis of large computations and datasets. Users can submit jobs to the cluster using a job scheduler like SGE/LSF/Mesos/Other.

However the cluster is still under-utilized and the users are still asking for help with parallel computing. Either users aren’t comfortable using the SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic workloads, or the interaction times aren’t good enough for the interactive use that users appreciate.

There was an internal effort to build a more complex/interactive/Pythonic system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and definitely isn’t something that Institution X wants to pursue. It turned out to be a harder problem than expected to design/build/maintain such a system in-house. They’d love to find an open source solution that was well featured and maintained by a community.

The Dask.distributed scheduler looks like it’s 90% of the system that Institution X needs. However there are a few open questions:

  • How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job scheduler?
  • How can we grow and shrink the cluster dynamically based on use?
  • How do users manage software environments on the workers?
  • How secure is the distributed scheduler?
  • Dask is resilient to worker failure, how about scheduler failure?
  • What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?
  • How do we handle multiple concurrent users and priorities?
  • How does this compare with Spark?

So for the rest of this post I’m going to answer these questions. As usual, few of answers will be of the form “Yes Dask can solve all of your problems.” These are open questions, not the questions that were easy to answer. We’ll get into what’s possible today and how we might solve these problems in the future.

How do we integrate dask.distributed with SGE/LSF/Mesos/Other?

It’s not difficult to deploy dask.distributed at scale within an existing cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already a researcher within the institution doing this manually by running dask-scheduler on some static node in the cluster and launching dask-worker a few hundred times with their job scheduler and a small job script.

The goal now is how to formalize this process for the individual version of SGE/LSF/Mesos/Other used within the institution while also developing and maintaining a standard Pythonic interface so that all of these tools can be maintained cheaply by Dask developers into the foreseeable future. In some cases Institution X is happy to pay for the development of a convenient “start dask on my job scheduler” tool, but they are less excited about paying to maintain it forever.

We want Python users to be able to say something like the following:

from dask.distributed import Executor, SGECluster

c = SGECluster(nworkers=200, **options)
e = Executor(c)

… and have this same interface be standardized across different job schedulers.

How can we grow and shrink the cluster dynamically based on use?

Alternatively, we could have a single dask.distributed deployment running 24/7 that scales itself up and down dynamically based on current load. Again, this is entirely possible today if you want to do it manually (you can add and remove workers on the fly) but we should add some signals to the scheduler like the following:

  • “I’m under duress, please add workers”
  • “I’ve been idling for a while, please reclaim workers”

and connect these signals to a manager that talks to the job scheduler. This removes an element of control from the users and places it in the hands of a policy that IT can tune to play more nicely with their other services on the same network.

How do users manage software environments on the workers?

Today Dask assumes that all users and workers share the exact same software environment. There are some small tools to send updated .py and .egg files to the workers but that’s it.

Generally Dask trusts that the full software environment will be handled by something else. This might be a network file system (NFS) mount on traditional cluster setups, or it might be handled by moving docker or conda environments around by some other tool like knit for YARN deployments or something more custom. For example Continuum sells proprietary software that does this.

Getting the standard software environment setup generally isn’t such a big deal for institutions. They typically have some system in place to handle this already. Where things become interesting is when users want to use drastically different environments from the system environment, like using Python 2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may also want to change the software environment many times in a single session.

The best solution I can think of here is to pass around fully downloaded conda environments using the dask.distributed network (it’s good at moving large binary blobs throughout the network) and then teaching the dask-workers to bootstrap themselves within this environment. We should be able to tear everything down and restart things within a small number of seconds. This requires some work; first to make relocatable conda binaries (which is usually fine but is not always fool-proof due to links) and then to help the dask-workers learn to bootstrap themselves.

Somewhat related, Hussain Sultan of Capital One recently contributed a dask-submit command to run scripts on the cluster: http://distributed.readthedocs.io/en/latest/submitting-applications.html

How secure is the distributed scheduler?

Dask.distributed is incredibly insecure. It allows anyone with network access to the scheduler to execute arbitrary code in an unprotected environment. Data is sent in the clear. Any malicious actor can both steal your secrets and then cripple your cluster.

This is entirely the norm however. Security is usually handled by other services that manage computational frameworks like Dask.

For example we might rely on Docker to isolate workers from destroying their surrounding environment and rely on network access controls to protect data access.

Because Dask runs on Tornado, a serious networking library and web framework, there are some things we can do easily like enabling SSL, authentication, etc.. However I hesitate to jump into providing “just a little bit of security” without going all the way for fear of providing a false sense of security. In short, I have no plans to work on this without a lot of encouragement. Even then I would strongly recommend that institutions couple Dask with tools intended for security. I believe that is common practice for distributed computational systems generally.

Dask is resilient to worker failure, how about scheduler failure?

Workers can come and go. Clients can come and go. The state in the scheduler is currently irreplaceable and no attempt is made to back it up. There are a few things you could imagine here:

  1. Backup state and recent events to some persistent storage so that state can be recovered in case of catastrophic loss
  2. Have a hot failover node that gets a copy of every action that the scheduler takes
  3. Have multiple peer schedulers operate simultaneously in a way that they can pick up slack from lost peers
  4. Have clients remember what they have submitted and resubmit when a scheduler comes back online

Currently option 4 is currently the most feasible and gets us most of the way there. However options 2 or 3 would probably be necessary if Dask were to ever run as critical infrastructure in a giant institution. We’re not there yet.

As of recent work spurred on by Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back and everyone will reconnect. The state for computations in flight is entirely lost but the computational infrastructure remains intact so that people can resubmit jobs without significant loss of service.

Dask has a bit of a harder time with this topic because it offers a persistent stateful interface. This problem is much easier for distributed database projects that run ephemeral queries off of persistent storage, return the results, and then clear out state.

What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?

The short answer is no. Other than number of cores and available RAM all workers are considered equal to each other (except when the user explicitly specifies otherwise).

However this problem and problems like it have come up a lot lately. Here are a few examples of similar cases:

  1. Multiple data centers geographically distributed around the country
  2. Multiple racks within a single data center
  3. Multiple workers that have GPUs that can move data between each other easily
  4. Multiple processes on a single machine

Having some notion of hierarchical worker group membership or inter-worker preferred relationships is probably inevitable long term. As with all distributed scheduling questions the hard part isn’t deciding that this is useful, or even coming up with a sensible design, but rather figuring out how to make decisions on the sensible design that are foolproof and operate in constant time. I don’t personally see a good approach here yet but expect one to arise as more high priority use cases come in.

How do we handle multiple concurrent users and priorities?

There are several sub-questions here:

  • Can multiple users use Dask on my cluster at the same time?

Yes, either by spinning up separate scheduler/worker sets or by sharing the same set.

  • If they’re sharing the same workers then won’t they clobber each other’s data?

This is very unlikely. Dask is careful about naming tasks, so it’s very unlikely that the two users will submit conflicting computations that compute to different values but occupy the same key in memory. However if they both submit computations that overlap somewhat then the scheduler will nicely avoid recomputation. This can be very nice when you have many people doing slightly different computations on the same hardware. This works in the same way that Git works.

  • If they’re sharing the same workers then won’t they clobber each other’s resources?

Yes, this is definitely possible. If you’re concerned about this then you should give everyone their own scheduler/workers (which is easy and standard practice). There is not currently much user management built into Dask.

How does this compare with Spark?

At an institutional level Spark seems to primarily target ETL + Database-like computations. While Dask modules like Dask.bag and Dask.dataframe can happily play in this space this doesn’t seem to be the focus of recent conversations.

Recent conversations are almost entirely around supporting interactive custom parallelism (lots of small tasks with complex dependencies between them) rather than the big Map->Filter->Groupby->Join abstractions you often find in a database or Spark. That’s not to say that these operations aren’t hugely important; there is a lot of selection bias here. The people I talk to are people for whom Spark/Databases are clearly not an appropriate fit. They are tackling problems that are way more complex, more heterogeneous, and with a broader variety of users.

I usually describe this situation with an analogy comparing “Big data” systems to human transportation mechanisms in a city. Here we go:

  • A Database is like a train: it goes between a set of well defined points with great efficiency, speed, and predictability. These are popular and profitable routes that many people travel between (e.g. business analytics). You do have to get from home to the train station on your own (ETL), but once you’re in the database/train you’re quite comfortable.
  • Spark is like an automobile: it takes you door-to-door from your home to your destination with a single tool. While this may not be as fast as the train for the long-distance portion, it can be extremely convenient to do ETL, Database work, and some machine learning all from the comfort of a single system.
  • Dask is like an all-terrain-vehicle: it takes you out of town on rough ground that hasn’t been properly explored before. This is a good match for the Python community, which typically does a lot of exploration into new approaches. You can also drive your ATV around town and you’ll be just fine, but if you want to do thousands of SQL queries then you should probably invest in a proper database or in Spark.

Again, there is a lot of selection bias here, if what you want is a database then you should probably get a database. Dask is not a database.

This is also wildly over-simplifying things. Databases like Oracle have lots of ETL and analytics tools, Spark is known to go off road, etc.. I obviously have a bias towards Dask. You really should never trust an author of a project to give a fair and unbiased view of the capabilities of the tools in the surrounding landscape.

Conclusion

That’s a rough sketch of current conversations and open problems for “How Dask might evolve to support institutional use cases.” It’s really quite surprising just how prevalent this story is among the full spectrum from universities to hedge funds.

The problems listed above are by no means halting adoption. I’m not listing the 100 or so questions that are answered with “yes, that’s already supported quite well”. Right now I’m seeing Dask being adopted by individuals and small groups within various institutions. Those individuals and small groups are pushing that interest up the stack. It’s still several months before any 1000+ person organization adopts Dask as infrastructure, but the speed at which momentum is building is quite encouraging.

I’d also like to thank the several nameless people who exercise Dask on various infrastructures at various scales on interesting problems and have reported serious bugs. These people don’t show up on the GitHub issue tracker but their utility in flushing out bugs is invaluable.

As interest in Dask grows it’s interesting to see how it will evolve. Culturally Dask has managed to simultaneously cater to both the open science crowd as well as the private-sector crowd. The project gets both financial support and open source contributions from each side. So far there hasn’t been any conflict of interest (everyone is pushing in roughly the same direction) which has been a really fruitful experience for all involved I think.

August 16, 2016 12:00 AM

August 12, 2016

William Stein

Jupyter: "take the domain name down immediately"

The Jupyter notebook is an open source BSD-licensed browser-based code execution environment, inspired by my early work on the Sage Notebook (which we launched in 2007), which was in turn inspired heavily by Mathematica notebooks and Google docs. Jupyter used to be called IPython.

SageMathCloud is an open source web-based environment for using Sage worksheets, terminals, LaTeX documents, course management, and Jupyter notebooks. I've put much hard work into making it so that multiple people can simultaneously edit Jupyter notebooks in SageMathCloud, and the history of all changes are recorded and browsable via a slider.

Many people have written to me asking for there to be a modified version of SageMathCloud, which is oriented around Jupyter notebooks instead of Sage worksheets. So the default file type is Jupyter notebooks, the default kernel doesn't involve the extra heft of Sage, etc., and the domain name involves Jupyter instead of "sagemath". Some people are disuased from using SageMathCloud for Jupyter notebooks because of the "SageMath" name.

Dozens of web applications (including SageMathCloud) use the word "Jupyter" in various places. However, I was unsure about using "jupyter" in a domain name. I found this github issue and requested clarification 6 weeks ago. We've had some back and forth, but they recently made it clear that it would be at least a month until any decision would be considered, since they are too busy with other things. In the meantime, I rented jupytercloud.com, which has a nice ring to it, as the planet Jupiter has clouds. Yesterday, I made jupytercloud.com point to cloud.sagemath.com to see what it would "feel like" and Tim Clemans started experimenting with customizing the page based on the domain name that the client sees. I did not mention jupytercloud.com publicly anywhere, and there were no links to it.

Today I received this message:

    William,

I'm writing this representing the Jupyter project leadership
and steering council. It has recently come to the Jupyter
Steering Council's attention that the domain jupytercloud.com
points to SageMathCloud. Do you own that domain? If so,
we ask that you take the domain name down immediately, as
it uses the Jupyter name.
I of course immediately complied. It is well within their rights to dictate how their name is used, and I am obsessive about scrupulously doing everything I can to respect people's intellectual property; with Sage we have put huge amounts of effort into honoring both the letter and spirit of copyright statements on open source software.

I'm writing this because it's unclear to me what people really want, and I have no idea what to do here.

1. Do you want something built on the same technology as SageMathCloud, but much more focused on Jupyter notebooks?

2. Does the name of the site matter to you?

3. What model should the Jupyter project use for their trademark? Something like Python? like Git?Like Linux?  Like Firefox?  Like the email program PINE?  Something else entirely?

4. Should I be worried about using Jupyter at all anywhere? E.g., in this blog post? As the default notebook for the SageMath project?

I appreciate any feedback.

Hacker News Discussion

UPDATE (Aug 12, 2016): The official decision is that I cannot use the domain jupytercloud.com.   They did say I can use jupyter.sagemath.com or sagemath.com/jupyter.   Needless to say, I'm disappointed, but I fully respect their (very foolish, IMHO) decision.


by William Stein (noreply@blogger.com) at August 12, 2016 10:46 AM

August 09, 2016

Continuum Analytics news

Anaconda Build Migration to Conda-Forge

Posted Wednesday, August 10, 2016

Summary

In a prior blog post, we shared with the community that we’re deprecating support for the Anaconda Build system on Anaconda Cloud. To help navigate the transition, we’re providing some additional guidance on suggested next steps.

What is Continuum’s Recommended Approach? 

The Anaconda Cloud team recommends using the community-based conda-forge for open source projects. Conda-forge leverages Github to create high-quality recipes, and Circle-CI, Travis-CI, and Appveyor to produce conda packages on Linux, MacOS and Windows with a strong focus on automation and reproducibility.

Why conda-forge?

Here are some of the benefits we see from using conda-forge: 

  1. Free: Conda-forge is a solidly supported community effort to offer free conda package builds. 
  2. Easier for your users: By building your package with conda-forge, you will be distributing your package in the conda-forge channel. This makes it easier for your users to access your package via the next-most-common place to get packages after the Anaconda defaults channel.
  3. Easier for you: You no longer have to set up extra CI or servers just for building conda packages. Instead, you can take advantage of an existing set of build machinery through the simplicity of creating a Pull Request in Github.
  4. Cross-Platform: If you were using Anaconda Build’s free public queue, only Linux builds were available. With conda-forge, you’ll get support for building packages not just on Linux, but also Windows and MacOS, opening up huge new audiences to your package.
  5. Better feedback: With many more people looking at your recipe, you’ll likely get additional feedback to help improve your package.  
  6. Evolve with best practices: As best practices around package building evolve, these standards will automatically be applied to your recipe(s), so you will be growing and improving in step with the community. 

To read more about conda-forge, see this guest blog post

What is the High-Level Migration Approach? 

If you were using Anaconda Build’s public Queue (Linux worker) and you had a working recipe previously, the migration process should be straightforward for you:

  1. Grab a copy of conda-forge's staged recipes repo
  2. Copy over your existing recipe into the staged-recipes/recipes folder
  3. Take a look at the example recipe in the staged-recipes/recipes folder, and adapt your recipe to the suggested style there.
  4. Submit a PR for new recipe against conda-forge’s staged-recipes repo.
  5. After review of your PR, a conda-forge moderator will merge in the PR.  
  6. Automated scripts will create a new feedstock repo and set up permissions for you on that repo.

Migration Tasks

This section walks through in more detail what the migration tasks are. It is a slightly expanded version of what you’ll find in the conda-forge Staged Recipes Repo readme.

Previously with Anaconda Build, you had…

  1. conda.recipe/: containing at least, meta.yml, maybe build/test scripts and resources (patches) 
  2. .binstar.yml: no longer used going forward

To get started with conda-forge...

  1. Fork https://github.com/conda-forge/staged-recipes
    • git clone https://github.com/<USERNAME>/staged-recipes
    • cp -r <old repo>/conda.recipe staged-recipes/recipes/<PKG-NAME>
    • Edit your recipe, based on example recipe in staged-recipes/recipes
    • git add recipes/
    • git push origin
  2. Create a Pull Request to conda-forge/staged-recipes from your new branch
    • The conda-forge linter bot will post a message about any potential issues or style suggestions it has on your recipe.
    • CI scripts will automatically build against Linux, MacOS, and Windows.
      • You can skip platforms by adding lines with selectors in your meta.yaml such as:
        • build:
          • skip: True  # [win]
  3. Reviewers will suggest improvements to your recipe.
  4. When all builds succeed, it’s ready to be reviewed by a community moderator for merge.
  5. The moderator will review and merge your new staged-recipes.  From here, automated scripts create a new repo (called a feedstock) just for your package, with the github id’s from the recipe-maintainers section added as maintainers. 
  6. As mentioned above, as best practices evolve, automated processes will periodically submit PR’s to your feedstock.  It is up to you and any other administrators to review and merge these PRs.

Example Packages

The following list links to a few prototypical scenarios you may find helpful as a reference: 

  • You can turn off a specific platform with build: skip: true [<platform>], ex: windows:
  • Use a PyPi release upstream with a published sha256 checksum:
  • If using post/pre-link scripts, make sure they don’t make lots of output, instead write to .messages.txt:
  • If you are building packages with compiled components (pure C libraries, Python packages with compiled extensions, etc.) use the toolchain package
  • If you need custom linux packages that are not available as conda packages, specify them with yum_requirements.txt:

What to be Aware of 

We’re really excited about conda-forge and supportive of its efforts. However, these are some things to keep in mind as you explore if it’s a fit for your project: 

  • If you were using Anaconda-Build as your Continuous Integration service, you probably do have to have to set up a new workflow; the astropy ci-helpers are very helpful. Conda-forge is not your project’s CI!
  • The upshot of this is you should only be building in conda-forge when you are ready to do a release. If you do try to use conda-forge as CI, you will likely get frustrated because your builds will take longer than you want and it won’t feel like a CI system.
  • When you submit a new package, you (and possibly the people you recruit) are signing up to be a maintainer of this package. The community expectation is that you will continue to update your package with respect to the upstream dependencies. The implied message is that conda-forge is not a great place to add hobby projects that you have no intention of maintaining.
  • If your recipe doesn’t meet the community quality guidelines, you are likely to get community feedback with the expectation that you’ll make changes.

Alternatives to be Aware of

Creating reproducible, automated, cross-platform is easier today than in the past with virtualization and containerization, but still challenging due to proprietary licenses and compiler toolchains. We’ve got some examples of conda build environments in different environments that can make easier to build your own packages virtually with vagrant:

...and Docker…

If you are thinking about how to tackle this problem at scale for private/commercial projects, or internally behind the firewall, please contact us to find out how we can help. 

Getting Started and Getting Involved

  1. General info: Read more about conda-forge: https://conda-forge.github.io/ 
  2. Create your receipe: start with the readme here: https://github.com/conda-forge/staged-recipes 
  3. For general Getting Started issues, try the FAQ on the staged-recipes wiki.
  4. Getting help: Non-package-specific issues, i.e. best practices, guidance, are best raised on the conda-forge.github.io issue tracker.  There is also a gitter chat channel: https://gitter.im/conda-forge/conda-forge.github.io.
  5. News: To find out how the community is evolving, stay tuned to meeting minutes on hackpad, and feel free to join meetings there.  Advanced notice is important, though - meetings are often hosted on Google Hangouts, which has a 10 person cap.  We have alternate avenues, but need to set them up.

What’s Your Experience?

If you’ve been using conda-forge or completed a migration, let us know what your experience has been in the comments section below, or on Twitter @ContinuumIO

by swebster at August 09, 2016 07:29 PM

August 08, 2016

Continuum Analytics news

The Citizen Data Scientist Strikes Back

Posted Monday, August 8, 2016

How Open Data Science is Giving Jedi Superpowers to the World

As the Data Science galaxy starts to form, novel concepts will inevitably emerge—ones that we cannot predict from today’s limited vantage point.
 
One of the latest of these is the “citizen data scientist.” This person is leveraging the latest data science technologies without a formal background in math or computer science. The growing community of citizen data scientists––engineers, analysts, scientists, economists and many others––outnumbers the mythical unicorn data scientists and, if equipped with the right technology and solutions, will substantially contribute new discoveries, new revenue, new innovations, new policies and more that will drive value in the marketplace through innovation, productivity and efficiency.
 
Unlike previous technology periods, where only specialized gurus could engage with innovation, Open Data Science empowers anyone through a wealth of highly accessible and innovative open source technology. As a result, a new generation of Jedis is being armed with greater intelligence to change the world from the near-infinite applications of data science today.

Open Data Science Puts the Power of Intelligence in the Citizens’ Hands

In the past, data and analytics were sequestered into a back corner for highly specialized teams to mine and unearth nuggets that led to epiphanies from mounds of data. Not only is data liberated today, but now data science is also being liberated. The lynchpin to data science liberation is making data science accessible via familiar and consumable applications. Now, any curious analyst, engineer or anyone using Open Data Science has the power to make informed decisions with real evidence and, most importantly, intelligent power that leverages not just predictive analytics, but machine learning and, increasingly, deep learning. Armed with this saber of intelligence, every day tasks in business and life will change. 
 
Take, for example, the TaxBrain project. Taxbrain is a striking example of the power of Open Data Science. Using open source resources like Anaconda, the Open Source Policy Center (OSPC) launched a publicly-available web application that allows anyone to explore how changes to tax policies affect government revenues and taxpayers. This browser-based software allows journalists, consumers, academics, economists––all citizen data scientists––to understand the implications of tax policy changes.
 
With Open Data Science underlying the creation of TaxBrain, anyone can run highly-sophisticated mathematical models of the complex U.S. tax code without having to know the details or code underlying it.
 
That’s truly empowering—and the range of potential data science applications doesn’t end there.

Making Data Intuitive with Modern Interactive Visualizations

Open Data Science is ushering in a golden age of consumability through interactive visualizations that engage and activate citizen data scientists. Our minds are not optimized for holding more than around seven numbers, let alone the tens of millions that are part and parcel to displaying Big Data. However, the human visual system is much better at wrangling complexity. The more these visualization methods progress, the more intuition—from anyone, not just experts—can be brought to bear on emerging challenges with visualizing Big Data. 
 
High-tech and brain-optimized visual interactions were foreshadowed in the 2002 film "Minority Report" where the hero, portrayed by Tom Cruise, uses a holographic display to rapidly manipulate data by flicking through the images with his hands, physically interacting with and manipulating the data as he experiences it. This interface allowed him to quickly analyze complex data as intuitive, beautiful visualizations.
 
Similarly, Open Data Science currently offers beautiful visualizations that allow citizen data scientists to engage with massive data sets. Datashader and Bokeh - delivered with Anaconda, the leading Open Data Science platform - provide intuitive visualizations using novel approaches to represent data—all based on the functionality of the human visual system. Analysts can see the rich texture of complex data without being overwhelmed by what they see, which makes data science more accessible. 

Open Data Science Allows Us to Focus on Essentials

We are approaching a future where complexity and uncertainty are finally embedded into intelligent applications that everyone can use without having to understand the underlying math and computer science. By arming citizen data scientists with these powerful intelligent applications, we empower everyone with a virtual supercomputer at their fingertips that closes the gap between the status quo predictive analytics and the power of human cognition and “intuition.” As we gain confidence in these intelligent apps, we will embed the intelligence into devices and applications that will automatically adapt to ever changing conditions to determine the best, or set of best, courses of action. 

Take the budgeting process of a company, for example. Various projects, hurdle rates, costs, staffing and more can be analyzed to recommend the best use of corporate resources––cash, people, time––eliminating the political jockeying, while achieving conflicting objectives such as maximizing ROI, while minimizing costs. The actual execution results are used to feed into future budgeting so that the new budgets adapt based on the execution capabilities of the organization. Streamlining decisional overhead allows citizen data scientists to focus on what matters.

Through the interoperability of Open Data Science, separate analysis components will work together like perfectly interlocking Lego pieces. That’s a multiplier of value, limited only by the imagination of visionary citizen data scientists.

Builing the Empire with Open Data Science 

Open Data Science is ushering in the era of the citizen data scientist. It’s an exhilarating time for organizations, ripe with innovative opportunities. But, managing the transition can seem daunting.
 
How can organizations embrace Open Data Science solutions, while avoiding a quagmire of technical, process and legal issues? These points are addressed in a recent webinar, The Journey to Open Data Science, with content freely available.
 
Exciting changes are coming, brought to us by the citizen data scientists of the future.

 

by swebster at August 08, 2016 06:17 PM

August 05, 2016

Spyder

Qt Charts (PyQt5.QtChart) as (Py)Qwt replacement

I've just discovered an interesting alternative to PyQwt for plotting curves efficiently. That is Qt Charts, a module of Qt which exists at least since 2012 according to PyQtChart changelog (PyQtChart is a set of Python bindings for Qt Charts library).

Here is a simple example which is as efficient as PyQwt (and much more efficient than PythonQwt):
(the associated Python code is here)



by Pierre (noreply@blogger.com) at August 05, 2016 01:59 AM

August 01, 2016

Continuum Analytics news

Powering Down the Anaconda Build CI System on Anaconda Cloud

Posted Monday, August 1, 2016

**UPDATES [Aug 10, 2016]: 

  • For our on-premises customers, the need may exist for in-house build/packaging and C.I. capabilities, and we will prioritize these capabilities in our road-map in accordance with customer demand.
  • We’ve added a blog post here to help people assess if conda-forge is a good alternative and migrate to it if so.

**end updates

We are reaching out to our committed base of Anaconda Cloud users to let you know we’ve made the decision to deprecate the Anaconda Build service within Anaconda Cloud as of September 1, 2016. Please note that Anaconda Build is not to be confused with conda-build, which will continue to be supported and maintained as a vital community project. The remaining functionality within Anaconda Cloud will continue to be available to the Anaconda user community to host, share and collaborate via conda packages, environments and notebooks.

Anaconda Build is a continuous integration service that is part of the Anaconda platform (specifically, Anaconda Cloud and Anaconda Repository). Anaconda Build helps conda users and data scientists automatically build and upload cross-platform conda packages as part of their project’s development and release workflow.

Anaconda Build was initially launched [in effect as a beta product] to support internal needs of the Anaconda platform teams and to explore the combination of a continuous integration service with the flexibility of conda packages and conda-build. While we were developing and supporting Anaconda Build, it was gratifying to see significant community effort around conda-forge to solve issues around cross-platform package-build workflows with conda and the formation of a community-driven collection of conda recipes to serve the open data science community. The resulting conda package builds from conda-forge are hosted on Anaconda Cloud.

We feel that the Anaconda community efforts around conda-forge (along with Travis CI, CircleCI, and Appveyor) are in a better position to deliver reliable package building services compared to our internal efforts with Anaconda Build. For both Anaconda community users and Anaconda platform subscribers, alternative solutions for automated conda package builds can be developed using the previously mentioned services.

The documentation for Anaconda Build will continue to be available in the Anaconda Cloud documentation for a short period of time after Anaconda Build is deprecated but will be removed in the near future.

If you have any questions about the deprecation of Anaconda Build or how to transition your conda package build workflow to conda-forge or other continuous integration services, please post on the Anaconda Cloud issue tracker on Github.

-The Anaconda Platform Product Team

by swebster at August 01, 2016 07:51 PM

July 28, 2016

Continuum Analytics news

Dask and scikit-learn: a 3-Part Tutorial

Posted Thursday, July 28, 2016

Dask core contributor Jim Crist has put together a series of posts discussing some recent experiments combining Dask and scikit-learn on his blog, Marginally Stable. From these experiments, a small library has been built up, and can be found here.

The tutorial spans three posts, which covers model parallelism, data parallelism and combining the two with a real-life dataset. 

Part I: Dask & scikit-learn: Model Parallelism

In this post we'll look instead at model-parallelism (use same data across different models), and dive into a daskified implementation of GridSearchCV.

Part II: Dask & scikit-learn: Data Parallelism

In the last post we discussed model-parallelism — fitting several models across the same data. In this post we'll look into simple patterns for data-parallelism, which will allow fitting a single model on larger datasets.

Part III: Dask & scikit-learn: Putting it All Together

In this post we'll combine the above concepts together to do distributed learning and grid search on a real dataset; namely the airline dataset. This contains information on every flight in the USA between 1987 and 2008.

Keep up with Jim and his blog by following him on Twitter, @jiminy_crist

 

by swebster at July 28, 2016 05:34 PM

July 25, 2016

Continuum Analytics news

Analyzing and Visualizing Big Data Interactively on Your Laptop: Datashading the 2010 US Census

Posted Tuesday, July 26, 2016

The 2010 Census collected a variety of demographic information for all the more than 300 million people in the USA. A subset of this has been pre-processed by the Cooper Center at the University of Virginia, who produced an online map of the population density and the racial/ethnic makeup of the USA. Each dot in this map corresponds to a specific person counted in the census, located approximately at their residence. (To protect privacy, the precise locations have been randomized at the block level, so that the racial category can only be determined to within a rough geographic precision.)

Using Datashader on Big Data

The Cooper Center website delivers pre-rendered image tiles to your browser, which is fast, but limited to the plotting choices they made. What if you want to look at the data a different way - filter it, combine it with other data or manipulate it further? You could certainly re-do the steps they did, using their Python source code, but that would be a big project. Just running the code takes "dozens of hours" and adapting it for new uses requires significant programming and domain expertise. However, the new Python datashader library from Continuum Analytics makes it fast and fully interactive to do these kinds of analyses dynamically, using simple code that is easy to adapt to new uses. The steps below show that using datashader makes it quite practical to ask and answer questions about big data interactively, even on your laptop.

Load Data and Set Up

First, let's load the 2010 Census data into a pandas dataframe:

import pandas as pd

%%time
df = pd.read_hdf('data/census.h5', 'census')
df.race = df.race.astype('category')

     CPU times: user 13.9 s, sys: 35.7 s, total: 49.6 s

     Wall time: 1min 7s

df.tail()

  meterswest metersnorth race
306674999 -8922890.0 2958501.2 h
306675000 -8922863.0 2958476.2 h
306675001 -8922887.0 2958355.5 h
306675002 -8922890.0 2958316.0 h
306675003 -8922939.0 2958243.8 h

Loading the data from the HDF5-format file takes a minute, as you can see, which is by far the most time-consuming step. The output of .tail() shows that there are more than 300 million datapoints (one per person), each with a location in Web Mercator coordinates, and that the race/ethnicity for each datapoint has been encoded as a single character (where 'w' is white, 'b' is black, 'a' is Asian, 'h' is Hispanic and 'o' is other (typically Native American).

Let's define some geographic ranges to look at later and a default plot size.

USA = ((-13884029, -7453304), (2698291, 6455972))
LakeMichigan = ((-10206131, -9348029), (4975642, 5477059))
Chicago = (( -9828281, -9717659), (5096658, 5161298))
Chinatown = (( -9759210, -9754583), (5137122, 5139825))

NewYorkCity = (( -8280656, -8175066), (4940514, 4998954))
LosAngeles = ((-13195052, -13114944), (3979242, 4023720))
Houston = ((-10692703, -10539441), (3432521, 3517616))
Austin = ((-10898752, -10855820), (3525750, 3550837))
NewOrleans = ((-10059963, -10006348), (3480787, 3510555))
Atlanta = (( -9448349, -9354773), (3955797, 4007753))

x_range,y_range = USA

plot_width = int(900)
plot_height = int(plot_width*7.0/12)

Population Density

For our first examples, let's ignore the race data, focusing on population density alone.

Datashader works by aggregating an arbitrarily large set of data points (millions, for a pandas dataframe, or billions+ for a dask dataframe) into a fixed-size buffer that's the shape of your final image. Each of the datapoints is assigned to one bin in this buffer, and each of these bins will later become one pixel. In this case, we'll aggregate all the datapoints from people in the continental USA into a grid containing the population density per pixel:

import datashader as ds
import datashader.transfer_functions as tf
from datashader.colors import Greys9, Hot, colormap_select as cm
def bg(img): return tf.set_background(img,"black")

%%time
cvs = ds.Canvas(plot_width, plot_height, *USA)
agg = cvs.points(df, 'meterswest', 'metersnorth')

     CPU times: user 3.97 s, sys: 12.2 ms, total: 3.98 s
     Wall time: 3.98 s

Computing this aggregate grid will take some CPU power (4-8 seconds on this MacBook Pro), because datashader has to iterate through the hundreds of millions of points in this dataset, one by one. But once the agg array has been computed, subsequent processing will now be nearly instantaneous, because there are far fewer pixels on a screen than points in the original database.

The aggregate grid now contains a count of the number of people in each location. We can visualize this data by mapping these counts into a grayscale value, ranging from black (a count of zero) to white (maximum count for this dataset). If we do this colormapping linearly, we can very quickly and clearly see...

%%time
bg(tf.interpolate(agg, cmap = cm(Greys9), how='linear'))
 

     CPU times: user 25.6 ms, sys: 4.77 ms, total: 30.4 ms
     Wall time: 29.8 ms

...almost nothing. The amount of detail visible is highly dependent on your monitor and its display settings, but it is unlikely that you will be able to make much out of this plot on most displays. If you know what to look for, you can see hotspots (high population densities) in New York City, Los Angeles, Chicago and a few other places. For feeding 300 million points in, we're getting almost nothing back in terms of visualization.

The first thing we can do is prevent "undersampling." In the plot above, there is no way to distinguish between pixels that are part of the background and those that have low but nonzero counts; both are mapped to black or nearly black on a linear scale. Instead, let's map all values that are not background to a dimly visible gray, leaving the highest-density values at white - let's discard the first 25% of the gray colormap and linearly interpolate the population densities over the remaining range:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.25), how='linear'))

The above plot at least reveals that data has been measured only within the political boundaries of the continental United States and that many areas in the mountainous West are so poorly populated that many pixels contained not even a single person (in datashader images, the background color is shown for pixels that have no data at all, using the alpha channel of a PNG image, while the specified colormap is shown for pixels that do have data). Some additional population centers are now visible, at least on some monitors. But, mainly what the above plot indicates is that population in the USA is extremely non-uniformly distributed, with hotspots in a few regions, and nearly all other pixels having much, much lower (but nonzero) values. Again, that's not much information to be getting out out of 300 million datapoints!

The problem is that of the available intensity scale in this gray colormap, nearly all pixels are colored the same low-end gray value, with only a few urban areas using any other colors. Thus, both of the above plots convey very little information. Because the data are clearly distributed so non-uniformly, let's instead try a nonlinear mapping from population counts into the colormap. A logarithmic mapping is often a good choice for real-world data that spans multiple orders of magnitude:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='log'))

Suddenly, we can see an amazing amount of structure! There are clearly meaningful patterns at nearly every location, ranging from the geographic variations in the mountainous West, to the densely spaced urban centers in New England and the many towns stretched out along roadsides in the midwest (especially those leading to Denver, the hot spot towards the right of the Rocky Mountains).

Clearly, we can now see much more of what's going on in this dataset, thanks to the logarithmic mapping. Yet, the choice of 'log' was purely arbitrary, and one could easily imagine that other nonlinear functions would show other interesting patterns. Instead of blindly searching through the space of all such functions, we can step back and notice that the main effect of the log transform has been to reveal local patterns at all population densities -- small towns show up clearly even if they are just slightly more dense than their immediate, rural neighbors, yet large cities with high population density also show up well against the surrounding suburban regions, even if those regions are more dense than the small towns on an absolute scale.

With this idea of showing relative differences across a large range of data values in mind, let's try the image-processing technique called histogram equalization. Given a set of raw counts, we can map these into a range for display such that every available color on the screen represents about the same number of samples in the original dataset. The result is similar to that from the log transform, but is now non-parametric -- it will equalize any linearly or nonlinearly distributed data, regardless of the distribution:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='eq_hist'))

Effectively, this transformation converts the data from raw magnitudes, which can easily span a much greater range than the dynamic range visible to the eye, to a rank-order or percentile representation, which reveals density differences at all ranges but obscures the absolute magnitudes involved. In this representation, you can clearly see the effects of geography (rivers, coastlines and mountains) on the population density, as well as history (denser near the longest-populated areas) and even infrastructure (with many small towns located at crossroads).

Given the very different results from the different types of plot, a good practice when visualizing any dataset with datashader is to look at both the linear and the histogram-equalized versions of the data; the linear version preserves the magnitudes but obscures the distribution, while the histogram-equalized version reveals the distribution while preserving only the order of the magnitudes, not their actual values. If both plots are similar, then the data is distributed nearly uniformly across the interval. But, much more commonly, the distribution will be highly nonlinear, and the linear plot will reveal only the envelope of the data - the lowest and the highest values. In such cases, the histogram-equalized plot will reveal much more of the structure of the data, because it maps the local patterns in the data into perceptible color differences on the screen, which is why eq_hist is the default colormapping.

Because we are only plotting a single dimension, we can use the colors of the display to effectively reach a higher dynamic range, mapping ranges of data values into different color ranges. Here, we'll use the colormap with the colors interpolated between the named colors shown:

print(cm(Hot,0.2))
bg(tf.interpolate(agg, cmap = cm(Hot,0.2)))

     ['darkred', 'red', 'orangered', 'darkorange', 'orange', 'gold', 'yellow', 'white']

Such a representation can provide additional detail in each range, while still accurately conveying the overall distribution.

Because we can control the colormap, we can use it to address very specific questions about the data itself. For instance, after histogram equalization, data should be uniformly distributed across the visible colormap. Thus, if we want to highlight, for exmaple, the top 1% of pixels (by population density), we can use a colormap divided into 100 ranges and simply change the top one to a different color:

import numpy as np
grays2 = cm([(i,i,i) for i in np.linspace(0,255,99)]) + ["red"]
bg(tf.interpolate(agg, cmap = grays2))

The above plot now conveys nearly all the information available in the original linear plot - that only a few pixels have the very highest population densities - while also conveying the structure of the data at all population density ranges via histogram equalization.

Categorical Data (Race)

Since we've got the racial/ethnic category for every pixel, we can use color to indicate the category value, instead of just extending dynamic range or highlighting percentiles, as shown above. To do this, we first need to set up a color key for each category label:

color_key = {'w':'aqua', 'b':'lime', 'a':'red', 'h':'fuchsia', 'o':'yellow' }

We can now aggregate the counts per race into grids, using ds.count_cat, instead of just a single grid with the total counts (which is what happens with the default aggregate reducer ds.count). We then generate an image by colorizing each pixel using the aggregate information from each category for that pixel's location:

def create_image(x_range, y_range, w=plot_width, h=plot_height, spread=0):
     cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
     agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
     img = tf.colorize(agg, color_key, how='eq_hist')
     if spread: img = tf.spread(img,px=spread)

     return tf.set_background(img,"black")

The result shows that the USA is overwhelmingly white, apart from some predominantly Hispanic regions along the Southern border, some regions with high densities of blacks in the Southeast and a few isolated areas of category "other" in the West (primarily Native American reservation areas).

 

create_image(*USA)

Interestingly, the racial makeup has some sharp boundaries around urban centers, as we can see if we zoom in:

create_image(*LakeMichigan)

With sufficient zoom, it becomes clear that Chicago (like most large US cities) has both a wide diversity of racial groups, and profound geographic segregation:

create_image(*Chicago)

Eventually, we can zoom in far enough to see individual datapoints. Here we can see that the Chinatown region of Chicago has, as expected, very high numbers of Asian residents, and that other nearby regions (separated by features like roads and highways) have other races, varying in how uniformly segregated they are:

create_image(*Chinatown,spread=plot_width//400)

Note that we've used the tf.spread function here to enlarge each point to cover multiple pixels so that each point is clearly visible.

Other Cities, for Comparison

Different cities have very different racial makeup, but they all appear highly segregated:

create_image(*NewYorkCity)

create_image(*LosAngeles)

create_image(*Houston)

create_image(*Atlanta)

create_image(*NewOrleans)

create_image(*Austin)

Analyzing Racial Data Through Visualization

The racial data and population densities are visible in the original Cooper Center map tiles, but because we aren't just working with static images here, we can look at any aspect of the data we like, with results coming back in a few seconds, rather than days. For instance, if we switch back to the full USA and then select only the black population, we can see that blacks predominantly reside in urban areas, except in the South and the East Coast:

cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))

bg(tf.interpolate(agg.sel(race='b'), cmap=cm(Greys9,0.25), how='eq_hist'))

(Compare to the all-race eq_hist plot at the start of this post.)

Or we can show only those pixels where there is at least one resident from each of the racial categories - white, black, Asian and Hispanic - which mainly highlights urban areas (compare to the first racial map shown for the USA above):

agg2 = agg.where((agg.sel(race=['w', 'b', 'a', 'h']) > 0).all(dim='race')).fillna(0)
bg(tf.colorize(agg2, color_key, how='eq_hist'))

In the above plot, the colors still show the racial makeup of each pixel, but the pixels have been filtered so that only those with at least one datapoint from every race are shown.

We can also look at all pixels where there are more black than white datapoints, which highlights predominantly black neighborhoods of large urban areas across most of the USA, but also some rural areas and small towns in the South:

bg(tf.colorize(agg.where(agg.sel(race='w') < agg.sel(race='b')).fillna(0), color_key, how='eq_hist'))

Here the colors still show the predominant race in that pixel, which is black for many of these, but in Southern California it looks like there are several large neighborhoods where blacks outnumber whites, but both are outnumbered by Hispanics.

Notice how each of these queries takes only a line or so of code, thanks to the xarray multidimensional array library that makes it simple to do operations on the aggregated data. Anything that can be derived from the aggregates is visible in milliseconds, not the days of computing time that would have been required using previous approaches. Even calculations that require reaggregating the data only take seconds to run, thanks to the optimized Numba and dask libraries used by datashader.

Using datashader, it is now practical to try out your own hypotheses and questions, whether for the USA or for your own region. You can try posing questions that are independent of the number of datapoints in each pixel, since that varies so much geographically, by normalizing the aggregated data in various ways. Now that the data has been aggregated but not yet rendered to the screen, there is an infinite range of queries you can pose!

Interactive Bokeh Plots Overlaid with Map Data

The above plots all show static images on their own. datashader can also be combined with plotting libraries, in order to add axes and legends, to support zooming and panning (crucial for a large dataset like this one!), and/or to combine datashader output with other data sources, such as map tiles. To start, we can define a Bokeh plot that shows satellite imagery from ArcGIS:

import bokeh.plotting as bp
from bokeh.models.tiles import WMTSTileSource

bp.output_notebook()

def base_plot(tools='pan,wheel_zoom,reset',webgl=False):
     p = bp.figure(tools=tools,
         plot_width=int(900), plot_height=int(500),
         x_range=x_range, y_range=y_range, outline_line_color=None,
         min_border=0, min_border_left=0, min_border_right=0,
         min_border_top=0, min_border_bottom=0, webgl=webgl)

     p.axis.visible = False
     p.xgrid.grid_line_color = None
     p.ygrid.grid_line_color = None

     return p

p = base_plot()

url="http://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.png"
tile_renderer = p.addtile(WMTSTileSource(url=url))
tile_renderer.alpha=1.0

We can then add an interactive plot that uses a callback to a datashader pipeline. In this pipeline, we'll use the tf.dynspread function to automatically increase the plotted size of each datapoint, once you've zoomed in so far that datapoints no longer have nearby neighbors:

from datashader.bokeh_ext import InteractiveImage

def image_callback(x_range, y_range, w, h):
     cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
     agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
     img = tf.colorize(agg, color_key, 'log')
     return tf.dynspread(img,threshold=0.75, max_px=8)

InteractiveImage(p, image_callback)

The above image will just be a static screenshot, but in a running Jupyter notebook you will be able to zoom and pan interactively, selecting any region of the map to study in more detail. Each time you zoom or pan, the entire datashader pipeline will be re-run, which will take a few seconds. At present, datashader does not use caching, tiling, partitioning or any of the other optimization techniques that would provide even more responsive plots, and, as the library matures, you can expect to see further improvements over time. But, the library is already fast enough to provide interactive plotting of all but the very largest of datasets, allowing you to change any aspect of your plot "on-the-fly" as you interact with the data.

To learn more about datashader, check out our extensive set of tutorial notebooks, then install it using conda install -c bokeh datashader and start trying out the Jupyter notebooks from github yourself! You can also watch my datashader talk from SciPy 2016 on YouTube. 

by swebster at July 25, 2016 03:12 PM

July 22, 2016

William Stein

DataDog's pricing: don't make the same mistake I made

I stupidly made a mistake recently by choosing to use DataDog for monitoring the infrastructure for my startup (SageMathCloud).

I got bit by their pricing UI design that looks similar to many other sites, but is different in a way that caused me to spend far more money than I expected.

I'm writing this post so that you won't make the same mistake I did.  As a product, DataDog is of course a lot of hard work to create, and they can try to charge whatever they want. However, my problem is that what they are going to charge was confusing and misleading to me.

I wanted to see some nice web-based data about my new autoscaled Kubernetes cluster, so I looked around at options. DataDog looked like a new and awesomely-priced service for seeing live logging. And when I looked (not carefully enough) at the pricing, it looked like only $15/month to monitor a bunch of machines. I'm naive about the cost of cloud monitoring -- I've been using Stackdriver on Google cloud platform for years, which is completely free (for now, though that will change), and I've also used self hosted open solutions, and some quite nice solutions I've written myself. So my expectations were way out of whack.

Ever busy, I signed up for the "$15/month plan":


One of the people on my team spent a little time and installed datadog on all the VM's in our cluster, and also made DataDog automatically start running on any nodes in our Kubernetes cluster. That's a lot of machines.

Today I got the first monthly bill, which is for the month that just happened. The cost was $639.19 USD charged to my credit card. I was really confused for a while, wondering if I had bought a year subscription.



After a while I realized that the cost is per host! When I looked at the pricing page the first time, I had just saw in big letters "$15", and "$18 month-to-month" and "up to 500 hosts". I completely missed the "Per Host" line, because I was so naive that I didn't think the price could possibly be that high.

I tried immediately to delete my credit card and cancel my plan, but the "Remove Card" button is greyed out, and it says you can "modify your subscription by contacting us at success@datadoghq.com":



So I wrote to success@datadoghq.com:

Dear Datadog,

Everybody on my team was completely mislead by your
horrible pricing description.

Please cancel the subscription for wstein immediately
and remove my credit card from your system.

This is the first time I've wasted this much money
by being misled by a website in my life.

I'm also very unhappy that I can't delete my credit
card or cancel my subscription via your website. It's
like one more stripe API call to remove the credit card
(I know -- I implemented this same feature for my site).


And they responded:

Thanks for reaching out. If you'd like to cancel your
Datadog subscription, you're able to do so by going into
the platform under 'Plan and Usage' and choose the option
downgrade to 'Lite', that will insure your credit card
will not be charged in the future. Please be sure to
reduce your host count down to the (5) allowed under
the 'Lite' plan - those are the maximum allowed for
the free plan.

Also, please note you'll be charged for the hosts
monitored through this month. Please take a look at
our billing FAQ.


They were right -- I was able to uninstall the daemons, downgrade to Lite, remove my card, etc. all through the website without manual intervention.

When people have been confused with billing for my site, I have apologized, immediately refunded their money, and opened a ticket to make the UI clearer.  DataDog didn't do any of that.

I wish DataDog would at least clearly state that when you use their service you are potentially on the hook for an arbitrarily large charge for any month. Yes, if they had made that clear, they wouldn't have had me as a customer, so they are not incentivized to do so.

A fool and their money are soon parted. I hope this post reduces the chances you'll be a fool like me.  If you chose to use DataDog, and their monitoring tools are very impressive, I hope you'll be aware of the cost.


ADDED:

On Hacker News somebody asked: "How could their pricing page be clearer? It says per host in fairly large letters underneath it. I'm asking because I will be designing a similar page soon (that's also billed per host) and I'd like to avoid the same mistakes."  My answer:

[EDIT: This pricing page by the top poster in this thread is way better than I suggest below -- https://www.serverdensity.com/pricing/]

1. VERY clearly state that when you sign up for the service, then you are on the hook for up to $18*500 = $9000 + tax in charges for any month. Even Google compute engine (and Amazon) don't create such a trap, and have a clear explicit quota increase process.
2. Instead of "HUGE $15" newline "(small light) per host", put "HUGE $18 per host" all on the same line. It would easily fit. I don't even know how the $15/host datadog discount could ever really work, given that the number of hosts might constantly change and there is no prepayment.
3. Inform users clearly in the UI at any time how much they are going to owe for that month (so far), rather than surprising them at the end. Again, Google Cloud Platform has a very clear running total in their billing section, and any time you create a new VM it gives the exact amount that VM will cost per month.
4. If one works with a team, 3 is especially important. The reason that I had monitors on 50+ machines is that another person working on the project, who never looked at pricing or anything, just thought -- he I'll just set this up everywhere. He had no idea there was a per-machine fee.

by William Stein (noreply@blogger.com) at July 22, 2016 02:17 PM

July 13, 2016

Continuum Analytics news

The Gordon and Betty Moore Foundation Grant for Numba and Dask

Posted Thursday, July 14, 2016

I am thrilled to announce that the Gordon and Betty Moore Foundation has provided a significant grant in order to help move Numba and Dask to version 1.0 and graduate them into robust community-supported projects. 

Numba and Dask are two projects that have grown out of our intense foundational desire at Continuum to improve the state of large-scale data analytics, quantitative computing, advanced analytics and machine learning. Our fundamental purpose at Continuum is to empower people to solve the world’s greatest challenges. We are on a mission to help people discover, analyze and collaborate by connecting their curiosity and experience with any data.    

One part of helping great people do even more with their computing power is to ensure that modern hardware is completely accessible and utilizable to those with deep knowledge in other areas besides programming. For many years, Python has been simplifying the connection between computers and the minds of those with deep knowledge in areas such as statistics, science, business, medicine, mathematics and engineering. Numba and Dask strengthen this connection even further so that modern hardware with multiple parallel computing units can be fully utilized with Python code. 

Numba enables scaling up on modern hardware, including computers with GPUs and extreme multi-core CPUs, by compiling a subset of Python syntax to machine code that can run in parallel. Dask enables Python code to take full advantage of both multi-core CPUs and data that does not fit in memory by defining a directed graph of tasks that work on blocks of data and using the wealth of libraries in the PyData stack. Dask also now works well on a cluster of machines with data stored in a distributed file-system, such as Hadoop’s HDFS. Together, Numba and Dask can be used to more easily build solutions that take full advantage of modern hardware, such as machine-learning algorithms, image-processing on clusters of GPUs or automatic visualization of billions of data-points with datashader

Peter Wang and I started Continuum with a desire to bring next-generation array-computing to PyData. We have broadened that initial desire to empowering entire data science teams with the Anaconda platform, while providing full application solutions to data-centric companies and institutions. It is extremely rewarding to see that Numba and Dask are now delivering on our initial dream to bring next-generation array-computing to the Python ecosystem in a way that takes full advantage of modern hardware.    

This award from the Moore Foundation will make it even easier for Numba and Dask to allow Python to be used for large scale computing. With Numba and Dask, users will be able to build high performance applications with large data sets. The grant will also enable our Community Innovation team at Continuum to ensure that these technologies can be used by other open source projects in the PyData ecosystem. This will help scientists, engineers and others interested in improving the world achieve their goals even faster.

Continuum has been an active contributor to the Python data science ecosystem since Peter and I founded the company in early 2012. Anaconda, the leading Open Data Science platform, is now the most popular Python distribution available. Continuum has also conceived and developed several new additions to this ecosystem, making them freely available to the open­ source community, while continuing to support the foundational projects that have made the ecosystem possible.  

The Gordon and Betty Moore Foundation fosters pathbreaking scientific discovery,  environmental conservation, patient care improvements and preservation of the special character of the Bay Area. The Numba and Dask projects are funded by the Gordon and Betty Moore Foundation through Grant GBMF5423 to Continuum Analytics (Grant Agreement #5423).    

We are honored to receive this grant and look forward to working with The Moore Foundation. 

To hear more about Numba and Dask, check out our related SciPy sessions in Austin, TX this week:

  • Thursday, July 14th at 10:30am: “Dask: Parallel and Distributed Computing” by Matthew Rocklin & Jim Crist of Continuum Analytics
  • Friday, July 15th at 11:00am: “Scaling Up and Out:  Programming GPU Clusters with Numba and Dask” by Stan Seibert & Siu Kwan Lam of Continuum Analytics
  • Friday, July 15th at 2:30pm: “Datashader: Revealing the Structure of Genuinely Big Data” by James Bednar & Jim Christ of Continuum Analytics. 

by swebster at July 13, 2016 04:49 PM

Automate your README: conda kapsel Beta 1

Posted Wednesday, July 13, 2016

TL;DR: New beta conda feature allows data scientists and others to describe project runtime requirements in a single file called kapsel.yml. Using kapsel.yml, conda will automatically reproduce prerequisites on any machine and then run the project. 

Data scientists working with Python often create a project directory containing related analysis, notebook files, data-cleaning scripts, Bokeh visualizations, and so on. For a colleague who wants to replicate your project, or even for the original creator a few months later, it can be tricky to run all this code exactly as it was run the first time. 

Most code relies on some specific setup before it’s run -- such as installing certain versions of packages, downloading data files, starting up database servers, configuring passwords, or configuring parameters to a model. 

You can write a long README file to manually record all these steps and hope that you got it right. Or, you could use conda kapsel. This new beta conda feature allows data scientists to list their setup and runtime requirements in a single file called kapsel.yml. Conda reads this file and performs all these steps automatically. With conda kapsel, your project just works for anyone you share it with.

Sharing your project with others

When you’ve shared your project directory (including a kapsel.yml) and a colleague types conda kapsel run in that directory, conda automatically creates a dedicated environment, puts the correct packages in it, downloads any needed data files, starts needed services, prompts the user for missing configuration values, and runs the right command from your project.

As with all things conda, there’s an emphasis on ease-of-use. It would be clunky to first manually set up a project, and then separately configure project requirements for automated setup. 

With the conda kapsel command, you set up and configure the project at the same time. For example, if you type conda kapsel add-packages bokeh=0.12, you’ll get Bokeh 0.12 in your project's environment, and automatically record a requirement for Bokeh 0.12 in your kapsel.yml. This means there’s no extra work to make your project reproducible. Conda keeps track of your project setup for you, automatically making any project directory into a runnable, reproducible “conda kapsel.”

There’s nothing data-science-specific about conda kapsel; it’s a general-purpose feature, just like conda’s core package management features. But we believe conda kapsel’s simple approach to reproducibility will appeal to data scientists.

Try out conda kapsel

To understand conda kapsel, we recommend going through the tutorial. It’s a quick way to see what it is and learn how to use it. The tutorial includes installation instructions.

Where to send feedback

If you want to talk interactively about conda kapsel, give us some quick feedback, or run into any questions, join our chat room on Gitter. We would love to hear from you!

If you find a bug or have a suggestion, filing a GitHub issue is another great way to let us know.

If you want to have a look at the code, conda kapsel is on GitHub.

Next steps for conda kapsel

This is a first beta, so we expect conda kapsel to continue to evolve. Future directions will depend on the feedback you give us, but some of the ideas we have in mind:

  • Support for automating additional setup steps: What’s in your README that could be automated? Let us know!
  • Extensibility: We’d like to support both third-party plugins, and custom setup scripts embedded in projects.
  • UX refinement: We believe the tool can be even more intuitive and we’re currently exploring some changes to eliminate points of confusion early users have encountered. (We’d love to hear your experiences with the tutorial, especially if you found anything clunky or confusing.)

For the time being, the conda kapsel API and command line syntax are subject to change in future releases. A project created with the current “beta” version of conda kapsel may always need to be run with that version of conda kapsel and not conda kapsel 1.0. When we think things are solid, we’ll switch from “beta” to “1.0” and you’ll be able to rely on long-term interface stability.

We hope you find conda kapsel useful!

by swebster at July 13, 2016 02:11 PM

July 12, 2016

Matthieu Brucher

Book review: Team Geek

Sometimes I forget that I have to work with teams, whether they are virtual teams or physical teams. And although I started working on understanding the culture map, I still have to understand how to efficiently work in a team. Enters the book.

Content and opinions

Divided in 6 chapters, the book tries to move from a centric point of view to the most general one, with team members and users, around a principle summed it by HRT. First of all, the book spends a chapter on geniuses. Actually, it’s not really geniuses and not people who think they are geniuses and spend times in their cave working on something and 10 weeks later get out and share their wonderful (crappy) code. Here, the focus is visibility and communication: we all make mistakes (let’s move on, as would say Leonard Hofstadter), so we need to face ourselves with the rest of the team as early as possible.

To achieve this, you need a team culture, a place where people can communicate. There are several levels in this, different ways to achieve this and probably a good balance between all elements, as explained in the second chapter of the book. And with this, you need a good team leader (chapter 3) that will nurture the team culture. Strangely, the book seems to advocate technical people to become team leaders, which is something I find difficult. Actually the book help me understand the good aspects of this, and from the different teams I saw around me, it seems that a pattern that has merits and with a good help to learn delegation, trust… it could be an interesting future for technical people (instead of having bad technical people taking management positions and fighting them because let’s face it, they don’t understand a thing :p ).

Fourth chapter is about dealing with poisonous people. One of the poison is… exactly what I shown in the last paragraph: resent and bitterness! A team is a team with his captain, we are all in the same boat. We can’t badmouth the people we work with (as hard it is!). Fifth chapter is more about the maze above you (fourth was more about dealing with the maze below), how to work with a good manager, and how to deal with a bad one. Sometimes it’s just about communication, sometimes, it’s not, so what should you do?

Finally, the other member of the team is the end-user. As the customer pays the bill in the end, he has to be on board and feel respected, trusted (as much as the team is, it’s a balance!). There are not that many chapters about users in software engineering books, it’s a hard topic. This final chapter gives good advice on the subject.

Conclusion

There are lots of things that are obvious. There are things that are explained in other books as well, but the fact that all relevant topics for computer scientists is in this book makes it an excellent introduction to people starting to work in a team.

by Matt at July 12, 2016 07:33 AM

July 11, 2016

Continuum Analytics news

Anaconda 4.1 Released

Posted Monday, July 11, 2016

We are happy to announce that Anaconda 4.1 has been released. Anaconda is the leading open data science platform powered by Python.

The highlights of this release are:

  • Addition of Jupyter Notebook Extensions

  • Windows installation - silent mode fixes & now compatible with SCCM (System Center Configuration Manager)

  • conda-recipes used to build (using conda-build) the vast majority of the packages in the Anaconda installer have been published at: https://github.com/ContinuumIO/anaconda-recipes

Updates:

  • Python 2.7.12, 3.4.5, 3.5.2

  • numpy 1.11.1

  • scipy 0.17.1

  • pandas 0.18.1

  • MKL 11.3.3

  • Navigator update from 1.1 to 1.2, in particular it no longer installs a desktop shortcut on MacOSX

  • Over 80 other packages, see changelog and package list

To update to Anaconda 4.1, use conda update conda followed by conda update anaconda

by swebster at July 11, 2016 05:42 PM

July 06, 2016

Continuum Analytics news

The Journey to Open Data Science Is Not as Hard as You Think

Posted Wednesday, July 6, 2016

Businesses are in a constant struggle to stay relevant in the market and change is rarely easy — especially when it involves technological overhaul.

Think about the world’s switch from the horse and buggy to automobiles: It revolutionized the world, but it was hardly a smooth transition. At the turn of the 20th century, North America was a loose web of muddy dirt roads trampled by 24 million horses. It took a half-century of slow progress before tires and brakes replaced hooves and reins.

Just as driverless cars hint at a new era of automobiles, the writing’s on the wall for modern analytics: Companies will need to embrace the world’s inevitable slide toward Open Data Science. Fortunately, just as headlights now illuminate our highways, there is a light to guide companies through the transformation.

The Muddy Road to New Technologies

No matter the company or the technological shift, transitions can be challenging for multiple reasons.

One reason is the inevitable skills gap in the labor market when new technology comes along. Particularly in highly specialized fields like data science, finding skilled employees with enterprise experience is difficult. The right hires can mean the difference between success and failure.

Another issue stems from company leaders’ insufficient understanding of existing technologies — both what they can and cannot do. Applications that use machine and deep learning require new software, but companies often mistakenly believe their existing systems are capable of handling the load. This issue is compounded by fragile, cryptic legacy code that can be a nightmare to repurpose.

Finally, these two problems combine to form a third: a lack of understanding about how to train people to implement and deploy new technology. Ultimately, this culminates in floundering and wasted resources across an entire organization.

Luckily, it does not have to be this way.

Open Data Science Paves a New Path

Fortunately, Open Data Science is the guiding light to help companies switch to modern data science easily. Here’s how such an initiative breaks down transitional barriers:

  • No skills gap: Open Data Science is founded on Python and R — both hot languages in universities and in the marketplace. This opens up a massive pool of available talent and a worldwide base of excited programmers and users.
  • No tech stagnation: Open Data Science applications connect via APIs to nearly any data source. In terms of programming, there’s an open source version of any proprietary software on the market. Open Data Science applications such as Anaconda allow for easy interoperability between systems, which is central to the movement.
  • No floundering: Open Data Science bridges old and new technologies to make training and deployment a breeze. One such example is Anaconda Fusion, which offers business analysts command of powerful Python Open Data Science libraries through a familiar Excel interface.

A Guided Pathway to Open Data Science

Of course, just knowing that Open Data Science speeds the transition isn’t enough. A well-trained guide is equally vital for leading companies down the best path to adoption.

The first step is a change management assessment. How will executive, operational and data science teams quickly get up to speed on why Open Data Science is critical to their business? What are the first steps? This process can seem daunting when attempted alone. But this is where consultants from the Open Data Science community can provide the focus and knowledge necessary to quickly convert a muddy back road into the Autobahn.

No matter the business and its existing technologies, any change management plan should include a few key points. First, there should be a method for integration of legacy code (which Anaconda makes easier with packages for melding Python with C or Fortran code). Intelligent migration of data is also important, as is training team members on new systems or recruiting new talent.

While the business world leaves little room for fumbles, like delayed adoption of new technologies or poor change management, Open Data Science can prevent your company’s data science initiatives from meeting a dead end. Open Data Science software provides more than low-cost applications, interoperability, transparency and access — it also brings community know-how to guide change management and facilitate an analytics overhaul in any company at any scale.

Thanks to Open Data Science, the road to data science superiority is now paved with gold.

 

by swebster at July 06, 2016 02:10 PM

July 05, 2016

Thomas Wiecki

Bayesian Deep Learning Part II: Bridging PyMC3 and Lasagne to build a Hierarchical Neural Network

(c) 2016 by Thomas Wiecki

Recently, I blogged about Bayesian Deep Learning with PyMC3 where I built a simple hand-coded Bayesian Neural Network and fit it on a toy data set. Today, we will build a more interesting model using Lasagne, a flexible Theano library for constructing various types of Neural Networks. As you may know, PyMC3 is also using Theano so having the Artifical Neural Network (ANN) be built in Lasagne, but placing Bayesian priors on our parameters and then using variational inference (ADVI) in PyMC3 to estimate the model should be possible. To my delight, it is not only possible but also very straight forward.

Below, I will first show how to bridge PyMC3 and Lasagne to build a dense 2-layer ANN. We'll then use mini-batch ADVI to fit the model on the MNIST handwritten digit data set. Then, we will follow up on another idea expressed in my last blog post -- hierarchical ANNs. Finally, due to the power of Lasagne, we can just as easily build a Hierarchical Bayesian Convolution ANN with max-pooling layers to achieve 98% accuracy on MNIST.

Most of the code used here is borrowed from the Lasagne tutorial.

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('white')
sns.set_context('talk')

import pymc3 as pm
import theano.tensor as T
import theano

from scipy.stats import mode, chisquare

from sklearn.metrics import confusion_matrix, accuracy_score

import lasagne

Data set: MNIST

We will be using the classic MNIST data set of handwritten digits. Contrary to my previous blog post which was limited to a toy data set, MNIST is an actually challenging ML task (of course not quite as challening as e.g. ImageNet) with a reasonable number of dimensions and data points.

In [2]:
import sys, os

def load_dataset():
    # We first define a download function, supporting both Python 2 and 3.
    if sys.version_info[0] == 2:
        from urllib import urlretrieve
    else:
        from urllib.request import urlretrieve

    def download(filename, source='http://yann.lecun.com/exdb/mnist/'):
        print("Downloading %s" % filename)
        urlretrieve(source + filename, filename)

    # We then define functions for loading MNIST images and labels.
    # For convenience, they also download the requested files if needed.
    import gzip

    def load_mnist_images(filename):
        if not os.path.exists(filename):
            download(filename)
        # Read the inputs in Yann LeCun's binary format.
        with gzip.open(filename, 'rb') as f:
            data = np.frombuffer(f.read(), np.uint8, offset=16)
        # The inputs are vectors now, we reshape them to monochrome 2D images,
        # following the shape convention: (examples, channels, rows, columns)
        data = data.reshape(-1, 1, 28, 28)
        # The inputs come as bytes, we convert them to float32 in range [0,1].
        # (Actually to range [0, 255/256], for compatibility to the version
        # provided at http://deeplearning.net/data/mnist/mnist.pkl.gz.)
        return data / np.float32(256)

    def load_mnist_labels(filename):
        if not os.path.exists(filename):
            download(filename)
        # Read the labels in Yann LeCun's binary format.
        with gzip.open(filename, 'rb') as f:
            data = np.frombuffer(f.read(), np.uint8, offset=8)
        # The labels are vectors of integers now, that's exactly what we want.
        return data

    # We can now download and read the training and test set images and labels.
    X_train = load_mnist_images('train-images-idx3-ubyte.gz')
    y_train = load_mnist_labels('train-labels-idx1-ubyte.gz')
    X_test = load_mnist_images('t10k-images-idx3-ubyte.gz')
    y_test = load_mnist_labels('t10k-labels-idx1-ubyte.gz')

    # We reserve the last 10000 training examples for validation.
    X_train, X_val = X_train[:-10000], X_train[-10000:]
    y_train, y_val = y_train[:-10000], y_train[-10000:]

    # We just return all the arrays in order, as expected in main().
    # (It doesn't matter how we do this as long as we can read them again.)
    return X_train, y_train, X_val, y_val, X_test, y_test

print("Loading data...")
X_train, y_train, X_val, y_val, X_test, y_test = load_dataset()
Loading data...
In [3]:
# Building a theano.shared variable with a subset of the data to make construction of the model faster.
# We will later switch that out, this is just a placeholder to get the dimensionality right.
input_var = theano.shared(X_train[:500, ...].astype(np.float64))
target_var = theano.shared(y_train[:500, ...].astype(np.float64))

Model specification

I imagined that it should be possible to bridge Lasagne and PyMC3 just because they both rely on Theano. However, it was unclear how difficult it was really going to be. Fortunately, a first experiment worked out very well but there were some potential ways in which this could be made even easier. I opened a GitHub issue on Lasagne's repo and a few days later, PR695 was merged which allowed for an ever nicer integration fo the two, as I show below. Long live OSS.

First, the Lasagne function to create an ANN with 2 fully connected hidden layers with 800 neurons each, this is pure Lasagne code taken almost directly from the tutorial. The trick comes in when creating the layer with lasagne.layers.DenseLayer where we can pass in a function init which has to return a Theano expression to be used as the weight and bias matrices. This is where we will pass in our PyMC3 created priors which are also just Theano expressions:

In [4]:
def build_ann(init):
    l_in = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),
                                     input_var=input_var)

    # Add a fully-connected layer of 800 units, using the linear rectifier, and
    # initializing weights with Glorot's scheme (which is the default anyway):
    n_hid1 = 800
    l_hid1 = lasagne.layers.DenseLayer(
        l_in, num_units=n_hid1,
        nonlinearity=lasagne.nonlinearities.tanh,
        b=init,
        W=init
    )

    n_hid2 = 800
    # Another 800-unit layer:
    l_hid2 = lasagne.layers.DenseLayer(
        l_hid1, num_units=n_hid2,
        nonlinearity=lasagne.nonlinearities.tanh,
        b=init,
        W=init
    )

    # Finally, we'll add the fully-connected output layer, of 10 softmax units:
    l_out = lasagne.layers.DenseLayer(
        l_hid2, num_units=10,
        nonlinearity=lasagne.nonlinearities.softmax,
        b=init,
        W=init
    )
    
    prediction = lasagne.layers.get_output(l_out)
    
    # 10 discrete output classes -> pymc3 categorical distribution
    out = pm.Categorical('out', 
                         prediction,
                         observed=target_var)
    
    return out

Next, the function which create the weights for the ANN. Because PyMC3 requires every random variable to have a different name, we're creating a class instead which creates uniquely named priors.

The priors act as regularizers here to try and keep the weights of the ANN small. It's mathematically equivalent to putting a L2 loss term that penalizes large weights into the objective function, as is commonly done.

In [5]:
class GaussWeights(object):
    def __init__(self):
        self.count = 0
    def __call__(self, shape):
        self.count += 1
        return pm.Normal('w%d' % self.count, mu=0, sd=.1, 
                         testval=np.random.normal(size=shape).astype(np.float64),
                         shape=shape)

If you compare what we have done so far to the previous blog post, it's apparent that using Lasagne is much more comfortable. We don't have to manually keep track of the shapes of the individual matrices, nor do we have to handle the underlying matrix math to make it all fit together.

Next are some functions to set up mini-batch ADVI, you can find more information in the prior blog post.

In [6]:
# Tensors and RV that will be using mini-batches
minibatch_tensors = [input_var, target_var]

# Generator that returns mini-batches in each iteration
def create_minibatch(data, batchsize=500):
    
    rng = np.random.RandomState(0)
    start_idx = 0
    while True:
        # Return random data samples of set size batchsize each iteration
        ixs = rng.randint(data.shape[0], size=batchsize)
        yield data[ixs]

minibatches = zip(
    create_minibatch(X_train, 500),
    create_minibatch(y_train, 500),
)

total_size = len(y_train)

def run_advi(likelihood, advi_iters=50000):
    # Train on train data
    input_var.set_value(X_train[:500, ...])
    target_var.set_value(y_train[:500, ...])
    
    v_params = pm.variational.advi_minibatch(
        n=advi_iters, minibatch_tensors=minibatch_tensors, 
        minibatch_RVs=[likelihood], minibatches=minibatches, 
        total_size=total_size, learning_rate=1e-2, epsilon=1.0
    )
    trace = pm.variational.sample_vp(v_params, draws=500)
    
    # Predict on test data
    input_var.set_value(X_test)
    target_var.set_value(y_test)
    
    ppc = pm.sample_ppc(trace, samples=100)
    y_pred = mode(ppc['out'], axis=0).mode[0, :]
    
    return v_params, trace, ppc, y_pred

Putting it all together

Lets run our ANN with mini-batch ADVI:

In [14]:
with pm.Model() as neural_network:
    likelihood = build_ann(GaussWeights())
    v_params, trace, ppc, y_pred = run_advi(likelihood)
Iteration 0 [0%]: ELBO = -126739832.76
Iteration 5000 [10%]: Average ELBO = -17180177.41
Iteration 10000 [20%]: Average ELBO = -304464.44
Iteration 15000 [30%]: Average ELBO = -146289.0
Iteration 20000 [40%]: Average ELBO = -121571.36
Iteration 25000 [50%]: Average ELBO = -112382.38
Iteration 30000 [60%]: Average ELBO = -108283.73
Iteration 35000 [70%]: Average ELBO = -106113.66
Iteration 40000 [80%]: Average ELBO = -104810.85
Iteration 45000 [90%]: Average ELBO = -104743.76
Finished [100%]: Average ELBO = -104222.88

Make sure everything converged:

In [17]:
plt.plot(v_params.elbo_vals[10000:])
sns.despine()
In [18]:
sns.heatmap(confusion_matrix(y_test, y_pred))
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21764cf910>
In [20]:
print('Accuracy on test data = {}%'.format(accuracy_score(y_test, y_pred) * 100))
Accuracy on test data = 91.83%

The performance is not incredibly high but hey, it seems to actually work.

Hierarchical Neural Network: Learning Regularization from data

The connection between the standard deviation of the weight prior to the strengh of the L2 penalization term leads to an interesting idea. Above we just fixed sd=0.1 for all layers, but maybe the first layer should have a different value than the second. And maybe 0.1 is too small or too large to begin with. In Bayesian modeling it is quite common to just place hyperpriors in cases like this and learn the optimal regularization to apply from the data. This saves us from tuning that parameter in a costly hyperparameter optimization. For more information on hierarchical modeling, see my other blog post.

In [20]:
class GaussWeightsHierarchicalRegularization(object):
    def __init__(self):
        self.count = 0
    def __call__(self, shape):
        self.count += 1
        
        regularization = pm.HalfNormal('reg_hyper%d' % self.count, sd=1)
        
        return pm.Normal('w%d' % self.count, mu=0, sd=regularization, 
                         testval=np.random.normal(size=shape),
                         shape=shape)
In [ ]:
with pm.Model() as neural_network_hier:
    likelihood = build_ann(GaussWeightsHierarchicalRegularization())
    v_params, trace, ppc, y_pred = run_advi(likelihood)
In [22]:
print('Accuracy on test data = {}%'.format(accuracy_score(y_test, y_pred) * 100))
Accuracy on test data = 92.13%

We get a small but nice boost in accuracy. Let's look at the posteriors of our hyperparameters:

In [23]:
pm.traceplot(trace, varnames=['reg_hyper1', 'reg_hyper2', 'reg_hyper3', 'reg_hyper4', 'reg_hyper5', 'reg_hyper6']);

Interestingly, they all are pretty different suggesting that it makes sense to change the amount of regularization that gets applied at each layer of the network.

Convolutional Neural Network

This is pretty nice but everything so far would have also been pretty simple to implement directly in PyMC3 as I have shown in my previous post. Where things get really interesting, is that we can now build way more complex ANNs, like Convolutional Neural Nets:

In [9]:
def build_ann_conv(init):
    network = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),
                                        input_var=input_var)

    network = lasagne.layers.Conv2DLayer(
            network, num_filters=32, filter_size=(5, 5),
            nonlinearity=lasagne.nonlinearities.tanh,
            W=init)

    # Max-pooling layer of factor 2 in both dimensions:
    network = lasagne.layers.MaxPool2DLayer(network, pool_size=(2, 2))

    # Another convolution with 32 5x5 kernels, and another 2x2 pooling:
    network = lasagne.layers.Conv2DLayer(
        network, num_filters=32, filter_size=(5, 5),
        nonlinearity=lasagne.nonlinearities.tanh,
        W=init)
    
    network = lasagne.layers.MaxPool2DLayer(network, 
                                            pool_size=(2, 2))
    
    n_hid2 = 256
    network = lasagne.layers.DenseLayer(
        network, num_units=n_hid2,
        nonlinearity=lasagne.nonlinearities.tanh,
        b=init,
        W=init
    )

    # Finally, we'll add the fully-connected output layer, of 10 softmax units:
    network = lasagne.layers.DenseLayer(
        network, num_units=10,
        nonlinearity=lasagne.nonlinearities.softmax,
        b=init,
        W=init
    )
    
    prediction = lasagne.layers.get_output(network)
    
    return pm.Categorical('out', 
                   prediction,
                   observed=target_var)
In [10]:
with pm.Model() as neural_network_conv:
    likelihood = build_ann_conv(GaussWeights())
    v_params, trace, ppc, y_pred = run_advi(likelihood, advi_iters=50000)
Iteration 0 [0%]: ELBO = -17290585.29
Iteration 5000 [10%]: Average ELBO = -3750399.99
Iteration 10000 [20%]: Average ELBO = -40713.52
Iteration 15000 [30%]: Average ELBO = -22157.01
Iteration 20000 [40%]: Average ELBO = -21183.64
Iteration 25000 [50%]: Average ELBO = -20868.2
Iteration 30000 [60%]: Average ELBO = -20693.18
Iteration 35000 [70%]: Average ELBO = -20483.22
Iteration 40000 [80%]: Average ELBO = -20366.34
Iteration 45000 [90%]: Average ELBO = -20290.1
Finished [100%]: Average ELBO = -20334.15
In [13]:
print('Accuracy on test data = {}%'.format(accuracy_score(y_test, y_pred) * 100))
Accuracy on test data = 98.21%

Much higher accuracy -- nice. I also tried this with the hierarchical model but it achieved lower accuracy (95%), I assume due to overfitting.

Lets make more use of the fact that we're in a Bayesian framework and explore uncertainty in our predictions. As our predictions are categories, we can't simply compute the posterior predictive standard deviation. Instead, we compute the chi-square statistic which tells us how uniform a sample is. The more uniform, the higher our uncertainty. I'm not quite sure if this is the best way to do this, leave a comment if there's a more established method that I don't know about.

In [14]:
miss_class = np.where(y_test != y_pred)[0]
corr_class = np.where(y_test == y_pred)[0]
In [15]:
preds = pd.DataFrame(ppc['out']).T
In [16]:
chis = preds.apply(lambda x: chisquare(x).statistic, axis='columns')
In [18]:
sns.distplot(chis.loc[miss_class].dropna(), label='Error')
sns.distplot(chis.loc[corr_class].dropna(), label='Correct')
plt.legend()
sns.despine()
plt.xlabel('Chi-Square statistic');