February 21, 2017

Matthieu Brucher

Playing with a Bela (1.bis): Compile last Audio Toolkit on Bela

A few months ago, I started playing with the Bela board. At the time, I had issues compiling Audio ToolKit with clang. Since then and thanks to Travis-CI, I figured out what was going on. Unfortunately, the Beagle Board doesn’t have complete C++11 support, so I’ve added the remaining pieces, and you need also a new Boost.

What not to do

I started with trying to compile a new Clang with libc++, but it seems that I need more than 8GB on the SD card! So I’ll wait until I can get such a card to try again.

Then, the other thing I did is trying to compile a full Boost 1.61 (because that’s what I use on Travis CI), but this froze the board…

What to do

So the only thing to do is to compile Boost test:

./b2 --with-test --with-system link=shared stage

and then point the Boost root folder in Audio Toolkit CMake file to this Boost folder.

Conclusion

I’ll post a longer post on Clang update when it’s done, but meanwhile, I can already start playing with Audio ToolKit on the Bela!

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at February 21, 2017 08:11 AM

February 20, 2017

Matthew Rocklin

Dask Development Log

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly(ish) about the work done on Dask and related projects during the previous week. This log covers work done between 2017-02-01 and 2017-02-20. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

  1. Profiling experiments with Dask-GLM
  2. Subsequent graph optimizations, both non-linear fusion and avoiding repeatedly creating new graphs
  3. Tensorflow and Keras experiments
  4. XGBoost experiments
  5. Dask tutorial refactor
  6. Google Cloud Storage support
  7. Cleanup of Dask + SKLearn project

Dask-GLM and iterative algorithms

Dask-GLM is currently just a bunch of solvers like Newton, Gradient Descent, BFGS, Proximal Gradient Descent, and ADMM. These are useful in solving problems like logistic regression, but also several others. The mathematical side of this work is mostly done by Chris White and Hussain Sultan at Capital One.

We’ve been using this project also to see how Dask can scale out machine learning algorithms. To this end we ran a few benchmarks here: https://github.com/dask/dask-glm/issues/26 . This just generates and solves some random problems, but at larger scales.

What we found is that some algorithms, like ADMM perform beautifully, while for others, like gradient descent, scheduler overhead can become a substantial bottleneck at scale. This is mostly just because the actual in-memory NumPy operations are so fast; any sluggishness on Dask’s part becomes very apparent. Here is a profile of gradient descent:

Notice all the white space. This is Dask figuring out what to do during different iterations. We’re now working to bring this down to make all of the colored parts of this graph squeeze together better. This will result in general overhead improvements throughout the project.

Graph Optimizations - Aggressive Fusion

We’re approaching this in two ways:

  1. More aggressively fuse tasks together so that there are fewer blocks for the scheduler to think about
  2. Avoid repeated work when generating very similar graphs

In the first case, Dask already does standard task fusion. For example, if you have the following to tasks:

x = f(w)
y = g(x)
z = h(y)

Dask (along with every other compiler-like project since the 1980’s) already turns this into the following:

z = h(g(f(w)))

What’s tricky with a lot of these mathematical or optimization algorithms though is that they are mostly, but not entirely linear. Consider the following example:

y = exp(x) - 1/x

Visualized as a node-link diagram, this graph looks like a diamond like the following:

         o  exp(x) - 1/x
        / \
exp(x) o   o   1/x
        \ /
         o  x

Graphs like this generally don’t get fused together because we could compute both exp(x) and 1/x in parallel. However when we’re bound by scheduling overhead and when we have plenty of parallel work to do, we’d prefer to fuse these into a single task, even though we lose some potential parallelism. There is a tradeoff here and we’d like to be able to exchange some parallelism (of which we have a lot) for less overhead.

PR here dask/dask #1979 by Erik Welch (Erik has written and maintained most of Dask’s graph optimizations).

Graph Optimizations - Structural Sharing

Additionally, we no longer make copies of graphs in dask.array. Every collection like a dask.array or dask.dataframe holds onto a Python dictionary holding all of the tasks that are needed to construct that array. When we perform an operation on a dask.array we get a new dask.array with a new dictionary pointing to a new graph. The new graph generally has all of the tasks of the old graph, plus a few more. As a result, we frequently make copies of the underlying task graph.

y = (x + 1)
assert set(y.dask).issuperset(x.dask)

Normally this doesn’t matter (copying graphs is usually cheap) but it can become very expensive for large arrays when you’re doing many mathematical operations.

Now we keep dask graphs in a custom mapping (dict-like object) that shares subgraphs with other arrays. As a result, we rarely make unnecessary copies and some algorithms incur far less overhead. Work done in dask/dask #1985.

TensorFlow and Keras experiments

Two weeks ago I gave a talk with Stan Seibert (Numba developer) on Deep Learning (Stan’s bit) and Dask (my bit). As part of that talk I decided to launch tensorflow from Dask and feed Tensorflow from a distributed Dask array. See this blogpost for more information.

That experiment was nice in that it showed how easy it is to deploy and interact with other distributed servies from Dask. However from a deep learning perspective it was immature. Fortunately, it succeeded in attracting the attention of other potential developers (the true goal of all blogposts) and now Brett Naul is using Dask to manage his GPU workloads with Keras. Brett contributed code to help Dask move around Keras models. He seems to particularly value Dask’s ability to manage resources to help him fully saturate the GPUs on his workstation.

XGBoost experiments

After deploying Tensorflow we asked what would it take to do the same for XGBoost, another very popular (though very different) machine learning library. The conversation for that is here: dmlc/xgboost #2032 with prototype code here mrocklin/dask-xgboost. As with TensorFlow, the integration is relatively straightforward (if perhaps a bit simpler in this case). The challenge for me is that I have little concrete experience with the applications that these libraries were designed to solve. Feedback and collaboration from open source developers who use these libraries in production is welcome.

Dask tutorial refactor

The dask/dask-tutorial project on Github was originally written or PyData Seattle in July 2015 (roughly 19 months ago). Dask has evolved substantially since then but this is still our only educational material. Fortunately Martin Durant is doing a pretty serious rewrite, both correcting parts that are no longer modern API, and also adding in new material around distributed computing and debugging.

Google Cloud Storage

Dask developers (mostly Martin) maintain libraries to help Python users connect to distributed file systems like HDFS (with hdfs3, S3 (with s3fs, and Azure Data Lake (with adlfs), which subsequently become usable from Dask. Martin has been working on support for Google Cloud Storage (with gcsfs) with another small project that uses the same API.

Cleanup of Dask+SKLearn project

Last year Jim Crist published three great blogposts about using Dask with SKLearn. The result was a small library dask-learn that had a variety of features, some incredibly useful, like a cluster-ready Pipeline and GridSearchCV, other less so. Because of the experimental nature of this work we had labeled the library “not ready for use”, which drew some curious responses from potential users.

Jim is now busy dusting off the project, removing less-useful parts and generally reducing scope to strictly model-parallel algorithms.

February 20, 2017 12:00 AM

February 19, 2017

Titus Brown

Request for Compute Infrastructure to Support the Data Intensive Biology Summer Institute for Sequence Analysis at UC Davis

Note: we were just awarded this allocation on Jetstream for DIBSI. Huzzah!


Abstract:

Large datasets have become routine in biology. However, performing a computational analysis of a large dataset can be overwhelming, especially for novices. From June 18 to July 21, 2017 (30 days), the Lab for Data Intensive Biology will be running several different computational training events at the University of California, Davis for 100 people and 25 instructors. In addition, there will be a week-long instructor training in how to reuse our materials, and focused workshops, such as: GWAS for veterinary animals, shotgun environmental -omics, binder, non-model RNAseq, introduction to Python, and lesson development for undergraduates. The materials for the workshop were previously developed and tested by approximately 200 students on Amazon Web Services cloud compute services at Michigan State University's Kellogg Biological Station from 2010 and 2016, with support from the USDA and NIH. Materials are and will continue to be CC-BY, with scripts and associated code under BSD; the material will be adapted for Jetstream cloud usage and made available for future use.

Keywords: Sequencing, Bioinformatics, Training

Principal investigator: C. Titus Brown

Field of science: Genomics

Resource Justification:

We are requesting 100 m.medium instances with 6 cores, 16 GB RAM, and 130 GB VM space each for each instructor and student for 4 weeks. The total request is for 432,000 service units (6 cores * 24 hrs/day * 30 days * 100 people). To accommodate large size data files, an additional 100 GB of storage volumes are requested for each person. Persistent storage beyond the duration is not necessary for this training workshop.

These calculations are based on running the course for seven years with approximately 200 students total over the past six years on AWS cloud services.

Syllabus:

http://ivory.idyll.org/dibsi/

http://angus.readthedocs.io/en/2016/

Resources: IU/TACC (Jetstream)

by Lisa Johnson Cohen at February 19, 2017 11:00 PM

February 16, 2017

Continuum Analytics news

Continuum Analytics to Speak at Galvanize New York

Friday, February 17, 2017

Chief Data Scientist and Co-founder Travis Oliphant to Discuss the Open Data Science Innovations That Will Change Our World 

NEW YORK, NY—February 17, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Chief Data Scientist and Co-founder Travis Oliphant will be speaking at Galvanize New York. During his Reaching the full-potential of a data-driven world in the Anaconda Community presentation, taking place on February 21 at 7:00 p.m. EST, Oliphant will discuss how the Anaconda platform is bringing together Python and other Open Data Science tools to bring about innovations that will change our world.

Oliphant will discuss how the rise of Python and data science has driven tremendous growth of the Open Data Science community. In addition, he will describe the open source technology developed at Continuum Analytics––including a preview of Anaconda Enterprise 5.0––and explain how attendees can participate in the growing business opportunities around the Anaconda ecosystem.

WHO: Travis Oliphant, chief data scientist and co-founder, Continuum Analytics
WHAT: “Reaching the full-potential of a data-driven world in the Anaconda Community” 
WHEN: February 21, 7:00 p.m. - 9:00 p.m. EST
WHERE: Galvanize New York - West Soho - 315 Hudson St. New York, NY 10013
REGISTER: HERE

Oliphant has a Ph.D. from the Mayo Clinic and B.S. and M.S. degrees in Mathematics and Electrical Engineering from Brigham Young University. Since 1997, he has worked extensively with Python for numerical and scientific programming, most notably as the primary developer of the NumPy package, and as a founding contributor of the SciPy package. He is also the author of the definitive Guide to NumPy. He has served as a director of the Python Software Foundation and as a director of NumFOCUS.

About Anaconda Powered by Continuum Analytics

Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io

###

Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

by swebster at February 16, 2017 10:17 PM

February 15, 2017

Enthought

Traits and TraitsUI: Reactive User Interfaces for Rapid Application Development in Python

The Enthought Tool Suite team is pleased to announce the release of Traits 4.6. Together with the release of TraitsUI 5.1 last year, these core packages of Enthought’s open-source rapid application development tools are now compatible with Python 3 as well as Python 2.7.  Long-time fans of Enthought’s open-source offerings will be happy to hear about the recent updates and modernization we’ve been working on, including the recent release of Mayavi 4.5 with Python 3 support, while newcomers to Python will be pleased that there is an easy way to get started with GUI programming which grows to allow you to build applications with sophisticated, interactive 2D and 3D visualizations.

A Brief Introduction to Traits and TraitsUI

Traits is a mature reactive programming library for Python that allows application code to respond to changes on Python objects, greatly simplifying the logic of an application.  TraitsUI is a tool for building desktop applications on top of the Qt or WxWidgets cross-platform GUI toolkits. Traits, together with TraitsUI, provides a programming model for Python that is similar in concept to modern and popular Javascript frameworks like React, Vue and Angular but targeting desktop applications rather than the browser.

Traits is also the core of Enthought’s open source 2D and 3D visualization libraries Chaco and Mayavi, drives the internal application logic of Enthought products like Canopy, Canopy Geoscience and Virtual Core, and Enthought’s consultants appreciate its the way it facilitates the rapid development of desktop applications for our consulting clients. It is also used by several open-source scientific software projects such as the HyperSpy multidimensional data analysis library and the pi3Diamond application for controlling diamond nitrogen-vacancy quantum physics experiments, and in commercial projects such as the PyRX Virtual Screening software for computational drug discovery.

 The open-source pi3Diamond application built with Traits, TraitsUI and Chaco by Swabian Instruments.

The open-source pi3Diamond application built with Traits, TraitsUI and Chaco by Swabian Instruments.

Traits is part of the Enthought Tool Suite of open source application development packages and is available to install through Enthought Canopy’s Package Manager (you can download Canopy here) or via Enthought’s new edm command line package and environment management tool. Running

edm install traits

at the command line will install Traits into your current environment.

Traits

The Traits library provides a new type of Python object which has an event stream associated with each attribute (or “trait”) of the object that tracks changes to the attribute.  This means that you can decouple your application model much more cleanly: rather than an object having to know all the work which might need to be done when it changes its state, instead other parts of the application register the pieces of work that each of them need when the state changes and Traits automatically takes care running that code.  This results in simpler, more modular and loosely-coupled code that is easier to develop and maintain.

Traits also provides optional data validation and initialization that dramatically reduces the amount of boilerplate code that you need to write to set up objects into a working state and ensure that the state remains valid.  This makes it more likely that your code is correct and does what you expect, resulting in fewer subtle bugs and more immediate and useful errors when things do go wrong.

When you consider all the things that Traits does, it would be reasonable to expect that it may have some impact on performance, but the heart of Traits is written in C and knows more about the structure of the data it is working with than general Python code. This means that it can make some optimizations that the Python interpreter can’t, the net result of which is that code written with Traits is often faster than equivalent pure Python code.

Example: A To-Do List in Traits

To be more concrete, let’s look at writing some code to model a to-do list.  For this, we are going to have a “to-do item” which represents one task and a “to-do list” which keeps track of all the tasks and which ones still need to be done.

Each “to-do item” should have a text description and a boolean flag which indicates whether or not it has been done.  In standard Python you might write this something like:

class ToDoItem(object):
def __init__(self, description='Something to do', completed=False):
self.description = description
self.completed = completed

But with Traits, this would look like:

from traits.api import Bool, HasTraits, Unicode
class ToDoItem(HasTraits):
description = Unicode('Something to do')
completed = Bool

You immediately notice that Traits is declarative – all we have to do is declare that the ToDoItem has attributes description and completed and Traits will set those up for us automatically with default values – no need to write an __init__ method unless you want to, and you can override the defaults by passing keyword arguments to the constructor:

>>> to_do = ToDoItem(description='Something else to do')
>>> print(to_do.description)
Something else to do
>>> print(to_do.completed)
False

Not only is this code simpler, but we’ve declared that the description attribute’s type is Unicode and the completed attribute’s type is Bool, which means that Traits will validate the type of new values set to these Traits:

>>> to_do.completed = 'yes'
TraitError: The 'completed' trait of a ToDoItem instance must be a boolean,
but a value of 'yes' <type 'str'> was specified.

Moving on to the second class, the “to-do list” which tracks which items are completed. With standard Python classes each ToDoItem would need to know the list which it belonged to and have a special method that handles changing the completed state, which at its simplest might look something like:

class ToDoItem(object):
def __init__(self, to_do_list, description='', completed=False):
self.to_do_list = to_do_list
self.description = description
self.completed = completed

def update_completed(self, completed):
self.completed = completed
self.to_do_list.update()

And this would be even more complex if an item might be a member of multiple “to do list” instances. Or worse, some other class which doesn’t have an update() method, but still needs to know when a task has been completed.

Traits solves this problem by having each attribute being reactive: there is an associated stream of change events that interested code can subscribe to. You can use the on_trait_change method to hook up a function that reacts to changes:

>>>> def observer(new_value):
...     print("Value changed to: {}".format(new_value))
...
>>> to_do.on_trait_change(observer, 'completed')
>>> to_do.completed = True
Value changed to: True
>>> to_do.completed = False
Value changed to: False

It would be easy to have the “to-do list” class setup update observers for each of its items. But, setting up these listeners manually for everything that you want to listen to can get tedious. For example, we’d need to track when we add new items and remove old items so we could add and remove listeners as appropriate.  Traits has a couple of mechanisms to automatically observe the streams of changes and avoid that sort of bookkeeping code. A class holding a list of our ToDoItems which automatically reacts to changes both in the list, and the completed state of each of these items might look something like this:

from traits.api import HasTraits, Instance, List, Property, on_trait_change
class ToDoList(HasTraits):
items = List(Instance(ToDoItem))
remaining_items = List(Instance(ToDoItem))
remaining = Property(Int, depends_on='remaining_items')

@on_trait_change('items.completed')
def update(self):
self.remaining_items = [item for item in self.items
if not item.completed]

def _get_remaining(self):
return len(self.remaining_items)

The @on_trait_change decorator sets up an observer on the items list and the completed attribute of each of the objects in the list which calls the method whenever a change occurs, updating the value of the remaining_items list.

An alternative way of reacting is to have a Property, which is similar to a regular Python property, but which is lazily recomputed as needed when a dependency changes.  In this case the remaining property listens for when the remaining_items list changes and will be recomputed by the specially-named _get_remaining method when the value is next asked for.

>>> todo_list = ToDoList(items=[
...     ToDoItem(description='Unify relativity and quantum mechanics'),
...     ToDoItem(description='Prove Riemann Hypothesis')])
...
>>> print(todo_list.remaining)
2
>>> todo_list.items[0].completed = True
>>> print(todo_list.remaining)
1

Perhaps the most important fact about this is that we didn’t need to modify our original ToDoItem in any way to support the ToDoList functionality.  In fact we can have multiple ToDoLists sharing ToDoItems, or even have other objects which listen for changes to the ToDoItems, and everything still works with no further modifications. Each class can focus on what it needs to do without worrying about the internals of the other classes.

Hopefully you can see how Traits allows you to do more with less code, making your applications and libraries simpler, more robust and flexible.  Traits has many more features than we can show in a simple example like this, but comprehensive documentation is available at http://docs.enthought.com/traits and, being BSD-licensed open-source, the Traits code is available at https://github.com/enthought/traits.

TraitsUI

One place where reactive frameworks really shine is in building user interfaces. When a user interacts with a GUI they change the state of the UI and a reactive system can use those changes to update the state of the business model appropriately. In fact all of the reactive Javascript frameworks mentioned earlier in the article come with integrated UI systems that make it very easy to describe a UI view declaratively with HTML and hook it up to a model in Javascript.  In the same way Traits comes with strong integration with TraitsUI, a desktop GUI-building library that allows you describe a UI view declaratively with Python and hook it up to a Traits model.

TraitsUI itself sits on top of either the wxPython, PyQt or PySide GUI library wrappers, and in principle could have other backends written for it if needed. Between the facilities that the Traits and TraitsUI libraries provide, it is possible to quickly build desktop applications with clear separation of concerns between UI and business logic.

TraitsUI uses the standard Model-View-Controller or Model-View-ViewModel patterns for building GUI applications, and it allows you to add complexity as needed. Often all that you require is a model class written in Traits and simple declarative view on that class, and TraitsUI will handle the rest for you.

Example: A To-Do List UI

Getting started with TraitsUI is simple. If you have TraitsUI and a compatible GUI toolkit installed in your working environment, such as by running the command-line:

edm install pyqt traitsui

then any Traits object has a default GUI available with no additional work:

>>> todo_item.configure_traits()

traits-properties

With a little more finesse we can improve the view. In TraitsUI you do this by creating a View for your HasTraits class:

from traitsui.api import HGroup, Item, VGroup, View
todo_item_view = View(
VGroup(
Item('description', style='custom', show_label=False),
HGroup(Item('completed')),
),
title='To Do',
width=360, height=240,
resizable=True,
)

Views are defined declaratively, and are independent of the model: we can have multiple Views for the same model class, or even have a View which works with several different model classes. You can even declare a default view as part of your class definition if you want. In any case, once you have a view you can use it by passing it as the view parameter:

>>>todo_item.configure_traits(view=todo_item_view)

This produces a fairly nice, if basic, UI to edit an object.

traits-todo

If you run these examples within an interactive IPython terminal session (such as from the Canopy editor, or the IPython QtConsole), you’ll see that these user interfaces are hooked up “live” to the underlying Traits objects: when you type into the text field or toggle the “completed” checkbox, the values of attributes change automatically. Coupled with the ability to write Traits code that reacts to those changes you can write powerful applications with comparatively little code. For a complete example of a TraitsUI application, have a look at the full to-do list application on Github.

These examples only scratch the surface of what TraitsUI is capable of. With more work you can create UIs with complex views including tables and trees, or add menu bars and toolbars which can drive the application. And 2D and 3D plotting is available via the Chaco and Mayavi libraries. Full documentation for TraitsUI is available at http://docs.enthought.com/traitsui including a complete example of writing an image capture application using TraitsUI.

Enthought Tool Suite

The Enthought Tool Suite is a battle-tested rapid application development system. If you need to make your science or business code more accessible, it provides a toolkit that you can use to build applications for your users with a lot less difficulty than using low-level wxPython or Qt code.  And if you want to focus on what you do best, Enthought Consulting can help with scientist-developers to work on your project.

But best of all the Enthought Tool Suite is open source and licensed with the same BSD-style license as Python itself, so it is free to use, and if you love it and find you want to improve it, we welcome your participation and contributions! Join us at http://www.github.com/enthought.

Traits and TraitsUI Resources

The post Traits and TraitsUI: Reactive User Interfaces for Rapid Application Development in Python appeared first on Enthought Blog.

by Corran Webster at February 15, 2017 03:01 PM

Continuum Analytics news

New Research eBook - Winning at Data Science: How Teamwork Leads to Victory

Wednesday, February 15, 2017
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

As I write this blog post, the entire Anaconda team (myself included) are recovering from an eye-opening, inspiring and downright incredible experience at our very first AnacondaCON event this past week. The JW Marriott Austin was brimming with hundreds of people looking to immerse and learn more about Open Data Science in the enterprise. Being at the conference and chatting with customers, prospects, community members, others with inquiring minds with an interest in the Open Data Science movement further validated that data science has emerged more prominently in the enterprise—but not as quickly as we’d like. And not as quickly as enterprise leaders should like - and now is an ideal time to look carefully at priorities.

As we’re catching our breath from the whirlwind that was AnacondaCON, we’re excited to reveal the findings of a study we have been working on for many months, providing answers to some of the questions we saw surface at the event: How are enterprises responding to the undiscovered value and advances in data science? How can they use data science to its full advantage? We asked company decision leaders (200, to be exact) and data scientists (500+) to help us understand the current beliefs and attitudes on data science.

Some highlights include:

  • While 96 percent of company execs say data science is critical to the success of their business and 73 percent rank it as a top three most valuable technologies, 22 percent aren’t making full use of the data available to them 

  • A whooping 94% of enterprises are using open source for Data Science but only 50% are using the results in the frontlines of their business. 

  • Only 31 percent of execs are using data science daily, and less than half have implemented data science teams 

  • These hesitations to adopt are ultimately due to companies being satisfied with the status quo (38 percent), struggling to calculate ROI (27 percent) and budgetary restrictions (24 percent)

So, what’s missing? 

Collaboration. Our survey revealed that 69 percent of respondents associate “Open Data Science” with collaboration. No longer just a one-person job, teams are clearly what’s needed in order to capitalize on the volume of data—data science is a team sport. As we saw at both AnacondaCON and within our survey results, collaboration helps enterprises harness their data faster and extract more value to ultimately give people superpowers to change the world.

Download our full eBook Winning at Data Science: How Teamwork Leads to Victory and read our press release to learn more.

*This study was conducted by research firm Vanson Bourne, surveying 200 company executives and 500 data scientists at U.S. organizations. 

by swebster at February 15, 2017 01:48 PM

New Research Proves Increased Awareness in the Value of Open Data Science, but Enterprises are Slow to Respond

Wednesday, February 15, 2017

Data science is critical for success, but Continuum Analytics finds just 49 percent have data science teams in place
  
AUSTIN, Texas—February 15, 2017—New research announced today by Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, finds that 96 percent of data science and analytics decision makers agree that data science is critical to the success of their business, yet a whopping 22 percent are failing to make full use of the data available. These findings are included in Continuum Analytics’ new eBook, Winning at Data Science: How Teamwork Leads to Victory, based on the company’s inaugural study that explores the state of Open Data Science in the enterprise. Download the eBook here.

The research, conducted by independent research firm Vanson Bourne, surveyed 200 data science and analytics decision makers at U.S. organizations of all sizes and industries, to examine the state of Open Data Science in the enterprise. Continuum Analytics also surveyed more than 500 data scientists to uncover similarities and disparities between the two groups. Topics ranged from the value of data science, challenges around adoption and how data science is being utilized in the enterprise.

Key takeaways and findings from the research include:  

  • The benefits of data science in the enterprise are undisputed; 73 percent of respondents ranked it as one of the top three most valuable technologies they use. Conversely, findings show that a disparity exists between understanding the impact of data science and actually executing it in the enterprise––62 percent said data science is used at least on a weekly basis, but just 31 percent of that group are using it daily. 

  • When comparing the beliefs of executives/IT managers with data scientists, nearly all respondents from both groups agree on the critical impact of data science in the enterprise. However, a divide exists around where companies are in the data science lifecycle. Just 24 percent of data scientists feel their companies have reached the “teen” stage––developed enough to hold its own with room to mature––as opposed to the 40 percent of executives who feel confident they have arrived at this stage of development. 

  • Despite the benefits offered by data science, 22 percent of enterprise respondents report that their teams are failing to use the data to its potential. What’s more, 14 percent use data science very minimally or not at all, due to three primary adoption barriers: executive teams that are satisfied with the status quo (38 percent), a struggle to calculate ROI (27 percent) and budgetary restrictions (24 percent). 

While obstacles persist, an increasingly data-driven world calls for data science teams in the enterprise—it’s not a one person job. Though 89 percent of organizations have at least one data scientist, less than half have data science teams. Findings revealed that 69 percent of respondents associate Open Data Science with collaboration, proving that teamwork is essential to exploit the power of the data, requiring a combination of skills best tackled by a strong team. 

“Over 94 percent of the enterprises in the survey rely on open source for data science. Open Data Science is the Rosetta Stone to unlocking the value locked away in data, especially Big Data,” said Michele Chambers, EVP Anaconda Business Unit, Continuum Analytics. “Our research shows that data science is no longer just for competitive advantage; it needs to be infused into day-to-day operations to maximize the value of data. Data science is business and the best run businesses run Open Data Science.” 

For more information about the survey results, read the Anaconda blog post here. To view the full eBook, download here.

About Anaconda powered by Continuum Analytics

Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io.  

##

by swebster at February 15, 2017 01:44 PM

February 14, 2017

Titus Brown

My thoughts for "Imagining Tomorrow's University"

So I've been invited to Imagining Tomorrow's University, and they have this series of questions they'd like me to answer.

(Note that you can follow the conversation at #TomorrowsUni on Twitter.)


Conveniently I already answered many of these questions in my "What is Open Science?" blog post. I've copy/pasted from that for the first two answers.


Q: What is your two sentence definition of open science (or open research)?

A: Open science is the philosophical perspective that sharing is good and that barriers to sharing should be lowered as much as possible. The practice of open science is concerned with the details of how to lower or erase the technical, social, and cultural barriers to sharing.


Q: Why is open science important for transforming research and learning?

A: The potential value of open science should be immediately obvious: easier and faster access to ideas, methods, and data should drive science forward faster! But open science can also aid with reproducibility and replication, decrease the effects of economic inequality in the sciences by liberating ideas from subscription paywalls, and provide reusable materials for teaching and training.


Q: How can open science increase the societal impact of university research?

A: I have two answers.

first, if open science accelerates research progress, then that increases the societal impact intrinsically.

second, serendipity will strike. Most of my "wins" from open science have been unexpected - people using our research products in ways I never could have predicted or intended. This is really only possible if those research products are made fully available.


Q: How is open science part of, and important for your own research, teaching, and service agendas?

A: I think it's philosophically central to my view of how research should work. In that sense, it's integral to our research agenda, and it increases the impacts of our research and teaching. For service, I'm not sure what to say, although I prefer to donate my time to open organizations.


Q: What are the important activities, structures, etc. that have supported you in pursuing open science?

A: If I had to pick one, it would be the Moore Foundation. Without question, the Moore Foundation Data Driven Discovery Investigator award (links here) validate my decision to do open science in the past, and in turn gives me the freedom to try new things in the future.

I think blogging and Twitter have been integral to my pursuit of open science, my development of perspectives, and my discovery of a community of thought and practice around open science.


Q: What are the major technical, organizational, social, or cultural challenges you face, particularly as related to openness and sharing within your university and academia?

A: While most scientists are supportive of open science in theory, in fact most scientists are leery of actually sharing things widely before publication. This is disappointing but understandable in light of the incentive systems in place.

At my Assistant Professor job, I received a lot of administrator pushback on the time I was expending on open science, and this even made its way into a tenure letter. That having been said, in publication and funding reviews, I've received nothing but positive comments, so I think that's more important than what my administrative chain says. My colleagues have been nothing but supportive (see above, "theory" vs "practice".)


Q: If you had a senior leadership role in a university, what would you do to promote change and improve your university?

A: I'm not convinced there's anything that can be done by a university leader. University leadership is largely irrelevant to the daily practice of research, teaching, and service, in my experience. (I think university leadership is very important in facilitating a good environment at their institution, so they're not useless at all; they just don't have anything to do with my research directly, and nor should they.)

I think we need community leaders to effect change, and by community leaders I mean research leaders (senior folk with strong research careers - members of the National Academy, Nobel laureates, etc.). These folk need to visibly and loudly abandon the broken "journal prestige" system, forcefully push back against university administration on matters of research evaluation and tenure, and be a loud presence on grant panels and editorial boards.

The other thing we need is more open science practice. I feel like too much time is spent talking about how wonderful open science would be if we could just mandate foo bar and baz, and not enough time is spent actually doing science. Conveniently, Bjorn Brembs has written up this problem in detail.


Q: What $10M or more, risky and potentially transformative, big idea research proposal would you be writing if you had the right open science resources, and institutional support?

A: What a coincidence! I happen to have written something up here, What about all those genes of unknown function?. But it would cost $50m. One particularly relevant bit:

More importantly, I'd insist on pre-publication sharing of all the data within a walled garden of all the grantees, together with regular meetings at which all the grad students and postdocs could mix to talk about how to make use of the data. (This is an approach that Sage Biosciences has been pioneering for biomedical research.) I'd probably also try to fund one or two groups to facilitate the data storage and analysis -- maybe at $250k a year or so? -- so that all of the technical details could be dealt with.

But, while this approach could have massive impact on biology, I can answer the question a different way, too: what would I do with $10m if it landed in my lap?

I'd probably try to build something like Manylabs. I was pretty inspired by the environment there during a recent visit, and I think it could translate into a slightly more academic setting easily. I envision an institute that combines open space for brainstorming, collaboration, and networking, with regular short-term training events (a la Software & Data Carpentry) and long-term data science fellows (a la the Moore/Sloan Data Science Environments) while providing grants for a bunch of sabbatical folk. I'd park it in downtown Davis (good coffee, beer, food, bicycling), fill it with interesting people, and stir well.

However, let's be honest -- $10m isn't enough to effect real change in our university system, and in any case my experience with big grants is you have to over-promise in order to get the necessary funding. (There are one or two exceptions to this, but it's a pretty good rule ;).

If you wanted me to effect interesting change on the university level, I'd need about $10m a year for 10 years to run an incubator institute as a proof of concept. And given $1 bn/10 years to spend, I think we could do something really interesting by building a decentralized university for teaching and research. Happy to chat...


I have more to say but maybe I'll save it for the post-event blogging :)

--titus

by C. Titus Brown at February 14, 2017 11:00 PM

Continuum Analytics news

AnacondaCON Recap: A Community Comes Together

Tuesday, February 14, 2017
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

Last week, more than 400 Open Data Science community members descended on the city of Austin to attend our inaugural AnacondaCON event. From data scientists to engineers to business analysts, community members shared best practices, fostered new connections and worked to accelerate both their personal and organization’s path to Open Data Science. One thing is absolute—by the end of the three days, everyone felt the unmistakable buzz and excitement around a growing market on the edge of transforming the world. 

In case you weren’t able to join us (or are just looking to relive the fun), below are some highlights from AnacondaCON 2017: 

Peter Wang, our CTO and co-founder, kicked off the event with a keynote about the ongoing need for better access to data, a theme that was echoed throughout a number of breakout sessions over the two days. 

Sessions varied between business and technology tracks, enabling everyone to find a topic that resonated with them. Industry leaders from organizations such as Forrester, Capital One, General Electric, Clover Health, Bloomberg and many others shared personal insights on working with Open Data Science. These speakers also discussed areas in which they hoped would be given attention by the community. 

We encouraged everyone to bring their data science teams, and many did—including our newest #AnacondaCREW members from Legoland! The theme of teamwork and collaboration permeated throughout the event. Since data science is a team sport—and we don’t want anyone working alone—we gave everyone their very own team…a team of Anaconda Legos to take home! 

After an insightful and inspiring first day, we ended the night with some southern hospitality. We changed gears, spending time getting to know each other and having some fun! With a spread of classic Texan BBQ, a mariachi band and Anaconda-themed airbrush tattoos,I think it’s safe to say no one got a lot of sleep that night. 

Day two was just as successful—between sessions, attendees engaged in lively discussions about different ways to capitalize on data science to drive new outcomes and impact organizations in new ways. The Artificial Intelligence panel and City of Boston presentation on data science in the public sector also helped spark exciting conversation among attendees. 

Travis Oliphant, chief data scientist and co-founder of Continuum Analytics, closed out the conference by sharing a look at the tremendous  growth of the Anaconda community and his thoughts on how the community can continue to grow and thrive, pushing the Open Data Science movement forward. 

On behalf of our entire team, we want to thank everyone who attended and helped make this year’s event such a success. Below are a few of your own highlights from the event. Hope to see you again in 2018.

 

 

All photos courtesy of Casey Chapman-Ross Photography

by swebster at February 14, 2017 03:48 PM

Matthieu Brucher

Review of Intel Parallel Studio 2017: Advisor

Recently, I got access to the latest release of Parallel Studio with an update version of Advisor. 6 years after my last review, let’s dive into it again!

First lots of things changed since the first release of Parallel Studio. Lots of the dedicated tools were merged with big ones (Amplifier is now VTune, Composer is Debugger…) but still kept their nicer GUIs and workflows. They also evolved quite a lot in a decade, focusing now on how to extract the maximum performance of the Intel CPUs through vectorization and threading.

Context

Intel provides a large range of tools, from a compiler to a sampling profiler (based on hardware counters), from a debugger to tools analyzing programs behavior. Of course, all these applications have a goal: selling more chips. As it’s not just about the chip, this is a fair fight: you need to be able to use the most of your hardware.

Of course, I’ll use Audio Toolkit for the demo and the screenshots.

Presentation

Let’s start with the beginning and start a new project. You will need to set up a new project (or open an old one), which leads you to the following screenshot.

Setting up project properties in Advisor

Survey hotspots analysis is basically what you require, you may want to tick in Survey trip count analysis the Collect information about FLOPS if that’s the kind of analysis you are looking for. In future versions, it will be required for the roofline analysis which is currently not commercially available.

Once the configuration is done, let’s run the 1. Survey target, which would lead to the next screenshot.

Summary tab

I suggest to save snapshots (with the camera icon) after a run, as each run will actually overwrite e000, and bundle at least the source code with it.

Now it is possible to see the results:

Advisor results for an IIR filter

I guess it is time now for a quick demo on how we can decrypt such results and improve on them.

Demo

The first interesting bit is that it is indeed the IIR filter that takes most of the relevant time. Advisor only works on loops, but as audio processing is about loops, everything is fine. Each loop has different annotations, and the ones in IIR filter have the note “Compiler lacks sufficient information to vectorize the loop”. The issue here is that the Visual Studio compiler can’t vectorize properly, so let’s use the Intel compiler (in CMake GUI, use -t “Intel C++ Compiler 17.0”).

Advisor results for an IIR filter (Intel compiler)

I added here a pragma in the source code to force the vectorization, so the results are quite interesting. We have a good speed up compared to the previous version (6.9s to 4.7s), but here we are skewed because the order of the filter is odd, so there is an even number of coefficients for this loop (FIR part of the filter), and this works great for SSE2. Here only one loop is vectorized, where the icon for the loop is orange instead of blue.

If I want to push and ask for AVX instructions, then we will start seeing indications that the loop may be inefficient. In the following screenshot, I reordered the FIR loop so that we are vectorized on the number of samples of the processing and not the number of coefficients (usually there are only a handful of coefficients but up to hundreds of samples, so far more opportunities for vectorization). So this loop is not marked as inefficient. But the second one (the IIR part) is inefficient as we can’t reorder the loop straight away.

Advisor results for an IIR filter (optimized Intel compiler)

Here, we see that Advisor tags all the calls to the loop as Remainder (or Vectorized Remainder), which is the part of the vectorized loop finishes (the start is Peel, before the samples are aligned, then Body, when data is aligned and the full content of the register is used, and then Remainder when the data is aligned, but only the first parts of the vector registers can be used). And the efficiency of the loop is poor, only 9%, compared to the 76% of the reordered loop.

Conclusion

This was a small tutorial on Advisor, I also added alignment in the arrays in a filter so that Peel would be reduced, and other optimizations. I didn’t talk about the rest of the analytics Advisor provides, but you get the idea, and the fun of these tools is also to explore them.

One final note, Advisor doesn’t like huge applications, it thrives at small applications (with a small number of loops), so try to extract your kernels with representative data.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at February 14, 2017 09:00 AM

February 11, 2017

Matthew Rocklin

Experiment with Dask and TensorFlow

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

This post briefly describes potential interactions between Dask and TensorFlow and then goes through a concrete example using them together for distributed training with a moderately complex architecture.

This post was written in haste, see disclaimers below.

Introduction

Dask and TensorFlow both provide distributed computing in Python. TensorFlow excels at deep learning applications while Dask is more generic. We can combine both together in a few applications:

  1. Simple data parallelism: hyper-parameter searches during training and predicting already-trained models against large datasets are both trivial to distribute with Dask as they would be trivial to distribute with any distributed computing system (Hadoop/Spark/Flink/etc..) We won’t discuss this topic much. It should be straightforward.
  2. Deployment: A common pain point with TensorFlow is that setup isn’t well automated. This plagues all distributed systems, especially those that are run on a wide variety of cluster managers (see cluster deployment blogpost for more information). Fortunately, if you already have a Dask cluster running it’s trivial to stand up a distributed TensorFlow network on top of it running within the same processes.
  3. Pre-processing: We pre-process data with dask.dataframe or dask.array, and then hand that data off to TensorFlow for training. If Dask and TensorFlow are co-located on the same processes then this movement is efficient. Working together we can build efficient and general use deep learning pipelines.

In this blogpost we look very briefly at the first case of simple parallelism. Then go into more depth on an experiment that uses Dask and TensorFlow in a more complex situation. We’ll find we can accomplish a fairly sophisticated workflow easily, both due to how sensible TensorFlow is to set up and how flexible Dask can be in advanced situations.

Motivation and Disclaimers

Distributed deep learning is fundamentally changing the way humanity solves some very hard computing problems like natural language translation, speech-to-text transcription, image recognition, etc.. However, distributed deep learning also suffers from public excitement, which may distort our image of its utility. Distributed deep learning is not always the correct choice for most problems. This is for two reasons:

  1. Focusing on single machine computation is often a better use of time. Model design, GPU hardware, etc. can have a more dramatic impact than scaling out. For newcomers to deep learning, watching online video lecture series may be a better use of time than reading this blogpost.
  2. Traditional machine learning techniques like logistic regression, and gradient boosted trees can be more effective than deep learning if you have finite data. They can also sometimes provide valuable interpretability results.

Regardless, there are some concrete take-aways, even if distributed deep learning is not relevant to your application:

  1. TensorFlow is straightforward to set up from Python
  2. Dask is sufficiently flexible out of the box to support complex settings and workflows
  3. We’ll see an example of a typical distributed learning approach that generalizes beyond deep learning.

Additionally the author does not claim expertise in deep learning and wrote this blogpost in haste.

Simple Parallelism

Most parallel computing is simple. We easily apply one function to lots of data, perhaps with slight variation. In the case of deep learning this can enable a couple of common workflows:

  1. Build many different models, train each on the same data, choose the best performing one. Using dask’s concurrent.futures interface, this looks something like the following:

    # Hyperparameter search
    client = Client('dask-scheduler-address:8786')
    scores = client.map(train_and_evaluate, hyper_param_list, data=data)
    best = client.submit(max, scores)
    best.result()
    
  2. Given an already-trained model, use it to predict outcomes on lots of data. Here we use a big data collection like dask.dataframe:

    # Distributed prediction
    
    df = dd.read_parquet('...')
    ... # do some preprocessing here
    df['outcome'] = df.map_partitions(predict)
    

These techniques are relatively straightforward if you have modest exposure to Dask and TensorFlow (or any other machine learning library like scikit-learn), so I’m going to ignore them for now and focus on more complex situations.

Interested readers may find this blogpost on TensorFlow and Spark of interest. It is a nice writeup that goes over these two techniques in more detail.

A Distributed TensorFlow Application

We’re going to replicate this TensorFlow example which uses multiple machines to train a model that fits in memory using parameter servers for coordination. Our TensorFlow network will have three different kinds of servers:

distributed TensorFlow training graph

  1. Workers: which will get updated parameters, consume training data, and use that data to generate updates to send back to the parameter servers
  2. Parameter Servers: which will hold onto model parameters, synchronizing with the workers as necessary
  3. Scorer: which will periodically test the current parameters against validation/test data and emit a current cross_entropy score to see how well the system is running.

This is a fairly typical approach when the model can fit in one machine, but when we want to use multiple machines to accelerate training or because data volumes are too large.

We’ll use TensorFlow to do all of the actual training and scoring. We’ll use Dask to do everything else. In particular, we’re about to do the following:

  1. Prepare data with dask.array
  2. Set up TensorFlow workers as long-running tasks
  3. Feed data from Dask to TensorFlow while scores remain poor
  4. Let TensorFlow handle training using its own network

Prepare Data with Dask.array

For this toy example we’re just going to use the mnist data that comes with TensorFlow. However, we’ll artificially inflate this data by concatenating it to itself many times across a cluster:

def get_mnist():
    from tensorflow.examples.tutorials.mnist import input_data
    mnist = input_data.read_data_sets('/tmp/mnist-data', one_hot=True)
    return mnist.train.images, mnist.train.labels

import dask.array as da
from dask import delayed

datasets = [delayed(get_mnist)() for i in range(20)]  # 20 versions of same dataset
images = [d[0] for d in datasets]
labels = [d[1] for d in datasets]

images = [da.from_delayed(im, shape=(55000, 784), dtype='float32') for im in images]
labels = [da.from_delayed(la, shape=(55000, 10), dtype='float32') for la in labels]

images = da.concatenate(images, axis=0)
labels = da.concatenate(labels, axis=0)

>>> images
dask.array<concate..., shape=(1100000, 784), dtype=float32, chunksize=(55000, 784)>

images, labels = c.persist([images, labels])  # persist data in memory

This gives us a moderately large distributed array of around a million tiny images. If we wanted to we could inspect or clean up this data using normal dask.array constructs:

im = images[1].compute().reshape((28, 28))
plt.imshow(im, cmap='gray')

mnist number 3

im = images.mean(axis=0).compute().reshape((28, 28))
plt.imshow(im, cmap='gray')

mnist mean

im = images.var(axis=0).compute().reshape((28, 28))
plt.imshow(im, cmap='gray')

mnist var

This shows off how one can use Dask collections to clean up and provide pre-processing and feature generation on data in parallel before sending it to TensorFlow. In our simple case we won’t actually do any of this, but it’s useful in more real-world situations.

Finally, after doing our preprocessing on the distributed array of all of our data we’re going to collect images and labels together and batch them into smaller chunks. Again we use some dask.array constructs and dask.delayed when things get messy.

images = images.rechunk((10000, 784))
labels = labels.rechunk((10000, 10))

images = images.to_delayed().flatten().tolist()
labels = labels.to_delayed().flatten().tolist()
batches = [delayed([im, la]) for im, la in zip(images, labels)]

batches = c.compute(batches)

Now we have a few hundred pairs of NumPy arrays in distributed memory waiting to be sent to a TensorFlow worker.

Setting up TensorFlow workers alongside Dask workers

Dask workers are just normal Python processes. TensorFlow can launch itself from a normal Python process. We’ve made a small function here that launches TensorFlow servers alongside Dask workers using Dask’s ability to run long-running tasks and maintain user-defined state. All together, this is about 80 lines of code (including comments and docstrings) and allows us to define our TensorFlow network on top of Dask as follows:

$ pip install git+https://github.com/mrocklin/dask-tensorflow
from dask.distibuted import Client  # we already had this above
client = Client('dask-scheduler-address:8786')

from dask_tensorflow import start_tensorflow
tf_spec, dask_spec = start_tensorflow(client, ps=1, worker=4, scorer=1)

>>> tf_spec.as_dict()
{'ps': ['192.168.100.1:2227'],
 'scorer': ['192.168.100.2:2222'],
 'worker': ['192.168.100.3:2223',
            '192.168.100.4:2224',
            '192.168.100.5:2225',
            '192.168.100.6:2226']}

>>> dask_spec
{'ps': ['tcp://192.168.100.1:34471'],
 'scorer': ['tcp://192.168.100.2:40623'],
 'worker': ['tcp://192.168.100.3:33075',
            'tcp://192.168.100.4:37123',
            'tcp://192.168.100.5:32839',
            'tcp://192.168.100.6:36822']}

This starts three groups of TensorFlow servers in the Dask worker processes. TensorFlow will manage its own communication but co-exist right alongside Dask in the same machines and in the same shared memory spaces (note that in the specs above the IP addresses match but the ports differ).

This also sets up a normal Python queue along which Dask can safely send information to TensorFlow. This is how we’ll send those batches of training data between the two services.

Define TensorFlow Model and Distribute Roles

Now is the part of the blogpost where my expertise wanes. I’m just going to copy-paste-and-modify a canned example from the TensorFlow documentation. This is a simplistic model for this problem and it’s entirely possible that I’m making transcription errors. But still, it should get the point across. You can safely ignore most of this code. Dask stuff gets interesting again towards the bottom:

import math
import tempfile
import time
from queue import Empty

IMAGE_PIXELS = 28
hidden_units = 100
learning_rate = 0.01
sync_replicas = False
replicas_to_aggregate = len(dask_spec['worker'])

def model(server):
    worker_device = "/job:%s/task:%d" % (server.server_def.job_name,
                                         server.server_def.task_index)
    task_index = server.server_def.task_index
    is_chief = task_index == 0

    with tf.device(tf.train.replica_device_setter(
                      worker_device=worker_device,
                      ps_device="/job:ps/cpu:0",
                      cluster=tf_spec)):

        global_step = tf.Variable(0, name="global_step", trainable=False)

        # Variables of the hidden layer
        hid_w = tf.Variable(
            tf.truncated_normal(
                [IMAGE_PIXELS * IMAGE_PIXELS, hidden_units],
                stddev=1.0 / IMAGE_PIXELS),
            name="hid_w")
        hid_b = tf.Variable(tf.zeros([hidden_units]), name="hid_b")

        # Variables of the softmax layer
        sm_w = tf.Variable(
            tf.truncated_normal(
                [hidden_units, 10],
                stddev=1.0 / math.sqrt(hidden_units)),
            name="sm_w")
        sm_b = tf.Variable(tf.zeros([10]), name="sm_b")

        # Ops: located on the worker specified with task_index
        x = tf.placeholder(tf.float32, [None, IMAGE_PIXELS * IMAGE_PIXELS])
        y_ = tf.placeholder(tf.float32, [None, 10])

        hid_lin = tf.nn.xw_plus_b(x, hid_w, hid_b)
        hid = tf.nn.relu(hid_lin)

        y = tf.nn.softmax(tf.nn.xw_plus_b(hid, sm_w, sm_b))
        cross_entropy = -tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y, 1e-10, 1.0)))

        opt = tf.train.AdamOptimizer(learning_rate)

        if sync_replicas:
            if replicas_to_aggregate is None:
                replicas_to_aggregate = num_workers
            else:
                replicas_to_aggregate = replicas_to_aggregate

            opt = tf.train.SyncReplicasOptimizer(
                      opt,
                      replicas_to_aggregate=replicas_to_aggregate,
                      total_num_replicas=num_workers,
                      name="mnist_sync_replicas")

        train_step = opt.minimize(cross_entropy, global_step=global_step)

        if sync_replicas:
            local_init_op = opt.local_step_init_op
            if is_chief:
                local_init_op = opt.chief_init_op

            ready_for_local_init_op = opt.ready_for_local_init_op

            # Initial token and chief queue runners required by the sync_replicas mode
            chief_queue_runner = opt.get_chief_queue_runner()
            sync_init_op = opt.get_init_tokens_op()

        init_op = tf.global_variables_initializer()
        train_dir = tempfile.mkdtemp()

        if sync_replicas:
          sv = tf.train.Supervisor(
              is_chief=is_chief,
              logdir=train_dir,
              init_op=init_op,
              local_init_op=local_init_op,
              ready_for_local_init_op=ready_for_local_init_op,
              recovery_wait_secs=1,
              global_step=global_step)
        else:
          sv = tf.train.Supervisor(
              is_chief=is_chief,
              logdir=train_dir,
              init_op=init_op,
              recovery_wait_secs=1,
              global_step=global_step)

        sess_config = tf.ConfigProto(
            allow_soft_placement=True,
            log_device_placement=False,
            device_filters=["/job:ps", "/job:worker/task:%d" % task_index])

        # The chief worker (task_index==0) session will prepare the session,
        # while the remaining workers will wait for the preparation to complete.
        if is_chief:
          print("Worker %d: Initializing session..." % task_index)
        else:
          print("Worker %d: Waiting for session to be initialized..." %
                task_index)

        sess = sv.prepare_or_wait_for_session(server.target, config=sess_config)

        if sync_replicas and is_chief:
          # Chief worker will start the chief queue runner and call the init op.
          sess.run(sync_init_op)
          sv.start_queue_runners(sess, [chief_queue_runner])

        return sess, x, y_, train_step, global_step, cross_entropy


def ps_task():
    with local_client() as c:
        c.worker.tensorflow_server.join()


def scoring_task():
    with local_client() as c:
        # Scores Channel
        scores = c.channel('scores', maxlen=10)

        # Make Model
        server = c.worker.tensorflow_server
        sess, _, _, _, _, cross_entropy = model(c.worker.tensorflow_server)

        # Testing Data
        from tensorflow.examples.tutorials.mnist import input_data
        mnist = input_data.read_data_sets('/tmp/mnist-data', one_hot=True)
        test_data = {x: mnist.validation.images,
                     y_: mnist.validation.labels}

        # Main Loop
        while True:
            score = sess.run(cross_entropy, feed_dict=test_data)
            scores.append(float(score))

            time.sleep(1)


def worker_task():
    with local_client() as c:
        scores = c.channel('scores')
        num_workers = replicas_to_aggregate = len(dask_spec['worker'])

        server = c.worker.tensorflow_server
        queue = c.worker.tensorflow_queue

        # Make model
        sess, x, y_, train_step, global_step, _= model(c.worker.tensorflow_server)

        # Main loop
        while not scores or scores.data[-1] > 1000:
            try:
                batch = queue.get(timeout=0.5)
            except Empty:
                continue

            train_data = {x: batch[0],
                          y_: batch[1]}

            sess.run([train_step, global_step], feed_dict=train_data)

The last three functions defined here, ps_task, scorer_task and worker_task are functions that we want to run on each of our three groups of TensorFlow server types. The parameter server task just starts a long-running task and passively joins the TensorFlow network:

def ps_task():
    with local_client() as c:
        c.worker.tensorflow_server.join()

The scorer task opens up an inter-worker channel of communication named “scores”, creates the TensorFlow model, then every second scores the current state of the model against validation data. It reports the score on the inter-worker channel:

def scoring_task():
    with local_client() as c:
        scores = c.channel('scores')  #  inter-worker channel

        # Make Model
        sess, _, _, _, _, cross_entropy = model(c.worker.tensorflow_server)

        ...

        while True:
            score = sess.run(cross_entropy, feed_dict=test_data)
            scores.append(float(score))
            time.sleep(1)

The worker task makes the model, listens on the Dask-TensorFlow Queue for new training data, and continues training until the last reported score is good enough.

def worker_task():
    with local_client() as c:
        scores = c.channel('scores')

        queue = c.worker.tensorflow_queue

        # Make model
        sess, x, y_, train_step, global_step, _ = model(c.worker.tensorflow_server)

        while scores.data[-1] > 1000:
            batch = queue.get()

            train_data = {x: batch[0],
                          y_: batch[1]}

            sess.run([train_step, global_step], feed_dict=train_data)

We launch these tasks on the Dask workers that have the corresponding TensorFlow servers (see tf_spec and dask_spec above):

ps_tasks = [c.submit(ps_task, workers=worker)
            for worker in dask_spec['ps']]

worker_tasks = [c.submit(worker_task, workers=addr, pure=False)
                for addr in dask_spec['worker']]

scorer_task = c.submit(scoring_task, workers=dask_spec['scorer'][0])

This starts long-running tasks that just sit there, waiting for external stimulation:

long running TensorFlow tasks

Finally we construct a function to dump each of our batches of data from our Dask.array (from the very beginning of this post) into the Dask-TensorFlow queues on our workers. We make sure to only run these tasks where the Dask-worker has a corresponding TensorFlow training worker:

from distributed.worker_client import get_worker

def transfer_dask_to_tensorflow(batch):
    worker = get_worker()
    worker.tensorflow_queue.put(batch)

dump = c.map(transfer_dask_to_tensorflow, batches,
             workers=dask_spec['worker'], pure=False)

If we want to we can track progress in our local session by subscribing to the same inter-worker channel:

scores = c.channel('scores')

We can use this to repeatedly dump data into the workers over and over again until they converge.

while scores.data[-1] > 1000:
    dump = c.map(transfer_dask_to_tensorflow, batches,
                 workers=dask_spec['worker'], pure=False)
    wait(dump)

Conclusion

We discussed a non-trivial way to use TensorFlow to accomplish distributed machine learning. We used Dask to support TensorFlow in a few ways:

  1. Trivially setup the TensorFlow network
  2. Prepare and clean data
  3. Coordinate progress and stopping criteria

We found it convenient that Dask and TensorFlow could play nicely with each other. Dask supported TensorFlow without getting in the way. The fact that both libraries play nicely within Python and the greater PyData stack (NumPy/Pandas) makes it trivial to move data between them without costly or complex tricks.

Additionally, we didn’t have to work to integrate these two systems. There is no need for a separate collaborative effort to integrate Dask and TensorFlow at a core level. Instead, they are designed in such a way so as to foster this type of interaction without special attention or effort.

This is also the first blogpost that I’ve written that, from a Dask perspective, uses some more complex features like long running tasks or publishing state between workers with channels. These more advanced features are invaluable when creating more complex/bespoke parallel computing systems, such as are often found within companies.

What we could have done better

From a deep learning perspective this example is both elementary and incomplete. It would have been nice to train on a dataset that was larger and more complex than MNIST. Also it would be nice to see the effects of training over time and the performance of using different numbers of workers. In defense of this blogpost I can only claim that Dask shouldn’t affect any of these scaling results, because TensorFlow is entirely in control at these stages and TensorFlow already has plenty of published scaling information.

Generally speaking though, this experiment was done in a weekend afternoon and the blogpost was written in a few hours shortly afterwards. If anyone is interested in performing and publishing about a more serious distributed deep learning experiment with TensorFlow and Dask I would be happy to support them on the Dask side. I think that there is plenty to learn here about best practices.

Acknowledgements

The following individuals contributed to the construction of this blogpost:

  • Stephan Hoyer contributed with conversations about how TensorFlow is used in practice and with concrete experience on deployment.
  • Will Warner and Erik Welch both provided valuable editing and language recommendations

February 11, 2017 12:00 AM

February 09, 2017

Enthought

New Year, New Enthought Products!

We’ve had a number of major product development efforts underway over the last year, and we’re pleased to share a lot of new announcements for 2017:

A New Chapter for the Enthought Python Distribution (EPD):
Python 3 and Intel MKL 2017

In 2004, Enthought released the first “Python: Enthought Edition,” a Python package distribution tailored for a scientific and analytic audience. In 2008 this became the Enthought Python Distribution (EPD), a self-contained installer with the "enpkg" command-line tool to update and manage packages. Since then, over a million users have benefited from Enthought’s tested, pre-compiled set of Python packages, allowing them to focus on their science by eliminating the hassle of setting up tools.

Enthought Python Distribution logo

Fast forward to 2017, and we now offer over 450 Python packages and a new era for the Enthought Python Distributionaccess to all of the packages in the new EPD is completely free to all users and includes packages and runtimes for both Python 2 and Python 3 with some exciting new additions. Our ever-growing list of packages includes, for example, the 2017 release of the MKL (Math Kernel Library), the fruit of an ongoing collaboration with Intel.

The New Enthought Deployment Server:
Secure, Onsite Access to EPD and Private Packages

enthought-deployment-server-centralized-management-illustration-v2

For those who are interested in having a private copy of the Enthought Python Distribution behind their firewall, as well as the ability to upload and manage internal private packages alongside it, we now offer the Enthought Deployment Server, an onsite version of the server we have been using for years to serve millions of Python packages to our users.

enthought-deployment-server-logoWith a local Enthought Deployment Server, your private copy will periodically synchronize with our master repository, on a schedule of your choosing, to keep you up to date with the latest releases. You can also set up private package repositories and control access to them using your existing LDAP or Active Directory service in a way that suits your organization.  We can even give you access to the packages (and their historical versions) inside of air-gapped networks! See our webinar introducing the Enthought Deployment Server.

Command Line Access to the New EPD and Flat Environments
via the Enthought Deployment Manager (EDM)

In 2013, we expanded the original EPD to introduce Enthought Canopy, coupling an integrated analysis environment with additional features such as a graphical package manager, documentation browser, and other user-friendly tools together with the Enthought Python Distribution to provide even more features to help “make science and analysis easy.”

With its MATLAB-like experience, Canopy has enabled countless engineers, scientists and analysts to perform sophisticated analysis, build models, and create cutting-edge data science algorithms. The all-in-one analysis platform for Python has also been widely adopted in organizations who want to provide a single, unified platform that can be used by everyone from data analysts to software engineers.

But we heard from a number of you that you also still wanted the capability to have flat, standalone environments not coupled to any editor or graphical tool. And we listened!  

enthought-deployment-manager-cli-screenshot2So last year, we finished building out our next-generation command-line tool that makes producing flat, standalone Python environments super easy.  We call it the Enthought Deployment Manager (or EDM for short), because it’s a tool to quickly deploy one or multiple Python environments with the full control over package versions and runtime environments.

EDM is also a valuable tool for use cases such as command line deployment on local machines or servers, web application deployment on AWS using Ansible and Amazon CloudFormation, rapid environment setup on continuous integration systems such as Travis-CI, Appveyor, or Jenkins/TeamCity, and more.

Finally, a new state-of-the-art package dependency solver included in the tool guarantees the consistency of your environment, and if your workflow requires switching between different environments, its sandboxed architecture makes it a snap to switch contexts.  All of this has also been designed with a focus on providing robust backward compatibility to our customers over time.  Find out more about EDM here.

Enthought Canopy 2.0:
Python 3 packages and New EDM Back End Infrastructure

Enthought Canopy LogoThe new Enthought Python Distribution (EPD) and Enthought Deployment Manager (EDM) will also provide additional benefits for Canopy.  Canopy 2.0 is just around the corner, which will be the first version to include Python 3 packages from EPD.

In addition, we have re-worked Canopy’s graphical package manager to use EDM as its back end, to take advantage of both the consistency and stability of the environments EDM provides, as well as its new package dependency solver.  By itself, this will provide a big boost in stability for users (ever found yourself wrapped up in a tangle of inconsistent package versions?).  Alongside the conversion of Canopy’s back end infrastructure to EDM, we have also included a substantial number of stability improvements and bug fixes.

Canopy’s Graphical Debugger adds external IPython kernel debugging support

On the integrated analysis environment side of Canopy, the graphical debugger and variable browser, first introduced in 2015, has gotten some nifty new features, including the ability to connect to and debug an external IPython kernel, in addition to a number of stability improvements.  (Weren’t aware you could connect to an external process?  Look for the context menu in the IPython console, use it to connect to the IPython kernel running, say, a Jupyter notebook, and debug away!)

Canopy Data Import Tool adds CSV exports and input file templates

Enthought Canopy Data Import ToolAlso, we’ve continued to add new features to the Canopy Data Import Tool since its initial release in May of 2016. The Data Import Tool allows users to quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling.

The latest version of the tool (v. 1.0.9, shipping with Canopy 2.0) has some nice new features like CSV exporting, input file templates, and more. See Enthought’s blog for some great examples of how the Data Import Tool speeds data loading, wrangling and analysis.

What to Look Forward to in 2017

So where are we headed in 2017?  We have put a lot of effort into building a strong foundation with our core suite of products, and now we’re focused on continuing to deliver new value (our enterprise users in particular have a number of new features to look forward to).  First up, for example, you can look for expanded capabilities around Python environments, making it easy to manage multiple environments, or even standardize and distribute them in your organization.  With the tremendous advancements in our core products that took place in 2016, there are a lot of follow-on features we can deliver. Stay tuned for updates!

Have a specific feature you’d like to see in one of Enthought’s products? E-mail our product team at canopy.support@enthought.com and tell us about it!

The post New Year, New Enthought Products! appeared first on Enthought Blog.

by Tim Diller at February 09, 2017 09:49 PM

Travis Oliphant

NumFOCUS past and future.

NumFOCUS just finished its 5th year of operations, and I've lately been reflective on the early days and some of the struggles we went through to get the organization started.  It once was just an idea in a few community-minded developer's heads and now exists as an important non-profit Foundation for Open Data Science, democratic and reproducible discovery, and a champion for technical progress through diversity.

When Peter Wang and I started Continuum in early 2012, I had already started the ball rolling to create NumFOCUS.  I knew that we needed to create a non-profit that would create leadership and be a focus of community activity outside of any one company.  I strongly believe that for open-source to thrive, full-time attention needs to be paid to it by many people.  This requires money.  With the tremendous interest in and explosion around the NumPy community, it was clear to me that this federation of loosely-coupled people needed some kind of organization that could be community-led and could be a rallying point for community activity and community-led financing.  The potential also exists for NumFOCUS to act as community-based accountability to encourage positively re-inforcing behavior in the open-source communities it intersects with.

In late 2011, I started a new mailing list and invited anyone interested in discussing the idea of an independent community-run organization to the list.  Over 100 people responded and so I knew there was interest.    We debated on that list what to call the new concept for several weeks and Anthony Scopatz's name "NumFOCUS" stuck as the best alternative over several other names.   As an acronym, NumFOCUS could mean Numerical Foundation for Open Code and Usable Science.   I created a new mailing list, and then set about creating the legal organization called NumFOCUS and filing necessary paperwork.

Fernando Perez
John Hunter
In December of 2011, I coordinated with Fernando Perez, Perry Greenfield, John Hunter, and Jarrod Millman who had all expressed some interest in the idea and we incorporated in Texas (using LegalZoom) and became the first board of NumFOCUS.  We had a very simple set of bylaws and purposes all centered around making Science more accessible.   We decided to meet every-other week.   We all knew we were creating something that would last a long time.

Perry Greenfield
Jarrod Millman

In early 2012, I wanted to ensure NumFOCUS success and knew that it needed a strong, full-time, Executive Director to make that happen.  The problem was NumFOCUS didn't have a lot of money. A few of the board members had made donations, but Continuum with its own limited means was funding the majority of the costs for getting NumFOCUS started.   With the legal organization started, I created bank-accounts and setup the ability for people to donate to NumFOCUS with help from Anthony Scoptatz who was the first treasurer of NumFOCUS.

Anthony Scopatz
I had met Leah Silen through other community interactions in Austin back in 2007.  I knew her to be a very capable and committed person and thought she might be available.  I asked her if she would come aboard and be employed by Continuum but work full-time for NumFOCUS and the new board. She accepted and the organization of NumFOCUS began to improve immediately.

With her help, we transitioned the organization from LegalZoom's beginnings to register directly with the secretary of state in Texas and started the application process to become a 501(c)3.   She also quickly became involved in organizing the PyData conferences which Continuum initially spear-headed along with Julie Steele and Edd Wilder-James (at the time from O'Reilly).   In 2012, we had our first successful PyData conference at the GooglePlex in Mountain View .  It was clear that PyData could be used as a mechanism to provide revenue for NumFOCUS (at least to support Leah and other administrative help).

Leah Silen

We began working under that model through 2013 and 2014 with Continuum initially spending a lot of human resources and money organizing and running PyData with any proceeds going directly to NumFOCUS.   There were no proceeds in those years except enough to help pay for Leah's salary.   The rest of Leah's salary and PyData expenses came from Continuum which itself was still a small startup.

During these years of PyData growth in communities around the world, James Powell, became a drumbeat of consistency and community engagement.  He has paid his own way to nearly every PyData event throughout the world.  He has acted as emcee, volunteer extraordinaire, and popular speaker with his clever implementations and explanations of the Python stack.

James Powell
@dontusethiscode

Andy Terrel had been a friend of NumFOCUS and a member of the community and active with the board from its beginning.  In 2014, while working at Continuum, he took over my board seat.  In that capacity, he worked hard to gain financial independence for NumFOCUS.  He was instrumental in moving PyData fully to NumFOCUS management. I was comfortable stepping back from the board and stepping down in my involvement around organizing and backing PyData from a financial perspective because I trusted Andy's leadership and non-profit management instincts. He, James Powell, Leah, and all the other local PyData meetups and organizations world-wide have done an impressive thing in self-organizing and growing the community. We should all be grateful for their efforts.

Andy Terrel

I am very proud of the work I did to help start NumFOCUS and PyData. I hope to remember it as one of the most useful things I've done professionally. I am very grateful for all the others who also helped to create NumFOCUS as well as PyData. So many have worked hard to ensure it can be a worldwide and community-governed organization to support Open Data Science for a long time to come. I'm proud of the funding and people-time that Continuum provided to get NumFOCUS and PyData started as well as the on-going support of NumFOCUS that Continuum and other industry partners continue to provide.

Now, as an adviser to the organization, I get to hear from time to time how things are going. I'm very impressed at the progress being made by the dedication of the current leadership behind Andy Terrel as President and Leah Silen as Executive Director and the rest of the current board.

If you use or appreciate any of the tools in the Open Data Science that NumFOCUS sponsors, I encourage you to join and/or make a supporting donation here:  http://www.numfocus.org/support-numfocus.html.  Help NumFOCUS continue its mission to support the tools and communities you rely on everyday.

by Travis Oliphant (noreply@blogger.com) at February 09, 2017 09:14 PM

Continuum Analytics news

The Dominion: An Open Data Science Film

Thursday, February 9, 2017
Michele Chambers
EVP Anaconda Business Unit & CMO
Continuum Analytics

The inaugural AndacondaCON event was full of surprises. Personalized legos, delicious Texan BBQ, Anaconda “swag” and a preview of what might be the most dramatic, life-altering data science movie of all time: The Dominion

The Dominion tells the story of a world similar to ours. One full of possibilities but is being threatened by old machines and bad code. Leo, the main character joins forces with Matrix-like heroes to save businesses from alien-infected software. Leo and his team take it upon themselves to fight the Dominion's agents —Macros—who want to stop the world from open source innovation. 

In an effort to defeat the Macros, they board the rebel fleet flagship, The Anaconda, loaded with data science packages and payload to help the team drop into any environment to free people from the Macros’ hindering code. 

Leo and his team believe that the world is moving to Open Data Science, and that distributed cloud data is going to liberate humanity—all they have to do is work together to make it happen. 

The question is—do you? Are you ready for the future? Board The Anaconda to begin the journey with Leo, his team and us. The time is now. 

Couldn’t make it to AnacondaCON? There’s always next year, and watch the full trailer below. 

by swebster at February 09, 2017 03:53 PM

February 08, 2017

Thomas Wiecki

Why hierarchical models are awesome, tricky, and Bayesian

(c) 2017 by Thomas Wiecki

Hierarchical models are underappreciated. Hierarchies exist in many data sets and modeling them appropriately adds a boat load of statistical power (the common metric of statistical power). I provided an introduction to hierarchical models in a previous blog post: Best Of Both Worlds: Hierarchical Linear Regression in PyMC3", written with Danne Elbers. See also my interview with FastForwardLabs where I touch on these points.

Here I want to focus on a common but subtle problem when trying to estimate these models and how to solve it with a simple trick. Although I was somewhat aware of this trick for quite some time it just recently clicked for me. We will use the same hierarchical linear regression model on the Radon data set from the previous blog post, so if you are not familiar, I recommend to start there.

I will then use the intuitions we've built up to highlight a subtle point about expectations vs modes (i.e. the MAP). Several talks by Michael Betancourt have really expanded my thinking here.

In [31]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm 
import pandas as pd
import theano
import seaborn as sns

sns.set_style('whitegrid')
np.random.seed(123)

data = pd.read_csv('../data/radon.csv')
data['log_radon'] = data['log_radon'].astype(theano.config.floatX)
county_names = data.county.unique()
county_idx = data.county_code.values

n_counties = len(data.county.unique())

The intuitive specification

Usually, hierachical models are specified in a centered way. In a regression model, individual slopes would be centered around a group mean with a certain group variance, which controls the shrinkage:

In [2]:
with pm.Model() as hierarchical_model_centered:
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
    sigma_a = pm.HalfCauchy('sigma_a', 5)
    mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
    sigma_b = pm.HalfCauchy('sigma_b', 5)

    # Intercept for each county, distributed around group mean mu_a
    # Above we just set mu and sd to a fixed value while here we
    # plug in a common group distribution for all a and b (which are
    # vectors of length n_counties).
    a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=n_counties)

    # Intercept for each county, distributed around group mean mu_a
    b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=n_counties)

    # Model error
    eps = pm.HalfCauchy('eps', 5)
    
    # Linear regression
    radon_est = a[county_idx] + b[county_idx] * data.floor.values
    
    # Data likelihood
    radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
In [3]:
# Inference button (TM)!
with hierarchical_model_centered:
    hierarchical_centered_trace = pm.sample(draws=5000, tune=1000, njobs=4)[1000:]
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -1,090.4: 100%|██████████| 200000/200000 [00:37<00:00, 5322.37it/s]
Finished [100%]: Average ELBO = -1,090.4
100%|██████████| 5000/5000 [01:15<00:00, 65.96it/s] 
In [32]:
pm.traceplot(hierarchical_centered_trace);