July 23, 2017

July 18, 2017

Enthought

Webinar: Python for Scientists & Engineers: A Tour of Enthought’s Professional Training Course

When: Thursday, July 27, 2017, 11-11:45 AM CT (Live webcast, all registrants will receive a recording)

What:  A guided walkthrough and Q&A about Enthought’s technical training course Python for Scientists & Engineers with Enthought’s VP of Training Solutions, Dr. Michael Connell

Who Should Attend: individuals, team leaders, and learning & development coordinators who are looking to better understand the options to increase professional capabilities in Python for scientific and engineering applications

REGISTER  (if you can’t attend we’ll send all registrants a recording)


“Writing software is not my job…I just have to do it every day.”  
-21st Century Scientist or Engineer

Many scientists, engineers, and analysts today find themselves writing a lot of software in their day-to-day work even though that’s not their primary job and they were never formally trained for it. Of course, there is a lot more to writing software for scientific and analytic computing than just knowing which keyword to use and where to put the semicolon.

Software for science, engineering, and analysis has to solve the technical problem it was created to solve, of course, but it also has to be efficient, readable, maintainable, extensible, and usable by other people — including the original author six months later!

It has to be designed to prevent bugs and — because all reasonably complex software contains bugs — it should be designed so as to make the inevitable bugs quickly apparent, easy to diagnose, and easy to fix. In addition, such software often has to interface with legacy code libraries written in other languages like C or C++, and it may benefit from a graphical user interface to substantially streamline repeatable workflows and make the tools available to colleagues and other stakeholders who may not be comfortable working directly with the code for whatever reason.

Enthought’s Python for Scientists and Engineers is designed to accelerate the development of skill and confidence in addressing these kinds of technical challenges using some of Python’s core capabilities and tools, including:

  • The standard Python language
  • Core tools for science, engineering, and analysis, including NumPy (the fast array programming package), Matplotlib (for data visualization), and Pandas (for data analysis); and
  • Tools for crafting well-organized and robust code, debugging, profiling performance, interfacing with other languages like C and C++, and adding graphical user interfaces (GUIs) to your applications.

In this webinar, we give you the key information and insight you need to evaluate whether Enthought’s Python for Scientists and Engineers course is the right solution to take your technical skills to the next level, including:

  • Who will benefit most from the course
  • A guided tour through the course topics
  • What skills you’ll take away from the course, how the instructional design supports that
  • What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
  • What previous course attendees say about the course

REGISTER


michael_connell-enthought-vp-trainingPresenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT


Python for Scientists & Engineers Training: The Quick Start Approach to Turbocharging Your Work

If you are tired of running repeatable processes manually and want to (semi-) automate them to increase your throughput and decrease pilot error, or you want to spend less time debugging code and more time writing clean code in the first place, or you are simply tired of using a multitude of tools and languages for different parts of a task and want to replace them with one comprehensive language, then Enthought’s Python for Scientists and Engineers is definitely for you!

This class has been particularly appealing to people who have been using other tools like MATLAB or even Excel for their computational work and want to start applying their skills using the Python toolset.  And it’s no wonder — Python has been identified as the most popular coding language for five years in a row for good reason.

One reason for its broad popularity is its efficiency and ease-of-use. Many people consider Python more fun to work in than other languages (and we agree!). Another reason for its popularity among scientists, engineers, and analysts in particular is Python’s support for rapid application development and extensive (and growing) open source library of powerful tools for preparing, visualizing, analyzing, and modeling data as well as simulation.

Python is also an extraordinarily comprehensive toolset – it supports everything from interactive analysis to automation to software engineering to web app development within a single language and plays very well with other languages like C/C++ or FORTRAN so you can continue leveraging your existing code libraries written in those other languages.

Many organizations are moving to Python so they can consolidate all of their technical work streams under a single comprehensive toolset. In the first part of this class we’ll give you the fundamentals you need to switch from another language to Python and then we cover the core tools that will enable you to do in Python what you were doing with other tools, only faster and better!

Additional Resources

Upcoming Open Python for Scientists & Engineers Sessions:

Los Alamos, NM, Aug 14-18, 2017
Albuquerque, NM, Sept 11-15, 2017
Washington, DC, Sept 25-29, 2017
Los Alamos, NM, Oct 9-13, 2017
Cambridge, UK, Oct 16-20, 2017
San Diego, CA, Oct 30-Nov 3, 2017
Albuquerque, NM, Nov 13-17, 2017
Los Alamos, NM, Dec 4-8, 2017
Austin, TX, Dec 11-15, 2017

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Learn More

Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat Sheets
Enthought's Machine Learning with Python Cheat Sheets Additional Webinars in the Training Series:

Python for Data Science: A Tour of Enthought’s Professional Technical Training Course

Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

Download Enthought’s Pandas Cheat SheetsEnthought's Pandas Cheat Sheets

The post Webinar: Python for Scientists & Engineers: A Tour of Enthought’s Professional Training Course appeared first on Enthought Blog.

by cgodshall at July 18, 2017 10:20 PM

Pierre de Buyl

Developing a Cython library

For some time, I have used Cython to accelerate parts of Python programs. One stumbling block in going from Python/NumPy code to Cython code is the fact that one cannot access NumPy's random number generator from Cython without explicit declaration. Here, I give the steps to make a pip-installable 'cimportable' module, using the threefry random number generator as an example.

(Note: a draft of this article appeared online in june 2017 by mistake, this version is complete)

The aim

The aim is that, starting with a Python code reading

import numpy as np

N=100
x=0
for i in range(N):
    x = x + np.random.normal()

one can end up with a very similar Cython code

cimport numpy as np

cdef int i, N
cdef double x
N = 100
x = 0
for i in range(N):
    x = x + np.random.normal()

With the obvious benefit of using the same module for the random number generator (RNG) with a simple interface.

This is impossible with the current state of NumPy, even though there is work in that direction ng-numpy-randomstate. This post is still relevant for other where Cython is involved contexts anyway.

The challenge

Building a c-importable module just depends on having a corresponding .pxd file available in the path. The idea behind .pxd files is that they contain C-level (or cdef level) declarations whereas the implementation goes in the .pyx file with the same basename.

A consequence of this is that Python-type (regular def) functions do not appear in the .pxd file but only in the .pyx file and cannot be cimported in another cython file. They can of course be Python imported.

The challenge lies in a proper organization of these different parts and of a seamless packaging and installation via pip.

Organization of the module

The module is named threefry after the corresponding Threefry RNG random123. It contains my implementation of the RNG as a C library and of a Cython wrapper.

I review below the steps, that I found via the documentation and quite a lot of trial and error.

Enable cimporting

To enable the use of the Cython wrapper from other Cython code, it is necessary to write a .pxd file, see the documentation on Sharing Declarations. .pxd files can exist on their own but in the present situation, we will use them with the same base name as the .pyx file. This way the .pxd file is automatically read by Cython when compiling the extension, it is as if its content was written in the .pyx file itself.

The .pxd can only contain plain C, cdef or cpdef declarations, pure Python declarations must go the in .pyx file.

Note: The .pxd file must be packaged with the final module, see below.

The file threefry.pxd contains the following declarations

from libc.stdint cimport uint64_t

cdef extern from "threefry.h":
    ...

cdef class rng:
    ...

meaning that the extension type threefry.rng will be accessible via a cimport from other modules. The implementation is stored in threefry.pyx.

With the aim of hiding the implementation details, I wrote a __init__.pxd file containing the following:

from threefry.threefry cimport rng

so that the user code looks like

cimport threefry
cdef threefry.rng r = threefry.rng(seed)

and I am free to refactor the code later if I wish to do so.

Compilation information

To cimport my module, there is one more critical step: providing the needed compiler flag for the C declaration, that is providing the include path for threefry.h (that must be read when compiling user code).

For this purpose, I define a utility routine get_include that can be called from the user's setup.py file as:

from setuptools import setup, Extension
from Cython.Build import cythonize
import threefry

setup(
    ext_modules=cythonize(Extension('use_threefry', ["use_threefry.pyx"], include_dirs=[threefry.get_include()]))
)

Note: the argument include_dirs is given to Extension and not to cythonize.

Packaging

The .h and .pxd files must be added via the package_data argument to setup.

Wrapping up

In short, to make a cimport-able module

  1. Move the shared declarations to a .pxd file.
  2. The implementation goes in the .pyx file, that will be installed as a compiled module.
  3. The .pxd and .h files must be added to package_data.
  4. A convenient way to obtain the include directories must be added.

All of this can be found in my random number generator package https://github.com/pdebuyl/threefry

The algorithm is from Salmon's et al paper Parallel Random Numbers: As Easy as 1, 2, 3, their code being distributed at random123. I wrote about it earlier in a blog post

by Pierre de Buyl at July 18, 2017 09:00 AM

Matthew Rocklin

Scikit-Image and Dask Performance

This weekend at the SciPy 2017 sprints I worked alongside Scikit-image developers to investigate parallelizing scikit-image with Dask.

Here is a notebook of our work.

July 18, 2017 12:00 AM

July 17, 2017

numfocus

NumFOCUS Projects at SciPy 2017

SciPy 2017 is a wrap! As you’d expect, we had lots of participation by NumFOCUS sponsored projects. Here’s a collection of links to talks given by our projects:   Tutorials Software Carpentry Scientific Python Course Tutorial by Maxim Belkin (Part 1) (Part 2) The Jupyter Interactive Widget Ecosystem Tutorial by Matt Craig, Sylvain Corlay, & Jason […]

by Gina Helfrich at July 17, 2017 10:21 PM

July 12, 2017

Enthought

Webinar: A Tour of Enthought’s Latest Enterprise Python Solutions

When: Thursday, July 20, 2017, 11-11:45 AM CT (Live webcast)

What: A comprehensive overview and live demonstration of Enthought’s latest tools for Python for the enterprise with Enthought’s Chief Technical & Engineering Officer, Didrik Pinte

Who Should Attend: Python users (or those supporting Python users) who are looking for a universal solution set that is reliable and “just works”; scientists, engineers, and data science teams trying to answer the question “how can I more easily build and deploy my applications”; organizations looking for an alternative to MATLAB that is cost-effective, robust, and powerful

REGISTER  (if you can’t attend we’ll send all registrants a recording)


For over 15 years, Enthought has been empowering scientists, engineers, analysts, and data scientists to create amazing new technologies, to make new discoveries, and to do so faster and more effectively than they dreamed possible. Along the way, hand in hand with our customers in aerospace, biotechnology, finance, oil and gas, manufacturing, national laboratories, and more, we’ve continued to “build the science tools we wished we had,” and share them with the world.

For 2017, we’re pleased to announce the release of several major new products and tools, specifically designed to make Python more powerful and accessible for users like you who are building the future of science, engineering, artificial intelligence, and data analysis.

WHAT YOU’LL SEE IN THE WEBINAR

In this webinar, Enthought’s Chief Technical & Engineering Officer will share a comprehensive overview and live demonstration of Enthought’s latest products and how they provide the foundation for scientific computing and artificial intelligence applications with Python, including:

We’ll also walk through  specific use cases so you can quickly see how Enthought’s Enterprise Python tools can impact your workflows and productivity.

REGISTER  (if you can’t attend we’ll send all registrants a recording)


Presenter: Didrik Pinte, Chief Technical & Engineering Officer, Enthought

 

 

 

Related Blogs:

Blog: Enthought Announces Canopy 2.1: A Major Milestone Release for the Python Analysis Environment and Package Distribution (June 2017)

Blog: Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting (April 2017)

Blog: New Year, New Enthought Products (Jan 2017)

Product pages:

The post Webinar: A Tour of Enthought’s Latest Enterprise Python Solutions appeared first on Enthought Blog.

by admin at July 12, 2017 03:27 PM

Continuum Analytics news

Continuum Analytics Named a 2017 Gartner Cool Vendor in Data Science and Machine Learning

Thursday, July 13, 2017

Data Science and AI platform, Anaconda, empowers leading businesses worldwide with solutions to transform data into intelligence 

AUSTIN, Texas—July 13, 2017Continuum Analytics, the creator and driving force behind Anaconda, the leading data science and AI platform powered by Python, today announced it has been included in the “Cool Vendors in Data Science and Machine Learning, 2017” report by Gartner, Inc. 

“We believe the addition of machine learning to Gartner’s Hype Cycle for Emerging Technologies in 2016 highlights the growing importance of data science across the enterprise,” said Scott Collison, chief executive officer of Continuum Analytics. “Data science has shifted from ‘emerging’ to ‘established’ and we’re seeing this evolution first-hand as Anaconda’s active user base of four million continues to grow. We are enabling future innovations; solving some of the world’s biggest challenges and uncovering answers to questions that haven’t even been asked yet.” 

Continuum Analytics recently released its newest version, Anaconda 4.4, featuring a comprehensive platform for Python-centric data science with a single-click installer for Windows, Mac, Linux and Power8. Anaconda 4.4 is also designed to make it easy to work with both Python 2 and Python 3 code. 

Gartner is the world's leading information technology research and advisory company. You can find the full report on Gartner’s site: https://www.gartner.com/document/3706738

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

About Anaconda Powered by Continuum Analytics

Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads and 4 million unique users to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Learn more at continuum.io.

by swebster at July 12, 2017 02:18 PM

July 10, 2017

Continuum Analytics news

Anaconda Expert to Discuss Data Science Best Practices at MIT CDOIQ Symposium

Tuesday, July 11, 2017

CAMBRIDGE, MA— July 11, 2017Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Solutions Architect Zach Carwile will speak at the 2017 MIT Chief Data Officer and Information Quality (MIT CDOIQ) Symposium on July 13 at 4:30pm EST. As CDOs and IQ Professionals take their place as the central players in the business of data, the MIT CDOIQ Symposium brings the brightest industry minds together to discuss and advance the big data landscape.

In his session, titled “Data Science is Just the First Step…,” Carwile will explore how organizations can empower their data science teams to build enterprise-grade data products for the analysts who drive business processes. Carwile will also discuss the benefits of containerization for data science projects and explain how establishing a robust deployment process maximizes the value and reach of data science investments.

WHO: Zach Carwhile, solutions architect, Anaconda Powered By Continuum Analytics
WHAT: “Data Science is Just the First Step…”
WHEN: July 13, 4:30–5:10pm. EST
WHERE: Massachusetts Institute of Technology, Tang Building (Room E51), MIT East Campus, 70 Memorial Drive, Cambridge, MA, USA 02139
REGISTER: HERE

###

About Anaconda Powered by Continuum Analytics
Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io.

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by swebster at July 10, 2017 06:23 PM

Paul Ivanov

SciPy 2017 tips

After missing it for a couple of years, I am happy to be back in Austin, TX for SciPy this week!

Always invigorating and exhilarating, Scientific Computing with Python (SciPy) has remained a top quality venue for getting together with fellow Pythonistas, especially the academically-bent variety.

As a graduate student eight years ago, I was fortunate enough to be one of receive sponsorship and attended my first SciPy - SciPy 2009. This was the last time it was held at CalTech in Pasadena, CA.

The following year, in 2010, at the first SciPy held in its now usual spot in Austin, TX, each attendee got a bottle of delicious salsa!

SciPy2010 Salsa Stack

Here are some oy my thoughts about attending this wonderful conference.

Conference Tips

bring a sweatshirt -- Yes, I know Austin's hot, but at the AT&T center, they don't mess around and crank the air conditioning all the way up to 11!

join the slack group -- This year, there's a Slack group for SciPy: the link to join is in a pair of emails with the titles "Getting the most out of SciPy2017" and "Getting the most out of SciPy2017-UPDATED", both from SciPy2017 Organizers. So far at the tutorials slack has served as a useful back channel for communicating repo URLs and specific commands to run, signaling questions without interrupting the speakers' flow.

engage with others during the breaks, lunch, etc -- There are lots of tool authors here and we love chatting with users (and helping you become contributors and authors yourselves). Not your first SciPy and feeling "in-your-element"? Make the effort to invite others into the conversations and lunch outings you're having with old friends - we're all here because we care about this stuff.

take introvert breaks (and be considerate of others who may be doing the same) - I'm an introvert. Though I enjoy interacting with others (one-on-one or in small groups is best for me), it takes a lot of energy and at some point, I run out of steam. That's when I go for a walk, stepping away from the commotion to just have some quiet time.

be kind to yourself (especially at the sprints) -- Between the tutorials, all of the talks, and the sprints that follow, there will be a flurry of activity. Conferences are already draining enough without trying to get any work done, just meeting a bunch of new people and taking in a lot of information. It took a lot of false starts for me to have productive work output at sprints, but the best thing I've learned about them is to just let go of trying to get a lot done. Instead, try to get something small and well defined done or just help others.

Stuff to do in Austin

The conference already has a great list of Things to do in Austin, as well as Restaurants, so I'll just mention a few of my personal favorites.

Barton Springs Pool. Take a nice dip in the cool waters, and grab a delicious bite from one of the food trucks at The Picnic food truck park.

Go see the bats. The Congress Ave bridge in Austin is home to the largest urban bat colony in the world. You can read more about this here, but the short version is that around sunset (8-9pm) - a large numbers of bats stream out from underneath the bridge to go feed on insects. Some days, they leave in waves (this Saturday there were two waves, the first was smaller, but many people left thinking that was the entire show).

I hope you enjoy SciPy2017!

by Paul Ivanov at July 10, 2017 07:00 AM

July 07, 2017

numfocus

FEniCS Conference 2017 in Review

Jack Hale of FEniCS Project was kind enough to share his summary of the recent 2017 FEniCS conference, for which NumFOCUS provided some travel funds. Read on below! The FEniCS Conference 2017 brought together 82 participants from around the world for a conference on the FEniCS Project, a NumFOCUS fiscally sponsored project. FEniCS is a open-source computing […]

by Gina Helfrich at July 07, 2017 06:10 PM

July 05, 2017

numfocus

Belinda Weaver joins Carpentries as Community Development Lead

Belinda Weaver was recently hired by the Carpentries as their new Community Development Lead. We are delighted to welcome her to the NumFOCUS family! Here, Belinda introduces herself to the community and invites your participation and feedback on her work. I am very pleased to take up the role of Community Development Lead for Software […]

by Gina Helfrich at July 05, 2017 10:05 PM

Thomas Wiecki

What's new in PyMC3 3.1

We recently released PyMC3 3.1 after the first stable 3.0 release in January 2017. You can update either via pip install pymc3 or via conda install -c conda-forge pymc3.

A lot is happening in PyMC3-land. One thing I am particularily proud of is the developer community we have built. We now have around 10 active core contributors from the US, Germany, Russia, Japan and Switzerland. Specifically, since 3.0, Adrian Seyboldt, Junpeng Lao and Hannes Bathke have joined the team. Moreover, we have 3 Google Summer of Code students: Maxime Kochurov, who is working on Variational Inference; Bill Engels, who is working on Gaussian Processes, and Bhargav Srinivasa is implementing Riemannian HMC.

Moreover, PyMC3 is being seeing increased adoption in academia, as well as in industry.

Here, I want to highlight some of the new features of PyMC3 3.1.

Discourse forum + better docs

To facilitate the community building process and give users a place to ask questions we have a launched a discourse forum: http://discourse.pymc.io. Bug reports should still onto the Github issue tracker, but for all PyMC3 questions or modeling discussions, please use the discourse forum.

There are also some improvements to the documentation. Mainly, a quick-start to the general PyMC3 API, and a quick-start to the variational API.

Gaussian Processes

PyMC3 now as high-level support for GPs which allow for very flexible non-linear curve-fitting (among other things). This work was mainly done by Bill Engels with help from Chris Fonnesbeck. Here, we highlight the basic API, but for more information see the full introduction.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.cm as cmap
cm = cmap.inferno

import numpy as np
import scipy as sp
import seaborn as sns

import theano
import theano.tensor as tt
import theano.tensor.nlinalg

import pymc3 as pm
In [3]:
np.random.seed(20090425)
n = 20
X = pm.floatX(np.sort(3*np.random.rand(n))[:,None])

# generate fake data from GP with white noise (with variance sigma2)
y = pm.floatX(
    np.array([ 1.36653628,  1.15196999,  0.82142869,  0.85243384,  0.63436304,
               0.14416139,  0.09454237,  0.32878065,  0.51946622,  0.58603513,
               0.46938673,  0.63876778,  0.48415033,  1.28011185,  1.52401102,
               1.38430047,  0.47455605, -0.21110139, -0.49443319, -0.25518805])
)
In [4]:
Z = pm.floatX(np.linspace(0, 3, 100)[:, None])

with pm.Model() as model:
    # priors on the covariance function hyperparameters and noise
    l = pm.Uniform('l', 0, 10)
    log_s2_f = pm.Uniform('log_s2_f', lower=-10, upper=5)
    log_s2_n = pm.Uniform('log_s2_n', lower=-10, upper=5)
    f_cov = tt.exp(log_s2_f) * pm.gp.cov.ExpQuad(1, l)

    # Instantiate GP
    y_obs = pm.gp.GP('y_obs', cov_func=f_cov, sigma=tt.exp(log_s2_n), 
                     observed={'X': X, 'Y': y})
    
    trace = pm.sample()
    
    # Draw samples from GP
    gp_samples = pm.gp.sample_gp(trace, y_obs, Z, samples=50, random_seed=42)
Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 27.649:   6%|▌         | 12091/200000 [00:09<02:15, 1386.23it/s]
Convergence archived at 12100
Interrupted at 12,100 [6%]: Average Loss = 9,348
100%|██████████| 1000/1000 [00:20<00:00, 49.74it/s]
100%|██████████| 50/50 [00:12<00:00,  3.93it/s]
In [5]:
fig, ax = plt.subplots(figsize=(9, 5))

[ax.plot(Z, x, color=cm(0.3), alpha=0.3) for x in gp_samples]
# overlay the observed data
ax.plot(X, y, 'ok', ms=10);
ax.set(xlabel="x", ylabel="f(x)", title="Posterior predictive distribution");

Improvements to NUTS

NUTS is now identical to Stan's implementation and also much much faster. In addition, Adrian Seyboldt added higher-order integrators, which promise to be more efficient in higher dimensions, and sampler statistics that help identify problems with NUTS sampling.

In addition, we changed the default kwargs of pm.sample(). By default, the sampler is run for 500 iterations with tuning enabled (you can change this with the tune kwarg), these samples are then discarded from the returned trace. Moreover, if no arguments are specified, sample() will draw 500 samples in addition to the tuning samples. So for almost all models, just calling pm.sample() should be sufficient.

In [12]:
with pm.Model():
    mu1 = pm.Normal("mu1", mu=0, sd=1, shape=1000)
    trace = pm.sample(discard_tuned_samples=False) # do not remove tuned samples for the plot below
Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 7.279:  14%|█▍        | 28648/200000 [00:08<00:53, 3176.47it/s] 
Convergence archived at 28900
Interrupted at 28,900 [14%]: Average Loss = 8.9536
100%|██████████| 1000/1000 [00:03<00:00, 263.60it/s]

trace now has a bunch of extra parameters pertaining to statistics of the sampler:

In [13]:
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(trace['step_size_bar']); ax.set(xlabel='iteration', ylabel='step size');

Variational Inference

Maxim "Ferrine" Kochurov has done outstanding contributions to improve support for Variational Inference. Essentially, Ferrine has implemented Operator Variational Inference (OPVI) which is a framework to express many existing VI approaches in a modular fashion. He has also made it much easier to supply mini-batches. See here for a full overview of the capabilities.

Specifically, PyMC3 supports the following VI methods:

  • Auto-diff Variational Inference (ADVI) mean-field
  • ADVI full rank
  • Stein Variational Gradient Descent (SVGD)
  • Armortized SVGD

In addition, Ferrine is making great progress on adding Flows which allows learning very flexible transformations of the VI approximation to learn more complex (i.e. non-normal) posterior distributions.

In [31]:
x = np.random.randn(10000)
x_mini = pm.Minibatch(x, batch_size=100)

with pm.Model():
    mu = pm.Normal('x', mu=0, sd=1)
    sd = pm.HalfNormal('sd', sd=1)
    obs = pm.Normal('obs', mu=mu, sd=sd, observed=x_mini)
    vi_est = pm.fit() # Run ADVI
    vi_trace = vi_est.sample() # sample from VI posterior
Average Loss = 149.38: 100%|██████████| 10000/10000 [00:01<00:00, 9014.13it/s]
Finished [100%]: Average Loss = 149.33
In [26]:
plt.plot(vi_est.hist)
plt.ylabel('ELBO'); plt.xlabel('iteration');
In [34]:
pm.traceplot(vi_trace);

As you can see, we have also added a new high-level API in the spirit of sample: pymc3.fit() with many configuration options:

In [36]:
help(pm.fit)
Help on function fit in module pymc3.variational.inference:

fit(n=10000, local_rv=None, method='advi', model=None, random_seed=None, start=None, inf_kwargs=None, **kwargs)
    Handy shortcut for using inference methods in functional way
    
    Parameters
    ----------
    n : `int`
        number of iterations
    local_rv : dict[var->tuple]
        mapping {model_variable -> local_variable (:math:`\mu`, :math:`\rho`)}
        Local Vars are used for Autoencoding Variational Bayes
        See (AEVB; Kingma and Welling, 2014) for details
    method : str or :class:`Inference`
        string name is case insensitive in {'advi', 'fullrank_advi', 'advi->fullrank_advi', 'svgd', 'asvgd'}
    model : :class:`pymc3.Model`
        PyMC3 model for inference
    random_seed : None or int
        leave None to use package global RandomStream or other
        valid value to create instance specific one
    inf_kwargs : dict
        additional kwargs passed to :class:`Inference`
    start : `Point`
        starting point for inference
    
    Other Parameters
    ----------------
    frac : `float`
        if method is 'advi->fullrank_advi' represents advi fraction when training
    kwargs : kwargs
        additional kwargs for :func:`Inference.fit`
    
    Returns
    -------
    :class:`Approximation`

SVGD for example is an algorithm that updates multiple particles and is thus well suited for multi-modal posteriors.

In [49]:
with pm.Model():
    pm.NormalMixture('m', 
                     mu=np.array([0., .5]), 
                     w=np.array([.4, .6]), 
                     sd=np.array([.1, .1]))
    vi_est = pm.fit(method='SVGD')
    vi_est = vi_est.sample(5000)
100%|██████████| 10000/10000 [00:24<00:00, 407.10it/s]
In [50]:
sns.distplot(vi_est['m'])
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f335208>

Cholesky factorization

There is a nice trick to covariance estimation using the Cholesky decomposition for increased efficiency and numerical stability. The MvNormal distribution now accepts a Cholesky-factored covariance matrix. In addition, the LKJ prior has been changed to provide the Cholesky covariance matrix. Thus, if you are estimating covariances, definitely use this much improved parameterization.

In [20]:
n_dim = 5
data = np.random.randn(100, n_dim)

with pm.Model() as model:
    # Note that we access the distribution for the standard
    # deviations, and do not create a new random variable.
    sd_dist = pm.HalfCauchy.dist(beta=2.5)
    packed_chol = pm.LKJCholeskyCov('chol_cov', eta=2, n=n_dim, 
                                    sd_dist=sd_dist)
    chol = pm.expand_packed_triangular(n_dim, packed_chol, lower=True)

    # Define a new MvNormal with the given covariance
    vals = pm.MvNormal('vals', mu=np.zeros(n_dim), 
                       chol=chol, shape=n_dim,
                       observed=data)
    trace = pm.sample()
Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 716.37:   8%|▊         | 15309/200000 [00:05<01:06, 2768.80it/s]
Convergence archived at 15400
Interrupted at 15,400 [7%]: Average Loss = 2,171.4
100%|██████████| 1000/1000 [00:06<00:00, 160.64it/s]

Live-trace to see sampling in real-time

This one is really cool, you can watch the trace evolve while its sampling using pm.sample(live_plot=True). Contributed by David Brochart. See here for full docs.

Better display of random variables

We now make use of the fancy display features of the Jupyter Notebook to provide a nicer view of RVs:

In [12]:
with pm.Model():
    μ = pm.Normal('μ', mu=0, sd=1)
    σ = pm.HalfNormal('σ', sd=1)
    γ = pm.Normal('γ', mu=μ, sd=σ, observed=np.random.randn(100))
In [13]:
γ
Out[13]:
$γ \sim \text{Normal}(\mathit{mu}=μ, \mathit{sd}=f(σ))$

You can create greek letters in the Jupyter Notebook by typing the $\LaTeX$ command and hitting tab: \mu[tab].

GPU support (experimental)

While still experimental, we have made a lot of progress towards fully supporting float32 throughout. If you set floatX = float32 in your .theanorc, cast all your input data to float32 (e.g. by using pm.floatX() for automatic casting), you should get much faster inference. If you set the backend to use the GPU, you should get a nice speed-up on the right types of models. Please report any successes or failures in this regard.

Other useful packages

These are not part of PyMC3 3.1 but build on it and should be of interest to many users.

Bayesian Deep Learning with Gelato

Gelato bridges PyMC3 and Lasagne, a library to easily build Neural Networks similar to Keras. Building Bayesian convolution neural networks and estimating them using VI has never been simpler. See here for an example on MNIST.

Building hierarchical GLMs with Bambi

Bambi is a new package on top of PyMC3 (they also recently added a Stan backend) which allows creation of complex, hierarchical GLMs with very intuitive syntax, e.g.:

model.fit('rt ~ condition', random=['condition|subject', '1|stimulus'], samples=5000, chains=2).

Looking towards PyMC3 3.2

PyMC3 3.2 is already underway. Some of the features we are working on include:

On my own behalf: Patreon

I have recently created an account on Patreon where you can support me financially for writing blog posts. These allow me to devote more time to writing posts so if you find this blog useful, please consider supporting me.

Thanks specifically to Jonathan Ng for pledging.

by Thomas Wiecki at July 05, 2017 02:00 PM

Continuum Analytics news

Package Better with Conda Build 3

Friday, July 7, 2017
Michael Sarahan
Continuum Analytics

Handling version compatibility is one of the hardest challenges in building software. Until now, conda-build provided helpful tools in terms of the ability to constrain or pin versions in recipes. The limiting thing about this capability was that it entailed editing a lot of recipes.

Conda-build 3 introduces a new scheme for controlling version constraints, which enhances behavior in two ways. First, you can now set versions in an external file and provide lists of versions for conda-build to loop over. Matrix builds are now much simpler and no longer require an external tool, such as conda-build-all. Second, there have been several new jinja2 functions added, which allow recipe authors to express their constraints relative to the versions of packages installed at build time. This dynamic expression greatly cuts down on the need for editing recipes.

Each of these developments have enabled interesting new capabilities for cross-compiling, as well as improving package compatibility by adding more intelligent constraints.

This document is intended as a quick overview of new features in conda-build 3. For more information, see the docs.

These demos use conda-build's python API to render and build recipes. That API currently does not have a docs page, but is pretty self explanatory. See the source at: https://github.com/conda/conda-build/blob/master/conda_build/api.py

This jupyter notebook itself is included in conda-build's tests folder. If you're interested in running this notebook yourself, see the tests/test-recipes/variants folder in a git checkout of the conda-build source. Tests are not included with conda packages of conda-build.

from conda_build import api 
import os 
from pprint import pprint

First, set up some helper functions that will output recipe contents in a nice-to-read way:

def print_yamls(recipe, **kwargs): 
          yamls = [api.output_yaml(m[0]) 
                   for m in api.render(recipe, verbose=False, permit_unsatisfiable_variants=True, **kwargs)] 
          for yaml in yamls: 
              print(yaml) 
              print('-' * 50)
def print_outputs(recipe, **kwargs): 
    pprint(api.get_output_file_paths(recipe, verbose=False, **kwargs))

Most of the new functionality revolves around much more powerful use of jinja2 templates. The core idea is that there is now a separate configuration file that can be used to insert many different entries into your meta.yaml files.

!cat 01_basic_templating/meta.yaml

    package:
      name: abc
      version: 1.0

    requirements:
      build:
       - something {{ something }}
      run:
       - something {{ something }}

# The configuration is hierarchical - it can draw from many config files. One place they can live is alongside meta.yaml:
!cat 01_basic_templating/conda_build_config.yaml
 

    something:
      - 1.0
      - 2.0

 

Since we have one slot in meta.yaml, and two values for that one slot, we should end up with two output packages:

print_outputs('01_basic_templating/')
 
    Returning non-final recipe for abc-1.0-0; one or more dependencies was unsatisfiable:
    Build: something
    Host: None
 
    ['/Users/msarahan/miniconda3/conda-bld/osx-64/abc-1.0-h1332e90_0.tar.bz2',
    '/Users/msarahan/miniconda3/conda-bld/osx-64/abc-1.0-h9f70ef6_0.tar.bz2']

print_yamls('01_basic_templating/')
    package:
        name: abc
        version: '1.0'
    build:
        string: '0'
    requirements:
        build:
            - something 1.0
        run:
            - something 1.0
    extra:
        final: true
 
--------------------------------------------------
    package:
        name: abc
        version: '1.0'
    build:
        string: '0'
    requirements:
        build:
            - something 2.0
        run:
            - something 2.0
    extra:
        final: true
 
--------------------------------------------------

 

OK, that's fun already. But wait, there's more!

We saw a warning about "finalization." That's conda-build trying to figure out exactly what packages are going to be installed for the build process. This is all determined before the build. Doing so allows us to tell you the actual output filenames before you build anything. Conda-build will still render recipes if some dependencies are unavailable, but you obviously won't be able to actually build that recipe.

!cat 02_python_version/meta.yaml
 
    package:
      name: abc
      version: 1.0
 
    requirements:
        build:
            - python
        run:
            - python
 
!cat 02_python_version/conda_build_config.yaml
 
    python:
          - 2.7
          - 3.5
 
print_yamls('02_python_version/')
 
    package:
      name: abc
      version: '1.0'
 
    build:
        string: py27hef4ac7c_0
    requirements:
      build:
          - readline 6.2 2
          - tk 8.5.18 0
          - pip 9.0.1 py27_1
          - setuptools 27.2.0 py27_0
          - openssl 1.0.2l 0
          - sqlite 3.13.0 0
          - python 2.7.13 0
          - wheel 0.29.0 py27_0
          - zlib 1.2.8 3
      run:
          - python >=2.7,<2.8
 
    extra:
      final: true
 
--------------------------------------------------
    package:
        name: abc
        version: '1.0'
    build:
        string: py35h6785551_0
    requirements:
        build:
          - readline 6.2 2
          - setuptools 27.2.0 py35_0
          - tk 8.5.18 0
          - openssl 1.0.2l 0
          - sqlite 3.13.0 0
          - python 3.5.3 1
          - pip 9.0.1 py35_1
          - xz 5.2.2 1
          - wheel 0.29.0 py35_0
          - zlib 1.2.8 3
        run:
          - python >=3.5,<3.6
    extra:
        final: true
 
--------------------------------------------------

 

Here you see that we have many more dependencies than we specified, and we have much more detailed pinning. This is a finalized recipe. It represents exactly the state that would be present for building (at least on the current platform).

So, this new way to pass versions is very fun, but there's a lot of code out there that uses the older way of doing things—environment variables and CLI arguments. Those still work. They override any conda_build_config.yaml settings.

# Setting environment variables overrides the conda_build_config.yaml. This preserves older, well-established behavior.
os.environ["CONDA_PY"] = "3.4"
print_yamls('02_python_version/')
del os.environ['CONDA_PY']
 
    package:
        name: abc
        version: '1.0'
    build:
        string: py34h31af026_0
        requirements: build:
          - readline 6.2 2
          - python 3.4.5 0
          - setuptools 27.2.0 py34_0
          - tk 8.5.18 0
          - openssl 1.0.2l 0
          - sqlite 3.13.0 0
          - pip 9.0.1 py34_1
          - xz 5.2.2 1
          - wheel 0.29.0 py34_0
          - zlib 1.2.8 3
     run:
          - python >=3.4,<3.5
    extra:
        final: true
 
--------------------------------------------------

# Passing python as an argument (CLI or to the API) also overrides conda_build_config.yaml
print_yamls('02_python_version/', python="3.6")
 
 
    package:
        name: abc
        version: '1.0'
    build:
        string: py36hd0a5620_0
    requirements:
        build:
          - readline 6.2 2
          - wheel 0.29.0 py36_0
          - tk 8.5.18 0
          - python 3.6.1 2
          - openssl 1.0.2l 0
          - sqlite 3.13.0 0
          - pip 9.0.1 py36_1
          - xz 5.2.2 1
          - setuptools 27.2.0 py36_0
          - zlib 1.2.8 3
        run:
          - python >=3.6,<3.7
 
    extra:
        final: true
 
--------------------------------------------------

 

Wait a minute—what is that h7d013e7 gobbledygook in the build/string field?

Conda-build 3 aims to generalize pinning/constraints. Such constraints differentiate a package. For example, in the past, we have had things like py27np111 in filenames. This is the same idea, just generalized. Since we can't readily put every possible constraint into the filename, we have kept the old ones, but added the hash as a general solution.

There's more information about what goes into a hash at: https://conda.io/docs/building/variants.html#differentiating-packages-built-with-different-variants

Let's take a look at how to inspect the hash contents of a built package.

outputs = api.build('02_python_version/', python="3.6",
                    anaconda_upload=False)
pkg_file = outputs[0]
print(pkg_file)
 
    The following NEW packages will be INSTALLED:
        openssl: 1.0.2l-0
        pip: 9.0.1-py36_1
        python: 3.6.1-2
        readline: 6.2-2
        setuptools: 27.2.0-py36_0
        sqlite: 3.13.0-0 tk: 8.5.18-0
        wheel: 0.29.0-py36_0
        xz: 5.2.2-1
        zlib: 1.2.8-3
 
    source tree in: /Users/msarahan/miniconda3/conda-bld/abc_1498787283909/work
    Attempting to finalize metadata for abc
 
    INFO:conda_build.metadata:Attempting to finalize metadata for abc
 
    BUILD START: ['abc-1.0-py36hd0a5620_0.tar.bz2']
    Packaging abc
 
    INFO:conda_build.build:Packaging abc
 
    The following NEW packages will be INSTALLED:
        openssl: 1.0.2l-0
        pip: 9.0.1-py36_1
        python: 3.6.1-2
        readline: 6.2-2
        setuptools: 27.2.0-py36_0
        sqlite: 3.13.0-0 tk: 8.5.18-0
        wheel: 0.29.0-py36_0
        xz: 5.2.2-1
        zlib: 1.2.8-3
 
    Packaging abc-1.0-py36hd0a5620_0
 
    INFO:conda_build.build:Packaging abc-1.0-py36hd0a5620_0
 
    number of files: 0
    Fixing permissions
    Fixing permissions
    updating: abc-1.0-py36hd0a5620_0.tar.bz2
    Nothing to test for: /Users/msarahan/miniconda3/conda-bld/osx-64/abc-1.0-py36hd0a5620_0.tar.bz2
    # Automatic uploading is disabled
    # If you want to upload package(s) to anaconda.org later, type:
 
    anaconda upload /Users/msarahan/miniconda3/conda-bld/osx-64/abc-1.0-py36hd0a5620_0.tar.bz2
 
    # To have conda build upload to anaconda.org automatically, use
    # $ conda config --set anaconda_upload yes
 
    anaconda_upload is not set. Not uploading wheels: []
    /Users/msarahan/miniconda3/conda-bld/osx-64/abc-1.0-py36hd0a5620_0.tar.bz2

# Using command line here just to show you that this command exists.
!conda inspect hash-inputs ~/miniconda3/conda-bld/osx-64/abc-1.0-py36hd0a5620_0.tar.bz2
 
    Package abc-1.0-py36hd0a5620_0 does not include recipe. Full hash information is not reproducible.
    WARNING:conda_build.inspect:Package abc-1.0-py36hd0a5620_0 does not include recipe. Full hash information is not reproducible.
    {'abc-1.0-py36hd0a5620_0': {'files': [],
                                'recipe': {'requirements': {'build': ['openssl '
                                                                      '1.0.2l 0',
                                                                      'pip 9.0.1 '
                                                                      'py36_1',
                                                                      'python '
                                                                      '3.6.1 2',
                                                                      'readline '
                                                                      '6.2 2',
                                                                      'setuptools '
                                                                      '27.2.0 '
                                                                      'py36_0',
                                                                      'sqlite '
                                                                      '3.13.0 0',
                                                                      'tk 8.5.18 0',
                                                                      'wheel '
                                                                      '0.29.0 '
                                                                      'py36_0',
                                                                      'xz 5.2.2 1',
                                                                      'zlib 1.2.8 '
                                                                      '3'],
                                                            'run': ['python '
                                                                    '>=3.6,<3.7']}}}}

 

pin_run_as_build is a special extra key in the config file. It is a generalization of the x.x concept that existed for numpy since 2015. There's more information at: https://conda.io/docs/building/variants.html#customizing-compatibility

Each x indicates another level of pinning in the output recipe. Let's take a look at how we can control the relationship of these constraints. Before now you could certainly accomplish pinning, it just took more work. Now you can define your pinning expressions, and then change your target versions in just one config file.

!cat 05_compatible/meta.yaml
 
    package:
      name: compatible
      version: 1.0
 
    requirements:
      build:
        - libpng
    run:
        - {{ pin_compatible('libpng') }}

 

This is effectively saying "add a runtime libpng constraint that follows conda-build's default behavior, relative to the version of libpng that was used at build time."

pin_compatible is a new helper function available to you in meta.yaml. The default behavior is: exact version match lower bound ("x.x.x.x.x.x.x"), next major version upper bound ("x").

print_yamls('05_compatible/')
 
    package:
      name: compatible
      version: '1.0'
    build:
      string: h3d53989_0
    requirements:
      build:
        - libpng 1.6.27 0
        - zlib 1.2.8 3
      run:
        - libpng >=1.6.27,<2
     extra:
      final: true
 
--------------------------------------------------

 

These constraints are completely customizable with pinning expressions:

!cat 06_compatible_custom/meta.yaml
 
    package:
      name: compatible
      version: 1.0
 
    requirements:
      build:
        - libpng
      run:
        - {{ pin_compatible('libpng', max_pin='x.x') }}
 
print_yamls('06_compatible_custom/')
 
    package:
      name: compatible
      version: '1.0'
    build:
      string: ha6c6d66_0
    requirements:
      build:
        - libpng 1.6.27 0
        - zlib 1.2.8 3
      run:
        - libpng >=1.6.27,<1.7
    extra:
      final: true
 
--------------------------------------------------

 

Finally, you can also manually specify version bounds. These supersede any relative constraints.

!cat 07_compatible_custom_lower_upper/meta.yaml
 
    package:
      name: compatible
      version: 1.0
 
    requirements:
      build:
        - libpng
      run:
        - {{ pin_compatible('libpng', min_pin=None, upper_bound='5.0') }}
 
print_yamls('07_compatible_custom_lower_upper/')
 
    package:
      name: compatible
      version: '1.0'
    build:
      string: heb31dda_0
    requirements:
      build:
        - libpng 1.6.27 0
        - zlib 1.2.8 3
      run:
        - libpng <5.0
    extra:
      final: true
 
--------------------------------------------------

 

Much of the development of conda-build 3 has been inspired by improving the compiler toolchain situation. Conda-build 3 adds special support for more dynamic specification of compilers.

!cat 08_compiler/meta.yaml
 
    package:
      name: cross
      version: 1.0
 
    requirements:
      build:
        - {{ compiler('c') }}

 

By replacing any actual compiler with this jinja2 function, we're free to swap in different compilers based on the contents of the conda_build_config.yaml file (or other variant configuration). Rather than saying "I need gcc," we are saying "I need a C compiler."

By doing so, recipes are much more dynamic, and conda-build also helps to keep your recipes in line with respect to runtimes. We're also free to keep compilation and linking flags associated with specific "compiler" packages—allowing us to build against potentially multiple configurations (Release, Debug?). With cross compilers, we could also build for other platforms.

!cat 09_cross/meta.yaml
 
    package:
      name: cross
      version: 1.0
 
    requirements:
      build:
        - {{ compiler('c') }}

# But, by adding in a base compiler name, and target platforms, we can make a build matrix
# This is not magic, the compiler packages must already exist. Conda-build is only following a naming scheme.
!cat 09_cross/conda_build_config.yaml
 
 
    c_compiler:
      - gcc
    target_platform:
      - linux-64
      - linux-cos5-64
      - linux-aarch64
 
print_yamls('09_cross/')
 
    Returning non-final recipe for cross-1.0-0; one or more dependencies was unsatisfiable:
    Build: gcc_linux-64
    Host: None
    WARNING:conda_build.render:
    Returning non-final recipe for cross-1.0-0; one or more dependencies was unsatisfiable:
    Build: gcc_linux-64
    Host: None
    Returning non-final recipe for cross-1.0-0; one or more dependencies was unsatisfiable:
    Build: gcc_linux-cos5-64
    Host: None
    WARNING:conda_build.render:Returning non-final recipe for cross-1.0-0; one or more dependencies was unsatisfiable:
    Build: gcc_linux-cos5-64
    Host: None
    Returning non-final recipe for cross-1.0-0; one or more dependencies was unsatisfiable:
    Build: gcc_linux-aarch64
    Host: None
    WARNING:conda_build.render:Returning non-final recipe for cross-1.0-0; one or more dependencies was unsatisfiable:
    Build: gcc_linux-aarch64
    Host: None
 
 
    package:
      name: cross
      version: '1.0'
    build:
      string: '0'
    requirements:
      build:
        - gcc_linux-64
    extra:
      final: true
 
--------------------------------------------------
    package:
      name: cross
      version: '1.0'
    build:
      string: '0'
    requirements:
      build:
        - gcc_linux-cos5-64
    extra:
      final: true
 
--------------------------------------------------
    package:
      name: cross
      version: '1.0'
    build:
      string: '0'
    requirements:
      build:
        - gcc_linux-aarch64
    extra:
      final: true
 
--------------------------------------------------

 

Finally, it is frequently a problem to remember to add runtime dependencies. Sometimes the recipe author is not entirely familiar with the lower level code and has no idea about runtime dependencies. Other times, it's just a pain to keep versions of runtime dependencies in line. Conda-build 3 introduces a way of storing the required runtime dependencies on the package providing the dependency at build time.

For example, using g++ in a non-static configuration will require that the end-user have a sufficiently new libstdc++ runtime library available at runtime. Many people don't currently include this in their recipes. Sometimes the system libstdc++ is adequate, but often not. By imposing the downstream dependency, we can make sure that people don't forget the runtime dependency.

# First, a package that provides some library.
# When anyone uses this library, they need to include the appropriate runtime.
 
!cat 10_runtimes/uses_run_exports/meta.yaml
 
 
    package:
      name: package_has_run_exports
      version: 1.0
    build:
      run_exports:
        - {{ pin_compatible('bzip2') }}
    requirements:
      build:
        - bzip2

# This is the simple downstream package that uses the library provided in the previous recipe.
!cat 10_runtimes/consumes_exports/meta.yaml
 
    package:
      name: package_consuming_run_exports
      version: 1.0
    requirements:
      build:
        - package_has_run_exports

# Let's build the former package first.
api.build('10_runtimes/uses_run_exports', anaconda_upload=False)
 
 
    The following NEW packages will be INSTALLED:
        bzip2: 1.0.6-3
    source tree in: /Users/msarahan/miniconda3/conda-bld/package_has_run_exports_1498787302719/work
    Attempting to finalize metadata for package_has_run_exports
 
    INFO:conda_build.metadata:Attempting to finalize metadata for package_has_run_exports
 
    BUILD START: ['package_has_run_exports-1.0-hcc78ab3_0.tar.bz2']
    Packaging package_has_run_exports
 
    INFO:conda_build.build:Packaging package_has_run_exports
 
    The following NEW packages will be INSTALLED:
        bzip2: 1.0.6-3
    number of files: 0
    Fixing permissions
    Fixing permissions
    updating: package_has_run_exports-1.0-hcc78ab3_0.tar.bz2
    Nothing to test for: /Users/msarahan/miniconda3/conda-bld/osx-64/package_has_run_exports-1.0-hcc78ab3_0.tar.bz2
    # Automatic uploading is disabled
    # If you want to upload package(s) to anaconda.org later, type:
 
    anaconda upload /Users/msarahan/miniconda3/conda-bld/osx-64/package_has_run_exports-1.0-hcc78ab3_0.tar.bz2
     
    # To have conda build upload to anaconda.org automatically, use
    # $ conda config --set anaconda_upload yes
 
    anaconda_upload is not set. Not uploading wheels: []
 
 
    ['/Users/msarahan/miniconda3/conda-bld/osx-64/package_has_run_exports-1.0-hcc78ab3_0.tar.bz2']
 
 
print_yamls('10_runtimes/consumes_exports')
 
    package:
      name: package_consuming_run_exports
      version: '1.0'
    build:
      string: h8346d2f_0
    requirements:
      build:
        - package_has_run_exports 1.0 hcc78ab3_0
      run:
        - bzip2 >=1.0.6,<2
    extra:
      final: true
 
--------------------------------------------------

 

In the above recipe, note that bzip2 has been added as a runtime dependency, and is pinned according to conda-build's default pin_compatible scheme. This behavior can be overridden in recipes if necessary, but we hope it will prove useful.

by swebster at July 05, 2017 02:49 AM

July 04, 2017

Matthieu Brucher

On the physics of FTL in space opera books

Almost 2 years ago, I published an article on space operas. I just read the last few books in some series, and it’s quite interesting to see the differences in how faster-than-light travel works.

I won’t talk about Star Wars or star Trek, I’m more into The Lost Fleet and others.

So for Jack Campbell’s Lost Fleet, there are 2 ways of traveling. The first is natural jumps between systems due to some unknown physics. Then, on top of it, you have an alien technology that allows you to jump between arches. The jumps are interesting, because you need several days to cross from one to another.

On a similar way, you have Mike Shepherd’s Kris Longknife/Vicky Peterwald world. Here, you have jumps between star systems that are instantaneous, and alien made. They don’t require an arch, so similar to the first Campbell’s kind. And then there are other hidden jumps that are also alien made, but not known by everyone. The interesting thing is that if Campbell speed makes sense, Shepherd’s is not that much, as he talks about acceleration, and not speed itself. So you never know how fast they re actually going. On a side note, Shepherd is making lots of references to Campbell’s world in his latest books.

I also read some Evan Currie’s Odyssey One. Here, you don’t have jumps, you have a FTL that lots of people have, except Earth, which travel through tachyons. At some point in the past, people thought that tachyons could travel faster than light, so it makes some sense to use them for instantaneous travel and sonar scanning. The fact that the aliens are not using this technology makes me think that it is too dangerous and they evolved a better, safer technology for FTL, even if it’s not instantaneous like the other.

The last one I’d like to talk about is Joshua Dalzelle’s Omega Force universe. Here we don’t have jumps, but we don’t have instantaneous travels either. Let’s say that if one day we can figure out FTL, I think that this universe is the closest to what we will actually have.

All these universes have also other differences, of course, mainly in the targeted audience. Campbell will be targeted for strategists, Shepherd is oriented towards young adults, Currie’s towards space explorers and Dalzelle’s towards space RPG fans. There are others that I liked but where FTL is not that important (like Ancillary Justice), but still lots of fun.

by Matt at July 04, 2017 07:15 AM

July 03, 2017

Matthew Rocklin

Dask Benchmarks

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

We measure the performance of Dask’s distributed scheduler for a variety of different workloads under increasing scales of both problem and cluster size. This helps to answer questions about dask’s scalability and also helps to educate readers on the sorts of computations that scale well.

We will vary our computations in a few ways to see how they stress performance. We consider the following:

  1. Computational and communication patterns like embarrassingly parallel, fully sequential, bulk communication, many-small communication, nearest neighbor, tree reductions, and dynamic graphs.
  2. Varying task duration ranging from very fast (microsecond) tasks, to 100ms and 1s long tasks. Faster tasks make it harder for the central scheduler to keep up with the workers.
  3. Varying cluster size from one two-core worker to 256 two-core workers and varying dataset size which we scale linearly with the number of workers. This means that we’re measuring weak scaling.
  4. Varying APIs between tasks, multidimensional arrays and dataframes all of which have cases in the above categories but depend on different in-memory computational systems like NumPy or Pandas.

We will start with benchmarks for straight tasks, which are the most flexible system and also the easiest to understand. This will help us to understand scaling limits on arrays and dataframes.

Note: we did not tune our benchmarks or configuration at all for these experiments. They are well below what is possible, but perhaps representative of what a beginning user might experience upon setting up a cluster without expertise or thinking about configuration.

A Note on Benchmarks and Bias

you can safely skip this section if you’re in a rush

This is a technical document, not a marketing piece. These benchmarks adhere to the principles laid out in this blogpost and attempt to avoid those pitfalls around developer bias. In particular the following are true:

  1. We decided on a set of benchmarks before we ran them on a cluster
  2. We did not improve the software or tweak the benchmarks after seeing the results. These were run on the current release of Dask in the wild that was put out weeks ago, not on a development branch.
  3. The computations were constructed naively, as a novice would write them. They were not tweaked for extra performance.
  4. The cluster was configured naively, without attention to scale or special parameters

We estimate that expert use would result in about a 5-10x scaling improvement over what we’ll see. We’ll detail how to improve scaling with expert methods at the bottom of the post.

All that being said the author of this blogpost is paid to write this software and so you probably shouldn’t trust him. We invite readers to explore things independently. All configuration, notebooks, plotting code, and data are available below:


Tasks

We start by benchmarking the task scheduling API. Dask’s task scheduling APIs are at the heart of the other “big data” APIs (like dataframes). We start with tasks because they’re the simplest and most raw representation of Dask. Mostly we’ll run the following functions on integers, but you could fill in any function here, like a pandas dataframe method or sklearn routine.

import time

def inc(x):
    return x + 1

def add(x, y):
    return x + y

def slowinc(x, delay=0.1):
    time.sleep(delay)
    return x + 1

def slowadd(x, y, delay=0.1):
    time.sleep(delay)
    return x + y

def slowsum(L, delay=0.1):
    time.sleep(delay)
    return sum(L)

Embarrassingly Parallel Tasks

We run the following code on our cluster and measure how long they take to complete:

futures = client.map(slowinc, range(4 * n), delay=1) # 1s delay
wait(futures)
futures = client.map(slowinc, range(100 * n_cores)) # 100ms delay
wait(futures)
futures = client.map(inc, range(n_cores * 200))     # fast
wait(futures)

We see that for fast tasks the system can process around 2000-3000 tasks per second. This is mostly bound by scheduler and client overhead. Adding more workers into the system doesn’t give us any more tasks per second. However if our tasks take any amount of time (like 100ms or 1s) then we see decent speedups.

If you switch to linear scales on the plots, you’ll see that as we get out to 512 cores we start to slow down by about a factor of two. I’m surprised to see this behavior (hooray benchmarks) because all of Dask’s scheduling decisions are independent of cluster size. My first guess is that the scheduler may be being swamped with administrative messages, but we’ll have to dig in a bit deeper here.

Tree Reduction

Not all computations are embarrassingly parallel. Many computations have dependencies between them. Consider a tree reduction, where we combine neighboring elements until there is only one left. This stresses task dependencies and small data movement.

from dask import delayed

L = range(2**7 * n)
while len(L) > 1:  # while there is more than one element left
    # add neighbors together
    L = [delayed(slowadd)(a, b) for a, b in zip(L[::2], L[1::2])]

L[0].compute()

We see similar scaling to the embarrassingly parallel case. Things proceed linearly until they get to around 3000 tasks per second, at which point they fall behind linear scaling. Dask doesn’t seem to mind dependencies, even custom situations like this one.

Nearest Neighbor

Nearest neighbor computations are common in data analysis when you need to share a bit of data between neighboring elements, such as frequently occurs in timeseries computations in dataframes or overlapping image processing in arrays or PDE computations.

L = range(20 * n)
L = client.map(slowadd, L[:-1], L[1:])
L = client.map(slowadd, L[:-1], L[1:])
wait(L)

Scaling is similar to the tree reduction case. Interesting dependency structures don’t incur significant overhead or scaling costs.

Sequential

We consider a computation that isn’t parallel at all, but is instead highly sequential. Increasing the number of workers shouldn’t help here (there is only one thing to do at a time) but this does demonstrate the extra stresses that arise from a large number of workers. Note that we have turned off task fusion for this, so here we’re measuring how many roundtrips can occur between the scheduler and worker every second.

x = 1

for i in range(100):
    x = delayed(inc)(x)

x.compute()

So we get something like 100 roundtrips per second, or around 10ms roundtrip latencies. It turns out that a decent chunk of this cost was due to an optimization; workers prefer to batch small messages for higher throughput. In this case that optimization hurts us. Still though, we’re about 2-4x faster than video frame-rate here (video runs at around 24Hz or 40ms between frames).

Client in the loop

Finally we consider a reduction that consumes whichever futures finish first and adds them together. This is an example of using client-side logic within the computation, which is often helpful in complex algorithms. This also scales a little bit better because there are fewer dependencies to track within the scheduler. The client takes on a bit of the load.

from dask.distributed import as_completed
futures = client.map(slowinc, range(n * 20))

pool = as_completed(futures)
batches = pool.batches()

while True:
    try:
        batch = next(batches)
        if len(batch) == 1:
            batch += next(batches)
    except StopIteration:
        break
    future = client.submit(slowsum, batch)
    pool.add(future)

Tasks: Complete

We show most of the plots from above for comparison.

Arrays

When we combine NumPy arrays with the task scheduling system above we get dask.array, a distributed multi-dimensional array. This section shows computations like the last section (maps, reductions, nearest-neighbor), but now these computations are motivated by actual data-oriented computations and involve real data movement.

Create Dataset

We make a square array with somewhat random data. This array scales with the number of cores. We cut it into uniform chunks of size 2000 by 2000.

N = int(5000 * math.sqrt(n_cores))
x = da.random.randint(0, 10000, size=(N, N), chunks=(2000, 2000))
x = x.persist()
wait(x)

Creating this array is embarrassingly parallel. There is an odd corner in the graph here that I’m not able to explain.

Elementwise Computation

We perform some numerical computation element-by-element on this array.

y = da.sin(x) ** 2 + da.cos(x) ** 2
y = y.persist()
wait(y)

This is also embarrassingly parallel. Each task here takes around 300ms (the time it takes to call this on a single 2000 by 2000 numpy array chunk).

Reductions

We sum the array. This is implemented as a tree reduction.

x.std().compute()

Random Access

We get a single element from the array. This shouldn’t get any faster with more workers, but it may get slower depending on how much base-line load a worker adds to the scheduler.

x[1234, 4567].compute()

We get around 400-800 bytes per second, which translates to response times of 10-20ms, about twice the speed of video framerate. We see that performance does degrade once we have a hundred or so active connections.

Communication

We add the array to its transpose. This forces different chunks to move around the network so that they can add to each other. Roughly half of the array moves on the network.

y = x + x.T
y = y.persist()
wait(y)

The task structure of this computation is something like nearest-neighbors. It has a regular pattern with a small number of connections per task. It’s really more a test of the network hardware, which we see does not impose any additional scaling limitations (this looks like normal slightly-sub-linear scaling).

Rechunking

Sometimes communication is composed of many small transfers. For example if you have a time series of images so that each image is a chunk, you might want to rechunk the data so that all of the time values for each pixel are in a chunk instead. Doing this can be very challenging because every output chunk requires a little bit of data from every input chunk, resulting in potentially n-squared transfers.

y = x.rechunk((20000, 200)).persist()
wait(y)

y = y.rechunk((200, 20000)).persist()
wait(y)

This computation can be very hard. We see that dask does it more slowly than fast computations like reductions, but it still scales decently well up to hundreds of workers.

Nearest Neighbor

Dask.array includes the ability to overlap small bits of neighboring blocks to enable functions that require a bit of continuity like derivatives or spatial smoothing functions.

y = x.map_overlap(slowinc, depth=1, delay=0.1).persist()
wait(y)

Array Complete

DataFrames

We can combine Pandas Dataframes with Dask to obtain Dask dataframes, distributed tables. This section will be much like the last section on arrays but will instead focus on pandas-style computations.

Create Dataset

We make an array of random integers with ten columns and two million rows per core, but into chunks of size one million. We turn this into a dataframe of integers:

x = da.random.randint(0, 10000, size=(N, 10), chunks=(1000000, 10))
df = dd.from_dask_array(x).persist()
wait(df)

Elementwise

We can perform 100ms tasks or try out a bunch of arithmetic.

y = df.map_partitions(slowinc, meta=df).persist()
wait(y)
y = (df[0] + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10).persist()
wait(y)

Random access

Similarly we can try random access with loc.

df.loc[123456].compute()

Reductions

We can try reductions along the full dataset or a single series:

df.std().compute()
df[0].std().compute()

Groupby aggregations like df.groupby(...).column.mean() operate very similarly to reductions, just with slightly more complexity.

df.groupby(0)[1].mean().compute()

Shuffles

However operations like df.groupby(...).apply(...) are much harder to accomplish because we actually need to construct the groups. This requires a full shuffle of all of the data, which can be quite expensive.

This is also the same operation that occurs when we sort or call set_index.

df.groupby(0).apply(len).compute()  # this would be faster as df.groupby(0).size()
y = df.set_index(1).persist()
wait(y)

This still performs decently and scales well out to a hundred or so workers.

Timeseries operations

Timeseries operations often require nearest neighbor computations. Here we look at rolling aggregations, but cumulative operations, resampling, and so on are all much the same.

y = df.rolling(5).mean().persist()
wait(y)

Dataframes: Complete

Analysis

Let’s start with a few main observations:

  1. The longer your individual tasks take, the better Dask (or any distributed system) will scale. As you increase the number of workers you should also endeavor to increase average task size, for example by increasing the in-memory size of your array chunks or dataframe partitions.
  2. The Dask scheduler + Client currently maxes out at around 3000 tasks per second. Another way to put this is that if our computations take 100ms then we can saturate about 300 cores, which is more-or-less what we observe here.
  3. Adding dependencies is generally free in modest cases such as in a reduction or nearest-neighbor computation. It doesn’t matter what structure your dependencies take, as long as parallelism is still abundant.
  4. Adding more substantial dependencies, such as in array rechunking or dataframe shuffling, can be more costly, but dask collection algorithms (array, dataframe) are built to maintain scalability even at scale.
  5. The scheduler seems to slow down at 256 workers, even for long task lengths. This suggests that we may have an overhead issue that needs to be resolved.

Expert Approach

So given our experience here, let’s now tweak settings to make Dask run well. We want to avoid two things:

  1. Lots of independent worker processes
  2. Lots of small tasks

So lets change some things:

  1. Bigger workers: Rather than have 256 two-core workers lets deploy 32 sixteen-core workers.
  2. Bigger chunks: Rather than have 2000 by 2000 numpy array chunks lets bump this up to 10,000 by 10,000.

    Rather than 1,000,000 row Pandas dataframe partitions let’s bump this up to 10,000,000.

    These sizes are still well within comfortable memory limits. Each is about a Gigabyte in our case.

When we make these changes we find that all metrics improve at larger scales. Some notable improvements are included in a table below (sorry for not having pretty plots in this case).

BenchmarkSmallBigUnit
Tasks: Embarrassingly parallel 3500 3800 tasks/s
Array: Elementwise sin(x)**2 + cos(x)**2 2400 6500 MB/s
DataFrames: Elementwise arithmetic 9600 66000 MB/s
Arrays: Rechunk 4700 4800 MB/s
DataFrames: Set index 1400 1000 MB/s

We see that for some operations we can get significant improvements (dask.dataframe is now churning through data at 60/s) and for other operations that are largely scheduler or network bound this doesn’t strongly improve the situation (and sometimes hurts).

Still though, even with naive settings we’re routinely pushing through 10s of gigabytes a second on a modest cluster. These speeds are available for a very wide range of computations.

Final thoughts

Hopefully these notes help people to understand Dask’s scalability. Like all tools it has limits, but even under normal settings Dask should scale well out to a hundred workers or so. Once you reach this limit you might want to start taking other factors into consideration, especially threads-per-worker and block size, both of which can help push well into the thousands-of-cores range.

The included notebooks are self contained, with code to both run and time the computations as well as produce the Bokeh figures. I would love to see other people reproduce these benchmarks (or others!) on different hardware or with different settings.

Tooling

This blogpost made use of the following tools:

  1. Dask-kubernetes: for deploying clusters of varying sizes on Google compute engine
  2. Bokeh: for plotting (gallery)
  3. gcsfs: for storage on Google cloud storage

July 03, 2017 12:00 AM

June 30, 2017

Continuum Analytics news

Continuum Analytics Appoints Aaron Barfoot As Chief Financial Officer

Wednesday, June 28, 2017

AUSTIN, TEXAS—June 28, 2017Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced Aaron Barfoot as the company’s new chief financial officer (CFO). Barfoot, a cloud hosting industry veteran and former executive at ClearDATA and Rackspace, will oversee both finance and accounting operations as the company continues to experience rapid growth and reinforces its footprint as the leader in Open Data Science.

“Our company is going through a substantial growth period. 2016 marked more than 11 million downloads, an increase of more than eight million from the previous year, and we’re upwards of 25 million today with more than 4 million active users and we're seeing the similar growth in product revenue,” said Scott Collison, CEO. “It’s the perfect time to welcome Aaron to the team; his background in finance and capital structure for large companies will prove instrumental to Continuum Analytics’ continued success.”  
 
Anaconda adoption has increased by 37 percent from 2016 to 2017 according to a recent KDNuggets Data Science Poll
 
“Continuum Analytics’ growth numbers point to an undeniable need for the critical insights data science  delivers. The data it uncovers significantly impacts critical business outcomes and has the power to change the course of a company,” said Barfoot. “As the data science market begins to push $140 billion, I’m thrilled to join the Anaconda team and contribute to the success of the company as it poises for continued growth.”  
 
Mr. Barfoot previously held the position of vice president of finance for Rackspace (RAX) during the company’s six-year period of significant growth and success in the marketplace. His strategic planning and initiatives helped grow revenue from $362 million in 2007 to $1.5 billion in 2013 while also driving better margins and improving capital returns. He played a key role in acquisitions, provided financial support during the company’s $188 million IPO, and developed several financial planning systems to maximize resources and staffing.
 
Aaron Barfoot holds a Bachelor of Science in Economics from Baylor University.
 
About Anaconda Powered by Continuum Analytics
Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io.
 
Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by swebster at June 30, 2017 03:15 PM

June 29, 2017

Enthought

What’s New in the Canopy Data Import Tool Version 1.1

New features in the Canopy Data Import Tool Version 1.1:
Support for Pandas v. 20, Excel / CSV export capabilities, and more

Enthought Canopy Data Import ToolWe’re pleased to announce a significant new feature release of the Canopy Data Import Tool, version 1.1. The Data Import Tool allows users to quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling. Here are some of the notable updates in version 1.1:

1. Support for PyQt
The Data Import Tool now supports both PyQt and PySide backends. Python 3 support will also be available shortly.

2. Exporting DataFrames to csv/xlsx file formats
We understand that data exploration and manipulation are only one part of your data analysis process, which is why the Data Import Tool now provides a way for you to save the DataFrame as a CSV/XLSX file. This way, you can share processed data with your colleagues or feed this processed file to the next step in your data analysis pipeline.

3. Column Sort Indicators
In earlier versions of the Data Import Tool, it was not obvious that clicking on the right-end of the column header sorted the columns. With this release, we added sort indicators on every column, which can be pressed to sort the column in an ascending or descending fashion. And given the complex nature of the data we get, we know sorting the data based on single column is never enough, so we also made sorting columns using the Data Import Tool stable (ie, sorting preserves any existing order in the DataFrame).

4. Support for Pandas versions 0.19.2 – 0.20.1
Version 1.1 of the Data Import Tool now supports 0.19.2 and 0.20.1 versions of the Pandas library.

5. Column Name Selection
If duplicate column names exist in the data file, Pandas automatically mangles them to create unique column names. This mangling can be buggy at times, especially if there is whitespace around the column names. The Data Import Tool corrects this behavior to give a consistent user experience. Until the last release, this was being done under the hood by the Tool. With this release, we changed the Tool’s behavior to explicitly point out what columns are being renamed and how.

6. Template Discovery
With this release, we updated how a Template file is chosen for a given input file. If multiple template files are discovered to be relevant, we choose the latest. We also sped up loading data from files if a relevant Template is used.

For those of you new to the Data Import Tool, a Template file contains all of the commands you executed on the raw data using the Data Import Tool. A Template file is created when a DataFrame is successfully imported into the IPython console in the Canopy Editor. Further, a unique Template file is created for every data file.

Using Template files, you can save your progress and when you later reload the data file into the Tool, the Tool will automatically discover and load the right Template for you, letting you start off from where you left things.

7. Increased Cell Copy and Data Loading Speeds Copying cells has been sped up significantly. We also sped up loading data from large files (>70MB in size).


Using the Data Import Tool in Practice: a Machine Learning Use Case

In theory, we could look at the various machine learning models that can be used to solve our problems and jump right to training and testing the models.

However, in reality, a large amount of time is invested in the data cleaning and data preparation process. More often than not, real-life data cannot be simply fed to a machine learning model directly; there could be missing values, the data might need further processing to remove unnecessary details and join columns to generate a clean and concise dataset.

That’s where the Data Import Tool comes in. The Pandas library made the process of data cleaning and processing has gotten easier and now, the Data Import Tool makes it A LOT easier. By letting you visually clean your dataset, be it removing, converting or joining columns, the Data Import Tool will allow you to visually operate on the data frame and look at the outcome of the operations. Not only that, the Data Import Tool is stateful, meaning that every command can be reverted and changes can be undone.

To give you a real world example, let’s look at the training and test datasets from the Occupancy detection dataset. The dataset contains 8 columns of data, the first column contains index values, the second column contains DateTime values and the rest contain numerical values.

As soon as you try loading the dataset, you might get an error. This is because the dataset contains a row containing column headers for 7 columns. But, the rest of the dataset contains 8 columns of data, which includes the index column. Because of this, we will have to skip the first row of data, which can be done from the Edit Command pane of the ReadData command.

After we set `Number of rows to skip` to `1` and click `Refresh Data`, we should see the DataFrame we expect from the raw data. You might notice that the Data Import tool automatically converted the second column of data into a `DateTime` column. The DIT infers the type of data in a column and automatically performs the necessary conversions. Similarly, the last column was converted into a Boolean column because it represents the Occupancy, with values 0/1.

As we can see from the raw data, the first column in the data contains Index values.. We can access the `SetIndex` command from the right-click menu item on the `ID` column.

Alongside automatic conversions, the DIT generates the relevant Python/Pandas code, which can be saved from the `Save -> Save Code` sub menu item. The complete code generated when we loaded the training data set can be seen below:

# -*- coding: utf-8 -*-
import pandas as pd


# Pandas version check
from pkg_resources import parse_version
if parse_version(pd.__version__) != parse_version('0.19.2'):
raise RuntimeError('Invalid pandas version')


from catalyst.pandas.convert import to_bool, to_datetime
from catalyst.pandas.headers import get_stripped_columns

# Read Data from datatest.txt


filename = 'occupancy_data/datatest.txt'
data_frame = pd.read_table(
filename,
delimiter=',', encoding='utf-8', skiprows=1,
keep_default_na=False, na_values=['NA', 'N/A', 'nan', 'NaN', 'NULL', ''], comment=None,
header=None, thousands=None, skipinitialspace=True,
mangle_dupe_cols=True, quotechar='"',
index_col=False
)


# Ensure stripping of columns
data_frame = get_stripped_columns(data_frame)


# Type conversion for the following columns: 1, 7
for column in ['7']:
valid_bools = {0: False, 1: True, 'true': True, 'f': False, 't': True, 'false': False}
data_frame[column] = to_bool(data_frame[column], valid_bools)
for column in ['1']:
data_frame[column] = to_datetime(data_frame[column])

As you can see, the generated script shows how the training data can be loaded into a DataFrame using Pandas, how the relevant columns can be converted to Bool and DateTime type and how a column can be set as the Index of the DataFrame. We can trivially modify this script to perform the same operations on the other datasets by replacing the filename.

Finally, not only does the Data Import Tool generate and autosave a Python/Pandas script for each of the commands applied, it also saves them into a nifty Template file. The Template file aids in reproducibility and speeds up the analysis process.

Once you successfully modify the training data, every subsequent time you load the training data using the Data Import Tool, it will automatically apply the commands/operations you previously ran. Not only that, we know that the training and test datasets are similar and we need to perform the same data cleaning operations on both files.

Once we cleaned the training dataset using the Data Import Tool, if we load the test dataset, it will intelligently understand that we are loading a file similar to the training dataset and will automatically perform the same operations that we performed on the training data.

The datasets are available at – https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+#


Ready to try the Canopy Data Import Tool?

Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today.

(NOTE: The automatic free trial is currently only available for Python 2.7 users. Python 3 users may request a free trial by emailing canopy.support@enthought.com. All paid Canopy subscribers have immediate access to the Data Import Tool for Python 2, and for Python 3 as soon as it it supported.)

 


We encourage you to update the latest version of the Data Import Tool in Canopy’s Package Manager (search for the “catalyst” package) to make the most of the updates.

For a complete list of changes, please refer to the Release Notes for the Version 1.1 of the Tool here. Refer to the Enthought Knowledge Base for Known Issues with the Tool.

Finally, if you would like to provide us feedback regarding the Data Import Tool, write to us at canopy.support@enthought.com.


Additional resources:

Related blogs:

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

The post What’s New in the Canopy Data Import Tool Version 1.1 appeared first on Enthought Blog.

by admin at June 29, 2017 08:30 PM

June 28, 2017

Matthew Rocklin

Programmatic Bokeh Servers

This work is supported by Continuum Analytics

This was cross posted to the Bokeh blog here. Please consider referencing and sharing that post via social media instead of this one.

This blogpost shows how to start a very simple bokeh server application programmatically. For more complex examples, or for the more standard command line interface, see the Bokeh documentation.

Motivation

Many people know Bokeh as a tool for building web visualizations from languages like Python. However I find that Bokeh’s true value is in serving live-streaming, interactive visualizations that update with real-time data. I personally use Bokeh to serve real-time diagnostics for a distributed computing system. In this case I embed Bokeh directly into my library. I’ve found it incredibly useful and easy to deploy sophisticated and beautiful visualizations that help me understand the deep inner-workings of my system.

Most of the (excellent) documentation focuses on stand-alone applications using the Bokeh server

$ bokeh serve myapp.py

However as a developer who wants to integrate Bokeh into my application starting up a separate process from the command line doesn’t work for me. Also, I find that starting things from Python tends to be a bit simpler on my brain. I thought I’d provide some examples on how to do this within a Jupyter notebook.

Launch Bokeh Servers from a Notebook

The code below starts a Bokeh server running on port 5000 that provides a single route to / that serves a single figure with a line-plot. The imports are a bit wonky, but the amount of code necessary here is relatively small.

from bokeh.server.server import Server
from bokeh.application import Application
from bokeh.application.handlers.function import FunctionHandler
from bokeh.plotting import figure, ColumnDataSource

def make_document(doc):
    fig = figure(title='Line plot!', sizing_mode='scale_width')
    fig.line(x=[1, 2, 3], y=[1, 4, 9])

    doc.title = "Hello, world!"
    doc.add_root(fig)

apps = {'/': Application(FunctionHandler(make_document))}

server = Server(apps, port=5000)
server.start()

We make a function make_document which is called every time someone visits our website. This function can create plots, call functions, and generally do whatever it wants. Here we make a simple line plot and register that plot with the document with the doc.add_root(...) method.

This starts a Tornado web server and creates a new image whenever someone connects, similar to libraries like Tornado, or Flask. In this case our web server piggybacks on the Jupyter notebook’s own IOLoop. Because Bokeh is built on Tornado it can play nicely with other async applications like Tornado or Asyncio.

Live Updates

I find that Bokeh’s real strength comes when you want to stream live data into the browser. Doing this by hand generally means serializing your data on the server, figuring out how web sockets work, sending the data to the client/browser and then updating plots in the browser.

Bokeh handles this by keeping a synchronized table of data on the client and the server, the ColumnDataSource. If you define plots around the column data source and then push more data into the source then Bokeh will handle the rest. Updating your plots in the browser just requires pushing more data into the column data source on the server.

In the example below every time someone connects to our server we make a new ColumnDataSource, make an update function that adds a new record into it, and set up a callback to call that function every 100ms. We then make a plot around that data source to render the data as colored circles.

Because this is a new Bokeh server we start this on a new port, though in practice if we had multiple pages we would just add them as multiple routes in the apps variable.

import random

def make_document(doc):
    source = ColumnDataSource({'x': [], 'y': [], 'color': []})

    def update():
        new = {'x': [random.random()],
               'y': [random.random()],
               'color': [random.choice(['red', 'blue', 'green'])]}
        source.stream(new)

    doc.add_periodic_callback(update, 100)

    fig = figure(title='Streaming Circle Plot!', sizing_mode='scale_width',
                 x_range=[0, 1], y_range=[0, 1])
    fig.circle(source=source, x='x', y='y', color='color', size=10)

    doc.title = "Now with live updating!"
    doc.add_root(fig)

apps = {'/': Application(FunctionHandler(make_document))}

server = Server(apps, port=5001)
server.start()

By changing around the figures (or combining multiple figures, text, other visual elements, and so on) you have full freedom over the visual styling of your web service. By changing around the update function you can pull data from sensors, shove in more interesting data, and so on. This toy example is meant to provide the skeleton of a simple application; hopefully you can fill in details from your application.

Real example

Here is a simple example taken from Dask’s dashboard that maintains a streaming time series plot with the number of idle and saturated workers in a Dask cluster.

def make_document(doc):
    source = ColumnDataSource({'time': [time(), time() + 1],
                               'idle': [0, 0.1],
                               'saturated': [0, 0.1]})

    x_range = DataRange1d(follow='end', follow_interval=20000, range_padding=0)

    fig = figure(title="Idle and Saturated Workers Over Time",
                 x_axis_type='datetime', y_range=[-0.1, len(scheduler.workers) + 0.1],
                 height=150, tools='', x_range=x_range, **kwargs)
    fig.line(source=source, x='time', y='idle', color='red')
    fig.line(source=source, x='time', y='saturated', color='green')
    fig.yaxis.minor_tick_line_color = None

    fig.add_tools(
        ResetTool(reset_size=False),
        PanTool(dimensions="width"),
        WheelZoomTool(dimensions="width")
    )

    doc.add_root(fig)

    def update():
        result = {'time': [time() * 1000],
                  'idle': [len(scheduler.idle)],
                  'saturated': [len(scheduler.saturated)]}
        source.stream(result, 10000)

    doc.add_periodic_callback(update, 100)

You can also have buttons, sliders, widgets, and so on. I rarely use these personally though so they don’t interest me as much.

Final Thoughts

I’ve found the Bokeh server to be incredibly helpful in my work and also very approachable once you understand how to set one up (as you now do). I hope that this post serves people well. This blogpost is available as a Jupyter notebook if you want to try it out yourself.

June 28, 2017 12:00 AM

Use Apache Parquet

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

This is a tiny blogpost to encourage you to use Parquet instead of CSV for your dataframe computations. I’ll use Dask.dataframe here but Pandas would work just as well. I’ll also use my local laptop here, but Parquet is an excellent format to use on a cluster.

CSV is convenient, but slow

I have the NYC taxi cab dataset on my laptop stored as CSV

mrocklin@carbon:~/data/nyc/csv$ ls
yellow_tripdata_2015-01.csv  yellow_tripdata_2015-07.csv
yellow_tripdata_2015-02.csv  yellow_tripdata_2015-08.csv
yellow_tripdata_2015-03.csv  yellow_tripdata_2015-09.csv
yellow_tripdata_2015-04.csv  yellow_tripdata_2015-10.csv
yellow_tripdata_2015-05.csv  yellow_tripdata_2015-11.csv
yellow_tripdata_2015-06.csv  yellow_tripdata_2015-12.csv

This is a convenient format for humans because we can read it directly.

mrocklin@carbon:~/data/nyc/csv$ head yellow_tripdata_2015-01.csv
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-01-15 19:05:39,2015-01-15
19:23:42,1,1.59,-73.993896484375,40.750110626220703,1,N,-73.974784851074219,40.750617980957031,1,12,1,0.5,3.25,0,0.3,17.05
1,2015-01-10 20:33:38,2015-01-10
20:53:28,1,3.30,-74.00164794921875,40.7242431640625,1,N,-73.994415283203125,40.759109497070313,1,14.5,0.5,0.5,2,0,0.3,17.8
1,2015-01-10 20:33:38,2015-01-10
20:43:41,1,1.80,-73.963340759277344,40.802787780761719,1,N,-73.951820373535156,40.824413299560547,2,9.5,0.5,0.5,0,0,0.3,10.8
1,2015-01-10 20:33:39,2015-01-10
20:35:31,1,.50,-74.009086608886719,40.713817596435547,1,N,-74.004325866699219,40.719985961914063,2,3.5,0.5,0.5,0,0,0.3,4.8
1,2015-01-10 20:33:39,2015-01-10
20:52:58,1,3.00,-73.971176147460938,40.762428283691406,1,N,-74.004180908203125,40.742652893066406,2,15,0.5,0.5,0,0,0.3,16.3
1,2015-01-10 20:33:39,2015-01-10
20:53:52,1,9.00,-73.874374389648438,40.7740478515625,1,N,-73.986976623535156,40.758193969726563,1,27,0.5,0.5,6.7,5.33,0.3,40.33
1,2015-01-10 20:33:39,2015-01-10
20:58:31,1,2.20,-73.9832763671875,40.726009368896484,1,N,-73.992469787597656,40.7496337890625,2,14,0.5,0.5,0,0,0.3,15.3
1,2015-01-10 20:33:39,2015-01-10
20:42:20,3,.80,-74.002662658691406,40.734142303466797,1,N,-73.995010375976563,40.726325988769531,1,7,0.5,0.5,1.66,0,0.3,9.96
1,2015-01-10 20:33:39,2015-01-10
21:11:35,3,18.20,-73.783042907714844,40.644355773925781,2,N,-73.987594604492187,40.759357452392578,2,52,0,0.5,0,5.33,0.3,58.13

We can use tools like Pandas or Dask.dataframe to read in all of this data. Because the data is large-ish, I’ll use Dask.dataframe

mrocklin@carbon:~/data/nyc/csv$ du -hs .
22G .
In [1]: import dask.dataframe as dd

In [2]: %time df = dd.read_csv('yellow_tripdata_2015-*.csv')
CPU times: user 340 ms, sys: 12 ms, total: 352 ms
Wall time: 377 ms

In [3]: df.head()
Out[3]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         2  2015-01-15 19:05:39   2015-01-15 19:23:42                1
1         1  2015-01-10 20:33:38   2015-01-10 20:53:28                1
2         1  2015-01-10 20:33:38   2015-01-10 20:43:41                1
3         1  2015-01-10 20:33:39   2015-01-10 20:35:31                1
4         1  2015-01-10 20:33:39   2015-01-10 20:52:58                1

   trip_distance  pickup_longitude  pickup_latitude  RateCodeID  \
0           1.59        -73.993896        40.750111           1
1           3.30        -74.001648        40.724243           1
2           1.80        -73.963341        40.802788           1
3           0.50        -74.009087        40.713818           1
4           3.00        -73.971176        40.762428           1

  store_and_fwd_flag  dropoff_longitude  dropoff_latitude  payment_type \
0                  N         -73.974785         40.750618             1
1                  N         -73.994415         40.759109             1
2                  N         -73.951820         40.824413             2
3                  N         -74.004326         40.719986             2
4                  N         -74.004181         40.742653             2

   fare_amount  extra  mta_tax  tip_amount  tolls_amount  \
0         12.0    1.0      0.5        3.25           0.0
1         14.5    0.5      0.5        2.00           0.0
2          9.5    0.5      0.5        0.00           0.0
3          3.5    0.5      0.5        0.00           0.0
4         15.0    0.5      0.5        0.00           0.0

   improvement_surcharge  total_amount
0                    0.3         17.05
1                    0.3         17.80
2                    0.3         10.80
3                    0.3          4.80
4                    0.3         16.30

In [4]: from dask.diagnostics import ProgressBar

In [5]: ProgressBar().register()

In [6]: df.passenger_count.sum().compute()
[########################################] | 100% Completed |
3min 58.8s
Out[6]: 245566747

We were able to ask questions about this data (and learn that 250 million people rode cabs in 2016) even though it is too large to fit into memory. This is because Dask is able to operate lazily from disk. It reads in the data on an as-needed basis and then forgets it when it no longer needs it. This takes a while (4 minutes) but does just work.

However, when we read this data many times from disk we start to become frustrated by this four minute cost. In Pandas we suffered this cost once as we moved data from disk to memory. On larger datasets when we don’t have enough RAM we suffer this cost many times.

Parquet is faster

Lets try this same process with Parquet. I happen to have the same exact data stored in Parquet format on my hard drive.

mrocklin@carbon:~/data/nyc$ du -hs nyc-2016.parquet/
17G nyc-2016.parquet/

It is stored as a bunch of individual files, but we don’t actually care about that. We’ll always refer to the directory as the dataset. These files are stored in binary format. We can’t read them as humans

mrocklin@carbon:~/data/nyc$ head nyc-2016.parquet/part.0.parquet
<a bunch of illegible bytes>

But computers are much more able to both read and navigate this data. Lets do the same experiment from before:

In [1]: import dask.dataframe as dd

In [2]: df = dd.read_parquet('nyc-2016.parquet/')

In [3]: df.head()
Out[3]:
  tpep_pickup_datetime  VendorID tpep_dropoff_datetime  passenger_count  \
0  2015-01-01 00:00:00         2   2015-01-01 00:00:00                3
1  2015-01-01 00:00:00         2   2015-01-01 00:00:00                1
2  2015-01-01 00:00:00         1   2015-01-01 00:11:26                5
3  2015-01-01 00:00:01         1   2015-01-01 00:03:49                1
4  2015-01-01 00:00:03         2   2015-01-01 00:21:48                2

    trip_distance  pickup_longitude  pickup_latitude  RateCodeID  \
0           1.56        -74.001320        40.729057           1
1           1.68        -73.991547        40.750069           1
2           4.00        -73.971436        40.760201           1
3           0.80        -73.860847        40.757294           1
4           2.57        -73.969017        40.754269           1

  store_and_fwd_flag  dropoff_longitude  dropoff_latitude  payment_type  \
0                  N         -74.010208         40.719662             1
1                  N           0.000000          0.000000             2
2                  N         -73.921181         40.768269             2
3                  N         -73.868111         40.752285             2
4                  N         -73.994133         40.761600             2

   fare_amount  extra  mta_tax  tip_amount  tolls_amount  \
0          7.5    0.5      0.5         0.0           0.0
1         10.0    0.0      0.5         0.0           0.0
2         13.5    0.5      0.5         0.0           0.0
3          5.0    0.5      0.5         0.0           0.0
4         14.5    0.5      0.5         0.0           0.0

   improvement_surcharge  total_amount
0                    0.3           8.8
1                    0.3          10.8
2                    0.0          14.5
3                    0.0           6.3
4                    0.3          15.8

In [4]: from dask.diagnostics import ProgressBar

In [5]: ProgressBar().register()

In [6]: df.passenger_count.sum().compute()
[########################################] | 100% Completed |
2.8s
Out[6]: 245566747

Same values, but now our computation happens in three seconds, rather than four minutes. We’re cheating a little bit here (pulling out the passenger count column is especially easy for Parquet) but generally Parquet will be much faster than CSV. This lets us work from disk comfortably without worrying about how much memory we have.

Convert

So do yourself a favor and convert your data

In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('csv/yellow_tripdata_2015-*.csv')
In [3]: from dask.diagnostics import ProgressBar
In [4]: ProgressBar().register()
In [5]: df.to_parquet('yellow_tripdata.parquet')
[############                            ] | 30% Completed |  1min 54.7s

If you want to be more clever you can specify dtypes and compression when converting. This can definitely help give you significantly greater speedups, but just using the default settings will still be a large improvement.

Advantages

Parquet enables the following:

  1. Binary representation of data, allowing for speedy conversion of bytes-on-disk to bytes-in-memory
  2. Columnar storage, meaning that you can load in as few columns as you need without loading the entire dataset
  3. Row-chunked storage so that you can pull out data from a particular range without touching the others
  4. Per-chunk statistics so that you can find subsets quickly
  5. Compression

Parquet Versions

There are two nice Python packages with support for the Parquet format:

  1. pyarrow: Python bindings for the Apache Arrow and Apache Parquet C++ libraries
  2. fastparquet: a direct NumPy + Numba implementation of the Parquet format

Both are good. Both can do most things. Each has separate strengths. The code above used fastparquet by default but you can change this in Dask with the engine='arrow' keyword if desired.

June 28, 2017 12:00 AM

June 27, 2017

numfocus

Why Is NumPy Only Now Getting Funded?

Recently we announced that NumPy, a foundational package for scientific computing with Python, had received its first-ever grant funding since the start of the project in 2006. Community interest in this announcement was massive, raising our website traffic by over 2600%. The predominant reaction was one of surprise—how could it be that NumPy, of all projects, had never […]

by Gina Helfrich at June 27, 2017 07:40 PM

Continuum Analytics news

Continuum Analytics Appoints Aaron Barfoot As Chief Financial Officer

Wednesday, June 28, 2017

AUSTIN, TEXAS—June 28, 2017Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced Aaron Barfoot as the company’s new chief financial officer (CFO). Barfoot, a cloud hosting industry veteran and former executive at ClearDATA and Rackspace, will oversee both finance and accounting operations as the company continues to experience rapid growth and reinforces its footprint as the leader in Open Data Science.

“Our company is going through a substantial growth period. 2016 marked more than 11 million downloads, an increase of more than eight million from the previous year, and we’re upwards of 25 million today with more than 4 million active users and we're seeing the similar growth in product revenue,” said Scott Collison, CEO. “It’s the perfect time to welcome Aaron to the team; his background in finance and capital structure for large companies will prove instrumental to Continuum Analytics’ continued success.”  
 
Anaconda adoption has increased by 37 percent from 2016 to 2017 according to a recent KDNuggets Data Science Poll
 
“Continuum Analytics’ growth numbers point to an undeniable need for the critical insights data science  delivers. The data it uncovers significantly impacts critical business outcomes and has the power to change the course of a company,” said Barfoot. “As the data science market begins to push $140 billion, I’m thrilled to join the Anaconda team and contribute to the success of the company as it poises for continued growth.”  
 
Mr. Barfoot previously held the position of vice president of finance for Rackspace (RAX) during the company’s six-year period of significant growth and success in the marketplace. His strategic planning and initiatives helped grow revenue from $362 million in 2007 to $1.5 billion in 2013 while also driving better margins and improving capital returns. He played a key role in acquisitions, provided financial support during the company’s $188 million IPO, and developed several financial planning systems to maximize resources and staffing.
 
Aaron Barfoot holds a Bachelor of Science in Economics from Baylor University.
 
About Anaconda Powered by Continuum Analytics
Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io.
 
Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by swebster at June 27, 2017 07:02 PM

Matthieu Brucher

I tried MOOCs and my conclusions are a mix of black and white

Recently, I took on two classes online on two different providers. After a trial more than a year ago, I decided to try MOOCs and I have a few conclusions from them.

As I’ve said, I tried a class some time ago on Coursera on audio signal processing. I was quite disappointed, as the content was more than basic and the tests were screwed up. Lots of the tests are automated and you can pass them several times.

This already shows that the quality of course and its tests depends a lot on the time spent by the teachers on building these systems. In that course, you would upload a Python pickle and the system would check the results. The issue is that the data for those tests was different from the data we could use to try and fix our code, with less common sense and with constraints that one could not think about.

But I tried again, first on Coursera with a course on Neural Networks. I took the class with the certificate just before they updated it to use Python. I was still lucky that someone ported their Matlab code to Python! There were a few mishaps, some formulas that were not obvious from the videos, but the tests were doable, the skeleton code was good and robust and the teacher knew his turf.

Then I went looking for a class on OpenGL. I found one on Edx and followed that one. And there it was exactly the opposite. First the tests were quite easy for the first 3 assignments, but then the last one… I mean one of them was implementing light in a skeleton shader, with a few transforms that were presented in the video, and the next one is implement a full raytracer with a parser. The step is just too high for the average programmer. Especially since you spent 2 out of 4 units talking about matrix multiplications in deep length!
The first straw was actually the OpenGL content itself. It was 10 years too late (OK, 8), still talking about OpenGL pre-3, and even the last assignment where you started playing with shaders, half of the code required OpenGL pre-3. This is not acceptable. Not when you have to pay 99$ to get a certificate that is basically useless.

MOOCs are targeted at professionals, not students (as the latter are still learning in real class rooms, there is no benefit for them). But the quality of the classes is still really random. On the same platform, you can have something great like neural networks, or something uselessly hard (to pass because of crappy auto grading, not to follow) like audio signal processing.
Each platform has its own way of presenting the classes, and they are more or less similar, you can find your way in one or the other. The issue is being able to qualify the quality of the course and how you get graded. Nothing is more annoying when you don’t learn anything relevent or when you fail a test because the online grading tool expected a vertical vector when you provided a horizontal one and it passed locally.

by Matt at June 27, 2017 07:29 AM

June 24, 2017

June 22, 2017

Gaël Varoquaux

Scikit-learn Paris sprint 2017

Two week ago, we held in Paris a large international sprint on scikit-learn. It was incredibly productive and fun, as always. We are still busy merging in the work, but I think that know is a good time to try to summarize the sprint.

A massive workforce

We had a mix of core contributors and newcomers, which is a great combination, as it enables us to be productive, but also to foster the new generation of core developers. Were present:

  • Albert Thomas
  • Alexandre Abadie
  • Alexandre Gramfort
  • Andreas Mueller
  • Arthur Imbert
  • Aurélien Bellet
  • Bertrand Thirion
  • Denis Engemann
  • Elvis Dohmatob
  • Gael Varoquaux
  • Jan Margeta
  • Joan Massich
  • Joris Van den Bossche
  • Laurent Direr
  • Lemaitre Guillaume
  • Loic Esteve
  • Mohamed Maskani Filali
  • Nathalie Vauquier
  • Nicolas Cordier
  • Nicolas Goix
  • Olivier Grisel
  • Patricio Cerda
  • Paul Lagrée
  • Raghav RV
  • Roman Yurchak
  • Sebastien Treger
  • Sergei Lebedev
  • Thierry Guillemot
  • Thomas Moreau
  • Tom Dupré la Tour
  • Vlad Niculae
  • Manoj Kumar (could not come to Paris because of visa issues)

And many more people participating remote, and I am pretty certain that I forgot people.

Support and hosting

Hosting: As the sprint extended through a French bank holiday and the week end, we were hosted in a variety of venues:

  • La paillasse, a Paris bio-hacker space
  • Criteo, a French company doing word-wide add-banner placement. The venue there was absolutely gorgeous, with a beautiful terrace on the roofs of Paris. And they even had a social event with free drinks one evening.

Guillaume Lemaître did most of the organization, and at Criteo Ibrahim Abubakari was our host. We were treated like kings during the whole stay; each host welcoming us as well they could.

Financial support by France is IA: Beyond our hosts, we need to thank France is IA who fund the sprint, covering some of the lunches, accomodations, and travel expenses to bring in our contributors from abroad (3000 euros travel & accomodation, and 1000 euros for food and a venue during the week end).

Some achievements during the sprint

I would be hard to list everything that we did during the sprint (have a look at the development changelog if you’re curious). Here are some

  • Quantile transformer, to transform the data distribution into uniform, or Gaussian distributions (PR, example):

    Before

    After

  • Memory saving by avoiding to cast to float64 if X is given as float32: we are slowly making sure that, as much as possible, all models avoid using internal representations of a dtype float64 when the data is given as float32. This reduces significantly memory usage and can give speed ups up to a factor of two.

  • API test on instances rather than class. This is to facilitate testing packages in scikit-learn-contrib.

  • Many small API fixes to ensure better consistency of models, as well as cleaning the codebase, making sure that examples display well under matplotlib 2.x.

  • Many bug fixes, include fixing corner cases in our average precision, which was dear to me (PR).

Work soon to be merged

  • ColumnTransformer (PR): from pandas dataframe to feature matrix, by applying different transformers to different columns.
  • Fixing t-SNE (PR): our t-SNE implementation was extremely memory-inefficient, and on top of this had minor bugs. We are fixing it.

There is a lot more of pending work that the sprint help moved forward. You can also glance at the monthly activity report on github.

Joblib progress

Joblib, the parallel-computing engine used by scikit-learn, is getting extended to work in distributed settings, for instance using dask distributed as a backend. At the sprint, we made progress running a grid-search on Criteo’s Hadoop cluster.

by Gaël Varoquaux at June 22, 2017 10:00 PM

June 21, 2017

Continuum Analytics news

It’s Getting Hot, Hot, Hot: Four Industries Turning Up The Data Science Heat

Wednesday, June 21, 2017
Christine Doig
Sr. Data Scientist, Product Manager

Summer 2017 has officially begun. As temperatures continue to rise, so does the use of data science across dozens of industries. In fact, IBM predicts the demand for data scientists will increase by 28 percent in just three short years, and our own survey recently revealed that 96 percent of company executives conclude data science is critical to business success. While it’s clear that health care providers, financial institutions and retail organizations are harnessing the growing power of data science, it’s time for more industries to turn up the data science heat. We take a peek below at some of the up and comers.  
 
Aviation and Aerospace
As data science continues to reach for the sky, it’s only fitting that the aviation industry is also on track to leverage this revolutionary technology. Airlines and passengers generate an abundance of data everyday, but are not currently harnessing the full potential of this information. Through advanced analytics and artificial intelligence driven by data science, fuel consumption, flight routes and air congestion could be optimized to improve the overall flight experience. What’s more, technology fueled by data science could help aviation proactively avoid some of the delays and inefficiencies that burden both staff and passengers—airlines just need to take a chance and fly with it! 

Cybersecurity   
In addition to aviation, cybersecurity has become an increasingly hot topic during the past few years. The global cost of handling cyberattacks is expected to rise from $400 billion in 2015 to $2.1 trillion by 2019, but implementing technology driven by data science can help secure business data and reduce these attacks. By focusing on the abnormalities, using all available data and automating whenever possible, companies will have a better chance at standing up to threatening attacks. Not to mention, artificial intelligence software is already being used to defend cyber infrastructure. 
  
Construction
While improving data security is essential, the construction industry is another space that should take advantage of data science tools to improve business outcomes. As an industry that has long resisted change, some companies are now turning to data science technology to manage large teams, improve efficiency in the building process and reduce project delivery time, ultimately increasing profit margins. By embracing data analytics and these new technologies, the construction industry will also have more room to successfully innovate. 
 
Ecology
From aviation to cybersecurity to construction, it’s clear that product-focused industries are on track to leverage data science. But what about the more natural side of things? One example suggests ecologists can learn more about ocean ecosystems through the use of technology driven by data science. Through coding and the use of other data science tools, these environmental scientists found they could conduct better, more effective oceanic research in significantly less time. Our hope is for other scientists to continue these methods and unearth more pivotal information about our planet. 
 
So there you have it. Four industries who are beginning to harness the power of data science to help transform business processes, drive innovation and ultimately change the world. Who will the next four be? 

 

by swebster at June 21, 2017 05:56 PM

Enthought

Enthought Announces Canopy 2.1: A Major Milestone Release for the Python Analysis Environment and Package Distribution

Python 3 and multi-environment support, new state of the art package dependency solver, and over 450 packages now available free for all users

Enthought Canopy logoEnthought is pleased to announce the release of Canopy 2.1, a significant feature release that includes Python 3 and multi-environment support, a new state of the art package dependency solver, and access to over 450 pre-built and tested scientific and analytic Python packages completely free for all users. We highly recommend that all current Canopy users upgrade to this new release.

Ready to dive in? Download Canopy 2.1 here.


For those currently familiar with Canopy, in this blog we’ll review the major new features in this exciting milestone release, and for those of you looking for a tool to improve your workflow with Python, or perhaps new to Python from a language like MATLAB or R, we’ll take you through the key reasons that scientists, engineers, data scientists, and analysts use Canopy to enable their work in Python.

First, let’s talk about the latest and greatest in Canopy 2.1!

  1. Support for Python 3 user environments: Canopy can now be installed with a Python 3.5 user environment. Users can benefit from all the Canopy features already available for Python 2.7 (syntax checking, debugging, etc.) in the new Python 3 environments. Python 3.6 is also available (and will be the standard Python 3 in Canopy 2.2).
  2. All 450+ Python 2 and Python 3 packages are now completely free for all users: Technical support, full installers with all packages for offline or shared installation, and the premium analysis environment features (graphical debugger and variable browser and Data Import Tool) remain subscriber-exclusive benefits. See subscription options here to take advantage of those benefits.
  3. Built in, state of the art dependency solver (EDM or Enthought Deployment Manager): the new EDM back end (which replaces the previous enpkg) provides additional features for robust package compatibility. EDM integrates a specialized dependency solver which automatically ensures you have a consistent package set after installation, removal, or upgrade of any packages.
  4. Environment bundles, which allow users to easily share environments directly with co-workers, or across various deployment solutions (such as the Enthought Deployment Server, continuous integration processes like Travis-CI and Appveyor, cloud solutions like AWS or Google Compute Engine, or deployment tools like Ansible or Docker). EDM environment bundles not only allow the user to replicate the set of installed dependencies but also support persistence for constraint modifiers, the list of manually installed packages, and the runtime version and implementation.
  5. Multi-environment support: with the addition of Python 3 environments and the new EDM back end, Canopy now also supports managing multiple Python environments from the user interface. You can easily switch between Python 2.7 and 3.5, or between multiple 2.7 or 3.5 environments. This is ideal especially for those migrating legacy code to Python 3, as it allows you to test as you transfer and also provides access to historical snapshots or libraries that aren’t yet available in Python 3.

 


Why Canopy is the Python platform of choice for scientists and engineers

Since 2001, Enthought has focused on making the scientific Python stack accessible and easy to use for both enterprises and individuals. For example, Enthought released the first scientific Python distribution in 2004, added robust and corporate support for NumPy on 64-bit Windows in 2011, and released Canopy 1.0 in 2013.

Since then, with its MATLAB-like experience, Canopy has enabled countless engineers, scientists and analysts to perform sophisticated analysis, build models, and create cutting-edge data science algorithms. Canopy’s all-in-one package distribution and analysis environment for Python has also been widely adopted in organizations who want to provide a single, unified platform that can be used by everyone from data analysts to software engineers.

Here are five of the top reasons that people choose Canopy as their tool for enabling data analysis, data modelling, and data visualization with Python:

1. Canopy provides a complete, self-contained installer that gets you up and running with Python and a library of scientific and analytic tools – fast

Canopy has been designed to provide a fast installation experience which not only installs the Canopy analysis environment but also the Python version of your choice (e.g. 2.7 or 3.5) and a core set of curated Python packages. The installation process can be executed in your home directory and does not require administrative privileges.

In just minutes, you’ll have a fully working Python environment with the primary tools for doing your work pre-installed: Jupyter, Matplotlib, NumPy and SciPy optimized with the latest MKL from Intel, Matplotlib, Scikit-learn, and Pandas, plus instant access to over 450 additional pre-built and tested scientific and analytic packages to customize your toolset.

No command line, no complex multi-stage setups! (although if you do prefer a flat, standalone command line interface for package and environment management, we offer that too via the EDM tool)

2. Access to a curated, quality assured set of packages managed through Canopy’s intuitive graphical package manager

The scientific Python ecosystem is gigantic and vibrant. Enthought is continuously updating its Enthought Python Distribution package set to provide the most recent “Enthought approved” versions of packages, with rigorous testing and quality assessment by our experts in the Python packaging ecosystem before release.

Our users can’t afford to take chances with the stability of their software and applications, and using Canopy as their gateway to the Python ecosystem helps take the risk out of the “wild west” of open source software. With more than 450 tested, pre-built and approved packages available in the Enthought Python Distribution, users can easily access both the most current stable version as well as historical versions of the libraries in the scientific Python stack.

Consistent with our focus on ease-of-use, Canopy provides a graphical package manager to easily search, install and remove packages from the user environment. You can also easily roll back to earlier versions of a package. The underlying EDM back end takes care of complex dependency management when installing, updating, and removing packages to ensure nothing breaks in the process.

3. Canopy is designed to be extensible for the enterprise

Canopy not only provides a consistent Python toolset for all 3 major operating systems and support for a wide variety of use cases (from data science to data analysis to modelling and even application development), but it is also extensible with other tools.

Canopy can easily be integrated with other software tools in use at enterprises, such with Excel via PyXLL or with LabVIEW from National Instruments using the Python Integration Toolkit for LabVIEW. The built-in Canopy Data Import Tool helps you automate your data ingestion steps and automatically import tabular data files such as CSVs into Pandas DataFrames.

But it doesn’t stop there. If an enterprise has Python embedded in another software application, Canopy can be directly connected to that application to provide coding and debugging capabilities. Canopy itself can even be customized or embedded to provide a sophisticated Python interface for your applications. Contact us to learn more about these options.

Finally, in addition to accessing the libraries in the Enthought Python Distribution from Canopy, users can use the same tools to share and deploy their own internal, private packages by adding the Enthought Deployment Server.  The Enthought Deployment Server also allows enterprises to have a private,  onsite copy of the full Enthought Python Distribution on their own approved servers and compliant with their existing security protocols.

 

5. Canopy’s straightforward analysis environment, specifically tailored to the needs and workflow of scientists, analysts, and engineers

Three integrated features of the Canopy analysis environment combine to create a powerful, yet streamlined platform: (1) a code editor, (2) an interactive graphical debugger with variable browser, and (3) an IPython window.

  • Canopy’s code editor comes with everything required to write analysis code, but without the burden of advanced development environments like PyCharm or Microsoft Visual Studio (although, if needed, other IDE’s can be configured to use the Canopy Python environment). With syntax highlighting, Python code auto-completion, and error checking, users can quickly interact with Python code, write and execute existing code.
  • Canopy’s interactive graphical debugger with variable browser helps you quickly find and fix code errors, understand and investigate code and data, and write new code more quickly.
  • The integrated IPython window lets you quickly test code, experiment with ideas and see the results of code run directly from the editor. Canopy also includes pre-configured Jupyter Notebook access.

Finally, access to package documentation at your fingertips in Canopy is a great benefit to faster coding. Canopy not only integrates online documentation and examples for many of the most used packages for data visualization, numerical analysis, machine learning, and more, but also let you easily extract and execute code from that documentation to get started working with starter code quickly.

We’re very excited for this major release and all of the new capabilities that it will enable for both individuals and enterprises, and encourage you to download or update to the new Canopy 2.1 today.

Have feedback on your experience with Canopy?

We’d love to hear about it! Contact the product development team at canopy.support@enthought.com.


Additional Resources:

Blog: New Year, New Enthought Products (Jan 2017)

Blog: Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting

Product pages:

The post Enthought Announces Canopy 2.1: A Major Milestone Release for the Python Analysis Environment and Package Distribution appeared first on Enthought Blog.

by dpinte at June 21, 2017 04:30 PM

June 20, 2017

Matthieu Brucher

Announcement: Audio TK 2.1.0

ATK is updated to 2.1.0 with a major refactoring of the Python wrappers and extensive testing of them. New filters were also added to support more complex pipelines (mute/solo and circular buffers for real-time spectrum displays) and Audio ToolKit provies now a CMake configuration file for easier integration in CMake projects.

Thanks to Travis and Appveyor, binaries for the releases are now updated on Github. On all platforms we compile static and shared libraries. On Linux, gcc 5, gcc 6, clang 3.8 and clang 3.9 are generated, on OS X, XCode 7 and XCode 8 are available as universal binaries, and on Windows, 32 bits, 64 bits, with dynamic or static (no shared libraries in this case) runtime, are also generated. Due to Travis strange configuration on macOS, the binaries there don’t yet support Python 2 or Python 3.

Download link: ATK 2.1.0

Changelog:
2.1.0
* Added a config file for CMake
* Rewrote the Python wrappers to use pybind11 instead of SWIG
* Added MuteSoloSumFilter to allow mute/solo operations on tracks with Python wrappers
* SumFilter can now sum multiple channels together
* Adding fourth order Linkwitz-Riley filters
* Adding a new circular buffer (for FFT plugins for instance)
* Added parameters for tube (inverters) filters definition
* Added Python wrappers in Travis-CI builds
* Added a modified implementation of the Munro-Piazza triode function to remove some artefacts

2.0.2
* Fix ARM compilation

2.0.1
* Turn set/get into properties when possible (Python wrapper)
* Enhanced Tools API (Audio ToolKit book)
* Added a Feedback Delay Netwrok filter (FDN) with Hadamard mixing matrix with Python wrappers
* Fixed MultipleUniversalFixedDelayLineFilter parameters

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 20, 2017 07:04 AM

June 15, 2017

Matthew Rocklin

Dask Release 0.15.0

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.0. This release contains performance and stability enhancements as well as some breaking changes. This blogpost outlines notable changes since the last release on May 5th.

As always you can conda install Dask:

conda install dask distributed

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on the defaults and conda-forge channels.

Full changelogs are available here:

Some notable changes follow.

NumPy ufuncs operate as Dask.array ufuncs

Thanks to recent changes in NumPy 1.13.0, NumPy ufuncs now operate as Dask.array ufuncs. Previously they would convert their arguments into Numpy arrays and then operate concretely.

import dask.array as da
import numpy as np

x = da.arange(10, chunks=(5,))

# Before
>>> np.negative(x)
array([ 0, -1, -2, -3, -4, -5, -6, -7, -8, -9])

# Now
>>> np.negative(x)
dask.array<negative, shape=(10,), dtype=int64, chunksize=(5,)>

To celebrate this change we’ve also improved support for more of the NumPy ufunc and reduction API, such as support for out parameters. This means that a non-trivial subset of the actual NumPy API works directly out-of-the box with dask.arrays. This makes it easier to write code that seamlessly works with either array type.

Note: the ufunc feature requires that you update NumPy to 1.13.0 or later. Packages are available through PyPI and conda on the defaults and conda-forge channels.

Asynchronous Clients

The Dask.distributed API is capable of operating within a Tornado or Asyncio event loop, which can be useful when integrating with other concurrent systems like web servers or when building some more advanced algorithms in machine learning and other fields. The API to do this used to be somewhat hidden and only known to a few and used underscores to signify that methods were asynchronous.

# Before
client = Client(start=False)
await client._start()

future = client.submit(func, *args)
result = await client._gather(future)

These methods are still around, but the process of starting the client has changed and we now recommend using the fully public methods even in asynchronous situations (these used to block).

# Now
client = await Client(asynchronous=True)

future = client.submit(func, *args)
result = await client.gather(future)  # no longer use the underscore

You can also await futures directly:

result = await future

You can use yield instead of await if you prefer Python 2.

More information is available at https://distributed.readthedocs.org/en/latest/asynchronous.html.

Single-threaded scheduler moves from dask.async to dask.local

The single-machine scheduler used to live in the dask.async module. With async becoming a keyword since Python 3.5 we’re forced to rename this. You can now find the code in dask.local. This will particularly affect anyone who was using the single-threaded scheduler, previously known as dask.async.get_sync. The term dask.get can be used to reliably refer to the single-threaded base scheduler across versions.

Retired the distributed.collections module

Early blogposts referred to functions like futures_to_dask_array which resided in the distributed.collections module. These have since been entirely replaced by better interactions between Futures and Delayed objects. This module has been removed entirely.

Always create new directories with the –local-directory flag

Dask workers create a directory where they can place temporary files. Typically this goes into your operating system’s temporary directory (/tmp on Linux and Mac).

Some users on network file systems specify this directory explicitly with the dask-worker ... --local-directory option, pointing to some other better place like a local SSD drive. Previously Dask would dump files into the provided directory. Now it will create a new subdirectory and place files there. This tends to be much more convenient for users on network file systems.

$ dask-worker scheduler-address:8786 --local-directory /scratch
$ ls /scratch
worker-1234/
$ ls /scratch/worker-1234/
user-script.py disk-storage/ ...

Bag.map no longer automatically expands tuples

Previously the map method would inspect functions and automatically expand tuples to fill arguments:

import dask.bag as db
b = db.from_sequence([(1, 10), (2, 20), (3, 30)])

>>> b.map(lambda x, y: return x + y).compute()
[11, 22, 33]

While convenient, this behavior gave rise to corner cases and stopped us from being able to support multi-bag mapping functions. It has since been removed. As an advantage though, you can now map two co-partitioned bags together.

a = db.from_sequence([1, 2, 3])
b = db.from_sequence([10, 20, 30])

>>> db.map(lambda x, y: x + y, a, b).compute()
[11, 22, 33]

Styling

Clients and Futures have nicer HTML reprs that show up in the Jupyter notebook.

And the dashboard stays a decent width and has a new navigation bar with links to other dashboard pages. This template is now consistently applied to all dashboard pages.

Multi-client coordination

More primitives to help coordinate between multiple clients on the same cluster have been added. These include Queues and shared Variables for futures.

Joblib performance through pre-scattering

When using Dask to power Joblib computations (such as occur in Scikit-Learn) with the joblib.parallel_backend context manager, you can now pre-scatter select data to all workers. This can significantly speed up some scikit-learn computations by reducing repeated data transfer.

import distributed.joblib
from sklearn.externals.joblib import parallel_backend

# Serialize the training data only once to each worker
with parallel_backend('dask.distributed', scheduler_host='localhost:8786',
                      scatter=[digits.data, digits.target]):
      search.fit(digits.data, digits.target)

Other Array Improvements

  • Filled out the dask.array.fft module
  • Added a basic dask.array.stats module with functions like chisquare
  • Support the @ matrix multiply operator

General performance and stability

As usual, a number of bugs were identified and resolved and a number of performance optimizations were implemented. Thank you to all users and developers who continue to help identify and implement areas for improvement. Users should generally have a smoother experience.

Removed ZMQ networking backend

We have removed the experimental ZeroMQ networking backend. This was not particularly useful in practice. However it was very effective in serving as an example while we were making our network communication layer pluggable with different protocols.

The following related projects have also been released recently and may be worth updating:

  • NumPy 1.13.0
  • Pandas 0.20.2
  • Bokeh 0.12.6
  • Fastparquet 0.1.0
  • S3FS 0.1.1
  • Cloudpickle 0.3.1 (pip)
  • lz4 0.10.0 (pip)

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.3 release on May 5th:

  • Antoine Pitrou
  • Elliott Sales de Andrade
  • Ghislain Antony Vaillant
  • John A Kirkham
  • Jim Crist
  • Joseph Crail
  • Juan Nunez-Iglesias
  • Julien Lhermitte
  • Martin Durant
  • Matthew Rocklin
  • Samantha Hughes
  • Tom Augspurger

The following people contributed to the dask/distributed repository since the 1.16.2 release on May 5th:

  • A. Jesse Jiryu Davis
  • Antoine Pitrou
  • Brett Naul
  • Eugene Van den Bulke
  • Fabian Keller
  • Jim Crist
  • Krisztián Szűcs
  • Matthew Rocklin
  • Simon Perkins
  • Thomas Arildsen
  • Viacheslav Ostroukh

June 15, 2017 12:00 AM

June 14, 2017

Continuum Analytics news

Here Comes The Data Science—And It’s All Right

Wednesday, June 14, 2017
Travis Oliphant
President, Chief Data Scientist & Co-Founder

Did you know that 94 percent of enterprises are using open source technologies for Data Science, and 96 percent of company executives say Data Science is critical to the success of their business?   We uncovered these statistics when surveying several hundred executives and data scientists to gain a better understanding of the state of Data Science in today’s organizations. Clearly, Data Science is becoming more popular by the minute—but, what tools and platforms are companies specifically using?
 
Many people start with Anaconda. As the leading and fastest-growing Open Data Science platform powered by Python, it has been downloaded more than 13 million times and has approximately two million active users in 2016—and this number is rapidly increasing!
 
Another proof of Data Science’s hockey stick growth curve is the 18th annual KDnuggets Software Poll, which shows Anaconda usage has increased by 37 percent from 2016—giving it the second highest growth rate in the poll. Furthermore, Anaconda earned a top 10 ranking for the industry’s most popular analytics/Data Science tools, with nearly a quarter of the KDnuggets’ survey respondents using the Open Data Science platform. In addition, Python surpassed R as the most popular Data Science language (52.6 percent of respondents use the language), a trend that will likely continue through the next few years, especially given its particular suitability to Artificial Intelligence and Machine Learning.

 
The growth numbers point an undeniable market need for Data Science technologies that can, and are, delivering tremendous insights and impactful business outcomes. By using Python, Anaconda and other Data Science technologies, organizations across dozens of industries will be able to identify patterns, uncover crucial insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems—such as predicting the effects of public policy, curing rare genetic diseases and even discovering new planets. This is only the beginning of the power of Data Science. 

 

by swebster at June 14, 2017 03:56 PM

June 13, 2017

numfocus

NumPy receives first ever funding, thanks to Moore Foundation

For the first time ever, NumPy—a core project for the Python scientific computing stack—has received grant funding. The proposal, “Improving NumPy for Better Data Science” will receive $645,020 from the Moore Foundation over 2 years, with the funding going to UC Berkeley Institute for Data Science. The principal investigator is Dr. Nathaniel Smith. NumFOCUS congratulates Nathaniel and all […]

by Gina Helfrich at June 13, 2017 02:00 PM

Matthieu Brucher

Audio ToolKit: RIAA correction curves

Vinyl has become trendy again, and as such, I’ve been asked to add some new filters in Audio ToolKit. Here is a small dive in RIAA land.

More than meets the eye

The RIAA filter compensates for lower bass frequencies and louder high frequencies on a vinyl (of course, this pre mastering filter for vinyl is also available now in Audio ToolKit on the develop branch). The filter is quite simple, it’s an order 2 low-pass filter, with known knees. There are three time constants: t_1 = 75*10^{-6}s, t_2 = 318*10^{-6} and t_3 = 3180*10^{-6}s. This then used in the following continuous transfer function:

H(s) = \dfrac{t_2}{t_1}\dfrac{(1+st_2)}{(1+st_1)(1+st_3)}

The trick now to convert this into discrete time is to first wrap these times with the following equation: t_d = \dfrac{1}{f_s * tan(\pi / (t_a * f_s))}, with f_s the sampling frequency. Only then can you get a matching curve:

RIAA correction curve

If you compare this with Wikipedia’s curve, you will see that it doesn’t match mine. The gain for lower frequencies is not correct, and at 1kHz, the gain is not 0dB. And if you try to make 0dB at 1kHz, you have even a lower gain! Why? Good question. These are the constant given by the association and the gain that should used.

RIAA is not simple. It’s not even consistent! But at least it given a good first curve for vinyl correction that can be enhanced by adding additional EQ filters.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 13, 2017 07:16 AM

June 09, 2017

numfocus

NumFOCUS adds Shogun Machine Learning Toolbox to Sponsored Projects

NumFOCUS is delighted to announce the addition of the Shogun Machine Learning Toolbox to our fiscally sponsored projects. Shogun’s mission is to make powerful machine learning tools available to everyone —researchers, engineers, students — anyone curious to experiment with machine learning to leverage data. The Shogun Machine Learning Toolbox provides efficient implementations of standard and state-of-the-art machine […]

by Gina Helfrich at June 09, 2017 03:00 PM

June 06, 2017

Continuum Analytics news

Using Anaconda to Embrace Python 3 And Support Python 2

Tuesday, June 6, 2017
Ian Stokes-Rees
Continuum Analytics

The data science community received a special delivery in December with the release of Python 3.6. At the time, I had a conversation with The New Stack to discuss what’s new with this release, and why we at Continuum see 2017 as the year that Python 3 is beginning to dominate the data science landscape, with major adoption in the enterprise space. I’m excited that today we’re extending that story with the most recent version 4.4 release of Anaconda, available for both Python 2.7 and Python 3.6. Anaconda downloads have skyrocketed in the past six months, with more than a million downloads per month. Besides delivering a comprehensive platform for Python-centric data science with a single-click installer for Windows, Mac, Linux, and Power8, Anaconda 4.4 is also designed to make it easy to work with both Python 2 and Python 3 code.  If you haven’t already started using Python 3, there’s no better time than today. 

Why Python 3.6?

When Python 3.6 launched, I was excited to see this tweet from legendary Python core developer and former Python Software Foundation board member, Raymond Hettinger, as it expressed my sentiments exactly:

 

 

To backup a minute; in 2016, Python 3.5 proved itself among the established Python community, and word got out that Python 3 was good to go. This made Python 3.6 the first version of the language that was delivered to a maturing base of users, poising it for a prime growth opportunity. Having just returned from PyCon 2017 in Portland Oregon—the flagship event for the Python community—I can tell you that Python 2 is mostly a footnote, and the buzz is largely around Python 3. With just over three years until Python 2 support is officially eliminated by the Python Software Foundation, and the scientific Python community publishing coordinated timelines for when Python 2 support will be suspended for many popular libraries, it is reasonable that there’s a widespread move to Python 3 happening now. This is great news because Python 3, and especially Python 3.6 (as Raymond alluded to in his tweet), offers many great new features and performance enhancements.

What’s New?

Of the 200 tools and libraries that Anaconda provides, there are dozens designed to help with co-development of Python 2 and Python 3 code. This means that many of the new Python 3 standard library capabilities are available through backported implementations that are bundled in Anaconda for Python 2. Enterprises that are still committed to maintaining their legacy Python 2 code-base can benefit from these features in advance of migrating to Python 3. Furthermore, Anaconda’s software sandboxing system means it is easy to run Python 2 and Python 3 in tandem on the same system, supporting a gradual migration and avoiding the need for any “big bang” cutover. For many, these capabilities alone are enough of a reason to use Anaconda. If you’re looking to migrate people and software to Python 3, then I’d recommend Lennart Regebro’s Python 3 Porting website and the shorter PSF HOWTO on Porting to Python 3 by Microsoft engineer Brett Cannon.

Anaconda 4.4 also ships with the Intel Math Kernel Libraries, providing a substantial performance boost for the 20+ Python libraries that are compiled to leverage these optimized routines on several generations of Intel processors. Numpy, Scipy and Scikit-Learn and all the libraries that build on these benefit from this.

What’s Next?

We’re advising our clients to see Python 3.6 as a “reference release” that should be adopted for any new Python-based projects. 

Python 3.6 offers a more stable version of the language that enhances some core concurrency capabilities around a concept known as “coroutines.” These provide new language constructs for the creation of asynchronous, and possibly parallel, functions. This is an area that is obviously important in the world of multi-core CPUs, increasing demands for computational power, and the leveling off of peak processor speeds. Python 3 also offers fundamental improvements of the core data structures, in particular dictionaries on which the language is practically built. Python dictionaries are often known as “hash-maps” in other languages.

The piece I’m most excited about, however, is the introduction of type annotations that can be used to provide a degree of type checking. I believe this will take Python to a new level for enterprise adoption and introduce possibilities for interface design and program behavior that have been tricky until now. At Continuum, we’re looking forward to leveraging these in our Numba Just-In-Time compiler for Python.

Join the Conversation

If you’re not already using Anaconda I’d encourage you to download it today. If you’re already an Anaconda user then you can update to the latest version with:

conda update conda anaconda

If you’re using Anaconda Navigator, update the “anaconda” package. If you want to get started with Python 3, then I’d recommend an O’Reilly booklet written by a colleague of mine, David Mertz, that describes the benefits and how to migrate to Python 3. It is called Picking a Python Version: A Manifesto

I’d love to hear your thoughts and comments on Python 3 and Anaconda. Catch me on Twitter: @ijstokes.

by swebster at June 06, 2017 07:05 PM

Matthieu Brucher

Audio ToolKit: Mailing list for beta testers

Recently, I’ve struggled with releasing perfect plugins. There were some glitches in the last 2 plugins that could have been avoided easily with more testing.

So this is a call for people who are interested in my plugins. You can join the mailing list, share your ideas of the current plugins, and the future ones, about what can be made better, how it can be better…

by Matt at June 06, 2017 07:59 AM

June 05, 2017

Enthought

SciPy 2017 Conference to Showcase Leading Edge Developments in Scientific Computing with Python

Renowned scientists, engineers and researchers from around the world to gather July 10-16, 2017 in Austin, TX to share and collaborate to advance scientific computing tool


AUSTIN, TX – June 6, 2017 –
Enthought, as Institutional Sponsor, today announced the SciPy 2017 Conference will be held July 10-16, 2017 in Austin, Texas. At this 16th annual installment of the conference, scientists, engineers, data scientists and researchers will participate in tutorials, talks and developer sprints designed to foster the continued rapid growth of the scientific Python ecosystem. This year’s attendees hail from over 25 countries and represent academia, government, national research laboratories, and industries such as aerospace, biotechnology, finance, oil and gas and more.

“Since 2001, the SciPy Conference has been a highly anticipated annual event for the scientific and analytic computing community,” states Dr. Eric Jones, CEO at Enthought and SciPy Conference co-founder. “Over the last 16 years we’ve witnessed Python emerge as the de facto open source programming language for science, engineering and analytics with widespread adoption in research and industry. The powerful tools and libraries the SciPy community has developed are used by millions of people to advance scientific inquest and innovation every day.”

Special topical themes for this year’s conference are “Artificial Intelligence and Machine Learning Applications” and the “Scientific Python (SciPy) Tool Stack.” Keynote speakers include:

  • Kathryn Huff, Assistant Professor in the Department of Nuclear, Plasma, and Radiological Engineering at the University of Illinois at Urbana-Champaign  
  • Sean Gulick, Research Professor at the Institute for Geophysics at the University of Texas at Austin
  • Gaël Varoquaux, faculty researcher in the Neurospin brain research institute at INRIA (French Institute for Research in Computer Science and Automation)

In addition to the special conference themes, there will also be over 100 talk and poster paper speakers/presenters covering eight mini-symposia tracks including: Astronomy; Biology, Biophysics, and Biostatistics; Computational Science and Numerical Techniques; Data Science; Earth, Ocean, and Geo Sciences; Materials Science and Engineering; Neuroscience; and Open Data and Reproducibility.

New for 2017 is a sold-out “Teen Track,” a two-day curriculum designed to inspire the scientists of tomorrow.  From July 10-11, high school students will learn more about the Python language and how developers solve real world scientific problems using Python and its scientific libraries.

Conference and tutorial registration is open at https://scipy2017.scipy.org.

About the SciPy Conference

SciPy 2017, the sixteenth annual Scientific Computing with Python conference, will be held July 10-16, 2017 in Austin, Texas. SciPy is a community dedicated to the advancement of scientific computing through open source Python software for mathematics, science and engineering. The annual SciPy Conference allows participants from all types of organizations to showcase their latest projects, learn from skilled users and developers and collaborate on code development. For more information or to register, visit https://scipy2017.scipy.org.

About Enthought

Enthought is a global leader in scientific and analytic software, consulting and training solutions serving a customer base comprised of some of the most respected names in the oil and gas, manufacturing, financial services, aerospace, military, government, biotechnology, consumer products and technology industries. The company was founded in 2001 and is headquartered in Austin, Texas, with additional offices in Cambridge, United Kingdom and Pune, India. For more information visit www.enthought.com and connect with Enthought on Twitter, LinkedIn, Google+, Facebook and YouTube.

 

 

The post SciPy 2017 Conference to Showcase Leading Edge Developments in Scientific Computing with Python appeared first on Enthought Blog.

by admin at June 05, 2017 06:18 PM

June 03, 2017

Paul Ivanov

June 1st, 2017

We had another biannual Jupyter team meeting this week, this time it was right nearby in Berkeley. Since I had read a poem at the last meeting, I was encouraged to keep that going and decided to make this a tradition. Here's the result, as delivered this past Friday, recorded by Fernando Pérez (thanks, Fernando!).

June 1st, 2017

We struggle -- with ourselves and with each other
we plan -- we code and write
the pieces and ideas t'wards what we think is right
but we may disagree -- about
the means, about the goals, about the
shoulders we should stand on --
where we should stand, what we should stretch toward
shrink from, avoid, embrace --
a sense of urgency - but this is not a race

There's much to learn, to do...

Ours not the only path, no one coerced you here
You chose this -- so did I and here we are --
still at the barricades and gaining ground
against the old closed world:
compute communication comes unshackled

by Paul Ivanov at June 03, 2017 07:00 AM

June 02, 2017

Randy Olson

TPOT Automated Machine Learning Competition

Can AutoML beat humans on Kaggle? Automated Machine Learning (AutoML) is poised to make a transformative impact on data science in 2017. At the University of Pennsylvania, we’ve been working hard to develop TPOT, a state-of-the-art open source AutoML tool

by Randy Olson at June 02, 2017 05:54 PM

June 01, 2017

William Stein

RethinkDB must relicense NOW

What is RethinkDB?

UPDATE:  Several months after I wrote this post, RethinkDB was relicensed.  For the CoCalc project, it was too late, and by then we had already switched to PostgreSQL


RethinkDB is a INCREDIBLE high quality polished open source realtime database that is easy to deploy, shard, replicate, and supports a reactive client programming model, which is useful for collaborative web-based applications. Shockingly, the 7-year old company that created RethinkDB has just shutdown. I am the CEO of a company, SageMath, Inc., that uses RethinkDB very heavily, so I have a strong interest in RethinkDB surviving as an independent open source project.

Three Types of Open Source Projects

There are many types of open source projects. RethinkDB was the type of open source project where most work on RethinkDB has been fulltime focused work, done by employees of the RethinkDB company. RethinkDB is licensed under the AGPL, but the company promised to make the software available to customers under other licenses.

Academia: I started the SageMath open source math software project in 2005, which has over 500 contributors, and a relatively healthy volunteer ecosystem, with about hundred contributors to each release, and many releases each year. These are mostly volunteer contributions by academics: usually grad students, postdocs, and math professors. They contribute because SageMath is directly relevant to their research, and they often contribute state of the art code that implements algorithms they have created or refined as part of their research. Sage is licensed under the GPL, and that license has worked extremely well for us. Academics sometimes even get significant grants from the NSF or the EU to support Sage development.

Companies: I also started the Cython compiler project in 2007, which has had dozens of contributors and is now the defacto standard for writing or wrapping fast code for use by Python. The developers of Cython mostly work at companies (e.g., Google) as a side project in their spare time. (Here's a message today about a new release from a Cython developer, who works at Google.) Cython is licensed under the Apache License.

What RethinkDB Will Become

RethinkDB will no longer be an open source project whose development is sponsored by a single company dedicated to the project. Will it be an academic project, a company-supported project, or dead?

A friend of mine at Oxford University surveyed his academic CS colleagues about RethinkDB, and they said they had zero interest in it. Indeed, from an academic research point of view, I agree that there is nothing interesting about RethinkDB. I myself am a college professor, and understand these people! Academic volunteer open source contributors are definitely not going to come to RethinkDB's rescue. The value in RethinkDB is not in the innovative new algorithms or ideas, but in the high quality carefully debugged implementations of standard algorithms (largely the work of bad ass German programmer Daniel Mewes). The RethinkDB devs had to carefully tune each parameter in those algorithms based on extensive automated testing, user feedback, the Jepsen tests, etc.

That leaves companies. Whether or not you like or agree with this, many companies will not touch AGPL licensed code:
"Google open source guru Chris DiBona says that the web giant continues to ban the lightning-rod AGPL open source license within the company because doing so "saves engineering time" and because most AGPL projects are of no use to the company."
This is just the way it is -- it's psychology and culture, so deal with it. In contrast, companies very frequently embrace open source code that is licensed under the Apache or BSD licenses, and they keep such projects alive. The extremely popular PostgreSQL database is licensed under an almost-BSD license. MySQL is freely licensed under the GPL, but there are good reasons why people buy a commercial MySQL license (from Oracle) for MySQL. Like RethinkDB, MongoDB is AGPL licensed, but they are happy to sell a different license to companies.

With RethinkDB today, the only option is AGPL. This very strongly discourage use by the only possible group of users and developers that have any chance to keep RethinkDB from death. If this situation is not resolved as soon as possible, I am extremely afraid that it never will be resolved. Ever. If you care about RethinkDB, you should be afraid too. Ignoring the landscape and culture of volunteer open source projects is dangerous.

A Proposal

I don't know who can make the decision to relicense RethinkDB. I don't kow what is going on with investors or who is in control. I am an outsider. Here is a proposal that might provide a way out today:

PROPOSAL: Dear RethinkDB, sell me an Apache (or BSD) license to the RethinkDB source code. Make this the last thing your company sells before it shuts down. Just do it.


Hacker News Discussion

by William Stein (noreply@blogger.com) at June 01, 2017 01:25 PM

May 30, 2017

numfocus

Python in Astronomy 2017

#pyastro17 brought together 54 participants from five continents for a wide-ranging workshop on Python in Astronomy.

by Gina Helfrich at May 30, 2017 06:01 PM

Matthieu Brucher

Announcement: ATKGuitarPreamp 1.0.0

I’m happy to announce the release of a modeling of the Vox AC30 preamplifier stage followed by a JCM800 tone stack based on the Audio Toolkit. They are available on Windows and OS X (min. 10.9) in different formats.

ATKGuitarPreamp

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKGuitarPreamp.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at May 30, 2017 07:51 AM

May 26, 2017

Enthought

Enthought at National Instruments’ NIWeek 2017: An Inside Look

This week I had the distinct privilege of representing Enthought at National Instruments‘ 23rd annual user conference, NIWeek 2017. National Instruments is a leader in test, measurement, and control solutions, and we share many common customers among our global scientific and engineering user base.

NIWeek kicked off on Monday with Alliance Day, where my colleague Andrew Collette and I went on stage to receive the LabVIEW Tools Network 2017 Product of the Year Award for Enthought’s Python Integration Toolkit, which provides a bridge between Python and LabVIEW, allowing you to create VI’s (virtual instruments) that make Python function and object method calls. Since its release last year, the Python Integration Toolkit has opened up access to a broad range of new capabilities for LabVIEW users,  by combining the best of Python with the best of LabVIEW. It was also inspiring to hear about the advances being made by other National Instruments partners. Congratulations to the award winners in other categories (Wineman Technology, Bloomy, and Moore Good Ideas)!

On Wednesday, Andrew gave a presentation titled “Building and Deploying Python-Powered LabVIEW Applications” to a standing-room only crowd.  He gave some background on the relative strengths of Python and LabVIEW (some of which is covered in our March 2017 webinar “Using Python and LabVIEW to Rapidly Solve Engineering Problems“) and then showcased some of the capabilities provided by the toolkit, such as plotting data acquisition results live to a web server using plotly, which is always a crowd-pleaser (you can learn more about that in the blog post “Using Plotly from LabVIEW via Python”).  Other demos included making use of the Python scikit-learn library for machine learning, (you can see Enthought’s CEO Eric Jones run that demo here, during the 2016 NIWeek keynotes.)

For a mechanical engineer like me, attending NIWeek is a bit like giving a kid a holiday in a candy shop.  There was much to admire on the expo floor, with all kinds of mechatronic gizmos and gadgets.  I was most interested by the lightning-fast video and image processing possible with NI’s FPGA systems, like the part sorting system shown below.  Really makes me want to play around with nifpga.

Another thing really gaining traction is the implementation of machine learning for a number of applications. I attended one talk titled “Deep Learning With LabVIEW and Acceleration on FPGAs” that demonstrated image classification using a neural network and talked about strategies to reduce the code size to get it to fit on an FPGA.

Finally, of course, I was really excited by all of the activity in the Industrial Internet of Things (IIoT), which is an area of core focus for Enthought.  We have been in the big data analytics game for a long time, and writing software for hard science is in our company DNA. But this year especially, starting with the AIChE 2017 Spring Meeting and now at NIWeek 2017, it has been really energizing to meet with industry leaders and see some of the amazing things that are being implemented in the IIoT.  National Instruments has been a leader in the test and measurement sector for a long time, and they have been pioneers in IIoT.  Now it is easy to download and install an interface to Amazon S3 for LabVIEW, and just like that, your sensor is now a connected sensor … and your data is ready for analysis in Enthought’s Canopy Data platform.

After immersion in NIWeek, I guess you could say, I’ve been “LabVIEWed”:

The post Enthought at National Instruments’ NIWeek 2017: An Inside Look appeared first on Enthought Blog.

by Tim Diller at May 26, 2017 08:44 PM

Continuum Analytics news

Let’s Talk PyCon 2017 - Thoughts from the Anaconda Team

Friday, May 26, 2017
Peter Wang
Chief Technology Officer & Co-Founder

We’re not even halfway through the year, but 2017 has already been filled to the brim with dynamic presentations and action-packed conferences. This past week, the Anaconda team was lucky enough to attend PyCon 2017 in Portland, OR - the largest annual gathering for the community that uses and develops Python. We came, we saw, we programmed, we networked, we spoke, we ate, we laughed, and we learned. Myself and some of our team members at the conference shared some details on their experiences - take a look and, if you attended, share your thoughts in the comment section below, or on Twitter @ContinuumIO

Did anything surprise you at PyCon? 

“I was surprised how many attendees were using Python for data. I missed last year's PyCon, and so comparing against PyCon 2015, there was a huge growth in the last two years. During Katy Huff's keynote, she asked how many people in the audience had degrees in science, and something like 40% of the people raised their hands. In the past, this was not the case - PyCon had a lot more "traditional" software developers.” - Peter Wang, CTO & co-founder, Anaconda

“Yes - how diverse the community is. Looking at the session topics provides an indicator about this, but having had somewhere between 60-80 interactions at the Anaconda booth, there was a huge range of discussions all the way from "Tell me more about data science" to "I've been using Anaconda for years and am a huge fan" or "conda saved my life.” I also saw a huge range of roles and backgrounds in attendees from enterprise, government, military, academic, students, and independent consultants. It was great to see a number of large players here: Facebook/Instagram, LinkedIn, Microsoft, Google,and Intel were all highly visible, supporting the community.” - Stephen Kearns, Product Marketing Manager, Anaconda

“What really struck me this year was how heavy the science and data science angles were from speakers, topics, exhibitors, and attendees.  The Thursday and Friday morning keynotes were Science + Python (Jake Vanderplas and Katy Huff), then the Sunday closing keynote was about containers and Kubernetes (Kelsey Hightower).” - Ian Stokes-Rees, Computational Scientist, Anaconda 

What was the most popular topic people were buzzing about? Was this surprising to you? 

“There's definitely a good feeling about the transition to Python 3 really happening, which has been a point of angst in the Python community for several years. To me, the sense of closure around this was palpable, in that people could spend their emotional energy talking about other things and not griping about ‘Python 2 vs. 3.’” - Peter Wang

“The talks! So great to see how fast the videos for the talks were getting posted.” - Stephen Kearns 

Did you attend any talks? Did any of them stand out? 

“Jake Vanderplas presented a well-researched and well-structured talk on the Python visualization landscape. The keynotes were all excellent. I appreciated the Instagram folks sharing their Python 3 migration story with everyone.” - Peter Wang

“There were some at-capacity tutorials by me on “Data Science Apps with Anaconda,” showing off our new Anaconda Project deployment capability and “Accelerating your Python Data Science code with Dask and Numba.” - Ian Stokes-Rees

How was the buzz around Anaconda at PyCon? 

“Awesome - we exhausted our entire supply of Anaconda Crew T-Shirts by the end of the second day. A conference first!” - Ian Stokes-Rees 

“It was great, and very positive. Lots of people were very interested in our various open source projects, but we also got a lot of interest from attendees in our enterprise offerings: commercially-supported Anaconda, our premium training, and the Anaconda Enterprise Data Science platform. In previous years, there were not as many people who I would characterize as "potential customers,” and this was a very positive change for us. I also think that it is a sign that the PyCon attendee audience is also changing, to include more people from the data science and machine learning ecosystem.” - Peter Wang

“Anaconda had lots of partnership engagement opportunities at the show, specifically with Intel, Microsoft and ESRI. It was exciting to hear Intel talk about how they’re using Anaconda as the channel for delivering optimized high performance Python, and great to see Microsoft giving SQL Server demonstrations of server-side Python using Anaconda. Lastly, great to hear that ESRI is increasing its Python interfaces to ArcGIS and have started to make the ArcGIS Python package available as a conda package from Anaconda Cloud.” - Ian Stokes-Rees

 

by swebster at May 26, 2017 04:58 PM

Jake Vanderplas

Exposing Python 3.6's Private Dict Version

/*! * * IPython notebook * */ /* CSS font colors for translated ANSI colors. */ .ansibold { font-weight: bold; } /* use dark versions for foreground, to improve visibility */ .ansiblack { color: black; } .ansired { color: darkred; } .ansigreen { color: darkgreen; } .ansiyellow { color: #c4a000; } .ansiblue { color: darkblue; } .ansipurple { color: darkviolet; } .ansicyan { color: steelblue; } .ansigray { color: gray; } /* and light for background, for the same reason */ .ansibgblack { background-color: black; } .ansibgred { background-color: red; } .ansibggreen { background-color: green; } .ansibgyellow { background-color: yellow; } .ansibgblue { background-color: blue; } .ansibgpurple { background-color: magenta; } .ansibgcyan { background-color: cyan; } .ansibggray { background-color: gray; } div.cell { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; border-radius: 2px; box-sizing: border-box; -moz-box-sizing: border-box; -webkit-box-sizing: border-box; border-width: 1px; border-style: solid; border-color: transparent; width: 100%; padding: 5px; /* This acts as a spacer between cells, that is outside the border */ margin: 0px; outline: none; border-left-width: 1px; padding-left: 5px; background: linear-gradient(to right, transparent -40px, transparent 1px, transparent 1px, transparent 100%); } div.cell.jupyter-soft-selected { border-left-color: #90CAF9; border-left-color: #E3F2FD; border-left-width: 1px; padding-left: 5px; border-right-color: #E3F2FD; border-right-width: 1px; background: #E3F2FD; } @media print { div.cell.jupyter-soft-selected { border-color: transparent; } } div.cell.selected { border-color: #ababab; border-left-width: 0px; padding-left: 6px; background: linear-gradient(to right, #42A5F5 -40px, #42A5F5 5px, transparent 5px, transparent 100%); } @media print { div.cell.selected { border-color: transparent; } } div.cell.selected.jupyter-soft-selected { border-left-width: 0; padding-left: 6px; background: linear-gradient(to right, #42A5F5 -40px, #42A5F5 7px, #E3F2FD 7px, #E3F2FD 100%); } .edit_mode div.cell.selected { border-color: #66BB6A; border-left-width: 0px; padding-left: 6px; background: linear-gradient(to right, #66BB6A -40px, #66BB6A 5px, transparent 5px, transparent 100%); } @media print { .edit_mode div.cell.selected { border-color: transparent; } } .prompt { /* This needs to be wide enough for 3 digit prompt numbers: In[100]: */ min-width: 14ex; /* This padding is tuned to match the padding on the CodeMirror editor. */ padding: 0.4em; margin: 0px; font-family: monospace; text-align: right; /* This has to match that of the the CodeMirror class line-height below */ line-height: 1.21429em; /* Don't highlight prompt number selection */ -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; /* Use default cursor */ cursor: default; } @media (max-width: 540px) { .prompt { text-align: left; } } div.inner_cell { min-width: 0; /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; /* Old browsers */ -webkit-box-flex: 1; -moz-box-flex: 1; box-flex: 1; /* Modern browsers */ flex: 1; } /* input_area and input_prompt must match in top border and margin for alignment */ div.input_area { border: 1px solid #cfcfcf; border-radius: 2px; background: #f7f7f7; line-height: 1.21429em; } /* This is needed so that empty prompt areas can collapse to zero height when there is no content in the output_subarea and the prompt. The main purpose of this is to make sure that empty JavaScript output_subareas have no height. */ div.prompt:empty { padding-top: 0; padding-bottom: 0; } div.unrecognized_cell { padding: 5px 5px 5px 0px; /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } div.unrecognized_cell .inner_cell { border-radius: 2px; padding: 5px; font-weight: bold; color: red; border: 1px solid #cfcfcf; background: #eaeaea; } div.unrecognized_cell .inner_cell a { color: inherit; text-decoration: none; } div.unrecognized_cell .inner_cell a:hover { color: inherit; text-decoration: none; } @media (max-width: 540px) { div.unrecognized_cell > div.prompt { display: none; } } div.code_cell { /* avoid page breaking on code cells when printing */ } @media print { div.code_cell { page-break-inside: avoid; } } /* any special styling for code cells that are currently running goes here */ div.input { page-break-inside: avoid; /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } @media (max-width: 540px) { div.input { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } } /* input_area and input_prompt must match in top border and margin for alignment */ div.input_prompt { color: #303F9F; border-top: 1px solid transparent; } div.input_area > div.highlight { margin: 0.4em; border: none; padding: 0px; background-color: transparent; } div.input_area > div.highlight > pre { margin: 0px; border: none; padding: 0px; background-color: transparent; } /* The following gets added to the if it is detected that the user has a * monospace font with inconsistent normal/bold/italic height. See * notebookmain.js. Such fonts will have keywords vertically offset with * respect to the rest of the text. The user should select a better font. * See: https://github.com/ipython/ipython/issues/1503 * * .CodeMirror span { * vertical-align: bottom; * } */ .CodeMirror { line-height: 1.21429em; /* Changed from 1em to our global default */ font-size: 14px; height: auto; /* Changed to auto to autogrow */ background: none; /* Changed from white to allow our bg to show through */ } .CodeMirror-scroll { /* The CodeMirror docs are a bit fuzzy on if overflow-y should be hidden or visible.*/ /* We have found that if it is visible, vertical scrollbars appear with font size changes.*/ overflow-y: hidden; overflow-x: auto; } .CodeMirror-lines { /* In CM2, this used to be 0.4em, but in CM3 it went to 4px. We need the em value because */ /* we have set a different line-height and want this to scale with that. */ padding: 0.4em; } .CodeMirror-linenumber { padding: 0 8px 0 4px; } .CodeMirror-gutters { border-bottom-left-radius: 2px; border-top-left-radius: 2px; } .CodeMirror pre { /* In CM3 this went to 4px from 0 in CM2. We need the 0 value because of how we size */ /* .CodeMirror-lines */ padding: 0; border: 0; border-radius: 0; } /* Original style from softwaremaniacs.org (c) Ivan Sagalaev Adapted from GitHub theme */ .highlight-base { color: #000; } .highlight-variable { color: #000; } .highlight-variable-2 { color: #1a1a1a; } .highlight-variable-3 { color: #333333; } .highlight-string { color: #BA2121; } .highlight-comment { color: #408080; font-style: italic; } .highlight-number { color: #080; } .highlight-atom { color: #88F; } .highlight-keyword { color: #008000; font-weight: bold; } .highlight-builtin { color: #008000; } .highlight-error { color: #f00; } .highlight-operator { color: #AA22FF; font-weight: bold; } .highlight-meta { color: #AA22FF; } /* previously not defined, copying from default codemirror */ .highlight-def { color: #00f; } .highlight-string-2 { color: #f50; } .highlight-qualifier { color: #555; } .highlight-bracket { color: #997; } .highlight-tag { color: #170; } .highlight-attribute { color: #00c; } .highlight-header { color: blue; } .highlight-quote { color: #090; } .highlight-link { color: #00c; } /* apply the same style to codemirror */ .cm-s-ipython span.cm-keyword { color: #008000; font-weight: bold; } .cm-s-ipython span.cm-atom { color: #88F; } .cm-s-ipython span.cm-number { color: #080; } .cm-s-ipython span.cm-def { color: #00f; } .cm-s-ipython span.cm-variable { color: #000; } .cm-s-ipython span.cm-operator { color: #AA22FF; font-weight: bold; } .cm-s-ipython span.cm-variable-2 { color: #1a1a1a; } .cm-s-ipython span.cm-variable-3 { color: #333333; } .cm-s-ipython span.cm-comment { color: #408080; font-style: italic; } .cm-s-ipython span.cm-string { color: #BA2121; } .cm-s-ipython span.cm-string-2 { color: #f50; } .cm-s-ipython span.cm-meta { color: #AA22FF; } .cm-s-ipython span.cm-qualifier { color: #555; } .cm-s-ipython span.cm-builtin { color: #008000; } .cm-s-ipython span.cm-bracket { color: #997; } .cm-s-ipython span.cm-tag { color: #170; } .cm-s-ipython span.cm-attribute { color: #00c; } .cm-s-ipython span.cm-header { color: blue; } .cm-s-ipython span.cm-quote { color: #090; } .cm-s-ipython span.cm-link { color: #00c; } .cm-s-ipython span.cm-error { color: #f00; } .cm-s-ipython span.cm-tab { background: url(); background-position: right; background-repeat: no-repeat; } div.output_wrapper { /* this position must be relative to enable descendents to be absolute within it */ position: relative; /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; z-index: 1; } /* class for the output area when it should be height-limited */ div.output_scroll { /* ideally, this would be max-height, but FF barfs all over that */ height: 24em; /* FF needs this *and the wrapper* to specify full width, or it will shrinkwrap */ width: 100%; overflow: auto; border-radius: 2px; -webkit-box-shadow: inset 0 2px 8px rgba(0, 0, 0, 0.8); box-shadow: inset 0 2px 8px rgba(0, 0, 0, 0.8); display: block; } /* output div while it is collapsed */ div.output_collapsed { margin: 0px; padding: 0px; /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } div.out_prompt_overlay { height: 100%; padding: 0px 0.4em; position: absolute; border-radius: 2px; } div.out_prompt_overlay:hover { /* use inner shadow to get border that is computed the same on WebKit/FF */ -webkit-box-shadow: inset 0 0 1px #000; box-shadow: inset 0 0 1px #000; background: rgba(240, 240, 240, 0.5); } div.output_prompt { color: #D84315; } /* This class is the outer container of all output sections. */ div.output_area { padding: 0px; page-break-inside: avoid; /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } div.output_area .MathJax_Display { text-align: left !important; } div.output_area div.output_area div.output_area img, div.output_area svg { max-width: 100%; height: auto; } div.output_area img.unconfined, div.output_area svg.unconfined { max-width: none; } /* This is needed to protect the pre formating from global settings such as that of bootstrap */ .output { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } @media (max-width: 540px) { div.output_area { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } } div.output_area pre { margin: 0; padding: 0; border: 0; vertical-align: baseline; color: black; background-color: transparent; border-radius: 0; } /* This class is for the output subarea inside the output_area and after the prompt div. */ div.output_subarea { overflow-x: auto; padding: 0.4em; /* Old browsers */ -webkit-box-flex: 1; -moz-box-flex: 1; box-flex: 1; /* Modern browsers */ flex: 1; max-width: calc(100% - 14ex); } div.output_scroll div.output_subarea { overflow-x: visible; } /* The rest of the output_* classes are for special styling of the different output types */ /* all text output has this class: */ div.output_text { text-align: left; color: #000; /* This has to match that of the the CodeMirror class line-height below */ line-height: 1.21429em; } /* stdout/stderr are 'text' as well as 'stream', but execute_result/error are *not* streams */ div.output_stderr { background: #fdd; /* very light red background for stderr */ } div.output_latex { text-align: left; } /* Empty output_javascript divs should have no height */ div.output_javascript:empty { padding: 0; } .js-error { color: darkred; } /* raw_input styles */ div.raw_input_container { line-height: 1.21429em; padding-top: 5px; } pre.raw_input_prompt { /* nothing needed here. */ } input.raw_input { font-family: monospace; font-size: inherit; color: inherit; width: auto; /* make sure input baseline aligns with prompt */ vertical-align: baseline; /* padding + margin = 0.5em between prompt and cursor */ padding: 0em 0.25em; margin: 0em 0.25em; } input.raw_input:focus { box-shadow: none; } p.p-space { margin-bottom: 10px; } div.output_unrecognized { padding: 5px; font-weight: bold; color: red; } div.output_unrecognized a { color: inherit; text-decoration: none; } div.output_unrecognized a:hover { color: inherit; text-decoration: none; } .rendered_html { color: #000; /* any extras will just be numbers: */ } .rendered_html :link { text-decoration: underline; } .rendered_html :visited { text-decoration: underline; } .rendered_html h1:first-child { margin-top: 0.538em; } .rendered_html h2:first-child { margin-top: 0.636em; } .rendered_html h3:first-child { margin-top: 0.777em; } .rendered_html h4:first-child { margin-top: 1em; } .rendered_html h5:first-child { margin-top: 1em; } .rendered_html h6:first-child { margin-top: 1em; } .rendered_html * + ul { margin-top: 1em; } .rendered_html * + ol { margin-top: 1em; } .rendered_html pre, .rendered_html tr, .rendered_html th, .rendered_html td, .rendered_html * + table { margin-top: 1em; } .rendered_html * + p { margin-top: 1em; } .rendered_html * + img { margin-top: 1em; } .rendered_html img, .rendered_html img.unconfined, div.text_cell { /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } @media (max-width: 540px) { div.text_cell > div.prompt { display: none; } } div.text_cell_render { /*font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;*/ outline: none; resize: none; width: inherit; border-style: none; padding: 0.5em 0.5em 0.5em 0.4em; color: #000; box-sizing: border-box; -moz-box-sizing: border-box; -webkit-box-sizing: border-box; } a.anchor-link:link { text-decoration: none; padding: 0px 20px; visibility: hidden; } h1:hover .anchor-link, h2:hover .anchor-link, h3:hover .anchor-link, h4:hover .anchor-link, h5:hover .anchor-link, h6:hover .anchor-link { visibility: visible; } .text_cell.rendered .input_area { display: none; } .text_cell.rendered .text_cell.unrendered .text_cell_render { display: none; } .cm-header-1, .cm-header-2, .cm-header-3, .cm-header-4, .cm-header-5, .cm-header-6 { font-weight: bold; font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; } .cm-header-1 { font-size: 185.7%; } .cm-header-2 { font-size: 157.1%; } .cm-header-3 { font-size: 128.6%; } .cm-header-4 { font-size: 110%; } .cm-header-5 { font-size: 100%; font-style: italic; } .cm-header-6 { font-size: 100%; font-style: italic; } .highlight .hll { background-color: #ffffcc } .highlight { background: #f8f8f8; } .highlight .c { color: #408080; font-style: italic } /* Comment */ .highlight .err { border: 1px solid #FF0000 } /* Error */ .highlight .k { color: #008000; font-weight: bold } /* Keyword */ .highlight .o { color: #666666 } /* Operator */ .highlight .ch { color: #408080; font-style: italic } /* Comment.Hashbang */ .highlight .cm { color: #408080; font-style: italic } /* Comment.Multiline */ .highlight .cp { color: #BC7A00 } /* Comment.Preproc */ .highlight .cpf { color: #408080; font-style: italic } /* Comment.PreprocFile */ .highlight .c1 { color: #408080; font-style: italic } /* Comment.Single */ .highlight .cs { color: #408080; font-style: italic } /* Comment.Special */ .highlight .gd { color: #A00000 } /* Generic.Deleted */ .highlight .ge { font-style: italic } /* Generic.Emph */ .highlight .gr { color: #FF0000 } /* Generic.Error */ .highlight .gh { color: #000080; font-weight: bold } /* Generic.Heading */ .highlight .gi { color: #00A000 } /* Generic.Inserted */ .highlight .go { color: #888888 } /* Generic.Output */ .highlight .gp { color: #000080; font-weight: bold } /* Generic.Prompt */ .highlight .gs { font-weight: bold } /* Generic.Strong */ .highlight .gu { color: #800080; font-weight: bold } /* Generic.Subheading */ .highlight .gt { color: #0044DD } /* Generic.Traceback */ .highlight .kc { color: #008000; font-weight: bold } /* Keyword.Constant */ .highlight .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */ .highlight .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */ .highlight .kp { color: #008000 } /* Keyword.Pseudo */ .highlight .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */ .highlight .kt { color: #B00040 } /* Keyword.Type */ .highlight .m { color: #666666 } /* Literal.Number */ .highlight .s { color: #BA2121 } /* Literal.String */ .highlight .na { color: #7D9029 } /* Name.Attribute */ .highlight .nb { color: #008000 } /* Name.Builtin */ .highlight .nc { color: #0000FF; font-weight: bold } /* Name.Class */ .highlight .no { color: #880000 } /* Name.Constant */ .highlight .nd { color: #AA22FF } /* Name.Decorator */ .highlight .ni { color: #999999; font-weight: bold } /* Name.Entity */ .highlight .ne { color: #D2413A; font-weight: bold } /* Name.Exception */ .highlight .nf { color: #0000FF } /* Name.Function */ .highlight .nl { color: #A0A000 } /* Name.Label */ .highlight .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */ .highlight .nt { color: #008000; font-weight: bold } /* Name.Tag */ .highlight .nv { color: #19177C } /* Name.Variable */ .highlight .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */ .highlight .w { color: #bbbbbb } /* Text.Whitespace */ .highlight .mb { color: #666666 } /* Literal.Number.Bin */ .highlight .mf { color: #666666 } /* Literal.Number.Float */ .highlight .mh { color: #666666 } /* Literal.Number.Hex */ .highlight .mi { color: #666666 } /* Literal.Number.Integer */ .highlight .mo { color: #666666 } /* Literal.Number.Oct */ .highlight .sa { color: #BA2121 } /* Literal.String.Affix */ .highlight .sb { color: #BA2121 } /* Literal.String.Backtick */ .highlight .sc { color: #BA2121 } /* Literal.String.Char */ .highlight .dl { color: #BA2121 } /* Literal.String.Delimiter */ .highlight .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */ .highlight .s2 { color: #BA2121 } /* Literal.String.Double */ .highlight .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */ .highlight .sh { color: #BA2121 } /* Literal.String.Heredoc */ .highlight .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */ .highlight .sx { color: #008000 } /* Literal.String.Other */ .highlight .sr { color: #BB6688 } /* Literal.String.Regex */ .highlight .s1 { color: #BA2121 } /* Literal.String.Single */ .highlight .ss { color: #19177C } /* Literal.String.Symbol */ .highlight .bp { color: #008000 } /* Name.Builtin.Pseudo */ .highlight .fm { color: #0000FF } /* Name.Function.Magic */ .highlight .vc { color: #19177C } /* Name.Variable.Class */ .highlight .vg { color: #19177C } /* Name.Variable.Global */ .highlight .vi { color: #19177C } /* Name.Variable.Instance */ .highlight .vm { color: #19177C } /* Name.Variable.Magic */ .highlight .il { color: #666666 } /* Literal.Number.Integer.Long */ /* Temporary definitions which will become obsolete with Notebook release 5.0 */ .ansi-black-fg { color: #3E424D; } .ansi-black-bg { background-color: #3E424D; } .ansi-black-intense-fg { color: #282C36; } .ansi-black-intense-bg { background-color: #282C36; } .ansi-red-fg { color: #E75C58; } .ansi-red-bg { background-color: #E75C58; } .ansi-red-intense-fg { color: #B22B31; } .ansi-red-intense-bg { background-color: #B22B31; } .ansi-green-fg { color: #00A250; } .ansi-green-bg { background-color: #00A250; } .ansi-green-intense-fg { color: #007427; } .ansi-green-intense-bg { background-color: #007427; } .ansi-yellow-fg { color: #DDB62B; } .ansi-yellow-bg { background-color: #DDB62B; } .ansi-yellow-intense-fg { color: #B27D12; } .ansi-yellow-intense-bg { background-color: #B27D12; } .ansi-blue-fg { color: #208FFB; } .ansi-blue-bg { background-color: #208FFB; } .ansi-blue-intense-fg { color: #0065CA; } .ansi-blue-intense-bg { background-color: #0065CA; } .ansi-magenta-fg { color: #D160C4; } .ansi-magenta-bg { background-color: #D160C4; } .ansi-magenta-intense-fg { color: #A03196; } .ansi-magenta-intense-bg { background-color: #A03196; } .ansi-cyan-fg { color: #60C6C8; } .ansi-cyan-bg { background-color: #60C6C8; } .ansi-cyan-intense-fg { color: #258F8F; } .ansi-cyan-intense-bg { background-color: #258F8F; } .ansi-white-fg { color: #C5C1B4; } .ansi-white-bg { background-color: #C5C1B4; } .ansi-white-intense-fg { color: #A1A6B2; } .ansi-white-intense-bg { background-color: #A1A6B2; } .ansi-bold { font-weight: bold; }

I just got home from my sixth PyCon, and it was wonderful as usual. If you weren't able to attend—or even if you were—you'll find a wealth of entertaining and informative talks on the PyCon 2017 YouTube channel.

Two of my favorites this year were a complementary pair of talks on Python dictionaries by two PyCon regulars: Raymond Hettinger's Modern Python Dictionaries A confluence of a dozen great ideas and Brandon Rhodes' The Dictionary Even Mightier (a followup of his PyCon 2010 talk, The Mighty Dictionary)

Raymond's is a fascinating dive into the guts of the CPython dict implementation, while Brandon's focuses more on recent improvements in the dict's user-facing API. One piece both mention is the addition in Python 3.6 of a private dictionary version to aid CPython optimization efforts. In Brandon's words:

"PEP509 added a private version number... every dictionary has a version number, and elsewhere in memory a master version counter. And when you go and change a dictionary the master counter is incremented from a million to a million and one, and that value a million and one is written into the version number of that dictionary. So what this means is that you can come back later and know if it's been modified, without reading maybe its hundreds of keys and values: you just look and see if the version has increased since the last time you were there."

He later went on to say,

"[The version number] is internal; I haven't seen an interface for users to get to it..."

which, of course, I saw as an implicit challenge. So let's expose it!

by Jake VanderPlas at May 26, 2017 04:00 PM

May 24, 2017

Filipe Saraiva

LaKademy 2017

LaKademy 2017 group photo

Some weeks ago we had the fifth edition of the KDE Latin-America summit, LaKademy. Since the first edition, KDE community in Latin-America has grown up and now we has several developers, translators, artists, promoters, and more people from here involved in KDE activities.

This time LaKademy was held in Belo Horizonte, a nice city known for the amazing cachaça, cheese, home made beers, cheese, hills, and of course, cheese. The city is very cosmopolitan, with several options of activities and gastronomy, while the people is gentle. I would like to back to Belo Horizonte, maybe in my next vacation.

LaKademy activites were held in CEFET, an educational technological institute. During the days of LaKademy there were political demonstrations and a general strike in the country, consequence of the current political crisis here in Brazil. Despite I support the demonstrations, I was in Belo Horizonte for event. So I focused in the tasks while in my mind I was side-by-side with the workers on the streets.

Like in past editions I worked a lot with Cantor, the mathematical software I am the maintainer. This time the main tasks performed were an extensive set of reviews: revisions in pending patches, in the bug management system in order to close very old (and invalid) reports, and in the task management workboard, specially to ping developers with old tasks without any comment in the last year.

There were some work to implement new features as well. I finished a backends refactoring in order to provide a recommended version of the programming language for each backend in Cantor. How each programming language has its own planning and scheduling, it is common some programming language version not be correctly supported in a Cantor backend (Sage, I am thinking you). This feature presents a “recommended” version of the programming language supported for the Cantor backend, meaning that version was tested and it will work correctly with Cantor. It is more like a workaround in order to maintain the sanity of the developer while he try to support 11 different programming languages.

Other feature I worked but it is not finished is a option to select different LaTeX processors in Cantor. Currently there are several LaTeX processors available (like pdflatex, pdftex, luatex, xetex, …), some of them with several additional features. This option will increased the versatility of Cantor and will allow the use of moderns processors and their features in the software.

I addition to these tasks I fixed some bugs and helped Fernando Telles, my past SoK student, with some tasks in Cantor.

(Like in past editions)², in LaKademy 2017 I also worked in other set of tasks related to the management and promotion of KDE Brazil. I investigated how to bring back our unified feed with Brazilian blogs posts as in the old Planet KDE Português, utilized to send updates about KDE in Brazil to our social networks. Fred implemented the solution. So I updated this feed in social networks, updated our e-mail contact utilized in this networks, and started a bootstrap version of LaKademy website (but the team is migrating to WordPress, I think it will not be used). I also did a large revision in the tasks of KDE Brazil workboard, migrated past year from the TODO website. Besides all this we had the promo meeting to discuss our actions in Latin-America – all the tasks were documented in the workboard.

Of course, just as we worked intensely in those days, we also had a lot of fun between a push and other. LaKademy is also a opportunity to find old friends and make new ones. It is amazing see again the KDE fellows, and I invite the newcomers to stay with us and go to next LaKademy editions!

This year we had a problem that we must to address in next edition – all the participants were Brazilians. We need to think about how to integrate people from other Latin-America countries in LaKademy. It would be bad if the event become only an Akademy-BR.

Filipe and Chicão

So, I give my greetings to the community and put myself in the mission to continue to work in order to grown the Latin-America as an important player to the development and future of KDE.

by Filipe Saraiva at May 24, 2017 08:37 PM

Enthought

Enthought Receives 2017 Product of the Year Award From National Instruments LabVIEW Tools Network

Python Integration Toolkit for LabVIEW recognized for extending LabVIEW connectivity and bringing the power of Python to applications in Test, Measurement and the Industrial Internet of Things (IIoT)

AUSTIN, TX – May 24, 2017 Enthought, a global leader in scientific and analytic computing solutions, was honored this week by National Instruments with the LabVIEW Tools Network Platform Connectivity 2017 Product of the Year Award for its Python Integration Toolkit for LabVIEW.

Python Integration Toolkit for LabVIEWFirst released at NIWeek 2016, the Python Integration Toolkit enables fast, two-way communication between LabVIEW and Python. With seamless access to the Python ecosystem of tools, LabVIEW users are able to do more with their data than ever before. For example, using the Toolkit, a user can acquire data from test and measurement tools with LabVIEW, perform signal processing or apply machine learning algorithms in Python, display it in LabVIEW, then share results using a Python-enabled web dashboard.

Enthought-Python-Integration-Toolkit-for-LabVIEW-Machine-Learning

Click to see the webinar “Using Python and LabVIEW to Rapidly Solve Engineering Problems” to learn more about adding capabilities such as machine learning by extending LabVIEW applications with Python.

“Python is ideally suited for scientists and engineers due to its simple, yet powerful syntax and the availability of an extensive array of open source tools contributed by a user community from industry and R&D,” said Dr. Tim Diller, Director, IIoT Solutions Group at Enthought. “The Python Integration Toolkit for LabVIEW unites the best elements of two major tools in the science and engineering world and we are honored to receive this award.”

Key benefits of the Python Integration Toolkit for LabVIEW from Enthought:

  • Enthought-Python-Integration-Toolkit-for-LabVIEW-Frequency-Burst

    Click to see the webinar “Introduction to the Python Integration Toolkit for LabVIEW” to learn more about the fast, two-way communication between Python and LabVIEW.

    Enables fast, two-way communication between LabVIEW and Python

  • Provides LabVIEW users seamless access to tens of thousands of mature, well-tested scientific and analytic software packages in the Python ecosystem, including software for machine learning, signal processing, image processing and cloud connectivity
  • Speeds development time by providing access to robust, pre-developed Python tools
  • Provides a comprehensive out-of-the box solution that allows users to be up and running immediately

“Add-on software from our third-party developers is an integral part of the NI ecosystem, and we’re excited to recognize Enthought for its achievement with the Python Integration Toolkit for LabVIEW,” said Matthew Friedman, senior group manager of the LabVIEW Tools Network at NI.

The Python Integration Toolkit is available for download via the LabVIEW Tools Network, and also includes the Enthought Canopy analysis environment and Python distribution. Enthought’s training, support and consulting resources are also available to help LabVIEW users maximize their value in leveraging Python.

For more information on Enthought’s Python Integration Toolkit for LabVIEW, visit www.enthought.com/python-for-LabVIEW.

 

Additional Resources

Product Information

Python Integration Toolkit for LabVIEW product page

Download a free trial of the Python Integration Toolkit for LabVIEW

Webinars

Webinar: Using Python and LabVIEW to Rapidly Solve Engineering Problems | Enthought
April 2017

Webinar: Introducing the New Python Integration Toolkit for LabVIEW from Enthought
September 2016

About Enthought

Enthought is a global leader in scientific and analytic software, consulting, and training solutions serving a customer base comprised of some of the most respected names in the oil and gas, manufacturing, financial services, aerospace, military, government, biotechnology, consumer products and technology industries. The company was founded in 2001 and is headquartered in Austin, Texas, with additional offices in Cambridge, United Kingdom and Pune, India. For more information visit www.enthought.com and connect with Enthought on Twitter, LinkedIn, Google+, Facebook and YouTube.

About NI

Since 1976, NI (www.ni.com) has made it possible for engineers and scientists to solve the world’s greatest engineering challenges with powerful platform-based systems that accelerate productivity and drive rapid innovation. Customers from a wide variety of industries – from healthcare to automotive and from consumer electronics to particle physics – use NI’s integrated hardware and software platform to improve the world we live in.

About the LabVIEW Tools Network

The LabVIEW Tools Network is the NI app store equipping engineers and scientists with certified, third-party add-ons and apps to complete their systems. Developed by industry experts, these cutting-edge technologies expand the power of NI software and modular hardware. Each third-party product is reviewed to meet specific guidelines and ensure compatibility. With hundreds of products available, the LabVIEW Tools Network is part of a rich ecosystem extending the NI Platform to help customers positively impact our world. Learn more about the LabVIEW Tools Network at www.ni.com/labview-tools-network.

LabVIEW, National Instruments, NI and ni.com and NIWeek are trademarks of National Instruments. Enthought, Canopy and Python Integration Toolkit for LabVIEW are trademarks of Enthought, Inc.

Media Contact

Courtenay Godshall, VP, Marketing, +1.512.536.1057, cgodshall@enthought.com

The post Enthought Receives 2017 Product of the Year Award From National Instruments LabVIEW Tools Network appeared first on Enthought Blog.

by admin at May 24, 2017 01:42 PM

May 23, 2017

numfocus

Welcome Nancy Nguyen, the new NumFOCUS Events Coordinator!

NumFOCUS is pleased to announce Nancy Nguyen has been hired as our new Events Coordinator. Nancy has over five years of event management experience in the non-profit and higher education sectors. She graduated from The University of Texas at Austin in 2011 with a BA in History. Prior to joining NumFOCUS, Nancy worked in development and fundraising […]

by Gina Helfrich at May 23, 2017 04:48 PM

Matthieu Brucher

Announcement: ATKBassPreamp 1.0.0

I’m happy to announce the release of a modeling of the Fender Bassman preamplifier stage based on the Audio Toolkit. They are available on Windows and OS X (min. 10.9) in different formats.

ATKBassPreamp

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKBassPreamp.

Update: it seems that a bug in the editor window resetted the knobs to their default values. This is fixed in 1.0.1.
Update: people have reported an issue with the dry/wet mix. This is fixed in 1.0.2.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at May 23, 2017 07:47 AM

May 22, 2017

numfocus

NumFOCUS Awards Small Development Grants to Projects

This spring the NumFOCUS Board of Directors awarded targeted small development grants to applicants from or approved by our sponsored and affiliated projects. In the wake of a successful 2016 end-of-year fundraising drive, NumFOCUS wanted to direct the donated funds to our projects in a way that would have impact and visibility to donors and […]

by Gina Helfrich at May 22, 2017 09:52 PM

May 17, 2017

numfocus

What is it like to chair a PyData conference?

Have you ever wondered what it’s like to be in charge of a PyData event? Vincent Warmerdam has collected some thoughts and reflections on his experience chairing this year’s PyData Amsterdam conference: This year I was the chair of PyData Amsterdam and I’d like to share some insights on what that was like. I was on the committee the […]

by Gina Helfrich at May 17, 2017 03:09 PM

May 16, 2017

Matthieu Brucher

Book review: OpenGL Data Visualization Cookbook

This review will actually be quite quick: I haven’t finished the book and I won’t finish it.

The book was published in August 2015 and is based on OpenGL 3. The authors may sometimes say that you can use shaders to do better, but the fact is that if you want to execute the code they propose, you need to use the backward compatibility layer, if it's available. OpenGL was published almost a decade ago, I can't understand why in 2015 two guys decided that a new book on scientific visualization should use an API that was deprecated a long time ago. What a waste of time. [amazon_enhanced asin="1782169725" /][amazon_enhanced asin="B01FGMWRO8" /]

by Matt at May 16, 2017 07:06 AM

May 14, 2017

Titus Brown

How to analyze, integrate, and model large volumes of biological data - some thoughts

This blog post stems from notes I made for a 12 minute talk at the Oregon State Microbiome Initiative, which followed from some previous thinking about data integration on my part -- in particular, Physics ain't biology (and vice versa) and What to do with lots of (sequencing) data.

My talk slides from OSU are here if you're interested.

Thanks to Andy Cameron for his detailed pre-publication peer review - any mistakes remaining are of course mine, not his ;).


Note: During the events below, I was just a graduate student. So my perspective is probably pretty limited. But this is what I saw and remember!

My graduate work was in Eric Davidson's lab, where we studied early development in the sea urchin. Eric had always been very interested in gene expression, and over the preceding decade or two (1980s and onwards) had invested heavily in genomic technologies. This included lots of cDNA macroarrays and BAC libraries, as well as (eventually) the sea urchin genome project.

The sea urchin is a great system for studying early development! You can get literally billions of synchronously developing embryos by fertilizing all the eggs simultaneously; the developing embryo is crystal clear and large enough to be examined using a dissecting scope; sea urchins are available world-wide; early development is mostly invariant with respect to cell lineage (although that comes with a lot of caveats); and sea urchin embryos have been studied since the 1800s, so there was a lot of background literature on the embryology.

The challenge: data integration without guiding theory

What we were faced with in the '90s and '00s was a challenge provided by the scads of new molecular data provided by genomics: that of data integration. We had discovered plenty of genes (all the usual homologs of things known in mice and fruit flies and other animals), we had cell-type specific markers, we could measure individual gene expression fairly easily and accurately with qPCR, we had perturbations working with morpholino oligos, and we had reporter assays working quite well with CAT and GFP. And now, between BAC sequencing and cDNA libraries and (eventually) genome sequencing, we had tons of genomic and transcriptomic data available.

How could we make sense of all of this data? It's hard to convey the confusion and squishiness of a lot of this data to anyone who hasn't done hands-on biology research; I would just say that single experiments or even collections of many experiments rarely provided a definitive answer, and usually just led to new questions. This is not rare in science, of course, but it typically took 2-3 years to figure out what a specific transcription factor might be doing in early development, much less nail down its specific upstream and downstream connections. Scale that to the dozens or 100s of genes involved in early development and, well, it was a lot of people, a lot of confusion, and a lot of discussion.

The GRN wiring diagram

To make a longer story somewhat shorter:

Eric ended up leading an effort (together with Hamid Bolouri, Dave McClay, Andy Cameron, and others in the sea urchin community) to build a gene regulatory network that provided a foundation for data integration and exploration. You can see the result here:

http://sugp.caltech.edu/endomes/

This network at its core is essentially a map of the genomic connections between genes (transcriptional regulation of transcription factors, together with downstream connections mediated by specific binding sites and signaling interactions between cells, as well as whatever other information we had). Eric named this "the view from the genome." On top of this is layered several different "views from the nucleus", which charted the different regulatory states initiated by asymmetries such as the localization of beta cadherin to the vegetal pole of the egg, and the location of sperm entry into the egg.

At least when it started, the network served primarily as a map of the interactions - a somewhat opinionated interpretation of both published and unpublished data. Peter et al., 2012 showed that the network could be used for in silico perturbations, but I don't know how much has been followed up on. During my experiences with it, it mainly served as a communications medium and a point of reference for discussions about future experiments as well as an integrative guide to published work.

What was sort of stunning in hindsight is the extent to which this model became a touchpoint for our lab and (fairly quickly) the community that studied sea urchin early development. Eric presented the network one year at the annual Developmental Biology of the Sea Urchin meeting, and by the next meeting, 18 months later, I remember it showing up in a good portion of talks from other labs. (One of my favorite memories is someone from Dave McClay's lab - I think it was Cyndi Bradham - putting up a view of the GRN inverted to make signaling interactions the core focus, instead of transcriptional regulation; heresy in Eric's lab!)

In essence, the GRN became a community resource fairly quickly. It was provided in both image and interactive form (using BioTapestry), and people felt free to edit and modify the network for their own presentations. It readily enabled in silico thought experiments - "what happens if I knock out this gene? The model predicts this, and this, and this should be downstream, and this other gene should be unaffected" that quickly led to choosing impactful actual experiments. In part because of this, arguments about the effects of specific genes quickly converged to conversation about how to test the arguments (for some definition of "quickly" and "conversation" - sometimes discussions were quite, ahem, robust in Eric's lab and the larger community!)

The GRN also served to highlight the unknowns and the insufficiencies in the model. Eric and others spent a lot of time thinking through questions such as this: "we know that transcription of gene X is repressed by gene Y; but something must still activate gene X. What could it be?" Eventually we did "crazy" things like measure the transcriptional levels and spatial expression patterns of all ~1000 transcription factors found in the sea urchin genome, which could then be directly integrated into the model for further testing.

In short, the GRN was a pretty amazing way for the community of people interested in early development in the sea urchin to communicate about the details. Universal agreement wasn't the major outcome, although I think significant points about early development were settled in part through the model - communication was the outcome.

And, importantly, it served as a central meeting point for data analysis. More on this below.

Missed opportunities?

One of the major missed opportunities (in my view, obviously - feel free to disagree, the comment section is below :) was that we never turned the GRN into a model that was super easy for experimentalists to play with. It would have required significant software development effort to make it possible to do click-able gene knockdown followed by predicted phenotype readout -- but this hasn't been done yet; apparently it has been tough to find funding for this purpose. Had I stayed in the developmental biology game, I like to think I would have invested significant effort in this kind of approach.

I also don't feel like much time was invested in the community annotation and updating aspect of things. The official model was tightly controlled by a few people (in the traditional scientific "experts know best!" approach) and there was no particular attempt to involve the larger community in annotating or updating the model except through 1-1 conversations or formal publications. It's definitely possible that I just missed it, because I was just a graduate student, and by mid-2004 I had also mentally checked out of grad school (it took me a few more years to physically check out ;).

Taking and holding ground

One question that occupies my mind a lot is the question of how we learn, as a community, from the research and data being produced in each lab. With data, one answer is to work to make the data public, annotate it, curate it, make it discoverable - all things that I'm interested in.

With research more broadly, though, it's more challenging. Papers are relatively poor methods for communicating the results of research, especially now that we have the Internet and interactive Web sites. Surely there are better venues (perhaps ones like Distill, the interactive visual journal for machine learning research). Regardless, the vast profusion of papers on any possible topic, combined with the array of interdisciplinary methods needed, means that knowledge integration is slow and knowledge diffusion isn't much faster.

I fear this means that when it comes to specific systems and question, we are potentially forgetting many things that we "know" as people retire or move on to other systems or questions. This is maybe to be expected, but when we confront the level of complexity inherent in biology, with little obvious convergence between systems, it seems problematic to repose our knowledge in dead tree formats.

Mechanistic maps and models for knowledge storage and data integration

So perhaps the solution is maps and models, as I describe above?

In thinking about microbiomes and microbial communities, I'm not sure what form a model would take. At the most concrete and boring level, a directly useful model would be something that took in a bunch of genomic/transcriptomic/proteomic data and evaluated it against everything that we knew, and then sorted it into "expected" and "unexpected". (This is what I discussed a little bit in my talk at OSU.)

The "expected" would be things like the observation of carbon fixation pathways in well-understood autotrophs - "yep, there it is, sort of matches what we already see." The "unexpected" would be things like unannotated or poorly understood genes that were behaving in ways that suggested they were correlated with whatever conditions we were examining. Perhaps we could have multiple bins of unexpected, so that we could separate out things like genes where the genome, transcriptome, and proteome all provided evidence of expression versus situations where we simply saw a transcript with no other kind of data. I don't know.

If I were to indulge in fanciful thinking, I could imagine a sort of Maxwell's Daemon of data integration, sorting data into bins of "boring" and "interesting", churning through data sets looking for a collection of "interesting" that correlated with other data sets produced from the same system. It's likely that such a daemon would have to involve some form of deep correlational analysis and structure identification - deep learning comes to mind. I really don't know.

One interesting question is, how would this interact with experimental biology and experimental biologists? The most immediately useful models might be the ones that worked off of individual genomes, such as flux-balance models; they could be applied to data from new experimental conditions and knockouts, or shifted to apply to strain variants and related species and look for missing genes in known pathways, or new genes that looked potentially interesting.

So I don't know a lot. All I do know is that our current approaches for knowledge integration don't scale to the volume of data we're gathering or (perhaps more importantly) to the scale of the biology we're investigating, and I'm pretty sure computational modeling of some sort has to be brought into the fray in practical ways.

Perhaps one way of thinking about this is to ask what types of computational models would serve as good reference resources, akin to a reference genome. The microbiome world is surprisingly bereft of good reference resources, with the 16s databases and IMG/M serving as two of the big ones; but we clearly need more, along the vein of a community KEGG and other such resources, curated and regularly updated.

Some concluding thoughts

Communication of understanding is key to progress in science; we should work on better ways of doing that. Open science (open data, open source, open access) is one way of better communicating data, computational methods, and results.

One theme that stood out for me from the microbiome workshop at OSU was that of energetics, a point that Stephen Giovanonni made most clearly. To paraphrase, "Microbiome science is limited by the difficulty of assessing the pros and cons of metabolic strategies." The guiding force behind evolution and ecology in the microbial world is energetics, and if we can get a mechanistic handle on energy extraction (autotrophy and heterotrophy) in single genomes and then graduate that to metagenome and community analysis, maybe that will provide a solid stepping stone for progress.

I'm a bit skeptical that the patterns that ecology and evolution can predict will be of immediate use for developing a predictive model. On the other hand, Jesse Zaneweld at the meeting presented on the notion that all happy microbiomes look the same, while all dysfunctional microbiomes are dysfunctional in their own special way; and Jesse pointed towards molecular signatures of dysfunction; so perhaps I'm wrong :).

It may well be that our data is still far too sparse to enable us to build a detailed mechanistic understanding of even simple microbial ecosystems. I wouldn't be surprised by this.

Trent Northern from the JGI concluded in his talk that we need model ecosystems too; absolutely! Perhaps experimental model ecosystems, either natural or fabricated, can serve to identify the computational approaches that will be most useful.

Along this vein, are there a natural set of big questions and core systems for which we could think about models? In the developmental biology world, we have a few big model systems that we focused on (mouse, zebrafish, fruit fly, and worm) - what are the equivalent microbial ecosystems?

All things to think about.

--titus

p.s. There are a ton of references and they can be fairly easily found, but a decent starting point might be Davidson et al., 2002, "A genomic regulatory network for development."

by C. Titus Brown at May 14, 2017 10:00 PM

May 13, 2017

Matthieu Brucher

Analog modeling: Comparing preamps

In a previous post, I explained how I modeled the triode inverter circuit. I’ve decided to put it inside two different plugins, so I’d like to present in 4 pictures their differences.

Preamps plugins

The two plugins will start as the modeling of the Fender Bassman preamp (inverter circuit, followed by its associated tone stack) and the other will be the modeling of the inverter section of a Vox AC30 (followed by the tone stack of a JCM800). Compared to the default preamp of Audio ToolKit, the behaviors are quite different, all with just a few differences in the values of the components:

30Hz preamps behavior

200Hz preamps behavior

1kHz preamps behavior

10kHz preamps behavior

I will probably add the options of using a different triode model (Audio Toolkit has lots of options, and I definitely need to present the differences in terms of quality and performance), and perhaps also a way of selecting a different tone stack. But for now, the plugins will propose a single modeling of a triode inverter followed by a tone stack. To model a full amp, you still need to model the final stage and the loudspeaker.

The next picture displays the preamp behavior depending on the used triode function:

Response with different triode functions

The Leach model and the original Koren model don’t behave as well the other models. It’s probably due to different parameters, but they give a good idea of the behavior of the tube. The modified Munro-Piazza is a personal modification of the tube function to make the derivative continuous as well. It helps the convergence when the state of the tube changes fast, even though it is clearly not enough to remove all discontinuities.

The following picture describes the cost with valgrind of the different models:

Triode preamp profiles

Obviously, the cost for the more complex functions, the time is spent trying to figure out the logarithm of a double number. This is because the fast math functions in ATK don’t support this function yet. If we use floating point numbers, then the cost is divided by 3 for the Koren model (for instance).

As the results are quite close, using floats or doubles is a matter of optimization and precision. In the plugins, I will use floats to maximize performance.

Coming next

In the next two weeks, the two plugins will be released. And depending on feedback and comments, I’ll add more options.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at May 13, 2017 07:19 AM

May 11, 2017

Leonardo Uieda

Reviews of our Scipy 2017 talk proposal: Bringing the Generic Mapping Tools to Python

Thumbnail image for publication.

This year, Scipy is using a double-open peer-review system, meaning that both authors and reviewers know each others identities. These are the reviews that we got for our proposal and our replies/comments (posted with permission from the reviewers). My sincerest thanks to all reviewers and editors for their time and effort.

The open review model is great because it increases the transparency of the process and might even result in better reviews. I started signing my reviews a few years ago and I found that I'm more careful with the tone of my review to make sure I don't offend anyone and provide constructive feedback.

Now, on to the reviews!

Review 1 - Paul Celicourt

The paper introduces a Python wrapper for the C-based Generic Mapping Tools used to process and analyze time series and gridded data. The content is well organized, but I encourage the authors to consider the following comments: While the authors promise to demonstrate an initial prototype of the wrapper, it is not sure that a WORKING prototype will be available by the time of the conference as claimed by the authors when looking at the potential functionalities to be implemented and presented in the second paragraph of the extended abstract. Furthermore, it is not clear what would be the functionalities of the initial prototype. On top of that, the approach to the implementation is not fully presented. For instance, the Simplified Wrapper and Interface Generator (SWIG) tool may be used to reduce the workload but the authors do not mention whether the wrapper would be manually developed or using an automated tool such as the SWIG. Finally, the portability of the shared memory process has not been addressed.

Thanks for all your comments, Paul! They are good questions and we should have addressed them better in the abstract.

That is a valid concern regarding the working prototype. We're not sure how much of the prototype will be ready for the conference. We are sure that we'll have something to show, even if it's not complete. The focus of the talk will be on our design decisions, implementation details, and the changes in the GMT modern execution mode on which the Python wrappers are based. We'll run some examples of whatever we have working mostly for the "Oooh"s and "Aaah"s.

The wrapper will be manually generated using ctypes. We chose this over SWIG or Cython because ctypes allows us to write pure Python code. It's a much simpler way of wrapping a C library. Not having any compiled extension modules also greatly facilitates distributing the package across operating systems. The same wrapper code can work on Windows, OSX, and Linux (as long as the GMT shared library is available).

The amount of C functions that we'll have to wrap is not that large. Mainly, we need GMT_Call_Module to run a command (like psxy), GMT_Create_Session for generating the session structure, and GMT_Open_VirtualFile and GMT_Read_VirtualFile for passing data to and from Python. The majority of the work will be in creating the Python functions for each GMT command, documenting them, and parsing the Python function arguments into something that GMT_Call_Module accepts. This work would have to be done manually using SWIG or Cython as well, so ctypes is not a disadvantage with regard to this. There are some more details about this in our initial design and goals.

Review 2 - Ricardo Barros Lourenço

The authors submitted a clear abstract, in the sense that they will present a new Python library, which is a binding to the Generic Mapping Tools (GMT) C library, which is widely adopted by the Geosciences community. They were careful in detailing their reasoning in such implementation, and also in analogue initiatives by other groups.

In terms of completeness, the abstract precisely describes that the design plans and some of the implementation would be detailed and explained, as well on a demo of their current version of the library. It was very interesting that the authors, while describing their implementation, also pointed that the library could be used in other applications not necessarily related to geoscientific applications, by the generation of general line plots, bar graphs, histograms, and 3D surfaces. It would be beneficial to the audience to see how this aspect is sustained, by comparing such capabilities with other libraries (such as Matplotlib and Seaborn) and evaluating their contribution to the geoscientific domain, and also on the expanded related areas.

The abstract is highly compelling to the Earth Sciences community members at the event because the GMT module is already used for high-quality visualization (both in electronic, but also in printed outputs - maps - which is an important contribution to) , but with a Python integration it could simplify the integration of "Pythonic" workflows into it, expanding the possibilities in geoscientific visualization, especially in printed maps.

It would be interesting, aside from a presumed comparison in online visualization with matplotlib and cartopy, if the authors would also discuss in their presentation other possible contributions, such as online tile generation in map servers, which is very expensive in terms of computational resources and is still is challenging in an exclusive "Pythonic" environment. Additionally, it would be interesting if the authors provide some clarification if there is any limitation on the usage of such library, more specifically to the high variance in geoscientific data sources, and also in how netCDF containers are consumed in their workflow (considering that these containers don't necessarily conform to a strict standard, allowing users to customize their usage) in terms of the automation of this I/O.

The topic of high relevance because there is still few options for spatial data visualization in a "fully pythonic" environment, and none of them is used in the process of plotting physical maps, in a production setting, such as GMT is. Considering these aspects, I recommend such proposal for acceptance.

Thank you, Ricardo, for your incentives and suggestions for the presentation!

I hadn't thought about the potential use in map tiling but we'll keep an eye on that from now on and see if we have anything to say about it. Thanks!

Regarding netCDF, the idea is to leverage the xarray library for I/O and use their Dataset objects as input and output arguments for the grid related GMT commands. There is also the option of giving the Python functions the file name of a grid and have GMT handle I/O, as it already does in the command line. The appeal of using xarray is that it integrates well with numpy and pandas and can be used instead of gmt grdmath (no need to tie your head in knots over RPN anymore!).

Review 3 - Ryan May

Python bindings for GMT, as demonstrated by the authors, are very much in demand within the geoscience community. The work lays out a clear path towards implementation, so it's an important opportunity for the community to be able offer API and interaction feedback. I feel like this talk would be very well received and kick off an important dialogue within the geoscience Python community.

Thanks, Ryan! Getting community feedback was the motivation for submitting a talk without having anything ready to show yet. It'll be much easier to see what the community wants and thinks before we have fully committed to an implementation. We're very much open and looking forward to getting a ton of questions!


What would you like to see in a GMT Python library? Let us know if there are any questions/suggestions before the conference. See you at Scipy in July!


Thumbnail image for this post is modified from "ScientificReview" by the Center for Scientific Review which is in the public domain.


Comments? Leave one below or let me know on Twitter @leouieda or in the Software Underground Slack group.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

Please enable JavaScript to view the comments powered by Disqus.

May 11, 2017 12:00 PM

May 09, 2017

Matthieu Brucher

Book review: Getting Started With JUCE

After the announce of JUCE 5 release, I played a little bit with it, and then decided to read the only book on JUCE. It’s outdated and tackles JUCE 2.1.2. But who knows, it may be a gem?

Content and opinions

The book starts with a chapter on JUCE and its installation. If the main application changed its name from Introjucer to Projucer (probably because of the change in licence), the rest seems to be quite similar. This app still creates a Visual Studio solution, a Xcode project or makefiles.

The second chapter was the one I was interested in the most, because I plan on creating new component and GUIs for my plugins. The chapter still feels a little bit short, we could have worked more with overwriting custom look and feel, handling more events… It seems that not much changed in this area since JUCE 2.1, so this is quite an achievement for ROLI and the team. If in the future my component can last several major releases, I know I can invest time in learning JUCE.

The next chapter feels useless with C++11/14 or even Boost. It seems JUCE still uses a custom String class, which is too bad, not sure it really brings anything to the library, and there are other APIs that are now deprecated in my opinion (data types like int32, file system handling…)

The fourth chapter deals with streaming data and building a small app that can play sound. It is a nice feature, but I have to say I read it even faster than the previous chapter because it was of no interest to me.

Finally, the last chapter ends the book with a sudden note on some utilities (I can’t even remember them without reading the chapter again), no final conclusion, but a feeling that the book was more a list of tutorials than a real book.

Conclusion

I wouldn’t recommend on buying the book, as the tutorials cover the only bit that is still relevant. But I can hope for an updated version, one day.

by Matt at May 09, 2017 07:49 AM

May 08, 2017

Continuum Analytics news

Anaconda Joins Forces with Leading Companies to Further Innovate Open Data Science

Monday, May 8, 2017
Travis Oliphant
President, Chief Data Scientist & Co-Founder

In addition to announcing the formation of the GPU Open Analytics Initiative with H2O and MapD, today, we are pleased to announce an exciting collaboration with NVIDIA, H2O and MapD, with a goal of democratizing machine learning to increase performance gains of data science workloads. Using NVIDIA’s Graphics Processing Unit (GPU) technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between Central Processing Units (CPUs) and GPUs and move toward their larger business goals. 

The new GPU Data Frame (GDF) will augment the Anaconda platform as the foundational fabric to bring data science technologies together allowing it to take full advantage of GPU performance gains. In most workflows using GPUs, data is first manipulated with the CPU and then loaded to the GPU for analytics. This creates a data transfer “tax” on the overall workflow.   With the new GDF initiative, data scientists will be able to move data easily onto the GPU and do all their manipulation and analytics at the same time without the extra transfer of data. With this collaboration, we are opening the door to an era where innovative AI applications can be deployed into production at an unprecedented pace and often with just a single click.

In a nutshell, this collaboration provides these key benefits:

  • Python Democratization. GPU Data Frame makes it easy to create new optimized data science models and iterate on ideas using the most innovative GPU and AI technologies.

  • Python Acceleration. The standard empowers data scientists with unparalleled acceleration within Python on GPUs for data science workloads, enabling Open Data Science to proliferate across the enterprise.

  • Python Production. Data science teams can move beyond ad-hoc analysis to unearthing game-changing results within production-deployed data science applications that drive measurable business impact.

Anaconda aims to bring the performance, insights and intelligence enterprises need to compete in today’s data-driven economy. We’re excited to be working with NVIDIA, mapD, and H2O as GPU Data Frame pushes the door to Open Data Science wide open by further empowering the data scientist community with unparalleled innovation, enabling Open Data Science to proliferate across the enterprise.

by swebster at May 08, 2017 08:51 PM

Anaconda Easy Button - Microsoft SQL Server and Python

Tuesday, May 9, 2017
Ian Stokes-Rees
Continuum Analytics

Previously there were many twisty roads that you may have followed if you wanted to use Python on a client system to connect to a Microsoft SQL Server database, and not all of those roads would even get you to your destination. With the news that Microsoft SQL Server 2017 has increased support for Python, by including a subset of Anaconda packages on the server-side, I thought it would be useful to demonstrate how Anaconda delivers the easy button to get Python on the client side connected to Microsoft SQL Server.

This blog post demonstrates how Anaconda and Anaconda Enterprise can be used on the client-side to connect Python running on Windows, Mac, or Linux to a SQL Server instance. The instructions should work for many versions SQL Server, Python and Anaconda, including Anaconda Enterprise, our commercially oriented version of Anaconda that adds in strong collaboration, security, and server deployment capabilities. If you run into any trouble let us know either through the Anaconda Community Support mailing list or on Twitter @ContinuumIO.

TL;DR: For the Impatient

If you're the kind of person who just wants the punch line and not the story, there are three core steps to connect to an existing SQL Server database:

  1. Install the SQL Server drivers for your platform on the client system. That is described in the Client Driver Installation section below.

  2. conda install pyodbc

  3. Establish a connection to the SQL Server database with an appropriately constructed connection statement:

     conn = pyodbc.connect(
        r'DRIVER={ODBC Driver 13 for SQL Server};' +
        ('SERVER={server},{port};'   +
         'DATABASE={database};'      +
         'UID={username};'           +
         'PWD={password}').format(
                server= 'sqlserver.testnet.corp',
                  port= 1433,
              database= 'AdventureWorksDW2012',
              username= 'tanya',
              password= 'Tanya1234')
    )
    

For cut-and-paste convenience, here's the string:

 (r'DRIVER={ODBC Driver 13 for SQL Server};' +
 ('SERVER={server},{port};'   +
  'DATABASE={database};'      +
  'UID={username};'           +
  'PWD={password}').format(
                server= 'sqlserver.testnet.corp',
                  port= 1433,
              database= 'AdventureWorksDW2012',
              username= 'tanya',
              password= 'Tanya1234')
)
'DRIVER={ODBC Driver 13 for SQL Server};SERVER=sqlserver.testnet.corp,1433;DATABASE=AdventureWorksDW2012;UID=tanya;PWD=Tanya1234'

Hopefully that doesn't look too intimidating!

Here's the scoop: the Python piece is easy (yay Python!) whereas the challenges are installing the platform-specific drivers (Step 1), and if you don't already have a database properly setup then the SQL Server installation, configuration, database loading, and setting up appropriate security credentials are the parts that the rest of this blog post are going to go into in more detail. As well as a fully worked out example of client-side connection and query.

And you can grab a copy of this blog post as a Jupyter Notebook from Anaconda Cloud.

On With The Story

While this isn't meant to be an exhaustive reference for SQL Server connectivity from Python and Anaconda it does cover several client/server configurations. In all cases I was running SQL Server 2016 on a Windows 10 system. My Linux system was CentOS 6.9 based. My Mac was running macOS 10.12.4, and my client-side Windows system also used Windows 10. The Windows and Mac Python examples were using Anaconda 4.3 with Python 3.6 and pyodbc version 3.0, while the Linux example used Anaconda Enterprise, based on Anaconda 4.2, using Python 2.7.

NOTE: In the examples below the $ symbol indicates the command line prompt. Do not include this in any commands if you cut-and-paste. Your prompt will probably look different!

Server Side Preparation

If you are an experienced SQL Server administrator then you can skip this section. All you need to know are:

  1. The hostname or IP address and port number for your SQL Server instance
  2. The database you want to connect to
  3. The user credentials that will be used to make the connection

The following provides details on how to setup your SQL Server instance to be able to exactly replicate the client-side Python-based connection that follows. If you do not have Microsoft SQL Server it can be downloaded and installed for free and is now available for Windows and Linux. NOTE: The recently released SQL Server 2017 and SQL Server on Azure both require pyodbc version >= 3.2. This blog post has been developed using SQL Server 2016 with pyodbc version 3.0.1.

This demonstration is going to use the Adventure Works sample database provided by Microsoft on CodePlex. There are instructions on how to install this into your SQL Server instance in Step 3 of this blog post, however you can also simply connect to an existing database by adjusting the connection commands below accordingly.

Many of the preparation steps described below are most easily handled using the SQL Server Management Studio which can be downloaded and installed for free.

Additionally this example makes use of the Mixed Authentication Mode which allows SQL Server-based usernames and passwords for database authentication. By default this is not enabled, and only Windows Authentication is permitted, which makes use of the pre-existing Kerberos user authentication token wallet. It should go without saying that you would only change to Mixed Authentication Mode for testing purposes if SQL Server is not already so configured. While the example below focuses on SQL Server Authentication there are also alternatives presented for the Windows Authentication Mode that uses Kerberos tokens.

You should know the hostname and port on which SQL Server is running and be sure you can connect to that hostname (or IP address) and port from the client-side system. The easiest way to test this is with telnet from the command line excuting the command:

$ telnet sqlserver.testnet.corp 1433

Where you would replace sqlserver.testnet.corp with the hostname or IP address of your SQL Server instance and 1433 with the port SQL Server is running on. Port 1433 is the SQL Server default. Executing this command on the client system should return output like:

Trying sqlserver.testnet.corp...
Connected to sqlserver.testnet.corp.
Escape character is '^]'.

telnet> close

At which point you can then type CTRL-] and then close.

If your client system is also Windows you can perform this simple Universal Data Link (UDL) test.

Finally you will need to confirm that your have access credentials for a user known by SQL Server and that the user is permitted to perform SELECT operations (and perhaps others) on the database in question. In this particular example we are making use of a fictional user named Tanya who has the username tanya and a password of Tanya1234. It is a 3 step process to get Tanya access to the Adventure Works database:

  1. The SQL Server user tanya is added as a Login to the DBMS:

    • which can be found in SQL Server Management Studio
    • under Security->Logins
    • right-click on Logins
    • add a New Login...
    • provide a Login name of tanya
    • select SQL Server authentication
    • provide a password of Tanya1234
    • uncheck the option for Enforce password policy (the other two will automatically be unchecked and greyed out)
  2. The database user needs to be added:

    • under Databases->AdventureWorksDW2012->Security->Users
    • right-click on Users
    • add a New User...
    • select SQL user with Login for type
    • add a Username set to tanya
    • add a Login name set to tanya.
  3. Grant tanya permissions on the AdventureWorksDW2012 database by executing the following query:

    use AdventureWorksDW2012; 
    GRANT SELECT, INSERT, DELETE, UPDATE ON SCHEMA::DBO TO tanya;
    

Using the UDL test method described above is a good way to confirm that tanya can connect to the database, even just from localhost, though it does not confirm if she can perform operations such as SELECT. For that I would recommend installing the free Microsoft Command Line Utilities 13.1 for SQL Server

Client Driver Installation

You'll now need to get the drivers installed on your client system. There are two parts to this: the platform specific dynamic libraries for SQL Server, and the Python ODBC interface. You won't be surprised to hear that the platform-specific libraries are harder to get setup, but this should still only take 10-15 minutes and there are established processes for all major operating systems.

Linux

The Linux drivers are avaialble for RHEL, Ubuntu, and SUSE. This is the process that you'd have to follow if you are using Anaconda Enterprise, as well as for anyone using a Linux variant with Anaconda installed.

My Linux test system was using CentOS 6.9, so I followed the RHEL6 installation procedure from Microsoft (linked above), which essentially consisted of 3 steps:

  1. Adding the Microsoft RPM repository to the yum configuration
  2. Pre-emptively removing some packages that may cause conflicts (I didn't have these installed)
  3. Using yum to install the msodbcsql RPM for version 13.1 of the drivers

In my case I had to play around with the yum command and in the end just doing:

$ ACCEPT_EULA=Y yum install msodbcsql

Windows

The Windows drivers are dead easy to install. Download the msodbcsql MSI file, double click to install, and you're in business.

Mac

The SQL Server drivers for Mac need to be installed via Homebrew which is a popular OS X package manager, though not affiliated with nor supported by Apple. Microsoft has created their own tap which is a Homebrew package repository. If you don't already have Homebrew installed you'll need to do that, then Microsoft have provided some simple instructions describing how to add the SQL Server tap and then install the mssql-tools package. The steps are simple enough I'll repeat them here, though check out that link above if you need more details or background.

$ brew tap microsoft/mssql-preview https://github.com/Microsoft/homebrew-mssql-preview
$ brew update
$ brew install mssql-tools

One thing I'll note is that the Homebrew installation output suggested I should execute a command to remove one of the SQL Server drivers. Don't do this! That driver is required. If you've already done it then the way to correct the process is to reset the configuration file by removing and re-adding the package:

$ brew remove  mssql-tools
$ brew install mssql-tools

Install Anaconda

Download and install Anaconda if you don't already have it on your system. There are graphical and command line installers available for Windows, Mac, and Linux. It is about 400 MB to download and a bit over 1 GB installed. If you're looking for a minimal system you can install Miniconda instead (command line only installer) and then a la carte pick the packages you want with conda install commands.

Anaconda Enterprise users or administrators can simply execute the commands below in the conda environment where they want pyodbc to be available.

Python ODBC package

This part is easy. You can just do:

$ conda install pyodbc

And if you're not using Anaconda or prefer pip, then you can also do:

$ pip install pyodbc

NOTE: If you are using the recently released SQL Server 2017 you will need pyodbc >= 3.2. There should be a conda package available for that "shortly" but be sure to check which version you get if you use the conda command above.

Connecting to SQL Server using pyodbc

Now that you've got your server-side and client-side systems setup with the correct software, databases, users, libraries, and drivers it is time to connect. If everything works properly these steps are very simple and work for all platforms. Everything that is platform-specific has been handled elsewhere in the process.

We start by importing the common pyodbc package. This is Microsoft's recommended Python interface to SQL Server. There was an alternate Python interface pymssql that at one point was more reliable for SQL Server connections and even until quite recently was the only way to get Python on a Mac to connect to SQL Server, however with Microsoft's renewed support for Python and Microsoft's own Mac Homebrew packages it is now the case that pyodbc is the leader for all platforms.

import pyodbc

Use a Python dict to define the configuration parameters for the connection

config = dict(server=   'sqlserver.testnet.corp', # change this to your SQL Server hostname or IP address
              port=      1433,                    # change this to your SQL Server port number [1433 is the default]
              database= 'AdventureWorksDW2012',
              username= 'tanya',
              password= 'Tanya1234')

Create a template connection string that can be re-used.

conn_str = ('SERVER={server},{port};'   +
            'DATABASE={database};'      +
            'UID={username};'           +
            'PWD={password}')

If you are using the Windows Authentication mode where existing authorization tokens are picked up automatically this connection string would be changed to remove UID and PWD entries and replace them with TRUSTED_CONNECTION, as below:

trusted_conn_str = ('SERVER={server};'     +
                    'DATABASE={database};' +
                    'TRUSTED_CONNECTION=yes')

Check your configuration looks right:

config
{'database': 'AdventureWorksDW2012',
 'password': 'Tanya1234',
 'port': 1433,
 'server': 'sqlserver.testnet.corp',
 'username': 'tanya'}

Now open a connection by specifying the driver and filling in the connection string with the connection parameters.

The following connection operation can take 10s of seconds to complete.

conn = pyodbc.connect(
    r'DRIVER={ODBC Driver 13 for SQL Server};' +
    conn_str.format(**config)
    )

Executing Queries

Request a cursor from the connection that can be used for queries.

cursor = conn.cursor()

Perform your query.

cursor.execute('SELECT TOP 10 EnglishProductName FROM dbo.DimProduct;')
<pyodbc.Cursor at 0x7f7ca4a82d50>

Loop through to look at the results (an iterable of 1-tuples, containing unicodde strings of the results).

for entry in cursor:
    print(entry)
(u'Adjustable Race', )
(u'Bearing Ball', )
(u'BB Ball Bearing', )
(u'Headset Ball Bearings', )
(u'Blade', )
(u'LL Crankarm', )
(u'ML Crankarm', )
(u'HL Crankarm', )
(u'Chainring Bolts', )
(u'Chainring Nut', )

Data Science Happens Here

Now that we've demonstrated how to connect to a SQL Server instance from Windows, Mac and Linux using Anaconda or Anaconda Enterprise it is possible to use T-SQL queries to interact with that database as you normally would.

Looking to the future, the latest preview release of SQL Server 2017 includes a server-side Python interface built around Anaconda. There are lots of great resources on Python and SQL Server connectivity from the team at Microsoft, and here are a few that you may find particularly interesting:

Next Steps

My bet is that if you're reading a blog post on SQL Server and Python (and you can download a Notebook version of it here) then you're using it in a commercial context. Anaconda Enterprise is going to be the best way for you and your organization to make a strategic investment in Open Data Science.

See how Anaconda Enterprise is transforming data science through our webinar series or grab one of our white papers on Enterprise Open Data Science.

Let us help you be successful in your strategic adoption of Python and Anaconda for high-performance enterprise-oriented open data science connected to your existing data sources and systems, such as SQL Server.

by swebster at May 08, 2017 04:46 PM

Data Science And Deep Learning Application Leaders Form GPU Open Analytics Initiative

Monday, May 8, 2017

Continuum Analytics, H2O.ai and MapD Technologies Create Open Common Data Frameworks for GPU In-Memory Analytics

SAN JOSE, CA—May 8, 2017—Continuum Analytics, H2O.ai, and MapD Technologies have announced the formation of the GPU Open Analytics Initiative (GOAI) to create common data frameworks enabling developers and statistical researchers to accelerate data science on GPUs. GOAI will foster the development of a data science ecosystem on GPUs by allowing resident applications to interchange data seamlessly and efficiently. BlazingDB, Graphistry and Gunrock from UC Davis led by CUDA Fellow John Owens have joined the founding members to contribute their technical expertise.

The formation of the Initiative comes at a time when analytics and machine learning workloads are increasingly being migrated to GPUs. However, while individually powerful, these workloads have not been able to benefit from the power of end-to-end GPU computing. A common standard will enable intercommunication between the different data applications and speed up the entire workflow, removing latency and decreasing the complexity of data flows between core analytical applications. 

At the GPU Technology Conference (GTC), NVIDIA’s annual GPU developers’ conference, the Initiative announced its first project: an open source GPU Data Frame with a corresponding Python API. The GPU Data Frame is a common API that enables efficient interchange of data between processes running on the GPU. End-to-end computation on the GPU avoids transfers back to the CPU or copying of in-memory data reducing compute time and cost for high-performance analytics common in artificial intelligence workloads.

Users of the MapD Core database can output the results of a SQL query into the GPU Data Frame, which then can be manipulated by the Continuum Analytics’ Anaconda NumPy-like Python API or used as input into the H2O suite of machine learning algorithms without additional data manipulation. In early internal tests, this approach exhibited order-of-magnitude improvements in processing times compared to passing the data between applications on a CPU. 

“The data science and analytics communities are rapidly adopting GPU computing for machine learning and deep learning. However, CPU-based systems still handle tasks like subsetting and preprocessing training data, which creates a significant bottleneck,” said Todd Mostak, CEO and co-founder of MapD Technologies. “The GPU Data Frame makes it easy to run everything from ingestion to preprocessing to training and visualization directly on the GPU. This efficient data interchange will improve performance, encouraging development of ever more sophisticated GPU-based applications.” 

“GPU Data Frame relies on the Anaconda platform as the foundational fabric that brings data science technologies together to take full advantage of GPU performance gains,” said Travis Oliphant, co-founder and chief data scientist of Continuum Analytics. “Using NVIDIA’s technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between CPUs and GPUs and move nimbly toward their larger business goals. The key to producing this kind of innovation are great partners like H2O and MapD.”

“Truly diverse open source ecosystems are essential for adoption - we are excited to start GOAI for GPUs alongside leaders in data and analytics pipeline to help standardize data formats,” said Sri Ambati, CEO and co-founder of H2O.ai. “GOAI is a call for the community of data developers and researchers to join the movement to speed up analytics and GPU adoption in the enterprise.”

The GPU Open Analytics Initiative is actively welcoming participants who are committed to open source and to GPUs as a computing platform. 

Details of the GPU Data Frame can be found at the Initiative’s Github link - 
https://github.com/gpuopenanalytics

In conjunction with this announcement, MapD Technologies has announced the immediate open sourcing of the MapD Core database to foster open analytics on GPUs. Anaconda and H2O already have large open source communities, which can benefit from this project immediately and drive further development to accelerate the adoption of data science and analytics on GPUs. 

About Anaconda Powered by Continuum Analytics
Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads and 4 million unique users to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Learn more at continuum.io

About H2O.ai
H2O.ai is focused on bringing AI to businesses through software. Its flagship product is H2O, the leading open source platform that makes it easy for financial services, insurance and healthcare companies to deploy AI and deep learning to solve complex problems. More than 9,000 organizations and 80,000+ data scientists depend on H2O for critical applications like predictive maintenance and operational intelligence. The company -- which was recently named to the CB Insights AI 100 -- is used by 169 Fortune 500 enterprises, including 8 of the world’s 10 largest banks, 7 of the 10 largest insurance companies and 4 of the top 10 healthcare companies. Notable customers include Capital One, Progressive Insurance, Transamerica, Comcast, Nielsen Catalina Solutions, Macy's, Walgreens and Kaiser Permanente.

About MapD Technologies
MapD Technologies is a next-generation analytics software company. Its technology harnesses the massive parallelism of modern graphics processing units (GPUs) to power lightning-fast SQL queries and visualization of large data sets. The MapD analytics platform includes the MapD Core database and MapD Immerse visualization client. These software products provide analysts and data scientists with the fastest time to insight, performance not possible with traditional CPU-based solutions. MapD software runs on-premise and on all leading cloud providers.

Founded in 2013, MapD Technologies originated from research at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). MapD is funded by GV, In-Q-Tel, New Enterprise Associates (NEA), NVIDIA, Vanedge Capital and Verizon Ventures. The company is headquartered in San Francisco.

Visit MapD at www.mapd.com or follow MapD on Twitter @mapd. For more information or to evaluate MapD, contact sales@mapd.com. Press inquiries, please contact press@mapd.com.

Media Contacts:

Jill Rosenthal
Continuum Analytics
anaconda@inkhouse.com

Mary Fuochi
MapD
press@mapd.com

James Christopherson
H2O.ai
james@vscpr.com

 

 

by swebster at May 08, 2017 12:47 PM

Matthew Rocklin

Dask Release 0.14.3

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.3. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on March 22nd.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed

or you can pip install from PyPI

pip install dask[complete] --upgrade

Conda packages should be on the default channel within a few days.

Arrays

Sparse Arrays

Dask.arrays now support sparse arrays and mixed dense/sparse arrays.

>>> import dask.array as da

>>> x = da.random.random(size=(10000, 10000, 10000, 10000),
...                      chunks=(100, 100, 100, 100))
>>> x[x < 0.99] = 0

>>> import sparse
>>> s = x.map_blocks(sparse.COO)  # parallel array of sparse arrays

In order to support sparse arrays we did two things:

  1. Made dask.array support ndarray containers other than NumPy, as long as they were API compatible
  2. Made a small sparse array library that was API compatible to the numpy.ndarray

This process was pretty easy and could be extended to other systems. This also allows for different kinds of ndarrays in the same Dask array, as long as interactions between the arrays are well defined (using the standard NumPy protocols like __array_priority__ and so on.)

Documentation: http://dask.pydata.org/en/latest/array-sparse.html

Update: there is already a pull request for Masked arrays

Reworked FFT code

The da.fft submodule has been extended to include most of the functions in np.fft, with the caveat that multi-dimensional FFTs will only work along single-chunk dimensions. Still, given that rechunking is decently fast today this can be very useful for large image stacks.

Documentation: http://dask.pydata.org/en/latest/array-api.html#fast-fourier-transforms

Constructor Plugins

You can now run arbitrary code whenever a dask array is constructed. This empowers users to build in their own policies like rechunking, warning users, or eager evaluation. A dask.array plugin takes in a dask.array and returns either a new dask array, or returns None, in which case the original will be returned.

>>> def f(x):
...     print('%d bytes' % x.nbytes)

>>> with dask.set_options(array_plugins=[f]):
...     x = da.ones((10, 1), chunks=(5, 1))
...     y = x.dot(x.T)
80 bytes
80 bytes
800 bytes
800 bytes

This can be used, for example, to convert dask.array code into numpy code to identify bugs quickly:

>>> with dask.set_options(array_plugins=[lambda x: x.compute()]):
...     x = da.arange(5, chunks=2)

>>> x  # this was automatically converted into a numpy array
array([0, 1, 2, 3, 4])

Or to warn users if they accidentally produce an array with large chunks:

def warn_on_large_chunks(x):
    shapes = list(itertools.product(*x.chunks))
    nbytes = [x.dtype.itemsize * np.prod(shape) for shape in shapes]
    if any(nb > 1e9 for nb in nbytes):
        warnings.warn("Array contains very large chunks")

with dask.set_options(array_plugins=[warn_on_large_chunks]):
    ...

These features were heavily requested by the climate science community, which tends to serve both highly technical computer scientists, and less technical climate scientists who were running into issues with the nuances of chunking.

DataFrames

Dask.dataframe changes are both numerous, and very small, making it difficult to give a representative accounting of recent changes within a blogpost. Typically these include small changes to either track new Pandas development, or to fix slight inconsistencies in corner cases (of which there are many.)

Still, two highlights follow:

Rolling windows with time intervals

>>> s.rolling('2s').count().compute()
2017-01-01 00:00:00    1.0
2017-01-01 00:00:01    2.0
2017-01-01 00:00:02    2.0
2017-01-01 00:00:03    2.0
2017-01-01 00:00:04    2.0
2017-01-01 00:00:05    2.0
2017-01-01 00:00:06    2.0
2017-01-01 00:00:07    2.0
2017-01-01 00:00:08    2.0
2017-01-01 00:00:09    2.0
dtype: float64

Read Parquet data with Arrow

Dask now supports reading Parquet data with both fastparquet (a Numpy/Numba solution) and Arrow and Parquet-CPP.

df = dd.read_parquet('/path/to/mydata.parquet', engine='fastparquet')
df = dd.read_parquet('/path/to/mydata.parquet', engine='arrow')

Hopefully this capability increases the use of both projects and results in greater feedback to those libraries so that they can continue to advance Python’s access to the Parquet format.

Graph Optimizations

Dask performs a few passes of simple linear-time graph optimizations before sending a task graph to the scheduler. These optimizations currently vary by collection type, for example dask.arrays have different optimizations than dask.dataframes. These optimizations can greatly improve performance in some cases, but can also increase overhead, which becomes very important for large graphs.

As Dask has grown into more communities, each with strong and differing performance constraints, we’ve found that we needed to allow each community to define its own optimization schemes. The defaults have not changed, but now you can override them with your own. This can be set globally or with a context manager.

def my_optimize_function(graph, keys):
    """ Takes a task graph and a list of output keys, returns new graph """
    new_graph = {...}
    return new_graph

with dask.set_options(array_optimize=my_optimize_function,
                      dataframe_optimize=None,
                      delayed_optimize=my_other_optimize_function):
    x, y = dask.compute(x, y)

Documentation: http://dask.pydata.org/en/latest/optimize.html#customizing-optimization

Speed improvements

Additionally, task fusion has been significantly accelerated. This is very important for large graphs, particularly in dask.array computations.

Web Diagnostics

The distributed scheduler’s web diagnostic page is now served from within the dask scheduler process. This is both good and bad:

  • Good: It is much easier to make new visuals
  • Bad: Dask and Bokeh now share a single CPU

Because Bokeh and Dask now share the same Tornado event loop we no longer need to send messages between them to then send out to a web browser. The Bokeh server has full access to all of the scheduler state. This lets us build new diagnostic pages more easily. This has been around for a while but was largely used for development. In this version we’ve switched the new version to be default and turned off the old one.

The cost here is that the Bokeh scheduler can take 10-20% of the CPU use. If you are running a computation that heavily taxes the scheduler then you might want to close your diagnostic pages. Fortunately, this almost never happens. The dask scheduler is typically fast enough to never get close to this limit.

Tornado difficulties

Beware that the current versions of Bokeh (0.12.5) and Tornado (4.5) do not play well together. This has been fixed in development versions, and installing with conda is fine, but if you naively pip install then you may experience bad behavior.

Joblib

The Dask.distributed Joblib backend now includes a scatter= keyword, allowing you to pre-scatter select variables out to all of the Dask workers. This significantly cuts down on overhead, especially on machine learning workloads where most of the data doesn’t change very much.

# Send the training data only once to each worker
with parallel_backend('dask.distributed', scheduler_host='localhost:8786',
                      scatter=[digits.data, digits.target]):
    search.fit(digits.data, digits.target)

Early trials indicate that computations like scikit-learn’s RandomForest scale nicely on a cluster without any additional code.

Documentation: http://distributed.readthedocs.io/en/latest/joblib.html

Preload scripts

When starting a dask.distributed scheduler or worker people often want to include a bit of custom setup code, for example to configure loggers, authenticate with some network system, and so on. This has always been possible if you start scheduler and workers from within Python but is tricky if you want to use the command line interface. Now you can write your custom code as a separate standalone script and ask the command line interface to run it for you at startup:

# scheduler-setup.py
from distributed.diagnostics.plugin import SchedulerPlugin

class MyPlugin(SchedulerPlugin):
    """ Prints a message whenever a worker is added to the cluster """
    def add_worker(self, scheduler=None, worker=None, **kwargs):
        print("Added a new worker at", worker)

    def dask_setup(scheduler):
        plugin = MyPlugin()
        scheduler.add_plugin(plugin)
dask-scheduler --preload scheduler-setup.py

This makes it easier for people to adapt Dask to their particular institution.

Documentation: http://distributed.readthedocs.io/en/latest/setup.html#customizing-initialization

Network Interfaces (for infiniband)

Many people use Dask on high performance supercomputers. This hardware differs from typical commodity clusters or cloud services in several ways, including very high performance network interconnects like InfiniBand. Typically these systems also have normal ethernet and other networks. You’re probably familiar with this on your own laptop when you have both ethernet and wireless:

$ ifconfig
lo          Link encap:Local Loopback                       # Localhost
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
eth0        Link encap:Ethernet  HWaddr XX:XX:XX:XX:XX:XX   # Ethernet
            inet addr:192.168.0.101
            ...
ib0         Link encap:Infiniband                           # Fast InfiniBand
            inet addr:172.42.0.101

The default systems Dask uses to determine network interfaces often choose ethernet by default. If you are on an HPC system then this is likely not optimal. You can direct Dask to choose a particular network interface with the --interface keyword

$ dask-scheduler --interface ib0
distributed.scheduler - INFO -   Scheduler at: tcp://172.42.0.101:8786

$ dask-worker tcp://172.42.0.101:8786 --interface ib0

Efficient as_completed

The as_completed iterator returns futures in the order in which they complete. It is the base of many asynchronous applications using Dask.

>>> x, y, z = client.map(inc, [0, 1, 2])
>>> for future in as_completed([x, y, z]):
...     print(future.result())
2
0
1

It can now also wait to yield an element only after the result also arrives

>>> for future, result in as_completed([x, y, z], with_results=True):
...     print(result)
2
0
1

And also yield all futures (and results) that have finished up until this point.

>>> for futures in as_completed([x, y, z]).batches():
...    print(client.gather(futures))
(2, 0)
(1,)

Both of these help to decrease the overhead of tight inner loops within asynchronous applications.

Example blogpost here: http://matthewrocklin.com/blog/work/2017/04/19/dask-glm-2

Co-released libraries

This release is aligned with a number of other related libraries, notably Pandas, and several smaller libraries for accessing data, including s3fs, hdfs3, fastparquet, and python-snappy each of which have seen numerous updates over the past few months. Much of the work of these latter libraries is being coordinated by Martin Durant

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.1 release on March 22nd

  • Antoine Pitrou
  • Dmitry Shachnev
  • Erik Welch
  • Eugene Pakhomov
  • Jeff Reback
  • Jim Crist
  • John A Kirkham
  • Joris Van den Bossche
  • Martin Durant
  • Matthew Rocklin
  • Michal Ficek
  • Noah D Brenowitz
  • Stuart Archibald
  • Tom Augspurger
  • Wes McKinney
  • wikiped

The following people contributed to the dask/distributed repository since the 1.16.1 release on March 22nd

  • Antoine Pitrou
  • Bartosz Marcinkowski
  • Ben Schreck
  • Jim Crist
  • Jens Nie
  • Krisztián Szűcs
  • Lezyes
  • Luke Canavan
  • Martin Durant
  • Matthew Rocklin
  • Phil Elson

May 08, 2017 12:00 AM