## May 23, 2017

### Enthought

#### Enthought Receives 2017 Product of the Year Award From National Instruments LabVIEW Tools Network

Python Integration Toolkit for LabVIEW recognized for extending LabVIEW connectivity and bringing the power of Python to applications in Test, Measurement and the Industrial Internet of Things (IIoT)

AUSTIN, TX – May 23, 2017 Enthought, a global leader in scientific and analytic computing solutions, was honored this week by National Instruments with the LabVIEW Tools Network Platform Connectivity 2017 Product of the Year Award for its Python Integration Toolkit for LabVIEW.

First released at NIWeek 2016, the Python Integration Toolkit enables fast, two-way communication between LabVIEW and Python. With seamless access to the Python ecosystem of tools, LabVIEW users are able to do more with their data than ever before. For example, using the Toolkit, a user can acquire data from test and measurement tools with LabVIEW, perform signal processing or apply machine learning algorithms in Python, display it in LabVIEW, then share results using a Python-enabled web dashboard.

“Python is ideally suited for scientists and engineers due to its simple, yet powerful syntax and the availability of an extensive array of open source tools contributed by a user community from industry and R&D,” said Dr. Tim Diller, Director, IIoT Solutions Group at Enthought. “The Python Integration Toolkit for LabVIEW unites the best elements of two major tools in the science and engineering world and we are honored to receive this award.”

Key benefits of the Python Integration Toolkit for LabVIEW from Enthought:

• Enables fast, two-way communication between LabVIEW and Python
• Provides LabVIEW users seamless access to tens of thousands of mature, well-tested scientific and analytic software packages in the Python ecosystem, including software for machine learning, signal processing, image processing and cloud connectivity
• Speeds development time by providing access to robust, pre-developed Python tools
• Provides a comprehensive out-of-the box solution that allows users to be up and running immediately

“Add-on software from our third-party developers is an integral part of the NI ecosystem, and we’re excited to recognize Enthought for its achievement with the Python Integration Toolkit for LabVIEW,” said Matthew Friedman, senior group manager of the LabVIEW Tools Network at NI.

The Python Integration Toolkit is available for download via the LabVIEW Tools Network, and also includes the Enthought Canopy analysis environment and Python distribution. Enthought’s training, support and consulting resources are also available to help LabVIEW users maximize their value in leveraging Python.

For more information on Enthought’s Python Integration Toolkit for LabVIEW, visit www.enthought.com/python-for-LabVIEW.

### Product Information

Python Integration Toolkit for LabVIEW product page

### Webinars

Enthought is a global leader in scientific and analytic software, consulting, and training solutions serving a customer base comprised of some of the most respected names in the oil and gas, manufacturing, financial services, aerospace, military, government, biotechnology, consumer products and technology industries. The company was founded in 2001 and is headquartered in Austin, Texas, with additional offices in Cambridge, United Kingdom and Pune, India. For more information visit www.enthought.com and connect with Enthought on Twitter, LinkedIn, Google+, Facebook and YouTube.

Since 1976, NI (www.ni.com) has made it possible for engineers and scientists to solve the world’s greatest engineering challenges with powerful platform-based systems that accelerate productivity and drive rapid innovation. Customers from a wide variety of industries – from healthcare to automotive and from consumer electronics to particle physics – use NI’s integrated hardware and software platform to improve the world we live in.

The LabVIEW Tools Network is the NI app store equipping engineers and scientists with certified, third-party add-ons and apps to complete their systems. Developed by industry experts, these cutting-edge technologies expand the power of NI software and modular hardware. Each third-party product is reviewed to meet specific guidelines and ensure compatibility. With hundreds of products available, the LabVIEW Tools Network is part of a rich ecosystem extending the NI Platform to help customers positively impact our world. Learn more about the LabVIEW Tools Network at www.ni.com/labview-tools-network.

LabVIEW, National Instruments, NI and ni.com and NIWeek are trademarks of National Instruments. Enthought, Canopy and Python Integration Toolkit for LabVIEW are trademarks of Enthought, Inc.

Media Contact

Courtenay Godshall, VP, Marketing, +1.512.536.1057, cgodshall@enthought.com

### numfocus

#### Welcome Nancy Nguyen, the new NumFOCUS Events Coordinator!

NumFOCUS is pleased to announce Nancy Nguyen has been hired as our new Events Coordinator. Nancy has over five years of event management experience in the non-profit and higher education sectors. She graduated from The University of Texas at Austin in 2011 with a BA in History. Prior to joining NumFOCUS, Nancy worked in development and fundraising […]

### Matthieu Brucher

#### Announcement: ATKBassPreamp 1.0.0

I’m happy to announce the release of a modeling of the Fender Bassman preamplifier stage based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATKBassPreamp

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

## May 22, 2017

### numfocus

#### NumFOCUS Awards Small Development Grants to Projects

This spring the NumFOCUS Board of Directors awarded targeted small development grants to applicants from or approved by our sponsored and affiliated projects. In the wake of a successful 2016 end-of-year fundraising drive, NumFOCUS wanted to direct the donated funds to our projects in a way that would have impact and visibility to donors and […]

## May 17, 2017

### numfocus

#### What is it like to chair a PyData conference?

Have you ever wondered what it’s like to be in charge of a PyData event? Vincent Warmerdam has collected some thoughts and reflections on his experience chairing this year’s PyData Amsterdam conference: This year I was the chair of PyData Amsterdam and I’d like to share some insights on what that was like. I was on the committee the […]

## May 16, 2017

### Matthieu Brucher

#### Book review: OpenGL Data Visualization Cookbook

This review will actually be quite quick: I haven’t finished the book and I won’t finish it.

The book was published in August 2015 and is based on OpenGL 3. The authors may sometimes say that you can use shaders to do better, but the fact is that if you want to execute the code they propose, you need to use the backward compatibility layer, if it's available. OpenGL was published almost a decade ago, I can't understand why in 2015 two guys decided that a new book on scientific visualization should use an API that was deprecated a long time ago. What a waste of time. [amazon_enhanced asin="1782169725" /][amazon_enhanced asin="B01FGMWRO8" /]

## May 14, 2017

### Titus Brown

#### How to analyze, integrate, and model large volumes of biological data - some thoughts

This blog post stems from notes I made for a 12 minute talk at the Oregon State Microbiome Initiative, which followed from some previous thinking about data integration on my part -- in particular, Physics ain't biology (and vice versa) and What to do with lots of (sequencing) data.

My talk slides from OSU are here if you're interested.

Thanks to Andy Cameron for his detailed pre-publication peer review - any mistakes remaining are of course mine, not his ;).

Note: During the events below, I was just a graduate student. So my perspective is probably pretty limited. But this is what I saw and remember!

My graduate work was in Eric Davidson's lab, where we studied early development in the sea urchin. Eric had always been very interested in gene expression, and over the preceding decade or two (1980s and onwards) had invested heavily in genomic technologies. This included lots of cDNA macroarrays and BAC libraries, as well as (eventually) the sea urchin genome project.

The sea urchin is a great system for studying early development! You can get literally billions of synchronously developing embryos by fertilizing all the eggs simultaneously; the developing embryo is crystal clear and large enough to be examined using a dissecting scope; sea urchins are available world-wide; early development is mostly invariant with respect to cell lineage (although that comes with a lot of caveats); and sea urchin embryos have been studied since the 1800s, so there was a lot of background literature on the embryology.

## The challenge: data integration without guiding theory

What we were faced with in the '90s and '00s was a challenge provided by the scads of new molecular data provided by genomics: that of data integration. We had discovered plenty of genes (all the usual homologs of things known in mice and fruit flies and other animals), we had cell-type specific markers, we could measure individual gene expression fairly easily and accurately with qPCR, we had perturbations working with morpholino oligos, and we had reporter assays working quite well with CAT and GFP. And now, between BAC sequencing and cDNA libraries and (eventually) genome sequencing, we had tons of genomic and transcriptomic data available.

How could we make sense of all of this data? It's hard to convey the confusion and squishiness of a lot of this data to anyone who hasn't done hands-on biology research; I would just say that single experiments or even collections of many experiments rarely provided a definitive answer, and usually just led to new questions. This is not rare in science, of course, but it typically took 2-3 years to figure out what a specific transcription factor might be doing in early development, much less nail down its specific upstream and downstream connections. Scale that to the dozens or 100s of genes involved in early development and, well, it was a lot of people, a lot of confusion, and a lot of discussion.

## The GRN wiring diagram

To make a longer story somewhat shorter:

Eric ended up leading an effort (together with Hamid Bolouri, Dave McClay, Andy Cameron, and others in the sea urchin community) to build a gene regulatory network that provided a foundation for data integration and exploration. You can see the result here:

http://sugp.caltech.edu/endomes/

This network at its core is essentially a map of the genomic connections between genes (transcriptional regulation of transcription factors, together with downstream connections mediated by specific binding sites and signaling interactions between cells, as well as whatever other information we had). Eric named this "the view from the genome." On top of this is layered several different "views from the nucleus", which charted the different regulatory states initiated by asymmetries such as the localization of beta cadherin to the vegetal pole of the egg, and the location of sperm entry into the egg.

At least when it started, the network served primarily as a map of the interactions - a somewhat opinionated interpretation of both published and unpublished data. Peter et al., 2012 showed that the network could be used for in silico perturbations, but I don't know how much has been followed up on. During my experiences with it, it mainly served as a communications medium and a point of reference for discussions about future experiments as well as an integrative guide to published work.

What was sort of stunning in hindsight is the extent to which this model became a touchpoint for our lab and (fairly quickly) the community that studied sea urchin early development. Eric presented the network one year at the annual Developmental Biology of the Sea Urchin meeting, and by the next meeting, 18 months later, I remember it showing up in a good portion of talks from other labs. (One of my favorite memories is someone from Dave McClay's lab - I think it was Cyndi Bradham - putting up a view of the GRN inverted to make signaling interactions the core focus, instead of transcriptional regulation; heresy in Eric's lab!)

In essence, the GRN became a community resource fairly quickly. It was provided in both image and interactive form (using BioTapestry), and people felt free to edit and modify the network for their own presentations. It readily enabled in silico thought experiments - "what happens if I knock out this gene? The model predicts this, and this, and this should be downstream, and this other gene should be unaffected" that quickly led to choosing impactful actual experiments. In part because of this, arguments about the effects of specific genes quickly converged to conversation about how to test the arguments (for some definition of "quickly" and "conversation" - sometimes discussions were quite, ahem, robust in Eric's lab and the larger community!)

The GRN also served to highlight the unknowns and the insufficiencies in the model. Eric and others spent a lot of time thinking through questions such as this: "we know that transcription of gene X is repressed by gene Y; but something must still activate gene X. What could it be?" Eventually we did "crazy" things like measure the transcriptional levels and spatial expression patterns of all ~1000 transcription factors found in the sea urchin genome, which could then be directly integrated into the model for further testing.

In short, the GRN was a pretty amazing way for the community of people interested in early development in the sea urchin to communicate about the details. Universal agreement wasn't the major outcome, although I think significant points about early development were settled in part through the model - communication was the outcome.

And, importantly, it served as a central meeting point for data analysis. More on this below.

## Missed opportunities?

One of the major missed opportunities (in my view, obviously - feel free to disagree, the comment section is below :) was that we never turned the GRN into a model that was super easy for experimentalists to play with. It would have required significant software development effort to make it possible to do click-able gene knockdown followed by predicted phenotype readout -- but this hasn't been done yet; apparently it has been tough to find funding for this purpose. Had I stayed in the developmental biology game, I like to think I would have invested significant effort in this kind of approach.

I also don't feel like much time was invested in the community annotation and updating aspect of things. The official model was tightly controlled by a few people (in the traditional scientific "experts know best!" approach) and there was no particular attempt to involve the larger community in annotating or updating the model except through 1-1 conversations or formal publications. It's definitely possible that I just missed it, because I was just a graduate student, and by mid-2004 I had also mentally checked out of grad school (it took me a few more years to physically check out ;).

## Taking and holding ground

One question that occupies my mind a lot is the question of how we learn, as a community, from the research and data being produced in each lab. With data, one answer is to work to make the data public, annotate it, curate it, make it discoverable - all things that I'm interested in.

With research more broadly, though, it's more challenging. Papers are relatively poor methods for communicating the results of research, especially now that we have the Internet and interactive Web sites. Surely there are better venues (perhaps ones like Distill, the interactive visual journal for machine learning research). Regardless, the vast profusion of papers on any possible topic, combined with the array of interdisciplinary methods needed, means that knowledge integration is slow and knowledge diffusion isn't much faster.

I fear this means that when it comes to specific systems and question, we are potentially forgetting many things that we "know" as people retire or move on to other systems or questions. This is maybe to be expected, but when we confront the level of complexity inherent in biology, with little obvious convergence between systems, it seems problematic to repose our knowledge in dead tree formats.

## Mechanistic maps and models for knowledge storage and data integration

So perhaps the solution is maps and models, as I describe above?

In thinking about microbiomes and microbial communities, I'm not sure what form a model would take. At the most concrete and boring level, a directly useful model would be something that took in a bunch of genomic/transcriptomic/proteomic data and evaluated it against everything that we knew, and then sorted it into "expected" and "unexpected". (This is what I discussed a little bit in my talk at OSU.)

The "expected" would be things like the observation of carbon fixation pathways in well-understood autotrophs - "yep, there it is, sort of matches what we already see." The "unexpected" would be things like unannotated or poorly understood genes that were behaving in ways that suggested they were correlated with whatever conditions we were examining. Perhaps we could have multiple bins of unexpected, so that we could separate out things like genes where the genome, transcriptome, and proteome all provided evidence of expression versus situations where we simply saw a transcript with no other kind of data. I don't know.

If I were to indulge in fanciful thinking, I could imagine a sort of Maxwell's Daemon of data integration, sorting data into bins of "boring" and "interesting", churning through data sets looking for a collection of "interesting" that correlated with other data sets produced from the same system. It's likely that such a daemon would have to involve some form of deep correlational analysis and structure identification - deep learning comes to mind. I really don't know.

One interesting question is, how would this interact with experimental biology and experimental biologists? The most immediately useful models might be the ones that worked off of individual genomes, such as flux-balance models; they could be applied to data from new experimental conditions and knockouts, or shifted to apply to strain variants and related species and look for missing genes in known pathways, or new genes that looked potentially interesting.

So I don't know a lot. All I do know is that our current approaches for knowledge integration don't scale to the volume of data we're gathering or (perhaps more importantly) to the scale of the biology we're investigating, and I'm pretty sure computational modeling of some sort has to be brought into the fray in practical ways.

Perhaps one way of thinking about this is to ask what types of computational models would serve as good reference resources, akin to a reference genome. The microbiome world is surprisingly bereft of good reference resources, with the 16s databases and IMG/M serving as two of the big ones; but we clearly need more, along the vein of a community KEGG and other such resources, curated and regularly updated.

## Some concluding thoughts

Communication of understanding is key to progress in science; we should work on better ways of doing that. Open science (open data, open source, open access) is one way of better communicating data, computational methods, and results.

One theme that stood out for me from the microbiome workshop at OSU was that of energetics, a point that Stephen Giovanonni made most clearly. To paraphrase, "Microbiome science is limited by the difficulty of assessing the pros and cons of metabolic strategies." The guiding force behind evolution and ecology in the microbial world is energetics, and if we can get a mechanistic handle on energy extraction (autotrophy and heterotrophy) in single genomes and then graduate that to metagenome and community analysis, maybe that will provide a solid stepping stone for progress.

I'm a bit skeptical that the patterns that ecology and evolution can predict will be of immediate use for developing a predictive model. On the other hand, Jesse Zaneweld at the meeting presented on the notion that all happy microbiomes look the same, while all dysfunctional microbiomes are dysfunctional in their own special way; and Jesse pointed towards molecular signatures of dysfunction; so perhaps I'm wrong :).

It may well be that our data is still far too sparse to enable us to build a detailed mechanistic understanding of even simple microbial ecosystems. I wouldn't be surprised by this.

Trent Northern from the JGI concluded in his talk that we need model ecosystems too; absolutely! Perhaps experimental model ecosystems, either natural or fabricated, can serve to identify the computational approaches that will be most useful.

Along this vein, are there a natural set of big questions and core systems for which we could think about models? In the developmental biology world, we have a few big model systems that we focused on (mouse, zebrafish, fruit fly, and worm) - what are the equivalent microbial ecosystems?

--titus

p.s. There are a ton of references and they can be fairly easily found, but a decent starting point might be Davidson et al., 2002, "A genomic regulatory network for development."

## May 13, 2017

### Matthieu Brucher

#### Analog modeling: Comparing preamps

In a previous post, I explained how I modeled the triode inverter circuit. I’ve decided to put it inside two different plugins, so I’d like to present in 4 pictures their differences.

#### Preamps plugins

The two plugins will start as the modeling of the Fender Bassman preamp (inverter circuit, followed by its associated tone stack) and the other will be the modeling of the inverter section of a Vox AC30 (followed by the tone stack of a JCM800). Compared to the default preamp of Audio ToolKit, the behaviors are quite different, all with just a few differences in the values of the components:

30Hz preamps behavior

200Hz preamps behavior

1kHz preamps behavior

10kHz preamps behavior

I will probably add the options of using a different triode model (Audio Toolkit has lots of options, and I definitely need to present the differences in terms of quality and performance), and perhaps also a way of selecting a different tone stack. But for now, the plugins will propose a single modeling of a triode inverter followed by a tone stack. To model a full amp, you still need to model the final stage and the loudspeaker.

The next picture displays the preamp behavior depending on the used triode function:

Response with different triode functions

The Leach model and the original Koren model don’t behave as well the other models. It’s probably due to different parameters, but they give a good idea of the behavior of the tube. The modified Munro-Piazza is a personal modification of the tube function to make the derivative continuous as well. It helps the convergence when the state of the tube changes fast, even though it is clearly not enough to remove all discontinuities.

The following picture describes the cost with valgrind of the different models:

Triode preamp profiles

Obviously, the cost for the more complex functions, the time is spent trying to figure out the logarithm of a double number. This is because the fast math functions in ATK don’t support this function yet. If we use floating point numbers, then the cost is divided by 3 for the Koren model (for instance).

As the results are quite close, using floats or doubles is a matter of optimization and precision. In the plugins, I will use floats to maximize performance.

#### Coming next

In the next two weeks, the two plugins will be released. And depending on feedback and comments, I’ll add more options.

## May 12, 2017

### Trichech

#### Hal-hal yang Perlu Diperhatikan Dalam Jual Nasi Box

Jual nasi box jakarta atau nasi kotak merupakan usaha yang paling banyak dilakoni belakangan ini. Ini dikarenakan semakin meningkatnya jenis acara yang diadakan. Dengan begini, usaha ini mampu menjadi ladang rezeki yang melimpah dan menguntungkan. Alasan banyak orang yang berlangganan nasi kotak adalah karena tidak ribet dan simpel. Hal ini tentu tidak didapatkan jika memasak makanan dalam jumlah banyak sendiri. Memasak hidangan sendiri untuk menjamu tamu dalam jumlah banyak tentu melelahkan dan membutuhkan biaya yang tidak sedikit.

Saat ini bisnis jual nasi box bisa menjadi jalan untuk mencari penghidupan. Dengan mengandalkan pendapatan yang didapatkan dari usaha ini mampu untuk membiayai segala keperluan keluarga, mulai dari biaya pendidikan, kesehatan, kebutuhan primer maupun sekunder lainnya. Kebutuhan ini terpenuhi asal mengetahui dan memahami hal-hal apa saja yang perlu diperhatikan dalam menjalankan usaha kuliner ini.

Dalam menjalani usaha jual nasi box sangat penting untuk mempertimbangkan lokasi penjualan. Alamat menjadi poin penting karena tempat inilah yang akan pertama kali di cari oleh pelanggan jika hendak memesan. Selain itu, nomor telepon yang bisa dihubungi juga sangat berpengaruh. Selain kedua hal tadi, anda juga perlu memastikan ketersediaan bahan baku untuk dimasak. Sebab, jika persediaan kurang, maka usaha ini tidak akan berjalan maksimak. Kemungkinan terburuknya adalah pelanggan akan kabur dan beralih ke penjual nasi kotak yang lain.

#### Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

## May 11, 2017

### Leonardo Uieda

#### Reviews of our Scipy 2017 talk proposal: Bringing the Generic Mapping Tools to Python

This year, Scipy is using a double-open peer-review system, meaning that both authors and reviewers know each others identities. These are the reviews that we got for our proposal and our replies/comments (posted with permission from the reviewers). My sincerest thanks to all reviewers and editors for their time and effort.

The open review model is great because it increases the transparency of the process and might even result in better reviews. I started signing my reviews a few years ago and I found that I'm more careful with the tone of my review to make sure I don't offend anyone and provide constructive feedback.

Now, on to the reviews!

### Review 1 - Paul Celicourt

The paper introduces a Python wrapper for the C-based Generic Mapping Tools used to process and analyze time series and gridded data. The content is well organized, but I encourage the authors to consider the following comments: While the authors promise to demonstrate an initial prototype of the wrapper, it is not sure that a WORKING prototype will be available by the time of the conference as claimed by the authors when looking at the potential functionalities to be implemented and presented in the second paragraph of the extended abstract. Furthermore, it is not clear what would be the functionalities of the initial prototype. On top of that, the approach to the implementation is not fully presented. For instance, the Simplified Wrapper and Interface Generator (SWIG) tool may be used to reduce the workload but the authors do not mention whether the wrapper would be manually developed or using an automated tool such as the SWIG. Finally, the portability of the shared memory process has not been addressed.

Thanks for all your comments, Paul! They are good questions and we should have addressed them better in the abstract.

That is a valid concern regarding the working prototype. We're not sure how much of the prototype will be ready for the conference. We are sure that we'll have something to show, even if it's not complete. The focus of the talk will be on our design decisions, implementation details, and the changes in the GMT modern execution mode on which the Python wrappers are based. We'll run some examples of whatever we have working mostly for the "Oooh"s and "Aaah"s.

The wrapper will be manually generated using ctypes. We chose this over SWIG or Cython because ctypes allows us to write pure Python code. It's a much simpler way of wrapping a C library. Not having any compiled extension modules also greatly facilitates distributing the package across operating systems. The same wrapper code can work on Windows, OSX, and Linux (as long as the GMT shared library is available).

The amount of C functions that we'll have to wrap is not that large. Mainly, we need GMT_Call_Module to run a command (like psxy), GMT_Create_Session for generating the session structure, and GMT_Open_VirtualFile and GMT_Read_VirtualFile for passing data to and from Python. The majority of the work will be in creating the Python functions for each GMT command, documenting them, and parsing the Python function arguments into something that GMT_Call_Module accepts. This work would have to be done manually using SWIG or Cython as well, so ctypes is not a disadvantage with regard to this. There are some more details about this in our initial design and goals.

### Review 2 - Ricardo Barros Lourenço

The authors submitted a clear abstract, in the sense that they will present a new Python library, which is a binding to the Generic Mapping Tools (GMT) C library, which is widely adopted by the Geosciences community. They were careful in detailing their reasoning in such implementation, and also in analogue initiatives by other groups.

In terms of completeness, the abstract precisely describes that the design plans and some of the implementation would be detailed and explained, as well on a demo of their current version of the library. It was very interesting that the authors, while describing their implementation, also pointed that the library could be used in other applications not necessarily related to geoscientific applications, by the generation of general line plots, bar graphs, histograms, and 3D surfaces. It would be beneficial to the audience to see how this aspect is sustained, by comparing such capabilities with other libraries (such as Matplotlib and Seaborn) and evaluating their contribution to the geoscientific domain, and also on the expanded related areas.

The abstract is highly compelling to the Earth Sciences community members at the event because the GMT module is already used for high-quality visualization (both in electronic, but also in printed outputs - maps - which is an important contribution to) , but with a Python integration it could simplify the integration of "Pythonic" workflows into it, expanding the possibilities in geoscientific visualization, especially in printed maps.

It would be interesting, aside from a presumed comparison in online visualization with matplotlib and cartopy, if the authors would also discuss in their presentation other possible contributions, such as online tile generation in map servers, which is very expensive in terms of computational resources and is still is challenging in an exclusive "Pythonic" environment. Additionally, it would be interesting if the authors provide some clarification if there is any limitation on the usage of such library, more specifically to the high variance in geoscientific data sources, and also in how netCDF containers are consumed in their workflow (considering that these containers don't necessarily conform to a strict standard, allowing users to customize their usage) in terms of the automation of this I/O.

The topic of high relevance because there is still few options for spatial data visualization in a "fully pythonic" environment, and none of them is used in the process of plotting physical maps, in a production setting, such as GMT is. Considering these aspects, I recommend such proposal for acceptance.

Thank you, Ricardo, for your incentives and suggestions for the presentation!

I hadn't thought about the potential use in map tiling but we'll keep an eye on that from now on and see if we have anything to say about it. Thanks!

Regarding netCDF, the idea is to leverage the xarray library for I/O and use their Dataset objects as input and output arguments for the grid related GMT commands. There is also the option of giving the Python functions the file name of a grid and have GMT handle I/O, as it already does in the command line. The appeal of using xarray is that it integrates well with numpy and pandas and can be used instead of gmt grdmath (no need to tie your head in knots over RPN anymore!).

### Review 3 - Ryan May

Python bindings for GMT, as demonstrated by the authors, are very much in demand within the geoscience community. The work lays out a clear path towards implementation, so it's an important opportunity for the community to be able offer API and interaction feedback. I feel like this talk would be very well received and kick off an important dialogue within the geoscience Python community.

Thanks, Ryan! Getting community feedback was the motivation for submitting a talk without having anything ready to show yet. It'll be much easier to see what the community wants and thinks before we have fully committed to an implementation. We're very much open and looking forward to getting a ton of questions!

What would you like to see in a GMT Python library? Let us know if there are any questions/suggestions before the conference. See you at Scipy in July!

Thumbnail image for this post is modified from "ScientificReview" by the Center for Scientific Review which is in the public domain.

Comments? Leave one below or let me know on Twitter @leouieda or in the Software Underground Slack group.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

## May 09, 2017

### Matthieu Brucher

#### Book review: Getting Started With JUCE

After the announce of JUCE 5 release, I played a little bit with it, and then decided to read the only book on JUCE. It’s outdated and tackles JUCE 2.1.2. But who knows, it may be a gem?

#### Content and opinions

The book starts with a chapter on JUCE and its installation. If the main application changed its name from Introjucer to Projucer (probably because of the change in licence), the rest seems to be quite similar. This app still creates a Visual Studio solution, a Xcode project or makefiles.

The second chapter was the one I was interested in the most, because I plan on creating new component and GUIs for my plugins. The chapter still feels a little bit short, we could have worked more with overwriting custom look and feel, handling more events… It seems that not much changed in this area since JUCE 2.1, so this is quite an achievement for ROLI and the team. If in the future my component can last several major releases, I know I can invest time in learning JUCE.

The next chapter feels useless with C++11/14 or even Boost. It seems JUCE still uses a custom String class, which is too bad, not sure it really brings anything to the library, and there are other APIs that are now deprecated in my opinion (data types like int32, file system handling…)

The fourth chapter deals with streaming data and building a small app that can play sound. It is a nice feature, but I have to say I read it even faster than the previous chapter because it was of no interest to me.

Finally, the last chapter ends the book with a sudden note on some utilities (I can’t even remember them without reading the chapter again), no final conclusion, but a feeling that the book was more a list of tutorials than a real book.

#### Conclusion

I wouldn’t recommend on buying the book, as the tutorials cover the only bit that is still relevant. But I can hope for an updated version, one day.

## May 08, 2017

### Continuum Analytics news

#### Anaconda Joins Forces with Leading Companies to Further Innovate Open Data Science

Monday, May 8, 2017
Travis Oliphant
President, Chief Data Scientist & Co-Founder

In addition to announcing the formation of the GPU Open Analytics Initiative with H2O and MapD, today, we are pleased to announce an exciting collaboration with NVIDIA, H2O and MapD, with a goal of democratizing machine learning to increase performance gains of data science workloads. Using NVIDIA’s Graphics Processing Unit (GPU) technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between Central Processing Units (CPUs) and GPUs and move toward their larger business goals.

The new GPU Data Frame (GDF) will augment the Anaconda platform as the foundational fabric to bring data science technologies together allowing it to take full advantage of GPU performance gains. In most workflows using GPUs, data is first manipulated with the CPU and then loaded to the GPU for analytics. This creates a data transfer “tax” on the overall workflow.   With the new GDF initiative, data scientists will be able to move data easily onto the GPU and do all their manipulation and analytics at the same time without the extra transfer of data. With this collaboration, we are opening the door to an era where innovative AI applications can be deployed into production at an unprecedented pace and often with just a single click.

In a nutshell, this collaboration provides these key benefits:

• Python Democratization. GPU Data Frame makes it easy to create new optimized data science models and iterate on ideas using the most innovative GPU and AI technologies.

• Python Acceleration. The standard empowers data scientists with unparalleled acceleration within Python on GPUs for data science workloads, enabling Open Data Science to proliferate across the enterprise.

• Python Production. Data science teams can move beyond ad-hoc analysis to unearthing game-changing results within production-deployed data science applications that drive measurable business impact.

Anaconda aims to bring the performance, insights and intelligence enterprises need to compete in today’s data-driven economy. We’re excited to be working with NVIDIA, mapD, and H2O as GPU Data Frame pushes the door to Open Data Science wide open by further empowering the data scientist community with unparalleled innovation, enabling Open Data Science to proliferate across the enterprise.

#### Anaconda Easy Button - Microsoft SQL Server and Python

Tuesday, May 9, 2017
Ian Stokes-Rees
Continuum Analytics

## sqlserver.jpg

Previously there were many twisty roads that you may have followed if you wanted to use Python on a client system to connect to a Microsoft SQL Server database, and not all of those roads would even get you to your destination. With the news that Microsoft SQL Server 2017 has increased support for Python, by including a subset of Anaconda packages on the server-side, I thought it would be useful to demonstrate how Anaconda delivers the easy button to get Python on the client side connected to Microsoft SQL Server.

This blog post demonstrates how Anaconda and Anaconda Enterprise can be used on the client-side to connect Python running on Windows, Mac, or Linux to a SQL Server instance. The instructions should work for many versions SQL Server, Python and Anaconda, including Anaconda Enterprise, our commercially oriented version of Anaconda that adds in strong collaboration, security, and server deployment capabilities. If you run into any trouble let us know either through the Anaconda Community Support mailing list or on Twitter @ContinuumIO.

## TL;DR: For the Impatient

If you're the kind of person who just wants the punch line and not the story, there are three core steps to connect to an existing SQL Server database:

1. Install the SQL Server drivers for your platform on the client system. That is described in the Client Driver Installation section below.

2. conda install pyodbc

3. Establish a connection to the SQL Server database with an appropriately constructed connection statement:

 conn = pyodbc.connect(
r'DRIVER={ODBC Driver 13 for SQL Server};' +
('SERVER={server},{port};'   +
'DATABASE={database};'      +
server= 'sqlserver.testnet.corp',
port= 1433,
)


For cut-and-paste convenience, here's the string:

 (r'DRIVER={ODBC Driver 13 for SQL Server};' +
('SERVER={server},{port};'   +
'DATABASE={database};'      +
server= 'sqlserver.testnet.corp',
port= 1433,
)

'DRIVER={ODBC Driver 13 for SQL Server};SERVER=sqlserver.testnet.corp,1433;DATABASE=AdventureWorksDW2012;UID=tanya;PWD=Tanya1234'

Hopefully that doesn't look too intimidating!

Here's the scoop: the Python piece is easy (yay Python!) whereas the challenges are installing the platform-specific drivers (Step 1), and if you don't already have a database properly setup then the SQL Server installation, configuration, database loading, and setting up appropriate security credentials are the parts that the rest of this blog post are going to go into in more detail. As well as a fully worked out example of client-side connection and query.

And you can grab a copy of this blog post as a Jupyter Notebook from Anaconda Cloud.

## On With The Story

While this isn't meant to be an exhaustive reference for SQL Server connectivity from Python and Anaconda it does cover several client/server configurations. In all cases I was running SQL Server 2016 on a Windows 10 system. My Linux system was CentOS 6.9 based. My Mac was running macOS 10.12.4, and my client-side Windows system also used Windows 10. The Windows and Mac Python examples were using Anaconda 4.3 with Python 3.6 and pyodbc version 3.0, while the Linux example used Anaconda Enterprise, based on Anaconda 4.2, using Python 2.7.

NOTE: In the examples below the $ symbol indicates the command line prompt. Do not include this in any commands if you cut-and-paste. Your prompt will probably look different! ## Server Side Preparation If you are an experienced SQL Server administrator then you can skip this section. All you need to know are: 1. The hostname or IP address and port number for your SQL Server instance 2. The database you want to connect to 3. The user credentials that will be used to make the connection The following provides details on how to setup your SQL Server instance to be able to exactly replicate the client-side Python-based connection that follows. If you do not have Microsoft SQL Server it can be downloaded and installed for free and is now available for Windows and Linux. NOTE: The recently released SQL Server 2017 and SQL Server on Azure both require pyodbc version >= 3.2. This blog post has been developed using SQL Server 2016 with pyodbc version 3.0.1. This demonstration is going to use the Adventure Works sample database provided by Microsoft on CodePlex. There are instructions on how to install this into your SQL Server instance in Step 3 of this blog post, however you can also simply connect to an existing database by adjusting the connection commands below accordingly. Many of the preparation steps described below are most easily handled using the SQL Server Management Studio which can be downloaded and installed for free. ## ssms.jpg Additionally this example makes use of the Mixed Authentication Mode which allows SQL Server-based usernames and passwords for database authentication. By default this is not enabled, and only Windows Authentication is permitted, which makes use of the pre-existing Kerberos user authentication token wallet. It should go without saying that you would only change to Mixed Authentication Mode for testing purposes if SQL Server is not already so configured. While the example below focuses on SQL Server Authentication there are also alternatives presented for the Windows Authentication Mode that uses Kerberos tokens. You should know the hostname and port on which SQL Server is running and be sure you can connect to that hostname (or IP address) and port from the client-side system. The easiest way to test this is with telnet from the command line excuting the command: $ telnet sqlserver.testnet.corp 1433

Where you would replace sqlserver.testnet.corp with the hostname or IP address of your SQL Server instance and 1433 with the port SQL Server is running on. Port 1433 is the SQL Server default. Executing this command on the client system should return output like:

Trying sqlserver.testnet.corp...
Connected to sqlserver.testnet.corp.
Escape character is '^]'.

telnet> close


At which point you can then type CTRL-] and then close.

If your client system is also Windows you can perform this simple Universal Data Link (UDL) test.

## testudl.jpg

Finally you will need to confirm that your have access credentials for a user known by SQL Server and that the user is permitted to perform SELECT operations (and perhaps others) on the database in question. In this particular example we are making use of a fictional user named Tanya who has the username tanya and a password of Tanya1234. It is a 3 step process to get Tanya access to the Adventure Works database:

1. The SQL Server user tanya is added as a Login to the DBMS:

• which can be found in SQL Server Management Studio
• provide a Login name of tanya
• select SQL Server authentication
• provide a password of Tanya1234
• uncheck the option for Enforce password policy (the other two will automatically be unchecked and greyed out)
2. The database user needs to be added:

• right-click on Users
• select SQL user with Login for type
• add a Username set to tanya
• add a Login name set to tanya.
3. Grant tanya permissions on the AdventureWorksDW2012 database by executing the following query:

use AdventureWorksDW2012;
GRANT SELECT, INSERT, DELETE, UPDATE ON SCHEMA::DBO TO tanya;


Using the UDL test method described above is a good way to confirm that tanya can connect to the database, even just from localhost, though it does not confirm if she can perform operations such as SELECT. For that I would recommend installing the free Microsoft Command Line Utilities 13.1 for SQL Server

## Client Driver Installation

You'll now need to get the drivers installed on your client system. There are two parts to this: the platform specific dynamic libraries for SQL Server, and the Python ODBC interface. You won't be surprised to hear that the platform-specific libraries are harder to get setup, but this should still only take 10-15 minutes and there are established processes for all major operating systems.

### Linux

The Linux drivers are avaialble for RHEL, Ubuntu, and SUSE. This is the process that you'd have to follow if you are using Anaconda Enterprise, as well as for anyone using a Linux variant with Anaconda installed.

My Linux test system was using CentOS 6.9, so I followed the RHEL6 installation procedure from Microsoft (linked above), which essentially consisted of 3 steps:

1. Adding the Microsoft RPM repository to the yum configuration
2. Pre-emptively removing some packages that may cause conflicts (I didn't have these installed)
3. Using yum to install the msodbcsql RPM for version 13.1 of the drivers

In my case I had to play around with the yum command and in the end just doing:

$ACCEPT_EULA=Y yum install msodbcsql ### Windows The Windows drivers are dead easy to install. Download the msodbcsql MSI file, double click to install, and you're in business. ### Mac The SQL Server drivers for Mac need to be installed via Homebrew which is a popular OS X package manager, though not affiliated with nor supported by Apple. Microsoft has created their own tap which is a Homebrew package repository. If you don't already have Homebrew installed you'll need to do that, then Microsoft have provided some simple instructions describing how to add the SQL Server tap and then install the mssql-tools package. The steps are simple enough I'll repeat them here, though check out that link above if you need more details or background. $ brew tap microsoft/mssql-preview https://github.com/Microsoft/homebrew-mssql-preview
$brew update$ brew install mssql-tools

One thing I'll note is that the Homebrew installation output suggested I should execute a command to remove one of the SQL Server drivers. Don't do this! That driver is required. If you've already done it then the way to correct the process is to reset the configuration file by removing and re-adding the package:

$brew remove mssql-tools$ brew install mssql-tools

## Install Anaconda

Download and install Anaconda if you don't already have it on your system. There are graphical and command line installers available for Windows, Mac, and Linux. It is about 400 MB to download and a bit over 1 GB installed. If you're looking for a minimal system you can install Miniconda instead (command line only installer) and then a la carte pick the packages you want with conda install commands.

Anaconda Enterprise users or administrators can simply execute the commands below in the conda environment where they want pyodbc to be available.

## Python ODBC package

This part is easy. You can just do:

$conda install pyodbc And if you're not using Anaconda or prefer pip, then you can also do: $ pip install pyodbc

NOTE: If you are using the recently released SQL Server 2017 you will need pyodbc >= 3.2. There should be a conda package available for that "shortly" but be sure to check which version you get if you use the conda command above.

## Connecting to SQL Server using pyodbc

Now that you've got your server-side and client-side systems setup with the correct software, databases, users, libraries, and drivers it is time to connect. If everything works properly these steps are very simple and work for all platforms. Everything that is platform-specific has been handled elsewhere in the process.

We start by importing the common pyodbc package. This is Microsoft's recommended Python interface to SQL Server. There was an alternate Python interface pymssql that at one point was more reliable for SQL Server connections and even until quite recently was the only way to get Python on a Mac to connect to SQL Server, however with Microsoft's renewed support for Python and Microsoft's own Mac Homebrew packages it is now the case that pyodbc is the leader for all platforms.

import pyodbc

Use a Python dict to define the configuration parameters for the connection

config = dict(server=   'sqlserver.testnet.corp', # change this to your SQL Server hostname or IP address
port=      1433,                    # change this to your SQL Server port number [1433 is the default]
password= 'Tanya1234')

Create a template connection string that can be re-used.

conn_str = ('SERVER={server},{port};'   +
'DATABASE={database};'      +
'PWD={password}')

If you are using the Windows Authentication mode where existing authorization tokens are picked up automatically this connection string would be changed to remove UID and PWD entries and replace them with TRUSTED_CONNECTION, as below:

trusted_conn_str = ('SERVER={server};'     +
'DATABASE={database};' +
'TRUSTED_CONNECTION=yes')

config

{'database': 'AdventureWorksDW2012',
'port': 1433,
'server': 'sqlserver.testnet.corp',
'username': 'tanya'}

Now open a connection by specifying the driver and filling in the connection string with the connection parameters.

The following connection operation can take 10s of seconds to complete.

conn = pyodbc.connect(
r'DRIVER={ODBC Driver 13 for SQL Server};' +
conn_str.format(**config)
)

## Executing Queries

Request a cursor from the connection that can be used for queries.

cursor = conn.cursor()

cursor.execute('SELECT TOP 10 EnglishProductName FROM dbo.DimProduct;')
<pyodbc.Cursor at 0x7f7ca4a82d50>

Loop through to look at the results (an iterable of 1-tuples, containing unicodde strings of the results).

for entry in cursor:
print(entry)
(u'Adjustable Race', )
(u'Bearing Ball', )
(u'BB Ball Bearing', )
(u'LL Crankarm', )
(u'ML Crankarm', )
(u'HL Crankarm', )
(u'Chainring Bolts', )
(u'Chainring Nut', )

## Data Science Happens Here

Now that we've demonstrated how to connect to a SQL Server instance from Windows, Mac and Linux using Anaconda or Anaconda Enterprise it is possible to use T-SQL queries to interact with that database as you normally would.

Looking to the future, the latest preview release of SQL Server 2017 includes a server-side Python interface built around Anaconda. There are lots of great resources on Python and SQL Server connectivity from the team at Microsoft, and here are a few that you may find particularly interesting:

## Next Steps

My bet is that if you're reading a blog post on SQL Server and Python (and you can download a Notebook version of it here) then you're using it in a commercial context. Anaconda Enterprise is going to be the best way for you and your organization to make a strategic investment in Open Data Science.

See how Anaconda Enterprise is transforming data science through our webinar series or grab one of our white papers on Enterprise Open Data Science.

Let us help you be successful in your strategic adoption of Python and Anaconda for high-performance enterprise-oriented open data science connected to your existing data sources and systems, such as SQL Server.

#### Data Science And Deep Learning Application Leaders Form GPU Open Analytics Initiative

Monday, May 8, 2017

Continuum Analytics, H2O.ai and MapD Technologies Create Open Common Data Frameworks for GPU In-Memory Analytics

SAN JOSE, CA—May 8, 2017—Continuum Analytics, H2O.ai, and MapD Technologies have announced the formation of the GPU Open Analytics Initiative (GOAI) to create common data frameworks enabling developers and statistical researchers to accelerate data science on GPUs. GOAI will foster the development of a data science ecosystem on GPUs by allowing resident applications to interchange data seamlessly and efficiently. BlazingDB, Graphistry and Gunrock from UC Davis led by CUDA Fellow John Owens have joined the founding members to contribute their technical expertise.

The formation of the Initiative comes at a time when analytics and machine learning workloads are increasingly being migrated to GPUs. However, while individually powerful, these workloads have not been able to benefit from the power of end-to-end GPU computing. A common standard will enable intercommunication between the different data applications and speed up the entire workflow, removing latency and decreasing the complexity of data flows between core analytical applications.

At the GPU Technology Conference (GTC), NVIDIA’s annual GPU developers’ conference, the Initiative announced its first project: an open source GPU Data Frame with a corresponding Python API. The GPU Data Frame is a common API that enables efficient interchange of data between processes running on the GPU. End-to-end computation on the GPU avoids transfers back to the CPU or copying of in-memory data reducing compute time and cost for high-performance analytics common in artificial intelligence workloads.

Users of the MapD Core database can output the results of a SQL query into the GPU Data Frame, which then can be manipulated by the Continuum Analytics’ Anaconda NumPy-like Python API or used as input into the H2O suite of machine learning algorithms without additional data manipulation. In early internal tests, this approach exhibited order-of-magnitude improvements in processing times compared to passing the data between applications on a CPU.

“The data science and analytics communities are rapidly adopting GPU computing for machine learning and deep learning. However, CPU-based systems still handle tasks like subsetting and preprocessing training data, which creates a significant bottleneck,” said Todd Mostak, CEO and co-founder of MapD Technologies. “The GPU Data Frame makes it easy to run everything from ingestion to preprocessing to training and visualization directly on the GPU. This efficient data interchange will improve performance, encouraging development of ever more sophisticated GPU-based applications.”

“GPU Data Frame relies on the Anaconda platform as the foundational fabric that brings data science technologies together to take full advantage of GPU performance gains,” said Travis Oliphant, co-founder and chief data scientist of Continuum Analytics. “Using NVIDIA’s technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between CPUs and GPUs and move nimbly toward their larger business goals. The key to producing this kind of innovation are great partners like H2O and MapD.”

“Truly diverse open source ecosystems are essential for adoption - we are excited to start GOAI for GPUs alongside leaders in data and analytics pipeline to help standardize data formats,” said Sri Ambati, CEO and co-founder of H2O.ai. “GOAI is a call for the community of data developers and researchers to join the movement to speed up analytics and GPU adoption in the enterprise.”

The GPU Open Analytics Initiative is actively welcoming participants who are committed to open source and to GPUs as a computing platform.

## Picture1.jpg

Details of the GPU Data Frame can be found at the Initiative’s Github link -
https://github.com/gpuopenanalytics

In conjunction with this announcement, MapD Technologies has announced the immediate open sourcing of the MapD Core database to foster open analytics on GPUs. Anaconda and H2O already have large open source communities, which can benefit from this project immediately and drive further development to accelerate the adoption of data science and analytics on GPUs.

Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads and 4 million unique users to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Learn more at continuum.io

H2O.ai is focused on bringing AI to businesses through software. Its flagship product is H2O, the leading open source platform that makes it easy for financial services, insurance and healthcare companies to deploy AI and deep learning to solve complex problems. More than 9,000 organizations and 80,000+ data scientists depend on H2O for critical applications like predictive maintenance and operational intelligence. The company -- which was recently named to the CB Insights AI 100 -- is used by 169 Fortune 500 enterprises, including 8 of the world’s 10 largest banks, 7 of the 10 largest insurance companies and 4 of the top 10 healthcare companies. Notable customers include Capital One, Progressive Insurance, Transamerica, Comcast, Nielsen Catalina Solutions, Macy's, Walgreens and Kaiser Permanente.

MapD Technologies is a next-generation analytics software company. Its technology harnesses the massive parallelism of modern graphics processing units (GPUs) to power lightning-fast SQL queries and visualization of large data sets. The MapD analytics platform includes the MapD Core database and MapD Immerse visualization client. These software products provide analysts and data scientists with the fastest time to insight, performance not possible with traditional CPU-based solutions. MapD software runs on-premise and on all leading cloud providers.

Founded in 2013, MapD Technologies originated from research at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). MapD is funded by GV, In-Q-Tel, New Enterprise Associates (NEA), NVIDIA, Vanedge Capital and Verizon Ventures. The company is headquartered in San Francisco.

Media Contacts:

Jill Rosenthal
Continuum Analytics
anaconda@inkhouse.com

Mary Fuochi
MapD
press@mapd.com

James Christopherson
H2O.ai
james@vscpr.com

### Matthew Rocklin

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.3. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on March 22nd.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed


or you can pip install from PyPI

pip install dask[complete] --upgrade


Conda packages should be on the default channel within a few days.

## Arrays

### Sparse Arrays

Dask.arrays now support sparse arrays and mixed dense/sparse arrays.

>>> import dask.array as da

>>> x = da.random.random(size=(10000, 10000, 10000, 10000),
...                      chunks=(100, 100, 100, 100))
>>> x[x < 0.99] = 0

>>> import sparse
>>> s = x.map_blocks(sparse.COO)  # parallel array of sparse arrays


In order to support sparse arrays we did two things:

1. Made dask.array support ndarray containers other than NumPy, as long as they were API compatible
2. Made a small sparse array library that was API compatible to the numpy.ndarray

This process was pretty easy and could be extended to other systems. This also allows for different kinds of ndarrays in the same Dask array, as long as interactions between the arrays are well defined (using the standard NumPy protocols like __array_priority__ and so on.)

### Reworked FFT code

The da.fft submodule has been extended to include most of the functions in np.fft, with the caveat that multi-dimensional FFTs will only work along single-chunk dimensions. Still, given that rechunking is decently fast today this can be very useful for large image stacks.

### Constructor Plugins

You can now run arbitrary code whenever a dask array is constructed. This empowers users to build in their own policies like rechunking, warning users, or eager evaluation. A dask.array plugin takes in a dask.array and returns either a new dask array, or returns None, in which case the original will be returned.

>>> def f(x):
...     print('%d bytes' % x.nbytes)

...     x = da.ones((10, 1), chunks=(5, 1))
...     y = x.dot(x.T)
80 bytes
80 bytes
800 bytes
800 bytes


This can be used, for example, to convert dask.array code into numpy code to identify bugs quickly:

>>> with dask.set_options(array_plugins=[lambda x: x.compute()]):
...     x = da.arange(5, chunks=2)

>>> x  # this was automatically converted into a numpy array
array([0, 1, 2, 3, 4])


Or to warn users if they accidentally produce an array with large chunks:

def warn_on_large_chunks(x):
shapes = list(itertools.product(*x.chunks))
nbytes = [x.dtype.itemsize * np.prod(shape) for shape in shapes]
if any(nb > 1e9 for nb in nbytes):
warnings.warn("Array contains very large chunks")

...


These features were heavily requested by the climate science community, which tends to serve both highly technical computer scientists, and less technical climate scientists who were running into issues with the nuances of chunking.

## DataFrames

Dask.dataframe changes are both numerous, and very small, making it difficult to give a representative accounting of recent changes within a blogpost. Typically these include small changes to either track new Pandas development, or to fix slight inconsistencies in corner cases (of which there are many.)

Still, two highlights follow:

### Rolling windows with time intervals

>>> s.rolling('2s').count().compute()
2017-01-01 00:00:00    1.0
2017-01-01 00:00:01    2.0
2017-01-01 00:00:02    2.0
2017-01-01 00:00:03    2.0
2017-01-01 00:00:04    2.0
2017-01-01 00:00:05    2.0
2017-01-01 00:00:06    2.0
2017-01-01 00:00:07    2.0
2017-01-01 00:00:08    2.0
2017-01-01 00:00:09    2.0
dtype: float64


### Read Parquet data with Arrow

Dask now supports reading Parquet data with both fastparquet (a Numpy/Numba solution) and Arrow and Parquet-CPP.

df = dd.read_parquet('/path/to/mydata.parquet', engine='fastparquet')


Hopefully this capability increases the use of both projects and results in greater feedback to those libraries so that they can continue to advance Python’s access to the Parquet format.

## Graph Optimizations

Dask performs a few passes of simple linear-time graph optimizations before sending a task graph to the scheduler. These optimizations currently vary by collection type, for example dask.arrays have different optimizations than dask.dataframes. These optimizations can greatly improve performance in some cases, but can also increase overhead, which becomes very important for large graphs.

As Dask has grown into more communities, each with strong and differing performance constraints, we’ve found that we needed to allow each community to define its own optimization schemes. The defaults have not changed, but now you can override them with your own. This can be set globally or with a context manager.

def my_optimize_function(graph, keys):
""" Takes a task graph and a list of output keys, returns new graph """
new_graph = {...}
return new_graph

dataframe_optimize=None,
delayed_optimize=my_other_optimize_function):


### Speed improvements

Additionally, task fusion has been significantly accelerated. This is very important for large graphs, particularly in dask.array computations.

## Web Diagnostics

The distributed scheduler’s web diagnostic page is now served from within the dask scheduler process. This is both good and bad:

• Good: It is much easier to make new visuals

Because Bokeh and Dask now share the same Tornado event loop we no longer need to send messages between them to then send out to a web browser. The Bokeh server has full access to all of the scheduler state. This lets us build new diagnostic pages more easily. This has been around for a while but was largely used for development. In this version we’ve switched the new version to be default and turned off the old one.

The cost here is that the Bokeh scheduler can take 10-20% of the CPU use. If you are running a computation that heavily taxes the scheduler then you might want to close your diagnostic pages. Fortunately, this almost never happens. The dask scheduler is typically fast enough to never get close to this limit.

Beware that the current versions of Bokeh (0.12.5) and Tornado (4.5) do not play well together. This has been fixed in development versions, and installing with conda is fine, but if you naively pip install then you may experience bad behavior.

## Joblib

The Dask.distributed Joblib backend now includes a scatter= keyword, allowing you to pre-scatter select variables out to all of the Dask workers. This significantly cuts down on overhead, especially on machine learning workloads where most of the data doesn’t change very much.

# Send the training data only once to each worker
scatter=[digits.data, digits.target]):
search.fit(digits.data, digits.target)


Early trials indicate that computations like scikit-learn’s RandomForest scale nicely on a cluster without any additional code.

When starting a dask.distributed scheduler or worker people often want to include a bit of custom setup code, for example to configure loggers, authenticate with some network system, and so on. This has always been possible if you start scheduler and workers from within Python but is tricky if you want to use the command line interface. Now you can write your custom code as a separate standalone script and ask the command line interface to run it for you at startup:

# scheduler-setup.py
from distributed.diagnostics.plugin import SchedulerPlugin

class MyPlugin(SchedulerPlugin):
""" Prints a message whenever a worker is added to the cluster """
print("Added a new worker at", worker)

plugin = MyPlugin()

dask-scheduler --preload scheduler-setup.py


This makes it easier for people to adapt Dask to their particular institution.

## Network Interfaces (for infiniband)

Many people use Dask on high performance supercomputers. This hardware differs from typical commodity clusters or cloud services in several ways, including very high performance network interconnects like InfiniBand. Typically these systems also have normal ethernet and other networks. You’re probably familiar with this on your own laptop when you have both ethernet and wireless:

$ifconfig lo Link encap:Local Loopback # Localhost inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host eth0 Link encap:Ethernet HWaddr XX:XX:XX:XX:XX:XX # Ethernet inet addr:192.168.0.101 ... ib0 Link encap:Infiniband # Fast InfiniBand inet addr:172.42.0.101  The default systems Dask uses to determine network interfaces often choose ethernet by default. If you are on an HPC system then this is likely not optimal. You can direct Dask to choose a particular network interface with the --interface keyword $ dask-scheduler --interface ib0
distributed.scheduler - INFO -   Scheduler at: tcp://172.42.0.101:8786

On-campus housing information is available for approximately $400/wk, which includes breakfast and dinner. Housing registration currently closes April 26th. Registration links for each workshop are under the workshop description; housing is linked there as well, and must be booked separately. Attendees of both weeks of workshops may book housing for both weeks, and attendees of the two-week introductory bioinformatics workshop, ANGUS may book a full four weeks of housing. For questions about registration, travel, invitation letters, or other general topics, please contact dibsi.training@gmail.com. For workshop specific questions, contact the instructors (e-mail links are under each workshop). --titus ## April 13, 2017 ### Matthew Rocklin #### Streaming Python Prototype This work is supported by Continuum Analytics, and the Data Driven Discovery Initiative from the Moore Foundation. This blogpost is about experimental software. The project may change or be abandoned without warning. You should not depend on anything within this blogpost. This week I built a small streaming library for Python. This was originally an exercise to help me understand streaming systems like Storm, Flink, Spark-Streaming, and Beam, but the end result of this experiment is not entirely useless, so I thought I’d share it. This blogpost will talk about my experience building such a system and what I valued when using it. Hopefully it elevates interest in streaming systems among the Python community. ## Background with Iterators Python has sequences and iterators. We’re used to mapping, filtering and aggregating over lists and generators happily. seq = [1, 2, 3, 4, 5] seq = map(inc, L) seq = filter(iseven, L) >>> sum(seq) # 2 + 4 + 6 12  If these iterators are infinite, for example if they are coming from some infinite data feed like a hardware sensor or stock market signal then most of these pieces still work except for the final aggregation, which we replace with an accumulating aggregation. def get_data(): i = 0 while True: i += 1 yield i seq = get_data() seq = map(inc, seq) seq = filter(iseven, seq) seq = accumulate(lambda total, x: total + x, seq) >>> next(seq) # 2 2 >>> next(seq) # 2 + 4 6 >>> next(seq) # 2 + 4 + 6 12  This is usually a fine way to handle infinite data streams. However this approach becomes awkward if you don’t want to block on calling next(seq) and have your program hang until new data comes in. This approach also becomes awkward when you want to branch off your sequence to multiple outputs and consume from multiple inputs. Additionally there are operations like rate limiting, time windowing, etc. that occur frequently but are tricky to implement if you are not comfortable using threads and queues. These complications often push people to a computation model that goes by the name streaming. To introduce streaming systems in this blogpost I’ll use my new tiny library, currently called streams (better name to come in the future). However if you decide to use streaming systems in your workplace then you should probably use some other more mature library instead. Common recommendations include the following: • ReactiveX (RxPy) • Flink • Storm (Streamparse) • Beam • Spark Streaming ## Streams We make a stream, which is an infinite sequence of data into which we can emit values and from which we can subscribe to make new streams. from streams import Stream source = Stream()  From here we replicate our example above. This follows the standard map/filter/reduce chaining API. s = (source.map(inc) .filter(iseven) .accumulate(lambda total, x: total + x))  Note that we haven’t pushed any data into this stream yet, nor have we said what should happen when data leaves. So that we can look at results, lets make a list and push data into it when data leaves the stream. results = [] s.sink(results.append) # call the append method on every element leaving the stream  And now lets push some data in at the source and see it arrive at the sink: >>> for x in [1, 2, 3, 4, 5]: ... source.emit(x) >>> results [2, 6, 12]  We’ve accomplished the same result as our infinite iterator, except that rather than pulling data with next we push data through with source.emit. And we’ve done all of this at only a 10x slowdown over normal Python iteators :) (this library takes a few microseconds per element rather than CPython’s normal 100ns overhead). This will get more interesting in the next few sections. ## Branching This approach becomes more interesting if we add multiple inputs and outputs. source = Stream() s = source.map(inc) evens = s.filter(iseven) evens.accumulate(add) odds = s.filter(isodd) odds.accumulate(sub)  Or we can combine streams together second_source = Stream() s = combine_latest(second_source, odds).map(sum)  So you may have multiple different input sources updating at different rates and you may have multiple outputs, perhaps some going to a diagnostics dashboard, others going to long-term storage, others going to a database, etc.. A streaming library makes it relatively easy to set up infrastructure and pipe everything to the right locations. ## Time and Back Pressure When dealing with systems that produce and consume data continuously you often want to control the flow so that the rates of production are not greater than the rates of consumption. For example if you can only write data to a database at 10MB/s or if you can only make 5000 web requests an hour then you want to make sure that the other parts of the pipeline don’t feed you too much data, too quickly, which would eventually lead to a buildup in one place. To deal with this, as our operations push data forward they also accept Tornado Futures as a receipt. Upstream: Hey Downstream! Here is some data for you Downstream: Thanks Upstream! Let me give you a Tornado future in return. Make sure you don't send me any more data until that future finishes. Upstream: Got it, Thanks! I will pass this to the person who gave me the data that I just gave to you.  Under normal operation you don’t need to think about Tornado futures at all (many Python users aren’t familiar with asynchronous programming) but it’s nice to know that the library will keep track of balancing out flow. The code below uses @gen.coroutine and yield common for Tornado coroutines. This is similar to the async/await syntax in Python 3. Again, you can safely ignore it if you’re not familiar with asynchronous programming. @gen.coroutine def write_to_database(data): with connect('my-database:1234/table') as db: yield db.write(data) source = Stream() (source.map(...) .accumulate(...) .sink(write_to_database)) # <- sink produces a Tornado future for data in infinite_feed: yield source.emit(data) # <- that future passes through everything # and ends up here to be waited on  There are also a number of operations to help you buffer flow in the right spots, control rate limiting, etc.. source = Stream() source.timed_window(interval=0.050) # Capture all records of the last 50ms into batches .filter(len) # Remove empty batches .map(...) # Do work on each batch .buffer(10) # Allow ten batches to pile up here .sink(write_to_database) # Potentially rate-limiting stage  I’ve written enough little utilities like timed_window and buffer to discover both that in a full system you would want more of these, and that they are easy to write. Here is the definition of timed_window class timed_window(Stream): def __init__(self, interval, child, loop=None): self.interval = interval self.buffer = [] self.last = gen.moment Stream.__init__(self, child, loop=loop) self.loop.add_callback(self.cb) def update(self, x, who=None): self.buffer.append(x) return self.last @gen.coroutine def cb(self): while True: L, self.buffer = self.buffer, [] self.last = self.emit(L) yield self.last yield gen.sleep(self.interval)  If you are comfortable with Tornado coroutines or asyncio then my hope is that this should feel natural. ## Recursion and Feedback By connecting the sink of one stream to the emit function of another we can create feedback loops. Here is stream that produces the Fibonnacci sequence. To stop it from overwhelming our local process we added in a rate limiting step: from streams import Stream source = Stream() s = source.sliding_window(2).map(sum) L = s.sink_to_list() # store result in a list s.rate_limit(0.5).sink(source.emit) # pipe output back to input source.emit(0) # seed with initial values source.emit(1)  >>> L [1, 2, 3, 5] >>> L # wait a couple seconds, then check again [1, 2, 3, 5, 8, 13, 21, 34] >>> L # wait a couple seconds, then check again [1, 2, 3, 5, 8, 13, 21, 34, 55, 89]  Note: due to the time rate-limiting functionality this example relied on an event loop running somewhere in another thread. This is the case for example in a Jupyter notebook, or if you have a Dask Client running. ## Things that this doesn’t do If you are familiar with streaming systems then you may say the following: Lets not get ahead of ourselves; there’s way more to a good streaming system than what is presented here. You need to handle parallelism, fault tolerance, out-of-order elements, event/processing times, etc.. … and you would be entirely correct. What is presented here is not in any way a competitor to existing systems like Flink for production-level data engineering problems. There is a lot of logic that hasn’t been built here (and its good to remember that this project was built at night over a week). Although some of those things, and in particular the distributed computing bits, we may get for free. ## Distributed computing So, during the day I work on Dask, a Python library for parallel and distributed computing. The core task schedulers within Dask are more than capable of running these kinds of real-time computations. They handle far more complex real-time systems every day including few-millisecond latencies, node failures, asynchronous computation, etc.. People use these features today inside companies, but they tend to roll their own system rather than use a high-level API (indeed, they chose Dask because their system was complex enough or private enough that rolling their own was a necessity). Dask lacks any kind of high-level streaming API today. Fortunately, the system we described above can be modified fairly easily to use a Dask Client to submit functions rather than run them locally. from dask.distributed import Client client = Client() # start Dask in the background source.to_dask() .scatter() # send data to a cluster .map(...) # this happens on the cluster .accumulate(...) # this happens on the cluster .gather() # gather results back to local machine .sink(...) # This happens locally  ## Other things that this doesn’t do, but could with modest effort There are a variety of ways that we could improve this with modest cost: 1. Streams of sequences: We can be more efficient if we pass not individual elements through a Stream, but rather lists of elements. This will let us lose the microseconds of overhead that we have now per element and let us operate at pure Python (100ns) speeds. 2. Streams of NumPy arrays / Pandas dataframes: Rather than pass individual records we might pass bits of Pandas dataframes through the stream. So for example rather than filtering elements we would filter out rows of the dataframe. Rather than compute at Python speeds we can compute at C speeds. We’ve built a lot of this logic before for dask.dataframe. Doing this again is straightforward but somewhat time consuming. 3. Annotate elements: we want to pass through event time, processing time, and presumably other metadata 4. Convenient Data IO utilities: We would need some convenient way to move data in and out of Kafka and other common continuous data streams. None of these things are hard. Many of them are afternoon or weekend projects if anyone wants to pitch in. ## Reasons I like this project This was originally built strictly for educational purposes. I (and hopefully you) now know a bit more about streaming systems, so I’m calling it a success. It wasn’t designed to compete with existing streaming systems, but still there are some aspects of it that I like quite a bit and want to highlight. 1. Lightweight setup: You can import it and go without setting up any infrastructure. It can run (in a limited way) on a Dask cluster or on an event loop, but it’s also fully operational in your local Python thread. There is no magic in the common case. Everything up until time-handling runs with tools that you learn in an introductory programming class. 2. Small and maintainable: The codebase is currently a few hundred lines. It is also, I claim, easy for other people to understand. Here is the code for filter: class filter(Stream): def __init__(self, predicate, child): self.predicate = predicate Stream.__init__(self, child) def update(self, x, who=None): if self.predicate(x): return self.emit(x)  3. Composable with Dask: Handling distributed computing is tricky to do well. Fortunately this project can offload much of that worry to Dask. The dividing line between the two systems is pretty clear and, I think, could lead to a decently powerful and maintainable system if we spend time here. 4. Low performance overhead: Because this project is so simple it has overheads in the few-microseconds range when in a single process. 5. Pythonic: All other streaming systems were originally designed for Java/Scala engineers. While they have APIs that are clearly well thought through they are sometimes not ideal for Python users or common Python applications. ## Future Work This project needs both users and developers. I find it fun and satisfying to work on and so encourage others to play around. The codebase is short and, I think, easily digestible in an hour or two. This project was built without a real use case (see the project’s examples directory for a basic Daskified web crawler). It could use patient users with real-world use cases to test-drive things and hopefully provide PRs adding necessary features. I genuinely don’t know if this project is worth pursuing. This blogpost is a test to see if people have sufficient interest to use and contribute to such a library or if the best solution is to carry on with any of the fine solutions that already exist. pip install git+https://github.com/mrocklin/streams  ## April 11, 2017 ### Enthought #### Webinar- Get More From Your Core: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core What: Presentation, demo, and Q&A with Brendon Hall, Geoscience Product Manager, Enthought Who should watch this webinar: • Oil and gas industry professionals who are looking for ways to extract more value from expensive science wells • Those interested in learning how artificial intelligence and machine learning techniques can be applied to core analysis VIEW Geoscientists and petroleum engineers rely on accurate core measurements to characterize reservoirs, develop drilling plans and de-risk play assessments. Whole-core CT scans are now routinely performed on extracted well cores, however the data produced from these scans are difficult to visualize and integrate with other measurements. Virtual Core automates aspects of core description for geologists, drastically reducing the time and effort required for core description, and its unified visualization interface displays cleansed whole-core CT data alongside core photographs and well logs. It provides tools for geoscientists to analyze core data and extract features from sub-millimeter scale to the entire core. In this webinar and demo, we’ll start by introducing the Clear Core processing pipeline, which automatically removes unwanted artifacts (such as tubing) from the CT image. We’ll then show how the machine learning capabilities in Virtual Core can be used to describe the core, extracting features such as bedding planes and dip angle. Finally, we’ll show how the data can be viewed and analyzed alongside other core data, such as photographs, wellbore images, well logs, plug measurements, and more. ## What You’ll Learn: • How core CT data, photographs, well logs, borehole images, and more can be integrated into a digital core workshop • How digital core data can shorten core description timelines and deliver business results faster • How new features can be extracted from digital core data using artificial intelligence • Novel workflows that leverage these features, such as identifying parasequences and strategies for determining net pay VIEW Presenter:  Brendon Hall, Enthought Geoscience Product Manager and Application Engineer ### Additional Resources Other Blogs and Articles on Virtual Core: ### Matthieu Brucher #### Audio Toolkit: Create a FIR Filter from a Template (EQ module) Last week, I published a post on adaptive filtering. It was long overdue, but I actually had one other project on hold for even longer: allowing a user to specify a filter template and let Audio Toolkit figure out a FIR filter from this template. #### Remez/Parks & McClellan algorithm The most famous algorithm is the Remez/Parks & McClellan algorithm. In Matlab, it’s called remez, but Remez is actually a more generic algorithm than just FIR determination. The algorithm starts by selecting a few random points on the templates where the user set non-zero weights. The zero weights are usually the transition zones, which means that the filter can roam free in these sections. Usually, you don’t want to have them too big, especially in bandpass filters. As the resulting filter has ripples, this means that you can select the weight of each bandwidth in the template. Where the ripples should be small, use a big weight, where they don’t matter, use a small one. Then, the Remez algorithm is all about moving these points to the maximum of the difference between the template and the actual filter. At the end, the result is an optimal filter around the given template, for a given order. The determination of the result rests often on the selection of the starting points. If all starting points are in only one bandwidth, then the determination of the filter is wrong. As such, Audio Toolkit selects points that are equidistant so that all bandwidths are covered. Of course, if one bandwidth is too small, then the determination will fail. #### Demo There are many good papers on the Remez algorithm for FIR determination so I won’t take the time to rehash something that lots of people did far better than I could. But I’ll try to explain how it goes on a simple example, with the Python script that was used as the reference test case for the development of the plugin. Instead of using the equidistant start, I used the set of indices coming from the paper I used (and same for the template). As such, the indices are: [51 101 341 361 531 671 701 851] After the optimization, we get the following error function: Remez Iteration 1 The maximum error is 0.0325 in that case. The algorithm then selects new iterations for the next iteration, at the minimum and maximum of the current error function: [ 0 151 296 409 512 595 744 877] From these indices, we compute the optimal parameters again and then get a new error function (notice that the highlighted points correspond to the previous min/max) Remez Iteration 2 The maximum error is now 0.162. And we start the selection process again: [ 0 163 320 409 512 579 718 874] Once again, we get a new error function: Remez Iteration 3 The max error is a little bit bigger and is now 0.169. We select new indices: [ 0 163 320 409 512 579 718 874] The indices are identical, and at the next iteration, the search for the best stops. The resulting filter has the following transfer function (the template is in red) Estimated filter against template #### Conclusion There is finally a way of designing filters in Audio Toolkit that don’t require you to go to Matlab or Python. This can be quite efficient to design a linear phase filter on the fly in a plugin. There is probably more work to be done in terms of optimization, but the processing part itself is already fully optimized. ## April 10, 2017 ### Continuum Analytics news #### Open Data Science is a Team Sport Monday, April 10, 2017 Stephen Kearns Continuum Analytics As every March Madness fan knows, athletic talent and coaching are key, but it’s how they come together as a unit that determines a team’s success. Known for its drama-ridden storylines and endless buzzer beaters, the NCAA’s college basketball championship tournaments (both mens and womens) showcase the power of teamwork and dedication. Basketball is a team sport, where the interrelationships of complementary player skills often dictate the game’s winner. Everyone must focus and work together for the common good of the team. These same principles hold true for data scientist teams. Much like basketball, data science requires a team of players in different positions, including business analysts, data scientists, data engineers, DevOps engineers, and more. However, too many data scientists still function in silos, each working with his/her own tools to manage data sets. Working individually doesn’t work on the court and it won’t work in data science. Data scientists, and their data science equipment, must function together to work as a team. That’s where Open Data Science comes in. With Open Data Science, team members have their positions, but are able to move around the court and play with flexibility, like a basketball team. Everyone can also score in basketball, and similarly with Open Data Science, all team members, from data engineers to domain experts, are encouraged to contribute wherever their skills intersect with with the goals of the project. In fact, according to our recent survey, company decision leaders and data scientists revealed that 69 percent of respondents associate “Open Data Science” with collaboration. No longer just a one-person job, data science is a team sport. Open Data Science is an inclusive movement that not only encourages data scientists to function as a cohesive unit, but also embraces open source tools, so they can work together more easily in a connected ecosystem. Instead of pigeonholing data scientists into using a single language or set of tools, Open Data Science facilitates collaboration and enables data science teams to reap the benefits of all available technologies. Open Data Science brings innovation from every community together, making the latest information readily available to all. Collaboration helps enterprises harness their data faster and extract more value—so, don’t drop the ball with your organization’s data science strategy. Make it a true team effort. Learn more about the Five Dysfunctions of a Data Science team in slides from my latest webinar below, or download the slides here. ## April 07, 2017 ### Paul Ivanov #### March 29th, 2017 What's missing -- feels like there's something missing -- The capacity is there -- the job's not stressful but I somehow fail at the ignition stage - all this fuel just sitting around -- un-utilized potential How do I light that fire? Set it ablaze in a daze caught up in the haze of comfort I need to challenge myself, raising tides lift all boats, but they also drown livestock cows, horses, and goats, seeking refuge in hills that once covered in grass now fill up like lifeboats. Doctors in white coats say "Keep your spirits up" -- hope floats.  ## April 04, 2017 ### Enthought #### Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting by: Tim Diller, Product Manager and Scientific Software Developer, Enthought Last week I attended the AIChE (American Institute of Chemical Engineers) Spring Meeting in San Antonio, Texas. It was a great time of year to visit this cultural gem deep in the heart of Texas (and just down the road from our Austin offices), with plenty of good food, sights and sounds to take in on top of the conference and its sessions. The AIChE Spring Meeting focuses on applications of chemical engineering in industry, and Enthought was invited to present a poster and deliver a “vendor perspective” talk on the Canopy Platform for Process Monitoring and Optimization as part of the “Big Data Analytics” track. This was my first time at AIChE, so some of the names were new, but in a lot of ways it felt very similar to many other engineering conferences I have participated in over the years (for instance, ASME (American Society of Mechanical Engineers), SAE (Society of Automotive Engineers), etc.). This event underscored that regardless of industry, engineers are bringing the same kinds of practical ingenuity to bear on similar kinds of problems, and with the cost of data acquisition and storage plummeting in the last decade, many engineers are now sitting on more data than they know how to effectively handle. ## What exactly is “big data”? Does it really matter for solving hard engineering problems? One theme that came up time and again in the “Big Data Analytics” sessions Enthought participated in was what exactly “big data” is. In many circles, a good working definition of what makes data “big” is that it exceeds the size of the physical RAM on the machine doing the computation, so that something other than simply loading the data into memory has to be done to make meaningful computations, and thus a working definition of some tens of GB delimits “big” data from “small”. For others, and many at the conference indeed, a more mundane definition of “big” means that the data set doesn’t fit within the row or column limits of a Microsoft Excel Worksheet. But the question of whether your data is “big” is really a moot one as far as we at Enthought are concerned; really, being “big” just adds complexity to an already hard problem, and the kind of complexity is an implementation detail dependent on the details of the problem at hand. And that relates to the central message of my talk, which was that an analytics platform (in this case I was talking about our Canopy Platform) should abstract away the tedious complexities, and help an expert get to the heart of the hard problem at hand. At AIChE, the “hard problems” at hand seemed invariably to involve one or both of two things: (1) increasing safety/reliability, and (2) increasing plant output. To solve these problems, two general kinds of activity were on display: different pattern recognition algorithms and tools, and modeling, typically through some kind of regression-based approach. Both of these things are straightforward in the Canopy Platform. The Canopy Platform is a collection of related technologies that work together in an integrated way to support the scientist/analyst/engineer. ## What is the Canopy Platform? If you’re using Python for science or engineering, you have probably used or heard of Canopy, Enthought’s Python-based data analytics application offering an integrated code editor and interactive command prompt, package manager, documentation browser, debugger, variable browser, data import tool, and lots of hidden features like support for many kinds of proxy systems that work behind the scenes to make a seamless work environment in enterprise settings. However, this is just one part of the Canopy Platform. Over the years, Enthought has been building other components and related technologies that work together in an integrated way to support the engineer/analyst/scientist solving hard problems. At the center of the this is the Enthought Python Distribution, with runtime interpreters for Python 2.7 and 3.x and over 450 pre-built Python packages for scientific computing, including tools for machine learning and the kind of regression modeling that was shown in some of the other presentations in the Big Data sessions. Other components of the Canopy Platform include interface modules for Excel (PyXLL) and for National Instruments’ LabView software (Python Integration Toolkit for LabVIEW), among others. A key component of our Canopy Platform is our Deployment Server, which simplifies the tricky tasks of deploying proprietary applications and packages or creating customized, reproducible Python environments inside an organization, especially behind a firewall or an air-gapped network. Finally, (and this is what we were really showing off at the AIChE Big Data Analytics session) there are the Data Catalog and the Cloud Compute layers within the Canopy Platform. The Data Catalog provides an indexed interface to potentially heterogeneous data sources, making them available for search and query based on various kinds of metadata. The Data Catalog provides an indexed interface to potentially heterogeneous data sources. These can range from a simple network directory with a collection of HDF5 files to a server hosting files with the Byzantine complexity of the IRIG 106 Ch. 10 Digital Recorder Standard used by US military test flight ranges. The nice thing about the Data Catalog is that it lets you query and select data based on computed metadata, for example “factory A, on Tuesdays when Ethylene output was below 10kg/hr”, or in a test flight data example “test flights involving a T-38 that exceeded 10,000 ft but stayed subsonic.” With the Cloud Compute layer, an expert user can write code and test it locally on some subset of data from the Data Catalog. Then, when it is working to satisfaction, he or she can publish the code as a computational kernel to run on some other, larger subset of the data in the Data Catalog, using remote compute resources, which might be an HPC cluster or an Apache Spark server. That kernel is then available to other users in the organization, who do not have to understand the algorithm to run it on other data queries. In the demo below, I showed hooking up the Data Catalog to some historical factory data stored on a remote machine. The Data Catalog allows selection of subsets of the data set for inspection and ad hoc analysis. Here, three channels are compared using a time window set on the time series data shown on the top plot. Then using a locally tested and developed compute kernel, I did a principal component analysis on the frequencies of the channel data for a subset of the data in the Data Catalog. Then I published the kernel and ran it on the entire data set using the remote compute resource. After the compute kernel has been published and run on the entire data set, then the result explorer tool enables further interactions. Ultimately, the Canopy Platform is for building and distributing applications that solve hard problems. Some of the products we have built on the platform are available today (for instance, Canopy Geoscience and Virtual Core), others are in prototype stage or have been developed for other companies with proprietary components and are not publicly available. It was exciting to participate in the Big Data Analytics track this year, to see what others are doing in this area, and to be a part of many interesting and fruitful discussions. Thanks to Ivan Castillo and Chris Reed at Dow for arranging our participation. ### Matthieu Brucher #### Announcement: ATKChorus 1.1.0 and ATKUniversalVariableDelay 1.1.0 I’m happy to announce the update of the chorus and the universal variable delay based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats. This release fixes the noises that can arise in some configuration. ATKChorus ATKUniversalVariableDelay The supported formats are: • VST2 (32bits/64bits on Windows, 64bits on OS X) • VST3 (32bits/64bits on Windows, 64bits on OS X) • Audio Unit (64bits, OS X) Direct link for ATKChorus. Direct link for ATKUniversalVariableDelay . The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code. ## April 03, 2017 ### numfocus #### IBM Brings Jupyter and Spark to the Mainframe NumFOCUS Platinum Sponsor IBM has been doing wonderful work to support one of our fiscally sponsored projects, Project Jupyter. Brian Granger over at the Jupyter Blog has the details… “For the past few years, Project Jupyter has been collaborating with IBM on a number of initiatives. Much of this work has happened in the Jupyter Incubation Program, […] ## March 30, 2017 ### Continuum Analytics news #### Anaconda Leader to Speak at TDWI Accelerate Boston Thursday, March 30, 2017 Chief Data Scientist and Co-Founder Travis Oliphant to Discuss the Power of the Python Ecosystem and Open Data Science BOSTON, Mass.—March 30, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Travis Oliphant, chief data scientist and co-founder, will be speaking at TDWI Accelerate Boston on April 4 at 1:30pm EST. As one of the leading conferences on Big Data and data science, Accelerate brings together the brightest and best data minds in the industry to discuss the future of data science and analytics. In his session, titled “How to Leverage Python, the Fastest-Growing Open Source Tool for Data Scientists,” Oliphant will highlight the power of the Python ecosystem and Open Data Science. With thirteen million downloads and counting, Anaconda remains the most popular Python distribution. Oliphant will specifically address the burgeoning ecosystem forming around the Python and Anaconda combination, including new offerings such as Anaconda Cloud and conda-forge. WHO: Travis Oliphant, chief data scientist and co-founder, Anaconda Powered By Continuum Analytics WHAT: How to Leverage Python, the Fastest-Growing Open Source Tool for Data Scientists WHEN: April 4, 1:30-2:10 p.m. EST WHERE: The Boston Marriott, Copley Place, 110 Huntington Ave, Boston, MA 02116 REGISTER: HERE ### About Anaconda Powered by Continuum Analytics Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io. ### Media Contact: Jill Rosenthal InkHouse anaconda@inkhouse.com ### Leonardo Uieda #### Talk proposal for Scipy 2017: Bringing the Generic Mapping Tools to Python This is the proposal for talk that I co-authored with Paul Wessel and submitted to Scipy 2017. Fingers crossed that they'll accept it! If not, this post can serve as an example of what not to do for next time :) The submission is on the GenericMappingTools/scipy2017 Github repository. It covers our initial ideas about the GMT Python interface. UPDATE (8 May 2017): The proposal was accepted! UPDATE (11 May 2017): I've posted the reviews we got for the proposal along with some comments and replies. ### Abstract The Generic Mapping Tools (GMT) is an open-source software package widely used in the geosciences to process and visualize time series and gridded data. Maps generated by GMT are ubiquitous in scientific publications in areas such as seismology and oceanography. We present a new GMT Python wrapper library built by the GMT team. We will show the design plans, internal implementations, and demonstrate an initial prototype of the library. Our wrapper connects to the GMT C API using ctypes and allows input and output using data from numpy ndarrays and xarray Datasets. The library is still in early stages of design and implementation and we are eager for contributions and feedback from the Scipy community. ### Extended Abstract The Generic Mapping Tools (GMT) is an open-source software package widely used in the geosciences to process and visualize time series and gridded data. GMT is a command-line tool written in C that is able to generate high quality figures and maps using the Postscript format. Maps generated by GMT are ubiquitous in scientific publications in areas such as seismology and oceanography. GMT has a large, feature rich, mature, and well tested code base. It has benefited from over 28 years of development and heavy usage within the scientific community. It is no wonder that there have been at least three attempts to bridge the gap between GMT and Python: gmtpy, pygmt, and PyGMT. Of the three, only gmtpy has had any development activity since 2014 and is the only project that has documentation. gmtpy interfaces with GMT through subprocesses. It pipes standard input and output to the GMT command-line application. Piping has it's limitations because all data are handled as text, making it difficult or impossible to pass binary data such as netCDF grids. On the Python side, the two main libraries for plotting data on maps are the matplotlib basemap toolkit and Cartopy. Both libraries rely on matplotlib as the backend for generating figures. Basemap is know to have its limitations (e.g., this post by Filipe Fernandes). Cartopy is a great improvement over basemap that fixes some of those limitations. However, Cartopy is still bound by the speed and memory usage constraints of matplotlib when it comes to very large and complex maps. We present a new GMT Python wrapper library gmt-python built by the GMT team. We will show the design plans, internal implementations, and demonstrate an initial prototype of the library. Starting in version 5, GMT introduced a C API that is exposed through a shared library. The API allows access to GMT modules as C functions and provides mechanisms for input and output of binary data through shared memory. Our Python wrapper connects to this shared library using the ctypes standard library module. The wrapper code can thus be written in pure Python, which greatly simplifies packaging and distribution. Input and output will be handled using the GMT C API "virtual files" that allow access to shared memory between C and Python. We will implement a thin conversion layer between native scientific Python data types and the GMT internal data structures. Thus, we can accept input and produce output as numpy ndarrays and pandas DataFrames for tabular data and xarray Datasets for netCDF grids. Support for displaying figures inline in the Jupyter notebook is planned from the start by retrieving PNG previews of the Postscript figures. This will allow GMT to be seamlessly integrated into the rich scientific Python ecosystem. Internally, the gmt-python library will contain low-level wrappers around the GMT C API functions and data structures. Users will interact with a higher-level API in which each GMT module is represented by a function. The module functions can take arguments as a single string representing the command-line arguments ("-Rg -JN180"), as keyword arguments (R='g', J='N180'), or as long-form aliases (region='g', projection='N180'). The Python interface relies on new features in GMT that are currently under development in the trunk of the SVN repository, mainly the "modern" execution mode, which greatly simplify the building of Postscript figures and maps. We are working in close collaboration with the rest of the GMT core developers to make changes on the GMT side as we exercise the new C API and discover bugs and missing features. The capabilities of GMT go beyond the geosciences. It can produce high-quality line plots, bar graphs, histograms, and 3D surfaces. Significant barriers to entry for GMT have been the complexities of programming in bash and the many command-line options and their meanings. The Python wrapper library can serve as a backend for new and easier to use APIs, making GMT more accessible while retaining the high quality of figures. Work on gmt-python is still in early stages of design and implementation. A prototype is not yet available but is predicted in time for the conference in July. We are open and eager for contributions and feedback from the Scipy community. Comments? Leave one below or let me know on Twitter @leouieda or in the Software Underground Slack group. Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes! Please enable JavaScript to view the comments powered by Disqus. ## March 28, 2017 ### Mark Fenner #### DC SVD II: From Values to Vectors In our last installment, we discussed solutions to the secular equation. These solutions are the eigenvalues (and/or) singular values of matrices with a particular form. Since this post is otherwise light on technical content, I’ll dive into those matrix forms now. Setting up the Secular Equation to Solve the Eigen and Singular Problems In dividing-and-conquering […] ### Matthieu Brucher #### Audio Toolkit: Recursive Least Square Filter (Adaptive module) I’ve started working on adaptive filtering a long time ago, but could never figure out why my simple implementation of the RLS algorithm failed. Well, there was a typo in the reference book! Now that this is fixed, let’s see what this guy does. #### Algorithm The RLS algorithm learns an input signal based on its past and predicts new values from it. As such, it can be used to learn periodic signals, but also noise. The basis is to predict a new value based on the past, compare it to the actual value and update the set of coefficients. The update itself is based on a memory time constraint, and the higher the value, the slower the update. Once the filter has learned enough, the learning stage can be shut off, and the filter can be used to select frequencies. #### Results Let’s start with a simple sinusoidal signal, and see if an order 10 can be used to learn it: Sinusoidal signal learnt with RLS As it can be seen, at the beginning, the filter is learning, as it doesn’t match the input. After a short time, it does match (zooming on the signal shows that there is a latency and also the amplitude do not exactly match). Let’s see how it does for more complex signals. Let’s add two additional slightly out of tunes sinusoids: Three out-of-tune sinusoids learnt with RLS Once again, after a short time, the learning phase is stable, and we can switch it off and the signal is estimated properly. Let’s try now something a little bit more complex, and try to denoise an input signal. Filtered noise The original noise in blue is estimated in green, and the remainder noise is in red. Obviously, we don’t do a great job here, but let’s see what is actually attenuated: Filtered noise in the spectral domain So the middle of the bandwidth is better attenuated that the sides, which is expected in a way. Now, what does that do to a signal we try to denoise? Denoised signal Obviously, the signal is denoised, but also increased! And the same happens in the spectral domain. Denoised signal in the spectral domain When looking at the estimated function, the picture is a little bit clearer: Estimated spectral transfer function Our noise is actually between 0.6 and 1.2 rad/s (from sampling frequency/10 to sampling frequency/5), and the RLS filter underestimates these a little bit but doesn’t cut the high frequencies, which can lead to ringing… Also the cost of learning the noise is quite costly: Learning cost Learning was only activated during half the total processing time… #### Conclusion RLS filters are interesting to follow a signal. Obviously this filter is just the start of this new module, and I hope I’ll have real denoising filters at some point. This filter will be available in ATK 2.0.0 and is already in the develop branch with the Python example scripts. ### Matthew Rocklin #### Dask and Pandas and XGBoost This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation ## Summary This post talks about distributing Pandas Dataframes with Dask and then handing them over to distributed XGBoost for training. More generally it discusses the value of launching multiple distributed systems in the same shared-memory processes and smoothly handing data back and forth between them. ## Introduction XGBoost is a well-loved library for a popular class of machine learning algorithms, gradient boosted trees. It is used widely in business and is one of the most popular solutions in Kaggle competitions. For larger datasets or faster training, XGBoost also comes with its own distributed computing system that lets it scale to multiple machines on a cluster. Fantastic. Distributed gradient boosted trees are in high demand. However before we can use distributed XGBoost we need to do three things: 1. Prepare and clean our possibly large data, probably with a lot of Pandas wrangling 2. Set up XGBoost master and workers 3. Hand data our cleaned data from a bunch of distributed Pandas dataframes to XGBoost workers across our cluster This ends up being surprisingly easy. This blogpost gives a quick example using Dask.dataframe to do distributed Pandas data wrangling, then using a new dask-xgboost package to setup an XGBoost cluster inside the Dask cluster and perform the handoff. After this example we’ll talk about the general design and what this means for other distributed systems. ## Example We have a ten-node cluster with eight cores each (m4.2xlarges on EC2) import dask from dask.distributed import Client, progress >>> client = Client('172.31.33.0:8786') >>> client.restart() <Client: scheduler='tcp://172.31.33.0:8786' processes=10 cores=80>  We load the Airlines dataset using dask.dataframe (just a bunch of Pandas dataframes spread across a cluster) and do a bit of preprocessing: import dask.dataframe as dd # Subset of the columns to use cols = ['Year', 'Month', 'DayOfWeek', 'Distance', 'DepDelay', 'CRSDepTime', 'UniqueCarrier', 'Origin', 'Dest'] # Create the dataframe df = dd.read_csv('s3://dask-data/airline-data/20*.csv', usecols=cols, storage_options={'anon': True}) df = df.sample(frac=0.2) # XGBoost requires a bit of RAM, we need a larger cluster is_delayed = (df.DepDelay.fillna(16) > 15) # column of labels del df['DepDelay'] # Remove delay information from training dataframe df['CRSDepTime'] = df['CRSDepTime'].clip(upper=2399) df, is_delayed = dask.persist(df, is_delayed) # start work in the background  This loaded a few hundred pandas dataframes from CSV data on S3. We then had to downsample because how we are going to use XGBoost in the future seems to require a lot of RAM. I am not an XGBoost expert. Please forgive my ignorance here. At the end we have two dataframes: • df: Data from which we will learn if flights are delayed • is_delayed: Whether or not those flights were delayed. Data scientists familiar with Pandas will probably be familiar with the code above. Dask.dataframe is very similar to Pandas, but operates on a cluster. >>> df.head()  Year Month DayOfWeek CRSDepTime UniqueCarrier Origin Dest Distance 182193 2000 1 2 800 WN LAX OAK 337 83424 2000 1 6 1650 DL SJC SLC 585 346781 2000 1 5 1140 AA ORD LAX 1745 375935 2000 1 2 1940 DL PHL ATL 665 309373 2000 1 4 1028 CO MCI IAH 643 >>> is_delayed.head() 182193 False 83424 False 346781 False 375935 False 309373 False Name: DepDelay, dtype: bool  ### Categorize and One Hot Encode XGBoost doesn’t want to work with text data like destination=”LAX”. Instead we create new indicator columns for each of the known airports and carriers. This expands our data into many boolean columns. Fortunately Dask.dataframe has convenience functions for all of this baked in (thank you Pandas!) >>> df2 = dd.get_dummies(df.categorize()).persist()  This expands our data out considerably, but makes it easier to train on. >>> len(df2.columns) 685  ### Split and Train Great, now we’re ready to split our distributed dataframes data_train, data_test = df2.random_split([0.9, 0.1], random_state=1234) labels_train, labels_test = is_delayed.random_split([0.9, 0.1], random_state=1234)  Start up a distributed XGBoost instance, and train on this data %%time import dask_xgboost as dxgb params = {'objective': 'binary:logistic', 'nround': 1000, 'max_depth': 16, 'eta': 0.01, 'subsample': 0.5, 'min_child_weight': 1, 'tree_method': 'hist', 'grow_policy': 'lossguide'} bst = dxgb.train(client, params, data_train, labels_train) CPU times: user 355 ms, sys: 29.7 ms, total: 385 ms Wall time: 54.5 s  Great, so we were able to train an XGBoost model on this data in about a minute using our ten machines. What we get back is just a plain XGBoost Booster object. >>> bst <xgboost.core.Booster at 0x7fa1c18c4c18>  We could use this on normal Pandas data locally import xgboost as xgb pandas_df = data_test.head() dtest = xgb.DMatrix(pandas_df) >>> bst.predict(dtest) array([ 0.464578 , 0.46631625, 0.47434333, 0.47245741, 0.46194169], dtype=float32)  Of we can use dask-xgboost again to train on our distributed holdout data, getting back another Dask series. >>> predictions = dxgb.predict(client, bst, data_test).persist() >>> predictions Dask Series Structure: npartitions=93 None float32 None ... ... None ... None ... Name: predictions, dtype: float32 Dask Name: _predict_part, 93 tasks  ### Evaluate We can bring these predictions to the local process and use normal Scikit-learn operations to evaluate the results. >>> from sklearn.metrics import roc_auc_score, roc_curve >>> print(roc_auc_score(labels_test.compute(), ... predictions.compute())) 0.654800768411  fpr, tpr, _ = roc_curve(labels_test.compute(), predictions.compute()) # Taken from http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py plt.figure(figsize=(8, 8)) lw = 2 plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve') plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver operating characteristic example') plt.legend(loc="lower right") plt.show()  We might want to play with our parameters above or try different data to improve our solution. The point here isn’t that we predicted airline delays well, it was that if you are a data scientist who knows Pandas and XGBoost, everything we did above seemed pretty familiar. There wasn’t a whole lot of new material in the example above. We’re using the same tools as before, just at a larger scale. ## Analysis OK, now that we’ve demonstrated that this works lets talk a bit about what just happened and what that means generally for cooperation between distributed services. ### What dask-xgboost does The dask-xgboost project is pretty small and pretty simple (200 TLOC). Given a Dask cluster of one central scheduler and several distributed workers it starts up an XGBoost scheduler in the same process running the Dask scheduler and starts up an XGBoost worker within each of the Dask workers. They share the same physical processes and memory spaces. Dask was built to support this kind of situation, so this is relatively easy. Then we ask the Dask.dataframe to fully materialize in RAM and we ask where all of the constituent Pandas dataframes live. We tell each Dask worker to give all of the Pandas dataframes that it has to its local XGBoost worker and then just let XGBoost do its thing. Dask doesn’t power XGBoost, it’s just sets it up, gives it data, and lets it do it’s work in the background. People often ask what machine learning capabilities Dask provides, how they compare with other distributed machine learning libraries like H2O or Spark’s MLLib. For gradient boosted trees the 200-line dask-xgboost package is the answer. Dask has no need to make such an algorithm because XGBoost already exists, works well and provides Dask users with a fully featured and efficient solution. Because both Dask and XGBoost can live in the same Python process they can share bytes between each other without cost, can monitor each other, etc.. These two distributed systems co-exist together in multiple processes in the same way that NumPy and Pandas operate together within a single process. Sharing distributed processes with multiple systems can be really beneficial if you want to use multiple specialized services easily and avoid large monolithic frameworks. ### Connecting to Other distributed systems A while ago I wrote a similar blogpost about hosting TensorFlow from Dask in exactly the same way that we’ve done here. It was similarly easy to setup TensorFlow alongside Dask, feed it data, and let TensorFlow do its thing. Generally speaking this “serve other libraries” approach is how Dask operates when possible. We’re only able to cover the breadth of functionality that we do today because we lean heavily on the existing open source ecosystem. Dask.arrays use Numpy arrays, Dask.dataframes use Pandas, and now the answer to gradient boosted trees with Dask is just to make it really really easy to use distributed XGBoost. Ta da! We get a fully featured solution that is maintained by other devoted developers, and the entire connection process was done over a weekend (see dmlc/xgboost #2032 for details). Since this has come out we’ve had requests to support other distributed systems like Elemental and to do general hand-offs to MPI computations. If we’re able to start both systems with the same set of processes then all of this is pretty doable. Many of the challenges of inter-system collaboration go away when you can hand numpy arrays between the workers of one system to the workers of the other system within the same processes. ## Acknowledgements Thanks to Tianqi Chen and Olivier Grisel for their help when building and testing dask-xgboost. Thanks to Will Warner for his help in editing this post. ## March 23, 2017 ### Continuum Analytics news #### The Conda Configuration Engine for Power Users Tuesday, April 4, 2017 Kale Franz Continuum Analytics Released last fall, conda 4.2 brought with it configuration superpowers. The capabilities are extensive, and they're designed with conda power users, devops engineers, and sysadmins in mind. Configuration information comes from four basic sources: 1. hard-coded defaults, 2. configuration files, 3. environment variables, and 4. command-line arguments. Each time a conda process initializes, an operating context is built that in a cascading fashion merges configuration sources. Command-line arguments hold the highest precedence, and hard-coded defaults the lowest. The configuration file search path has been dramatically expanded. In order from lowest to highest priority, and directly from the conda code, SEARCH_PATH = ( '/etc/conda/.condarc', '/etc/conda/condarc', '/etc/conda/condarc.d/', '/var/lib/conda/.condarc', '/var/lib/conda/condarc', '/var/lib/conda/condarc.d/', '$CONDA_ROOT/.condarc',
'$CONDA_ROOT/condarc', '$CONDA_ROOT/condarc.d/',
'~/.conda/.condarc',
'~/.conda/condarc',
'~/.conda/condarc.d/',
'~/.condarc',
'$CONDA_PREFIX/.condarc', '$CONDA_PREFIX/condarc',
'$CONDA_PREFIX/condarc.d/', '$CONDARC',
)

where environment variables and user home directory are expanded on first use. $CONDA_ROOT is automatically set to the root environment prefix (and shouldn't be set by users), and $CONDA_PREFIX is automatically set for activated conda environments. Thus, conda environments can have their own individualized and customized configurations. For the ".d" directories in the search path, conda will read in sorted order any (and only) files ending with .yml or .yaml extensions. The $CONDARC environment variable can be any path to a file having a .yml or .yaml extension, or containing "condarc" in the file name; it can also be a directory. Environment variables hold second-highest precedence, and all configuration parameters are able to be specified as environment variables. To convert from the condarc file-based configuration parameter name to the environment variable parameter name, make the name all uppercase and prepend CONDA_. For example, conda's always_yes configuration parameter can be specified using a CONDA_ALWAYS_YES environment variable. Configuration parameters in some cases have aliases. For example, setting always_yes: true or yes: true in a configuration file is equivalent to the command-line flag --yes. They're all also equivalent to both CONDA_ALWAYS_YES=true and CONDA_YES=true environment variables. A validation error is thrown if multiple parameters aliased to each other are specified within a single configuration source. There are three basic configuration parameter types: primitive, map, and sequence. Each follow a slightly different set of merge rules. The primitive configuration parameter is the easiest to merge. Within the linearized chain of information sources, the last source that sets the parameter wins. There is one caveat: if the parameter is trailed by a #!final flag, the merge cascade stops for that parameter. (Indeed, the markup concept is borrowed from the !important rule in CSS.) While still giving end-users extreme flexibility in most cases, we also give sysadmins the ability to lock down as much configuration as needed by making files read-only. Map configuration parameters have elements that are key-value pairs. Merges are at the per-key level. Given two files with the contents # file: /etc/conda/condarc.d/proxies.yml proxy_servers: https: http://prod-proxy # file: ~/.conda/condarc.d/proxies.yml proxy_servers: http: http://dev-proxy:1080 https: http://dev-proxy:1081  the merged proxy_servers configuration will be proxy_servers: http: http://dev-proxy:1080 https: http://dev-proxy:1081 However, by modifying the contents of the first file to be # file: /etc/conda/condarc.d/proxies.yml proxy_servers: https: http://prod-proxy #!final the merged settings will be proxy_servers: http: http://dev-proxy:1080 https: http://prod-proxy  Note the use of the !final flag acts at the per-key level. A !final flag can also be set for the parameter as a whole. With the first file again changed to # file: /etc/conda/condarc.d/proxies.yml proxy_servers: #!final https: http://prod-proxy the merged settings will be proxy_servers: https: http://prod-proxy with no http key defined. The sequence parameter merges are the most involved. Consider contents of the three files # file: /etc/conda/condarc channels: - one - two # file: ~/.condarc channels: - three - four # file:$CONDA_PREFIX/.condarc
channels:
- five
- six

the final merged configuration will be

channels:
- five
- six
- three
- four
- one
- two


Sequence order within each individual configuration source is preserved, while still respecting sources' overall precedence. Just like map parameters, a !final flag can be used for a sequence parameter as a whole. However, the !final flag does not apply to individual elements of sequence parameters, and instead !top and !bottom flags are available. Modifying the sequence example to the following

# file: /etc/conda/condarc
channels:
- one #!top
- two

# file: ~/.condarc
channels: #!final
- three
- four #!bottom

# file: $CONDA_PREFIX/.condarc channels: - five - six  will yield a final merged configuration channels: - one - three - two - four Managing all of these new sources of configuration could become difficult without some new tools. The most basic is conda config --validate, which simply exits 0 if conda's configured state passes all validation tests. The command conda config --describe (recently added in 4.3.16) gives a detailed description of available configuration parameters. We've also added the commands conda config --show-sources and conda config --show. The first displays all of the configuration information conda recognizes--in its non-merged form broken out per source. The second gives the final, merged values for all configuration parameters. Conda's configuration engine gives power users tools for ultimate control. If you've read to this point, that's probably you. And as a conda power user, please consider participating in the conda canary program. Be on the cutting edge, and also help influence new conda features and behaviors before they're solidified in general availability releases. ### Matthew Rocklin #### Dask Release 0.14.1 This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation. I’m pleased to announce the release of Dask version 0.14.1. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on February 27th. As always you can conda install from conda-forge conda install -c conda-forge dask distributed  or you can pip install from PyPI pip install dask[complete] --upgrade  ## Arrays Recent work in distributed computing and machine learning have motivated new performance-oriented and usability changes to how we handle arrays. ### Automatic chunking and operation on NumPy arrays Many interactions between Dask arrays and NumPy arrays work smoothly. NumPy arrays are made lazy and are appropriately chunked to match the operation and the Dask array. >>> x = np.ones(10) # a numpy array >>> y = da.arange(10, chunks=(5,)) # a dask array >>> z = x + y # combined become a dask.array >>> z dask.array<add, shape=(10,), dtype=float64, chunksize=(5,)> >>> z.compute() array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])  ### Reshape Reshaping distributed arrays is simple in simple cases, and can be quite complex in complex cases. Reshape now supports a much more broad set of shape transformations where any dimension is collapsed or merged to other dimensions. >>> x = da.ones((2, 3, 4, 5, 6), chunks=(2, 2, 2, 2, 2)) >>> x.reshape((6, 2, 2, 30, 1)) dask.array<reshape, shape=(6, 2, 2, 30, 1), dtype=float64, chunksize=(3, 1, 2, 6, 1)>  This operation ends up being quite useful in a number of distributed array cases. ### Optimize Slicing to Minimize Communication Dask.array slicing optimizations are now careful to produce graphs that avoid situations that could cause excess inter-worker communication. The details of how they do this is a bit out of scope for a short blogpost, but the history here is interesting. Historically dask.arrays were used almost exclusively by researchers with large on-disk arrays stored as HDF5 or NetCDF files. These users primarily used the single machine multi-threaded scheduler. We heavily tailored Dask array optimizations to this situation and made that community pretty happy. Now as some of that community switches to cluster computing on larger datasets the optimization goals shift a bit. We have tons of distributed disk bandwidth but really want to avoid communicating large results between workers. Supporting both use cases is possible and I think that we’ve achieved that in this release so far, but it’s starting to require increasing levels of care. ### Micro-optimizations With distributed computing also comes larger graphs and a growing importance of graph-creation overhead. This has been optimized somewhat in this release. We expect this to be a focus going forward. ## DataFrames ### Set_index Set_index is smarter in two ways: 1. If you set_index on a column that happens to be sorted then we’ll identify that and avoid a costly shuffle. This was always possible with the sorted= keyword but users rarely used this feature. Now this is automatic. 2. Similarly when setting the index we can look at the size of the data and determine if there are too many or too few partitions and rechunk the data while shuffling. This can significantly improve performance if there are too many partitions (a common case). ### Shuffle performance We’ve micro-optimized some parts of dataframe shuffles. Big thanks to the Pandas developers for the help here. This accelerates set_index, joins, groupby-applies, and so on. ### Fastparquet The fastparquet library has seen a lot of use lately and has undergone a number of community bugfixes. Importantly, Fastparquet now supports Python 2. We strongly recommend Parquet as the standard data storage format for Dask dataframes (and Pandas DataFrames). dask/fastparquet #87 ## Distributed Scheduler ### Replay remote exceptions Debugging is hard in part because exceptions happen on remote machines where normal debugging tools like pdb can’t reach. Previously we were able to bring back the traceback and exception, but you couldn’t dive into the stack trace to investigate what went wrong: def div(x, y): return x / y >>> future = client.submit(div, 1, 0) >>> future <Future: status: error, key: div-4a34907f5384bcf9161498a635311aeb> >>> future.result() # getting result re-raises exception locally <ipython-input-3-398a43a7781e> in div() 1 def div(x, y): ----> 2 return x / y ZeroDivisionError: division by zero  Now Dask can bring a failing task and all necessary data back to the local machine and rerun it so that users can leverage the normal Python debugging toolchain. >>> client.recreate_error_locally(future) <ipython-input-3-398a43a7781e> in div(x, y) 1 def div(x, y): ----> 2 return x / y ZeroDivisionError: division by zero  Now if you’re in IPython or a Jupyter notebook you can use the %debug magic to jump into the stacktrace, investigate local variables, and so on. In [8]: %debug > <ipython-input-3-398a43a7781e>(2)div() 1 def div(x, y): ----> 2 return x / y ipdb> pp x 1 ipdb> pp y 0  dask/distributed #894 ### Async/await syntax Dask.distributed uses Tornado for network communication and Tornado coroutines for concurrency. Normal users rarely interact with Tornado coroutines; they aren’t familiar to most people so we opted instead to copy the concurrent.futures API. However some complex situations are much easier to solve if you know a little bit of async programming. Fortunately, the Python ecosystem seems to be embracing this change towards native async code with the async/await syntax in Python 3. In an effort to motivate people to learn async programming and to gently nudge them towards Python 3 Dask.distributed we now support async/await in a few cases. You can wait on a dask Future async def f(): future = client.submit(func, *args, **kwargs) result = await future  You can put the as_completed iterator into an async for loop async for future in as_completed(futures): result = await future ... do stuff with result ...  And, because Tornado supports the await protocols you can also use the existing shadow concurrency API (everything prepended with an underscore) with await. (This was doable before.) results = client.gather(futures) # synchronous ... results = await client._gather(futures) # asynchronous  If you’re in Python 2 you can always do this with normal yield and the tornado.gen.coroutine decorator. dask/distributed #952 ### Inproc transport In the last release we enabled Dask to communicate over more things than just TCP. In practice this doesn’t come up (TCP is pretty useful). However in this release we now support single-machine “clusters” where the clients, scheduler, and workers are all in the same process and transfer data cost-free over in-memory queues. This allows the in-memory user community to use some of the more advanced features (asynchronous computation, spill-to-disk support, web-diagnostics) that are only available in the distributed scheduler. This is on by default if you create a cluster with LocalCluster without using Nanny processes. >>> from dask.distributed import LocalCluster, Client >>> cluster = LocalCluster(nanny=False) >>> client = Client(cluster) >>> client <Client: scheduler='inproc://192.168.1.115/8437/1' processes=1 cores=4> >>> from threading import Lock # Not serializable >>> lock = Lock() # Won't survive going over a socket >>> [future] = client.scatter([lock]) # Yet we can send to a worker >>> future.result() # ... and back <unlocked _thread.lock object at 0x7fb7f12d08a0>  dask/distributed #919 ### Connection pooling for inter-worker communications Workers now maintain a pool of sustained connections between each other. This pool is of a fixed size and removes connections with a least-recently-used policy. It avoids re-connection delays when transferring data between workers. In practice this shaves off a millisecond or two from every communication. This is actually a revival of an old feature that we had turned off last year when it became clear that the performance here wasn’t a problem. Along with other enhancements, this takes our round-trip latency down to 11ms on my laptop. In [10]: %%time ...: for i in range(1000): ...: future = client.submit(inc, i) ...: result = future.result() ...: CPU times: user 4.96 s, sys: 348 ms, total: 5.31 s Wall time: 11.1 s  There may be room for improvement here though. For comparison here is the same test with the concurent.futures.ProcessPoolExecutor. In [14]: e = ProcessPoolExecutor(8) In [15]: %%time ...: for i in range(1000): ...: future = e.submit(inc, i) ...: result = future.result() ...: CPU times: user 320 ms, sys: 56 ms, total: 376 ms Wall time: 442 ms  Also, just to be clear, this measures total roundtrip latency, not overhead. Dask’s distributed scheduler overhead remains in the low hundreds of microseconds. dask/distributed #935 There has been activity around Dask and machine learning: • dask-learn is undergoing some performance enhancements. It turns out that when you offer distributed grid search people quickly want to scale up their computations to hundreds of thousands of trials. • dask-glm now has a few decent algorithms for convex optimization. The authors of this wrote a blogpost very recently if you’re interested: Developing Convex Optimization Algorithms in Dask • dask-xgboost lets you hand off distributed data in Dask dataframes or arrays and hand it directly to a distributed XGBoost system (that Dask will nicely set up and tear down for you). This was a nice example of easy hand-off between two distributed services running in the same processes. ## Acknowledgements The following people contributed to the dask/dask repository since the 0.14.0 release on February 27th • Antoine Pitrou • Brian Martin • Elliott Sales de Andrade • Erik Welch • Francisco de la Peña • jakirkham • Jim Crist • Jitesh Kumar Jha • Julien Lhermitte • Martin Durant • Matthew Rocklin • Markus Gonser • Talmaj The following people contributed to the dask/distributed repository since the 1.16.0 release on February 27th • Antoine Pitrou • Ben Schreck • Elliott Sales de Andrade • Martin Durant • Matthew Rocklin • Phil Elson ### numfocus #### PyData Atlanta Meetup Celebrates 1 Year and over 1,000 members PyData Atlanta holds a meetup at MailChimp, where Jim Crozier spoke about analyzing NFL data with PySpark. Atlanta tells a new story about data by Rob Clewley In late 2015, the three of us (Tony Fast, Neel Shivdasani, and myself) had been regularly nerding out about data over beers and becoming fast friends. We were […] ## March 22, 2017 ### Matthew Rocklin #### Developing Convex Optimization Algorithms in Dask This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation. ## Summary We build distributed optimization algorithms with Dask. We show both simple examples and also benchmarks from a nascent dask-glm library for generalized linear models. We also talk about the experience of learning Dask to do this kind of work. This blogpost is co-authored by Chris White (Capital One) who knows optimization and Matthew Rocklin (Continuum Analytics) who knows distributed computing. ## Introduction Many machine learning and statistics models (such as logistic regression) depend on convex optimization algorithms like Newton’s method, stochastic gradient descent, and others. These optimization algorithms are both pragmatic (they’re used in many applications) and mathematically interesting. As a result these algorithms have been the subject of study by researchers and graduate students around the world for years both in academia and in industry. Things got interesting about five or ten years ago when datasets grew beyond the size of working memory and “Big Data” became a buzzword. Parallel and distributed solutions for these algorithms have become the norm, and a researcher’s skillset now has to extend beyond linear algebra and optimization theory to include parallel algorithms and possibly even network programming, especially if you want to explore and create more interesting algorithms. However, relatively few people understand both mathematical optimization theory and the details of distributed systems. Typically algorithmic researchers depend on the APIs of distributed computing libraries like Spark or Flink to implement their algorithms. In this blogpost we explore the extent to which Dask can be helpful in these applications. We approach this from two perspectives: 1. Algorithmic researcher (Chris): someone who knows optimization and iterative algorithms like Conjugate Gradient, Dual Ascent, or GMRES but isn’t so hot on distributed computing topics like sockets, MPI, load balancing, and so on 2. Distributed systems developer (Matt): someone who knows how to move bytes around and keep machines busy but doesn’t know the right way to do a line search or handle a poorly conditioned matrix ## Prototyping Algorithms in Dask Given knowledge of algorithms and of NumPy array computing it is easy to write parallel algorithms with Dask. For a range of complicated algorithmic structures we have two straightforward choices: 1. Use parallel multi-dimensional arrays to construct algorithms from common operations like matrix multiplication, SVD, and so on. This mirrors mathematical algorithms well but lacks some flexibility. 2. Create algorithms by hand that track operations on individual chunks of in-memory data and dependencies between them. This is very flexible but requires a bit more care. Coding up either of these options from scratch can be a daunting task, but with Dask it can be as simple as writing NumPy code. Let’s build up an example of fitting a large linear regression model using both built-in array parallelism and fancier, more customized parallelization features that Dask offers. The dask.array module helps us to easily parallelize standard NumPy functionality using the same syntax – we’ll start there. ### Data Creation Dask has many ways to create dask arrays; to get us started quickly prototyping let’s create some random data in a way that should look familiar to NumPy users. import dask import dask.array as da import numpy as np from dask.distributed import Client client = Client() ## create inputs with a bunch of independent normals beta = np.random.random(100) # random beta coefficients, no intercept X = da.random.normal(0, 1, size=(1000000, 100), chunks=(100000, 100)) y = X.dot(beta) + da.random.normal(0, 1, size=1000000, chunks=(100000,)) ## make sure all chunks are ~equally sized X, y = dask.persist(X, y) client.rebalance([X, y])  Observe that X is a dask array stored in 10 chunks, each of size (100000, 100). Also note that X.dot(beta) runs smoothly for both numpy and dask arrays, so we can write code that basically works in either world. Caveat: If X is a numpy array and beta is a dask array, X.dot(beta) will output an in-memory numpy array. This is usually not desirable as you want to carefully choose when to load something into memory. One fix is to use multipledispatch to handle odd edge cases; for a starting example, check out the dot code here. Dask also has convenient visualization features built in that we will leverage; below we visualize our data in its 10 independent chunks: ### Array Programming If you can write iterative array-based algorithms in NumPy, then you can write iterative parallel algorithms in Dask As we’ve already seen, Dask inherits much of the NumPy API that we are familiar with, so we can write simple NumPy-style iterative optimization algorithms that will leverage the parallelism dask.array has built-in already. For example, if we want to naively fit a linear regression model on the data above, we are trying to solve the following convex optimization problem: Recall that in non-degenerate situations this problem has a closed-form solution that is given by: We can compute$\beta^*$using the above formula with Dask: ## naive solution beta_star = da.linalg.solve(X.T.dot(X), X.T.dot(y)) >>> abs(beta_star.compute() - beta).max() 0.0024817567237768179  Sometimes a direct solve is too costly, and we want to solve the above problem using only simple matrix-vector multiplications. To this end, let’s take this one step further and actually implement a gradient descent algorithm which exploits parallel matrix operations. Recall that gradient descent iteratively refines an initial estimate of beta via the update: where$\alpha$can be chosen based on a number of different “step-size” rules; for the purposes of exposition, we will stick with a constant step-size: ## quick step-size calculation to guarantee convergence _, s, _ = np.linalg.svd(2 * X.T.dot(X)) step_size = 1 / s - 1e-8 ## define some parameters max_steps = 100 tol = 1e-8 beta_hat = np.zeros(100) # initial guess for k in range(max_steps): Xbeta = X.dot(beta_hat) func = ((y - Xbeta)**2).sum() gradient = 2 * X.T.dot(Xbeta - y) ## Update obeta = beta_hat beta_hat = beta_hat - step_size * gradient new_func = ((y - X.dot(beta_hat))**2).sum() beta_hat, func, new_func = dask.compute(beta_hat, func, new_func) # <--- Dask code ## Check for convergence change = np.absolute(beta_hat - obeta).max() if change < tol: break >>> abs(beta_hat - beta).max() 0.0024817567259038942  It’s worth noting that almost all of this code is exactly the same as the equivalent NumPy code. Because Dask.array and NumPy share the same API it’s pretty easy for people who are already comfortable with NumPy to get started with distributed algorithms right away. The only thing we had to change was how we produce our original data (da.random.normal instead of np.random.normal) and the call to dask.compute at the end of the update state. The dask.compute call tells Dask to go ahead and actually evaluate everything we’ve told it to do so far (Dask is lazy by default). Otherwise, all of the mathematical operations, matrix multiplies, slicing, and so on are exactly the same as with Numpy, except that Dask.array builds up a chunk-wise parallel computation for us and Dask.distributed can execute that computation in parallel. To better appreciate all the scheduling that is happening in one update step of the above algorithm, here is a visualization of the computation necessary to compute beta_hat and the new function value new_func: Each rectangle is an in-memory chunk of our distributed array and every circle is a numpy function call on those in-memory chunks. The Dask scheduler determines where and when to run all of these computations on our cluster of machines (or just on the cores of our laptop). #### Array Programming + dask.delayed Now that we’ve seen how to use the built-in parallel algorithms offered by Dask.array, let’s go one step further and talk about writing more customized parallel algorithms. Many distributed “consensus” based algorithms in machine learning are based on the idea that each chunk of data can be processed independently in parallel, and send their guess for the optimal parameter value to some master node. The master then computes a consensus estimate for the optimal parameters and reports that back to all of the workers. Each worker then processes their chunk of data given this new information, and the process continues until convergence. From a parallel computing perspective this is a pretty simple map-reduce procedure. Any distributed computing framework should be able to handle this easily. We’ll use this as a very simple example for how to use Dask’s more customizable parallel options. One such algorithm is the Alternating Direction Method of Multipliers, or ADMM for short. For the sake of this post, we will consider the work done by each worker to be a black box. We will also be considering a regularized version of the problem above, namely: At the end of the day, all we will do is: • create NumPy functions which define how each chunk updates its parameter estimates • wrap those functions in dask.delayed • call dask.compute and process the individual estimates, again using NumPy First we need to define some local functions that the chunks will use to update their individual parameter estimates, and import the black box local_update step from dask_glm; also, we will need the so-called shrinkage operator (which is the proximal operator for the$l1$-norm in our problem): from dask_glm.algorithms import local_update def local_f(beta, X, y, z, u, rho): return ((y - X.dot(beta)) **2).sum() + (rho / 2) * np.dot(beta - z + u, beta - z + u) def local_grad(beta, X, y, z, u, rho): return 2 * X.T.dot(X.dot(beta) - y) + rho * (beta - z + u) def shrinkage(beta, t): return np.maximum(0, beta - t) - np.maximum(0, -beta - t) ## set some algorithm parameters max_steps = 10 lamduh = 7.2 rho = 1.0 (n, p) = X.shape nchunks = X.npartitions XD = X.to_delayed().flatten().tolist() # A list of pointers to remote numpy arrays yD = y.to_delayed().flatten().tolist() # ... one for each chunk # the initial consensus estimate z = np.zeros(p) # an array of the individual "dual variables" and parameter estimates, # one for each chunk of data u = np.array([np.zeros(p) for i in range(nchunks)]) betas = np.array([np.zeros(p) for i in range(nchunks)]) for k in range(max_steps): # process each chunk in parallel, using the black-box 'local_update' magic new_betas = [dask.delayed(local_update)(xx, yy, bb, z, uu, rho, f=local_f, fprime=local_grad) for xx, yy, bb, uu in zip(XD, yD, betas, u)] new_betas = np.array(dask.compute(*new_betas)) # everything else is NumPy code occurring at "master" beta_hat = 0.9 * new_betas + 0.1 * z # create consensus estimate zold = z.copy() ztilde = np.mean(beta_hat + np.array(u), axis=0) z = shrinkage(ztilde, lamduh / (rho * nchunks)) # update dual variables u += beta_hat - z >>> # Number of coefficients zeroed out due to L1 regularization >>> print((z == 0).sum()) 12  There is of course a little bit more work occurring in the above algorithm, but it should be clear that the distributed operations are not one of the difficult pieces. Using dask.delayed we were able to express a simple map-reduce algorithm like ADMM with similarly simple Python for loops and delayed function calls. Dask.delayed is keeping track of all of the function calls we wanted to make and what other function calls they depend on. For example all of the local_update calls can happen independent of each other, but the consensus computation blocks on all of them. We hope that both parallel algorithms shown above (gradient descent, ADMM) were straightforward to someone reading with an optimization background. These implementations run well on a laptop, a single multi-core workstation, or a thousand-node cluster if necessary. We’ve been building somewhat more sophisticated implementations of these algorithms (and others) in dask-glm. They are more sophisticated from an optimization perspective (stopping criteria, step size, asynchronicity, and so on) but remain as simple from a distributed computing perspective. ## Experiment We compare dask-glm implementations against Scikit-learn on a laptop, and then show them running on a cluster. Reproducible notebook is available here We’re building more sophisticated versions of the algorithms above in dask-glm. This project has convex optimization algorithms for gradient descent, proximal gradient descent, Newton’s method, and ADMM. These implementations extend the implementations above by also thinking about stopping criteria, step sizes, and other niceties that we avoided above for simplicity. In this section we show off these algorithms by performing a simple numerical experiment that compares the numerical performance of proximal gradient descent and ADMM alongside Scikit-Learn’s LogisticRegression and SGD implementations on a single machine (a personal laptop) and then follows up by scaling the dask-glm options to a moderate cluster. Disclaimer: These experiments are crude. We’re using artificial data, we’re not tuning parameters or even finding parameters at which these algorithms are producing results of the same accuracy. The goal of this section is just to give a general feeling of how things compare. We create data ## size of problem (no. observations) N = 8e6 chunks = 1e6 seed = 20009 beta = (np.random.random(15) - 0.5) * 3 X = da.random.random((N,len(beta)), chunks=chunks) y = make_y(X, beta=np.array(beta), chunks=chunks) X, y = dask.persist(X, y) client.rebalance([X, y])  And run each of our algorithms as follows: # Dask-GLM Proximal Gradient result = proximal_grad(X, y, lamduh=alpha) # Dask-GLM ADMM X2 = X.rechunk((1e5, None)).persist() # ADMM prefers smaller chunks y2 = y.rechunk(1e5).persist() result = admm(X2, y2, lamduh=alpha) # Scikit-Learn LogisticRegression nX, ny = dask.compute(X, y) # sklearn wants numpy arrays result = LogisticRegression(penalty='l1', C=1).fit(nX, ny).coef_ # Scikit-Learn Stochastic Gradient Descent result = SGDClassifier(loss='log', penalty='l1', l1_ratio=1, n_iter=10, fit_intercept=False).fit(nX, ny).coef_  We then compare with the$L_{\infty}$norm (largest different value). abs(result - beta).max()  Times and$L_\infty$distance from the true “generative beta” for these parameters are shown in the table below: Algorithm Error Duration (s) Proximal Gradient 0.0227 128 ADMM 0.0125 34.7 LogisticRegression 0.0132 79 SGDClassifier 0.0456 29.4 Again, please don’t take these numbers too seriously: these algorithms all solve regularized problems, so we don’t expect the results to necessarily be close to the underlying generative beta (even asymptotically). The numbers above are meant to demonstrate that they all return results which were roughly the same distance from the beta above. Also, Dask-glm is using a full four-core laptop while SKLearn is restricted to use a single core. In the sections below we include profile plots for proximal gradient and ADMM. These show the operations that each of eight threads was doing over time. You can mouse-over rectangles/tasks and zoom in using the zoom tools in the upper right. You can see the difference in complexity of the algorithms. ADMM is much simpler from Dask’s perspective but also saturates hardware better for this chunksize. #### Profile Plot for Proximal Gradient Descent #### Profile Plot for ADMM The general takeaway here is that dask-glm performs comparably to Scikit-Learn on a single machine. If your problem fits in memory on a single machine you should continue to use Scikit-Learn and Statsmodels. The real benefit to the dask-glm algorithms is that they scale and can run efficiently on data that is larger-than-memory by operating from disk on a single computer or on a cluster of computers working together. ### Cluster Computing As a demonstration, we run a larger version of the data above on a cluster of eight m4.2xlarges on EC2 (8 cores and 30GB of RAM each.) We create a larger dataset with 800,000,000 rows and 15 columns across eight processes. N = 8e8 chunks = 1e7 seed = 20009 beta = (np.random.random(15) - 0.5) * 3 X = da.random.random((N,len(beta)), chunks=chunks) y = make_y(X, beta=np.array(beta), chunks=chunks) X, y = dask.persist(X, y)  We then run the same proximal_grad and admm operations from before: # Dask-GLM Proximal Gradient result = proximal_grad(X, y, lamduh=alpha) # Dask-GLM ADMM X2 = X.rechunk((1e6, None)).persist() # ADMM prefers smaller chunks y2 = y.rechunk(1e6).persist() result = admm(X2, y2, lamduh=alpha)  Proximal grad completes in around seventeen minutes while ADMM completes in around four minutes. Profiles for the two computations are included below: #### Profile Plot for Proximal Gradient Descent We include only the first few iterations here. Otherwise this plot is several megabytes. Link to fullscreen plot #### Profile Plot for ADMM Link to fullscreen plot These both obtained similar$L_{\infty}$errors to what we observed before. Algorithm Error Duration (s) Proximal Gradient 0.0306 1020 ADMM 0.00159 270 Although this time we had to be careful about a couple of things: 1. We explicitly deleted the old data after rechunking (ADMM prefers different chunksizes than proximal_gradient) because our full dataset, 100GB, is close enough to our total distributed RAM (240GB) that it’s a good idea to avoid keeping replias around needlessly. Things would have run fine, but spilling excess data to disk would have negatively affected performance. 2. We set the OMP_NUM_THREADS=1 environment variable to avoid over-subscribing our CPUs. Surprisingly not doing so led both to worse performance and to non-deterministic results. An issue that we’re still tracking down. ### Analysis The algorithms in Dask-GLM are new and need development, but are in a usable state by people comfortable operating at this technical level. Additionally, we would like to attract other mathematical and algorithmic developers to this work. We’ve found that Dask provides a nice balance between being flexible enough to support interesting algorithms, while being managed enough to be usable by researchers without a strong background in distributed systems. In this section we’re going to discuss the things that we learned from both Chris’ (mathematical algorithms) and Matt’s (distributed systems) perspective and then talk about possible future work. We encourage people to pay attention to future work; we’re open to collaboration and think that this is a good opportunity for new researchers to meaningfully engage. #### Chris’s perspective 1. Creating distributed algorithms with Dask was surprisingly easy; there is still a small learning curve around when to call things like persist, compute, rebalance, and so on, but that can’t be avoided. Using Dask for algorithm development has been a great learning environment for understanding the unique challenges associated with distributed algorithms (including communication costs, among others). 2. Getting the particulars of algorithms correct is non-trivial; there is still work to be done in better understanding the tolerance settings vs. accuracy tradeoffs that are occurring in many of these algorithms, as well as fine-tuning the convergence criteria for increased precision. 3. On the software development side, reliably testing optimization algorithms is hard. Finding provably correct optimality conditions that should be satisfied which are also numerically stable has been a challenge for me. 4. Working on algorithms in isolation is not nearly as fun as collaborating on them; please join the conversation and contribute! 5. Most importantly from my perspective, I’ve found there is a surprisingly large amount of misunderstanding in “the community” surrounding what optimization algorithms do in the world of predictive modeling, what problems they each individually solve, and whether or not they are interchangeable for a given problem. For example, Newton’s method can’t be used to optimize an l1-regularized problem, and the coefficient estimates from an l1-regularized problem are fundamentally (and numerically) different from those of an l2-regularized problem (and from those of an unregularized problem). My own personal goal is that the API for dask-glm exposes these subtle distinctions more transparently and leads to more thoughtful modeling decisions “in the wild”. #### Matt’s perspective This work triggered a number of concrete changes within the Dask library: 1. We can convert Dask.dataframes to Dask.arrays. This is particularly important because people want to do pre-processing with dataframes but then switch to efficient multi-dimensional arrays for algorithms. 2. We had to unify the single-machine scheduler and distributed scheduler APIs a bit, notably adding a persist function to the single machine scheduler. This was particularly important because Chris generally prototyped on his laptop but we wanted to write code that was effective on clusters. 3. Scheduler overhead can be a problem for the iterative dask-array algorithms (gradient descent, proximal gradient descent, BFGS). This is particularly a problem because NumPy is very fast. Often our tasks take only a few milliseconds, which makes Dask’s overhead of 200us per task become very relevant (this is why you see whitespace in the profile plots above). We’ve started resolving this problem in a few ways like more aggressive task fusion and lower overheads generally, but this will be a medium-term challenge. In practice for dask-glm we’ve started handling this just by choosing chunksizes well. I suspect that for the dask-glm in particular we’ll just develop auto-chunksize heuristics that will mostly solve this problem. However we expect this problem to recur in other work with scientists on HPC systems who have similar situations. 4. A couple of things can be tricky for algorithmic users: 1. Placing the calls to asynchronously start computation (persist, compute). In practice Chris did a good job here and then I came through and tweaked things afterwards. The web diagnostics ended up being crucial to identify issues. 2. Avoiding accidentally calling NumPy functions on dask.arrays and vice versa. We’ve improved this on the dask.array side, and they now operate intelligently when given numpy arrays. Changing this on the NumPy side is harder until NumPy protocols change (which is planned). #### Future work There are a number of things we would like to do, both in terms of measurement and for the dask-glm project itself. We welcome people to voice their opinions (and join development) on the following issues: 1. Asynchronous Algorithms 2. User APIs 3. Extend GLM families 4. Write more extensive rigorous algorithm testing - for satisfying provable optimality criteria, and for robustness to various input data 5. Begin work on smart initialization routines What is your perspective here, gentle reader? Both Matt and Chris can use help on this project. We hope that some of the issues above provide seeds for community engagement. We welcome other questions, comments, and contributions either as github issues or comments below. ## Acknowledgements Thanks also go to Hussain Sultan (Capital One) and Tom Augspurger for collaboration on Dask-GLM and to Will Warner (Continuum) for reviewing and editing this post. ### numfocus #### nteract: Building on top of Jupyter (from a rich REPL toolkit to interactive notebooks) This post originally appeared on the nteract blog. nteract builds upon the very successful foundations of Jupyter. I think of Jupyter as a brilliantly rich REPL toolkit. A typical REPL (Read-Eval-Print-Loop) is an interpreter that takes input from the user and prints results (on stdout and stderr). ​ Here’s the standard Python interpreter; a REPL many […] ## March 21, 2017 ### Matthieu Brucher #### Announcement: Audio TK 1.5.0 ATK is updated to 1.5.0 with new features oriented around preamplifiers and optimizations. It is also now compiled on Appveyor: https://ci.appveyor.com/project/mbrucher/audiotk. Thanks to Travis and Appveyor, binaries for the releases are now updated on Github. On all platforms we compile static and shared libraries. On Linux, gcc 5, gcc 6, clang 3.8 and clang 3.9 are generated, on OS X, XCode 7 and XCode 8 are available as universal binaries, and on Windows, 32 bits, 64 bits, with dynamic or static (no shared libraries in this case) runtime, are also generated. Download link: ATK 1.5.0 Changelog: 1.5.0 * Adding a follower class solid state preamplifier with Python wrappers * Adding a Dempwolf model for tube filters with Python wrappers * Adding a Munro-Piazza model for tube filters with Python wrappers * Optimized distortion and preamplifier filters by using fmath exp calls 1.4.1 * Vectorized x4 the IIR part of the IIR filter * Vectorized delay filters * Fixed bug in gain filters ## March 20, 2017 ### Continuum Analytics news #### ​Announcing Anaconda Project: Data Science Project Encapsulation and Deployment, the Easy Way! Monday, March 20, 2017 Christine Doig Sr. Data Scientist, Product Manager Kristopher Overholt Product Manager One year ago, we presented Anaconda and Docker: Better Together for Reproducible Data Science. In that blog post, we described our vision and a foundational approach to portable and reproducible data science using Anaconda and Docker. This approach embraced the philosophy of Open Data Science in which data scientists can connect the powerful data science experience of Anaconda with the tools that they know and love, which today includes Jupyter notebooks, machine learning frameworks, data analysis libraries, big data computations and connectivity, visualization toolkits, high-performance numerical libraries and more. We also discussed how data scientists could use Anaconda to develop data science analyses on their local machine, then use Docker to deploy those same data science analyses into production. This was the state of data science encapsulation and deployment that we presented last year: ## project-1.png In this blog post, we’ll be diving deeper into how we’ve created a standard data science project encapsulation approach that helps data scientists deploy secure, scalable and reproducible projects across an entire team with Anaconda. This blog post also provides more details about how we’re using Anaconda and Docker for encapsulation and containerization of data science projects to power the data science deployment functionality in the next generation of Anaconda Enterprise, which augments our truly end-to-end data science platform. ### Supercharge Your Data Science with More Than Just Dockerfiles! The reality is, as much as Docker is loved and used by the DevOps community, it is not the preferred tool or entrypoint for data scientists looking to deploy their applications. Using Docker alone as a data science encapsulation strategy still requires coordination with their IT and DevOps teams to write their Dockerfiles, install the required system libraries in their containers, and orchestrate and deploy their Docker containers into production. Having data scientists worry about infrastructure details and DevOps tooling takes away time from their most valuable skills: finding insights in data, modeling and running experiments, and delivering consumable data-driven applications to their team and end-users. Data scientists enjoy using the packages they know and love with Anaconda along with conda environments, and wish it was as easy to deploy data science projects as it is to get Anaconda running in their laptop. By working directly with our amazing customers and users and listening to the needs of their data science teams over the last five years, we have clearly identified how Anaconda and Docker can be used together for data science project encapsulation and as a more useful abstraction layer for data scientists: Anaconda Projects. ### The Next Generation of Portable and Reproducible Data Science with Anaconda As part of the next generation of data science encapsulation, reproducibility and deployment, we are happy to announce the release of Anaconda Project with the latest release of Anaconda! Download the latest version of Anaconda 4.3.1 to get started with Anaconda Project today. Or, if you already have Anaconda, you can install Anaconda Project using the following command: conda install anaconda-project Anaconda Project makes it easy to encapsulate data science projects and makes them fully portable and deployment-ready. It automates the configuration and setup of data science projects, such as installing the necessary packages and dependencies, downloading data sets and required files, setting environment variables for credentials or runtime configuration, and running commands. Anaconda Project is an open source tool created by Continuum Analytics that delivers light-weight, efficient encapsulation and portability of data science projects. Learn more by checking out the Anaconda Project documentation. Anaconda Project makes it easy to reproduce your data science analyses, share data science projects with others, run projects across different platforms, or deploy data science applications with a single-click in Anaconda Enterprise. Whether you’re running a project locally or deploying a project with Anaconda Enterprise, you are using the same project encapsulation standard: an Anaconda Project. We’re bringing you the next generation of true Open Data Science deployment in 2017 with Anaconda: ## project-2.png #### New Release of Anaconda Navigator with Support for Anaconda Projects As part of this release of Anaconda Project, we’ve integrated easy data science project creation and encapsulation to the familiar Anaconda Navigator experience, which is a graphical interface for your Anaconda environments and data science tools. You can easily create, edit, and upload Anaconda Projects to Anaconda Cloud through a graphical interface: ## anaconda-project-a (1).gif Download the latest version of Anaconda 4.3.1 to get started with Anaconda Navigator and Anaconda Project today. Or, if you already have Anaconda, you can install the latest version of Anaconda Navigator using the following command: conda install anaconda-navigator When you’re using Anaconda Project with Navigator, you can create a new project and specify its dependencies, or you can import an existing conda environment file (environment.yaml) or pip requirements file (requirements.txt). #### Anaconda Project examples: • Image classifier web application using Tensorflow and Flask • Live Python and R notebooks that retrieve the latest stock market data • Interactive Bokeh and Shiny applications for data clustering, cross filtering, and data exploration • Interactive visualizations of data sets with Bokeh, including streaming data • Machine learning models with REST APIs To get started even quicker with portable data science projects, refer to the example Anaconda Projects on Anaconda Cloud. ### Deploying Secure and Scalable Data Science Projects with Anaconda Enterprise The new data science deployment and collaboration functionality in Anaconda Enterprise leverages Anaconda Project plus industry-standard containerization with Docker and enterprise-ready container orchestration technology with Kubernetes. This productionization and deployment strategy makes it easy to create and deploy data science projects with a single-click for projects that use Python 2, Python 3, R, (including their dependencies in C++, Fortran, Java, etc.) or anything else you can build with the 730+ packages in Anaconda. ## project-3.png ### From Data Science Development to Deployment with Anaconda Projects and Anaconda Enterprise All of this is possible without having to edit Dockerfiles directly, install system packages in your Docker containers, or manually deploy Docker containers into production. Anaconda Enterprise handles all of that for you, so you can get back to doing data science analysis. The result is that any project that a data scientist can create on their machine with Anaconda can be deployed to an Anaconda Enterprise cluster in a secure, scalable, and highly-available manner with just a single click, including live notebooks, interactive applications, machine learning models with REST APIs, or any other projects that leverage the 730+ packages in Anaconda. ## anaconda-project-b (1).gif Anaconda is such a foundational and ubiquitous data science platform that other lightweight data science workspaces and workbenches are using Anaconda as a necessary core component for their portable and reproducible data science. Anaconda is the leading Open Data Science platform powered by Python and empowers data scientists with a truly integrated experience and support for end-to-end workflows. Why would you want your data science team using Anaconda in production with anything other than Anaconda Enterprise? Anaconda Enterprise is a true end-to-end data science platform that integrates with all of the most popular tools and platforms and provides your data science team with an on-premises package repository, secure enterprise notebook collaboration, data science and analytics on Hadoop/Spark, and secure and scalable data science deployment. Anaconda Enterprise also includes support for all of the 730+ Open Data Science packages in Anaconda. Finally, Anaconda Scale is the only recommended and certified method for deploying Anaconda to a Hadoop cluster for PySpark or SparkR jobs. ### Getting Started with Anaconda Enterprise and Anaconda Projects Anaconda Enterprise uses Anaconda Project and Docker as its standard project encapsulation and deployment format to enable simple one-click deployments of secure and scalable data science applications for your entire data science team. Are you interested in using Anaconda Enterprise in your organization to deploy data science projects, including live notebooks, machine learning models, dashboards, and interactive applications? Access to the next generation of Anaconda Enterprise v5, which features one-click secure and scalable data science deployments, is now available as a technical preview as part of the Anaconda Enterprise Innovator Program. Join the Anaconda Enterprise v5 Innovator Program today to discover the powerful data science deployment capabilities for yourself. Anaconda Enterprise handles your secure and scalable data science project encapsulation and deployment requirements so that your data science team can focus on data exploration and analysis workflows and spend less time worrying about infrastructure and DevOps tooling. ## March 16, 2017 ### Titus Brown #### Registration reminder for our two-week summer workshop on high-throughput sequencing data analysis! Our two-week summer workshop (announcement, direct link) is shaping up quite well, but the application deadline is today! So if you're interested, you should apply sometime before the end of the day. (We'll leave applications open as long as it's March 17th somewhere in the world.) Some updates and expansions on the original announcement -- • we'll be training attendees in high-performance computing, in the service of doing bioinformatics analyses. To that end, we've received a large grant from NSF XSEDE, and we'll be using JetStream for our analyses. • we have limited financial support that will be awarded after acceptances are issued in a week. Here's the original announcement below: ## ANGUS: Analyzing High Throughput Sequencing Data June 26-July 8, 2017 University of California, Davis • Zero-entry - no experience required or expected! • Hands-on training in using the UNIX command line to analyze your sequencing data. • Friendly, helpful instructors and TAs! • Summer sequencing camp - meet and talk science with great people! • Now in its eighth year! The workshop fee will be$500 for the two weeks, and on-campus room and board is available for $500/week. Applications will close March 17th. International and industry applicants are more than welcome! Please see http://ivory.idyll.org/dibsi/ANGUS.html for more information, and contact dibsi.training@gmail.com if you have questions or suggestions. --titus ### numfocus #### ﻿Facebook Makes Sophisticated Forecasting Techniques Available to Non-Experts Thanks to Stan, a NumFOCUS Sponsored Project Facebook Prophet Stan Facebook Forecasting Tool Prophet is built on Stan Facebook recently announced that they have made their forecasting tool, Prophet, open source. This is great news for data scientists and business analysts alike—forecasting is an important but tricky process that is critical to many, both for-profit and non-profit organizations. The Prophet forecasting tool is able […] ## March 14, 2017 ### Thomas Wiecki #### Random-Walk Bayesian Deep Networks: Dealing with Non-Stationary Data (c) 2017 by Thomas Wiecki -- Quantopian Inc. Most problems solved by Deep Learning are stationary. A cat is always a cat. The rules of Go have remained stable for 2,500 years, and will likely stay that way. However, what if the world around you is changing? This is common, for example when applying Machine Learning in Quantitative Finance. Markets are constantly evolving so features that are predictive in some time-period might not lose their edge while other patterns emerge. Usually, quants would just retrain their classifiers every once in a while. This approach of just re-estimating the same model on more recent data is very common. I find that to be a pretty unsatisfying way of modeling, as there are certain shortfalls: • The estimation window should be long so as to incorporate as much training data as possible. • The estimation window should be short so as to incorporate only the most recent data, as old data might be obsolete. • When you have no estimate of how fast the world around you is changing, there is no principled way of setting the window length to balance these two objectives. Certainly there is something to be learned even from past data, we just need to instill our models with a sense of time and recency. Enter random-walk processes. Ever since I learned about them in the stochastic volatility model they have become one of my favorite modeling tricks. Basically, it allows you to turn every static model into a time-sensitive one. You can read more about the details of a random-walk priors here, but the central idea is that, in any time-series model, rather than assuming a parameter to be constant over time, we allow it to change gradually, following a random walk. For example, take a logistic regression: $$Y_i = f(\beta X_i)$$ Where$f$is the logistic function and$\beta$is our learnable parameter. If we assume that our data is not iid and that$\beta$is changing over time. We thus need a different$\beta$for every$i$: $$Y_i = f(\beta_i X_i)$$ Of course, this will just overfit, so we need to constrain our$\beta_i$somehow. We will assume that while$\beta_i$is changing over time, it will do so rather gradually by placing a random-walk prior on it: $$\beta_t \sim \mathcal{N}(\beta_{t-1}, s^2)$$ So$\beta_t$is allowed to only deviate a little bit (determined by the step-width$s$) form its previous value$\beta_{t-1}$.$s\$ can be thought of as a stability parameter -- how fast is the world around you changing.

Let's first generate some toy data and then implement this model in PyMC3. We will then use this same trick in a Neural Network with hidden layers.

If you would like a more complete introduction to Bayesian Deep Learning, see my recent ODSC London talk. This blog post takes things one step further so definitely read further below.

In [1]:
%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale

import warnings
from scipy import VisibleDeprecationWarning
warnings.filterwarnings("ignore", category=VisibleDeprecationWarning)

sns.set_context('notebook')


### Generating data

First, lets generate some toy data -- a simple binary classification problem that's linearly separable. To introduce the non-stationarity, we will rotate this data along the center across time. Safely skip over the next few code cells.

In [2]:
X, Y = sklearn.datasets.make_blobs(n_samples=1000, centers=2, random_state=1)
X = scale(X)
colors = Y.astype(str)
colors[Y == 0] = 'r'
colors[Y == 1] = 'b'

interval = 20
subsample = X.shape[0] // interval
chunk = np.arange(0, X.shape[0]+1, subsample)
degs = np.linspace(0, 360, len(chunk))

sep_lines = []

for ii, (i, j, deg) in enumerate(list(zip(np.roll(chunk, 1), chunk, degs))[1:]):
c, s = np.cos(theta), np.sin(theta)
R = np.matrix([[c, -s], [s, c]])

X[i:j, :] = X[i:j, :].dot(R)

In [4]:
import base64
from tempfile import NamedTemporaryFile

VIDEO_TAG = """<video controls>
<source src="data:video/x-m4v;base64,{0}" type="video/mp4">
Your browser does not support the video tag.
</video>"""

def anim_to_html(anim):
if not hasattr(anim, '_encoded_video'):
anim.save("test.mp4", fps=20, extra_args=['-vcodec', 'libx264'])

anim._encoded_video = base64.b64encode(video).decode('utf-8')
return VIDEO_TAG.format(anim._encoded_video)

from IPython.display import HTML

def display_animation(anim):
plt.close(anim._fig)
return HTML(anim_to_html(anim))
from matplotlib import animation

# First set up the figure, the axis, and the plot element we want to animate
fig, ax = plt.subplots()
ims = [] #l, = plt.plot([], [], 'r-')
for i in np.arange(0, len(X), 10):
ims.append([(ax.scatter(X[:i, 0], X[:i, 1], color=colors[:i]))])

ax.set(xlabel='X1', ylabel='X2')
# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.ArtistAnimation(fig, ims,
interval=500,
blit=True);

display_animation(anim)

Out[4]:
Your browser does not support the video tag.

The last frame of the video, where all data is plotted is what a classifier would see that has no sense of time. Thus, the problem we set up is impossible to solve when ignoring time, but trivial once you do.

How would we classically solve this? You could just train a different classifier on each subset. But as I wrote above, you need to get the frequency right and you use less data overall.

## Random-Walk Logistic Regression in PyMC3¶

In [5]:
from pymc3 import HalfNormal, GaussianRandomWalk, Bernoulli
from pymc3.math import sigmoid
import theano.tensor as tt

X_shared = theano.shared(X)
Y_shared = theano.shared(Y)

n_dim = X.shape[1] # 2

with pm.Model() as random_walk_perceptron:
step_size = pm.HalfNormal('step_size', sd=np.ones(n_dim),
shape=n_dim)

# This is the central trick, PyMC3 already comes with this distribution
w = pm.GaussianRandomWalk('w', sd=step_size,
shape=(interval, 2))

weights = tt.repeat(w, X_shared.shape[0] // interval, axis=0)

class_prob = sigmoid(tt.batched_dot(X_shared, weights))

# Binary classification -> Bernoulli likelihood
pm.Bernoulli('out', class_prob, observed=Y_shared)


OK, if you understand the stochastic volatility model, the first two lines should look fairly familiar. We are creating 2 random-walk processes. As allowing the weights to change on every new data point is overkill, we subsample. The repeat turns the vector [t, t+1, t+2] into [t, t, t, t+1, t+1, ...] so that it matches the number of data points.

Next, we would usually just apply a single dot-product but here we have many weights we're applying to the input data, so we need to call dot in a loop. That is what tt.batched_dot does. In the end, we just get probabilities (predicitions) for our Bernoulli likelihood.

On to the inference. In PyMC3 we recently improved NUTS in many different places. One of those is automatic initialization. If you just call pm.sample(n_iter), we will first run ADVI to estimate the diagional mass matrix and find a starting point. This usually makes NUTS run quite robustly.

In [6]:
with random_walk_perceptron:
trace_perceptron = pm.sample(2000)

Auto-assigning NUTS sampler...
Average ELBO = -90.867: 100%|██████████| 200000/200000 [01:13<00:00, 2739.70it/s]
Finished [100%]: Average ELBO = -90.869
100%|██████████| 2000/2000 [00:39<00:00, 50.58it/s]


Let's look at the learned weights over time:

In [7]:
plt.plot(trace_perceptron['w'][:, :, 0].T, alpha=.05, color='r');
plt.plot(trace_perceptron['w'][:, :, 1].T, alpha=.05, color='b');
plt.xlabel('time'); plt.ylabel('weights'); plt.title('Optimal weights change over time'); sns.despine();


As you can see, the weights are slowly changing over time. What does the learned hyperplane look like? In the plot below, the points are still the training data but the background color codes the class probability learned by the model.

In [8]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
grid_2d = np.tile(grid_2d, (interval, 1))
dummy_out = np.ones(grid_2d.shape[0], dtype=np.int8)

X_shared.set_value(grid_2d)
Y_shared.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace_perceptron, model=random_walk_perceptron, samples=500)

def create_surface(X, Y, grid, ppc, fig=None, ax=None):
artists = []
cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
contour = ax.contourf(*grid, ppc, cmap=cmap)
artists.extend(contour.collections)
artists.append(ax.scatter(X[Y==0, 0], X[Y==0, 1], color='b'))
artists.append(ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r'))
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X1', ylabel='X2');
return artists

fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].mean(axis=0).reshape(100, 100), fig=fig, ax=ax)
axs.append(a)

anim2 = animation.ArtistAnimation(fig, axs,
interval=1000);
display_animation(anim2)

100%|██████████| 500/500 [00:23<00:00, 24.47it/s]

Out[8]:
Your browser does not support the video tag.

Nice, we can see that the random-walk logistic regression adapts its weights to perfectly separate the two point clouds.

## Random-Walk Neural Network¶

In the previous example, we had a very simple linearly classifiable problem. Can we extend this same idea to non-linear problems and build a Bayesian Neural Network with weights adapting over time?

If you haven't, I recommend you read my original post on Bayesian Deep Learning where I more thoroughly explain how a Neural Network can be implemented and fit in PyMC3.

Lets generate some toy data that is not linearly separable and again rotate it around its center.

In [9]:
from sklearn.datasets import make_moons
X, Y = make_moons(noise=0.2, random_state=0, n_samples=5000)
X = scale(X)

colors = Y.astype(str)
colors[Y == 0] = 'r'
colors[Y == 1] = 'b'

interval = 20
subsample = X.shape[0] // interval
chunk = np.arange(0, X.shape[0]+1, subsample)
degs = np.linspace(0, 360, len(chunk))

sep_lines = []

for ii, (i, j, deg) in enumerate(list(zip(np.roll(chunk, 1), chunk, degs))[1:]):
c, s = np.cos(theta), np.sin(theta)
R = np.matrix([[c, -s], [s, c]])

X[i:j, :] = X[i:j, :].dot(R)

In [28]:
fig, ax = plt.subplots()
ims = []
for i in np.arange(0, len(X), 10):
ims.append((ax.scatter(X[:i, 0], X[:i, 1], color=colors[:i]),))

ax.set(xlabel='X1', ylabel='X2')
anim = animation.ArtistAnimation(fig, ims,
interval=500,
blit=True);

display_animation(anim)

Out[28]:
Your browser does not support the video tag.

Looks a bit like Ying and Yang, who knew we'd be creating art in the process.

On to the model. Rather than have all the weights in the network follow random-walks, we will just have the first hidden layer change its weights. The idea is that the higher layers learn stable higher-order representations while the first layer is transforming the raw data so that it appears stationary to the higher layers. We can of course also place random-walk priors on all weights, or only on those of higher layers, whatever assumptions you want to build into the model.

In [11]:
np.random.seed(123)

ann_input = theano.shared(X)
ann_output = theano.shared(Y)

n_hidden = [2, 5]

# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden[0]).astype(theano.config.floatX)
init_2 = np.random.randn(n_hidden[0], n_hidden[1]).astype(theano.config.floatX)
init_out = np.random.randn(n_hidden[1]).astype(theano.config.floatX)

with pm.Model() as neural_network:
# Weights from input to hidden layer
step_size = pm.HalfNormal('step_size', sd=np.ones(n_hidden[0]),
shape=n_hidden[0])

weights_in_1 = pm.GaussianRandomWalk('w1', sd=step_size,
shape=(interval, X.shape[1], n_hidden[0]),
testval=np.tile(init_1, (interval, 1, 1))
)

weights_in_1_rep = tt.repeat(weights_in_1,
ann_input.shape[0] // interval, axis=0)

weights_1_2 = pm.Normal('w2', mu=0, sd=1.,
shape=(1, n_hidden[0], n_hidden[1]),
testval=init_2)

weights_1_2_rep = tt.repeat(weights_1_2,
ann_input.shape[0], axis=0)

weights_2_out = pm.Normal('w3', mu=0, sd=1.,
shape=(1, n_hidden[1]),
testval=init_out)

weights_2_out_rep = tt.repeat(weights_2_out,
ann_input.shape[0], axis=0)

# Build neural-network using tanh activation function
act_1 = tt.tanh(tt.batched_dot(ann_input,
weights_in_1_rep))
act_2 = tt.tanh(tt.batched_dot(act_1,
weights_1_2_rep))
act_out = tt.nnet.sigmoid(tt.batched_dot(act_2,
weights_2_out_rep))

# Binary classification -> Bernoulli likelihood
out = pm.Bernoulli('out',
act_out,
observed=ann_output)


Hopefully that's not too incomprehensible. It is basically applying the principles from the random-walk logistic regression but adding another hidden layer.

I also want to take the opportunity to look at what the Bayesian approach to Deep Learning offers. Usually, we fit these models using point-estimates like the MLE or the MAP. Let's see how well that works on a structually more complex model like this one:

In [12]:
import scipy.optimize
with neural_network:
map_est = pm.find_MAP(fmin=scipy.optimize.fmin_l_bfgs_b)

In [13]:
plt.plot(map_est['w1'].reshape(20, 4));


Some of the weights are changing, maybe it worked? How well does it fit the training data:

In [14]:
ppc = pm.sample_ppc([map_est], model=neural_network, samples=1)
print('Accuracy on train data = {:.2f}%'.format((ppc['out'] == Y).mean() * 100))

100%|██████████| 1/1 [00:00<00:00,  6.32it/s]
Accuracy on train data = 76.64%


Now on to estimating the full posterior, as a proper Bayesian would:

In [15]:
with neural_network:
trace = pm.sample(1000, tune=200)

Auto-assigning NUTS sampler...
Average ELBO = -538.86: 100%|██████████| 200000/200000 [13:06<00:00, 254.43it/s]
Finished [100%]: Average ELBO = -538.69
100%|██████████| 1000/1000 [1:22:05<00:00,  4.97s/it]

In [16]:
plt.plot(trace['w1'][200:, :, 0, 0].T, alpha=.05, color='r');
plt.plot(trace['w1'][200:, :, 0, 1].T, alpha=.05, color='b');
plt.plot(trace['w1'][200:, :, 1, 0].T, alpha=.05, color='g');
plt.plot(trace['w1'][200:, :, 1, 1].T, alpha=.05, color='c');

plt.xlabel('time'); plt.ylabel('weights'); plt.title('Optimal weights change over time'); sns.despine();


In [17]:
ppc = pm.sample_ppc(trace, model=neural_network, samples=100)
print('Accuracy on train data = {:.2f}%'.format(((ppc['out'].mean(axis=0) > .5) == Y).mean() * 100))

100%|██████████| 100/100 [00:00<00:00, 112.04it/s]
Accuracy on train data = 96.72%


I think this is worth highlighting. The point-estimate did not do well at all, but by estimating the whole posterior we were able to model the data much more accurately. I'm not quite sure why that is the case. It's possible that we either did not find the true MAP because the optimizer can't deal with the correlations in the posterior as well as NUTS can, or the MAP is just not a good point. See my other blog post on hierarchical models as for why the MAP is a terrible choice for some models.

On to the fireworks. What does this actually look like:

In [18]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
grid_2d = np.tile(grid_2d, (interval, 1))
dummy_out = np.ones(grid_2d.shape[0], dtype=np.int8)

ann_input.set_value(grid_2d)
ann_output.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].mean(axis=0).reshape(100, 100), fig=fig, ax=ax)
axs.append(a)

anim2 = animation.ArtistAnimation(fig, axs,
interval=1000);
display_animation(anim2)

100%|██████████| 500/500 [00:58<00:00,  7.82it/s]

Out[18]:
Your browser does not support the video tag.

Holy shit! I can't believe that actually worked. Just for fun, let's also make use of the fact that we have the full posterior and plot our uncertainty of our prediction (the background now encodes posterior standard-deviation where red means high uncertainty).

In [19]:
fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].std(axis=0).reshape(100, 100),
fig=fig, ax=ax)
axs.append(a)

anim2 = animation.ArtistAnimation(fig, axs,
interval=1000);
display_animation(anim2)

Out[19]:
Your browser does not support the video tag.

## Conclusions¶

In this blog post I explored the possibility of extending Neural Networks in new ways (to my knowledge), enabled by expressing them in a Probabilistic Programming framework. Using a classic point-estimate did not provide a good fit for the data, only full posterior inference using MCMC allowed us to fit this model adequately. What is quite nice, is that we did not have to do anything special for the inference in PyMC3, just calling pymc3.sample() gave stable results on this complex model.

Initially I built the model allowing all parameters to change, but realizing that we can selectively choose which layers to change felt like a profound insight. If you expect the raw data to change, but the higher-level representations to remain stable, as was the case here, we allow the bottom hidden layers to change. If we instead imagine e.g. handwriting recognition, where your handwriting might change over time, we would expect lower level features (lines, curves) to remain stable but allow changes in how we combine them. Finally, if the world remains stable but the labels change, we would place a random-walk process on the output layer. Of course, if you don't know, you can have every layer change its weights over time and give each one a separate step-size parameter which would allow the model to figure out which layers change (high step-size), and which remain stable (low step-size).

In terms of quantatitative finance, this type of model allows us to train on much larger data sets ranging back a long time. A lot of that data is still useful to build up stable hidden representations, even if for predicition you still want your model to predict using its most up-to-date state of the world. No need to define a window-length or discard valuable training data.

In [24]:
%load_ext watermark
%watermark -v -m -p numpy,scipy,sklearn,theano,pymc3,matplotlib

The watermark extension is already loaded. To reload it, use:
CPython 3.6.0
IPython 5.1.0

numpy 1.11.3
scipy 0.18.1
sklearn 0.18.1
theano 0.9.0beta1.dev-9f1aaacb6e884ebcff9e249f19848db8aa6cb1b2
pymc3 3.0
matplotlib 2.0.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)
system     : Darwin
release    : 16.4.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit


### Matthieu Brucher

#### AudioToolkit: creating a simple plugin with WDL-OL

Audio Toolkit was started several years ago now, there are more than a dozen plugins based on the platform, applications using it, but I never wrote a tutorial explaining how to use it. Users had to find out for themselves. This changes today.

# Building Audio Toolkit

Let’s start with building Audio ToolKit. It uses CMake to ease the pain of supporting several platforms, although you can build it yourself if you generate config.h.

You will require Boost, Eigen and FFTW if you want to test the library and ensure that everything is all right.

## Windows

Windows may be the most complicated platform. This stems from the fact that the runtime is different for each version of the Microsoft compiler (except after 2015), and usually that’s not the one you have with your DAW (and thus probably not the one you have with your users’ DAW).

SO the first question is which kind of build you need. For a plugin, I think it is clearly a static runtime that you require, for an app, I would suggest the dynamic runtime. For this, in the CMake GUI, set MSVC_RUNTIME to Static or Dynamic. Enable the same output, static for a plugin and shared libraries for an application.

Note that tests require the shared libraries.

## macSierra/OS X

On OS X, just create the default Xcode project, you may want to also generate ATK with CMAKE_OSX_ARCHITECTURES to i386 to get a 32bits version, or x86_64 for a universal binary (I’ll use i386 in this tutorial).

The same rules for static/shared apply here.

## Linux

For Linux, I don’t have a plugin support in WDL-OL, but suffice to say that it is the ideas in the next section that are actually relevant.

# Building a plugin with WDL-OL

I’ll use the same simple code to generate a simple plugin that does more or less nothing except copy data from the input to the output inside a plugin.

## Common code

Start by using the duplicate.py script to create your own plugin. Use a “1-1” PLUG_CHANNEL_IO value to create a mono plugin (this is in resource.h). More advanced configurations can be seen on the ATK plugin repository.

Now, we need an input and an output filter for our pipeline. Let’s add them to our plugin class:

#include <ATK/Core/InPointerFilter.h>
#include <ATK/Core/OutPointerFilter.h>

and new members:

  ATK::InPointerFilter<double> inFilter;
ATK::OutPointerFilter<double> outFilter;

Now, in the initialization list, add the following:

inFilter(nullptr, 1, 0, false), outFilter(nullptr, 1, 0, false)
  outFilter.set_input_port(0, &inFilter, 0);
Reset();

This is required to setup the pipeline and initialize the internal variables.
In Reset() put the following:

  int sampling_rate = GetSampleRate();

if(sampling_rate != outFilter.get_output_sampling_rate())
{
inFilter.set_input_sampling_rate(sampling_rate);
inFilter.set_output_sampling_rate(sampling_rate);
outFilter.set_input_sampling_rate(sampling_rate);
outFilter.set_output_sampling_rate(sampling_rate);
}

This ensures that all the sampling rates are consistent. If this is not required for a copy pipeline, for EQs, modeling filters, this is mandatory. Also ATK requires the pipeline to be consistent, so you can’t connect filters that don’t have matching input/output sampling rates. Some of them can change rates, like oversampling and undersampling ones, but they are the exception, not the rule.

And now, the only thing that remains is to actually trigger the pipeline:

  inFilter.set_pointer(inputs[0], nFrames);
outFilter.set_pointer(outputs[0], nFrames);
outFilter.process(nFrames);

Now, the WDL-OL projects must be adapted.

## Windows

In both cases, it is quite straightforward: set include paths and libraries for the link stage.

For Windows, you need to have a matching ATK build for Debug/Release. In the project properties, add ATK include folder in Project->Properties->C++->Preprocessor->AdditionalIncludeDirectories.

## macSierra/OS X

The second step is to add the libraries to the project by adding them to the Link Binary With Libraries list for each target you want to build.

# Conclusion

That’s it!

In the end, I hope that I showed that it is easy to build something with Audio ToolKit.

### Enthought

#### Webinar: Using Python and LabVIEW Together to Rapidly Solve Engineering Problems

When: On-Demand (Live webcast took place March 28, 2017)
What: Presentation, demo, and Q&A with Collin Draughon, Software Product Manager, National Instruments, and Andrew Collette, Scientific Software Developer, Enthought

View Now  If you missed the live session, fill out the form to view the recording!

Engineers and scientists all over the world are using Python and LabVIEW to solve hard problems in manufacturing and test automation, by taking advantage of the vast ecosystem of Python software.  But going from an engineer’s proof-of-concept to a stable, production-ready version of Python, smoothly integrated with LabVIEW, has long been elusive.

In this on-demand webinar and demo, we take a LabVIEW data acquisition app and extend it with Python’s machine learning capabilities, to automatically detect and classify equipment vibration.  Using a modern Python platform and the Python Integration Toolkit for LabVIEW, we show how easy and fast it is to install heavy-hitting Python analysis libraries, take advantage of them from live LabVIEW code, and finally deploy the entire solution, Python included, using LabVIEW Application Builder.

In this webinar, you’ll see how easy it is to solve an engineering problem by using LabVIEW and Python together.

## What You’ll Learn:

• How Python’s machine learning libraries can simplify a hard engineering problem
• How to extend an existing LabVIEW VI using Python analysis libraries
• How to quickly bundle Python and LabVIEW code into an installable app

## Who Should Watch:

• Engineers and managers interested in extending LabVIEW with Python’s ecosystem
• People who need to easily share and deploy software within their organization
• Current LabVIEW users who are curious what Python brings to the table
• Current Python users in organizations where LabVIEW is used

### How LabVIEW users can benefit from Python:

• High-level, general purpose programming language ideally suited to the needs of engineers, scientists, and analysts
• Huge, international user base representing industries such as aerospace, automotive, manufacturing, military and defense, research and development, biotechnology, geoscience, electronics, and many more
• Tens of thousands of available packages, ranging from advanced 3D visualization frameworks to nonlinear equation solvers
• Simple, beginner-friendly syntax and fast learning curve

View Now  If you missed the live webcast, fill out the form to view the recording

Presenters:

 Collin Draughon, National Instruments Software Product Manager Andrew Collette, Enthought Scientific Software Developer Python Integration Toolkit for LabVIEW core developer

Quickly and efficiently access scientific and engineering tools for signal processing, machine learning, image and array processing, web and cloud connectivity, and much more. With only minimal coding on the Python side, this extraordinarily simple interface provides access to all of Python’s capabilities.

• What is the Python Integration Toolkit for LabVIEW?

The Python Integration Toolkit for LabVIEW provides a seamless bridge between Python and LabVIEW. With fast two-way communication between environments, your LabVIEW project can benefit from thousands of mature, well-tested software packages in the Python ecosystem.

Run Python and LabVIEW side by side, and exchange data live. Call Python functions directly from LabVIEW, and pass arrays and other numerical data natively. Automatic type conversion virtually eliminates the “boilerplate” code usually needed to communicate with non-LabVIEW components.

Develop and test your code quickly with Enthought Canopy, a complete integrated development environment and supported Python distribution included with the Toolkit.

• What is LabVIEW?

LabVIEW is a software platform made by National Instruments, used widely in industries such as semiconductors, telecommunications, aerospace, manufacturing, electronics, and automotive for test and measurement applications. In August 2016, Enthought released the Python Integration Toolkit for LabVIEW, which is a “bridge” between the LabVIEW and Python environments.

• Who is Enthought?

Enthought is a global leader in software, training, and consulting solutions using the Python programming language.

The post Webinar: Using Python and LabVIEW Together to Rapidly Solve Engineering Problems appeared first on Enthought Blog.

### numfocus

#### Technical preview: Native GPU programming with CUDAnative.jl (Julia)

This post originally appeared on Julialang.org blog. 14 Mar 2017  |  Tim Besard After 2 years of slow but steady development, we would like to announce the first preview release of native GPU programming capabilities for Julia. You can now write your CUDA kernels in Julia, albeit with some restrictions, making it possible to use […]

#### Some fun with π in Julia

This post originally appeared on the Julialang.org blog. Some fun with π in Julia 14 Mar 2017  |  Simon Byrne, Luis Benet and David Sanders This post is available as a Jupyter notebook here π in Julia (Simon Byrne) Like most technical languages, Julia provides a variable constant for π. However Julia’s handling is a […]