May 23, 2017

Enthought

Enthought Receives 2017 Product of the Year Award From National Instruments LabVIEW Tools Network

Python Integration Toolkit for LabVIEW recognized for extending LabVIEW connectivity and bringing the power of Python to applications in Test, Measurement and the Industrial Internet of Things (IIoT)

AUSTIN, TX – May 23, 2017 Enthought, a global leader in scientific and analytic computing solutions, was honored this week by National Instruments with the LabVIEW Tools Network Platform Connectivity 2017 Product of the Year Award for its Python Integration Toolkit for LabVIEW.

Python Integration Toolkit for LabVIEWFirst released at NIWeek 2016, the Python Integration Toolkit enables fast, two-way communication between LabVIEW and Python. With seamless access to the Python ecosystem of tools, LabVIEW users are able to do more with their data than ever before. For example, using the Toolkit, a user can acquire data from test and measurement tools with LabVIEW, perform signal processing or apply machine learning algorithms in Python, display it in LabVIEW, then share results using a Python-enabled web dashboard.

“Python is ideally suited for scientists and engineers due to its simple, yet powerful syntax and the availability of an extensive array of open source tools contributed by a user community from industry and R&D,” said Dr. Tim Diller, Director, IIoT Solutions Group at Enthought. “The Python Integration Toolkit for LabVIEW unites the best elements of two major tools in the science and engineering world and we are honored to receive this award.”

Key benefits of the Python Integration Toolkit for LabVIEW from Enthought:

  • Enables fast, two-way communication between LabVIEW and Python
  • Provides LabVIEW users seamless access to tens of thousands of mature, well-tested scientific and analytic software packages in the Python ecosystem, including software for machine learning, signal processing, image processing and cloud connectivity
  • Speeds development time by providing access to robust, pre-developed Python tools
  • Provides a comprehensive out-of-the box solution that allows users to be up and running immediately

“Add-on software from our third-party developers is an integral part of the NI ecosystem, and we’re excited to recognize Enthought for its achievement with the Python Integration Toolkit for LabVIEW,” said Matthew Friedman, senior group manager of the LabVIEW Tools Network at NI.

The Python Integration Toolkit is available for download via the LabVIEW Tools Network, and also includes the Enthought Canopy analysis environment and Python distribution. Enthought’s training, support and consulting resources are also available to help LabVIEW users maximize their value in leveraging Python.

For more information on Enthought’s Python Integration Toolkit for LabVIEW, visit www.enthought.com/python-for-LabVIEW.

Additional Resources

Product Information

Python Integration Toolkit for LabVIEW product page

Download a free trial of the Python Integration Toolkit for LabVIEW

Webinars

Webinar: Using Python and LabVIEW to Rapidly Solve Engineering Problems | Enthought
April 2017

Webinar: Introducing the New Python Integration Toolkit for LabVIEW from Enthought
September 2016

About Enthought

Enthought is a global leader in scientific and analytic software, consulting, and training solutions serving a customer base comprised of some of the most respected names in the oil and gas, manufacturing, financial services, aerospace, military, government, biotechnology, consumer products and technology industries. The company was founded in 2001 and is headquartered in Austin, Texas, with additional offices in Cambridge, United Kingdom and Pune, India. For more information visit www.enthought.com and connect with Enthought on Twitter, LinkedIn, Google+, Facebook and YouTube.

About NI

Since 1976, NI (www.ni.com) has made it possible for engineers and scientists to solve the world’s greatest engineering challenges with powerful platform-based systems that accelerate productivity and drive rapid innovation. Customers from a wide variety of industries – from healthcare to automotive and from consumer electronics to particle physics – use NI’s integrated hardware and software platform to improve the world we live in.

About the LabVIEW Tools Network

The LabVIEW Tools Network is the NI app store equipping engineers and scientists with certified, third-party add-ons and apps to complete their systems. Developed by industry experts, these cutting-edge technologies expand the power of NI software and modular hardware. Each third-party product is reviewed to meet specific guidelines and ensure compatibility. With hundreds of products available, the LabVIEW Tools Network is part of a rich ecosystem extending the NI Platform to help customers positively impact our world. Learn more about the LabVIEW Tools Network at www.ni.com/labview-tools-network.

LabVIEW, National Instruments, NI and ni.com and NIWeek are trademarks of National Instruments. Enthought, Canopy and Python Integration Toolkit for LabVIEW are trademarks of Enthought, Inc.

Media Contact

Courtenay Godshall, VP, Marketing, +1.512.536.1057, cgodshall@enthought.com

The post Enthought Receives 2017 Product of the Year Award From National Instruments LabVIEW Tools Network appeared first on Enthought Blog.

by cgodshall at May 23, 2017 06:18 PM

numfocus

Welcome Nancy Nguyen, the new NumFOCUS Events Coordinator!

NumFOCUS is pleased to announce Nancy Nguyen has been hired as our new Events Coordinator. Nancy has over five years of event management experience in the non-profit and higher education sectors. She graduated from The University of Texas at Austin in 2011 with a BA in History. Prior to joining NumFOCUS, Nancy worked in development and fundraising […]

by Gina Helfrich at May 23, 2017 04:48 PM

Matthieu Brucher

Announcement: ATKBassPreamp 1.0.0

I’m happy to announce the release of a modeling of the Fender Bassman preamplifier stage based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATKBassPreamp

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKStereoUniversalDelay.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at May 23, 2017 07:47 AM

May 22, 2017

numfocus

NumFOCUS Awards Small Development Grants to Projects

This spring the NumFOCUS Board of Directors awarded targeted small development grants to applicants from or approved by our sponsored and affiliated projects. In the wake of a successful 2016 end-of-year fundraising drive, NumFOCUS wanted to direct the donated funds to our projects in a way that would have impact and visibility to donors and […]

by Gina Helfrich at May 22, 2017 09:52 PM

May 17, 2017

numfocus

What is it like to chair a PyData conference?

Have you ever wondered what it’s like to be in charge of a PyData event? Vincent Warmerdam has collected some thoughts and reflections on his experience chairing this year’s PyData Amsterdam conference: This year I was the chair of PyData Amsterdam and I’d like to share some insights on what that was like. I was on the committee the […]

by Gina Helfrich at May 17, 2017 03:09 PM

May 16, 2017

Matthieu Brucher

Book review: OpenGL Data Visualization Cookbook

This review will actually be quite quick: I haven’t finished the book and I won’t finish it.

The book was published in August 2015 and is based on OpenGL 3. The authors may sometimes say that you can use shaders to do better, but the fact is that if you want to execute the code they propose, you need to use the backward compatibility layer, if it's available. OpenGL was published almost a decade ago, I can't understand why in 2015 two guys decided that a new book on scientific visualization should use an API that was deprecated a long time ago. What a waste of time. [amazon_enhanced asin="1782169725" /][amazon_enhanced asin="B01FGMWRO8" /]

by Matt at May 16, 2017 07:06 AM

May 14, 2017

Titus Brown

How to analyze, integrate, and model large volumes of biological data - some thoughts

This blog post stems from notes I made for a 12 minute talk at the Oregon State Microbiome Initiative, which followed from some previous thinking about data integration on my part -- in particular, Physics ain't biology (and vice versa) and What to do with lots of (sequencing) data.

My talk slides from OSU are here if you're interested.

Thanks to Andy Cameron for his detailed pre-publication peer review - any mistakes remaining are of course mine, not his ;).


Note: During the events below, I was just a graduate student. So my perspective is probably pretty limited. But this is what I saw and remember!

My graduate work was in Eric Davidson's lab, where we studied early development in the sea urchin. Eric had always been very interested in gene expression, and over the preceding decade or two (1980s and onwards) had invested heavily in genomic technologies. This included lots of cDNA macroarrays and BAC libraries, as well as (eventually) the sea urchin genome project.

The sea urchin is a great system for studying early development! You can get literally billions of synchronously developing embryos by fertilizing all the eggs simultaneously; the developing embryo is crystal clear and large enough to be examined using a dissecting scope; sea urchins are available world-wide; early development is mostly invariant with respect to cell lineage (although that comes with a lot of caveats); and sea urchin embryos have been studied since the 1800s, so there was a lot of background literature on the embryology.

The challenge: data integration without guiding theory

What we were faced with in the '90s and '00s was a challenge provided by the scads of new molecular data provided by genomics: that of data integration. We had discovered plenty of genes (all the usual homologs of things known in mice and fruit flies and other animals), we had cell-type specific markers, we could measure individual gene expression fairly easily and accurately with qPCR, we had perturbations working with morpholino oligos, and we had reporter assays working quite well with CAT and GFP. And now, between BAC sequencing and cDNA libraries and (eventually) genome sequencing, we had tons of genomic and transcriptomic data available.

How could we make sense of all of this data? It's hard to convey the confusion and squishiness of a lot of this data to anyone who hasn't done hands-on biology research; I would just say that single experiments or even collections of many experiments rarely provided a definitive answer, and usually just led to new questions. This is not rare in science, of course, but it typically took 2-3 years to figure out what a specific transcription factor might be doing in early development, much less nail down its specific upstream and downstream connections. Scale that to the dozens or 100s of genes involved in early development and, well, it was a lot of people, a lot of confusion, and a lot of discussion.

The GRN wiring diagram

To make a longer story somewhat shorter:

Eric ended up leading an effort (together with Hamid Bolouri, Dave McClay, Andy Cameron, and others in the sea urchin community) to build a gene regulatory network that provided a foundation for data integration and exploration. You can see the result here:

http://sugp.caltech.edu/endomes/

This network at its core is essentially a map of the genomic connections between genes (transcriptional regulation of transcription factors, together with downstream connections mediated by specific binding sites and signaling interactions between cells, as well as whatever other information we had). Eric named this "the view from the genome." On top of this is layered several different "views from the nucleus", which charted the different regulatory states initiated by asymmetries such as the localization of beta cadherin to the vegetal pole of the egg, and the location of sperm entry into the egg.

At least when it started, the network served primarily as a map of the interactions - a somewhat opinionated interpretation of both published and unpublished data. Peter et al., 2012 showed that the network could be used for in silico perturbations, but I don't know how much has been followed up on. During my experiences with it, it mainly served as a communications medium and a point of reference for discussions about future experiments as well as an integrative guide to published work.

What was sort of stunning in hindsight is the extent to which this model became a touchpoint for our lab and (fairly quickly) the community that studied sea urchin early development. Eric presented the network one year at the annual Developmental Biology of the Sea Urchin meeting, and by the next meeting, 18 months later, I remember it showing up in a good portion of talks from other labs. (One of my favorite memories is someone from Dave McClay's lab - I think it was Cyndi Bradham - putting up a view of the GRN inverted to make signaling interactions the core focus, instead of transcriptional regulation; heresy in Eric's lab!)

In essence, the GRN became a community resource fairly quickly. It was provided in both image and interactive form (using BioTapestry), and people felt free to edit and modify the network for their own presentations. It readily enabled in silico thought experiments - "what happens if I knock out this gene? The model predicts this, and this, and this should be downstream, and this other gene should be unaffected" that quickly led to choosing impactful actual experiments. In part because of this, arguments about the effects of specific genes quickly converged to conversation about how to test the arguments (for some definition of "quickly" and "conversation" - sometimes discussions were quite, ahem, robust in Eric's lab and the larger community!)

The GRN also served to highlight the unknowns and the insufficiencies in the model. Eric and others spent a lot of time thinking through questions such as this: "we know that transcription of gene X is repressed by gene Y; but something must still activate gene X. What could it be?" Eventually we did "crazy" things like measure the transcriptional levels and spatial expression patterns of all ~1000 transcription factors found in the sea urchin genome, which could then be directly integrated into the model for further testing.

In short, the GRN was a pretty amazing way for the community of people interested in early development in the sea urchin to communicate about the details. Universal agreement wasn't the major outcome, although I think significant points about early development were settled in part through the model - communication was the outcome.

And, importantly, it served as a central meeting point for data analysis. More on this below.

Missed opportunities?

One of the major missed opportunities (in my view, obviously - feel free to disagree, the comment section is below :) was that we never turned the GRN into a model that was super easy for experimentalists to play with. It would have required significant software development effort to make it possible to do click-able gene knockdown followed by predicted phenotype readout -- but this hasn't been done yet; apparently it has been tough to find funding for this purpose. Had I stayed in the developmental biology game, I like to think I would have invested significant effort in this kind of approach.

I also don't feel like much time was invested in the community annotation and updating aspect of things. The official model was tightly controlled by a few people (in the traditional scientific "experts know best!" approach) and there was no particular attempt to involve the larger community in annotating or updating the model except through 1-1 conversations or formal publications. It's definitely possible that I just missed it, because I was just a graduate student, and by mid-2004 I had also mentally checked out of grad school (it took me a few more years to physically check out ;).

Taking and holding ground

One question that occupies my mind a lot is the question of how we learn, as a community, from the research and data being produced in each lab. With data, one answer is to work to make the data public, annotate it, curate it, make it discoverable - all things that I'm interested in.

With research more broadly, though, it's more challenging. Papers are relatively poor methods for communicating the results of research, especially now that we have the Internet and interactive Web sites. Surely there are better venues (perhaps ones like Distill, the interactive visual journal for machine learning research). Regardless, the vast profusion of papers on any possible topic, combined with the array of interdisciplinary methods needed, means that knowledge integration is slow and knowledge diffusion isn't much faster.

I fear this means that when it comes to specific systems and question, we are potentially forgetting many things that we "know" as people retire or move on to other systems or questions. This is maybe to be expected, but when we confront the level of complexity inherent in biology, with little obvious convergence between systems, it seems problematic to repose our knowledge in dead tree formats.

Mechanistic maps and models for knowledge storage and data integration

So perhaps the solution is maps and models, as I describe above?

In thinking about microbiomes and microbial communities, I'm not sure what form a model would take. At the most concrete and boring level, a directly useful model would be something that took in a bunch of genomic/transcriptomic/proteomic data and evaluated it against everything that we knew, and then sorted it into "expected" and "unexpected". (This is what I discussed a little bit in my talk at OSU.)

The "expected" would be things like the observation of carbon fixation pathways in well-understood autotrophs - "yep, there it is, sort of matches what we already see." The "unexpected" would be things like unannotated or poorly understood genes that were behaving in ways that suggested they were correlated with whatever conditions we were examining. Perhaps we could have multiple bins of unexpected, so that we could separate out things like genes where the genome, transcriptome, and proteome all provided evidence of expression versus situations where we simply saw a transcript with no other kind of data. I don't know.

If I were to indulge in fanciful thinking, I could imagine a sort of Maxwell's Daemon of data integration, sorting data into bins of "boring" and "interesting", churning through data sets looking for a collection of "interesting" that correlated with other data sets produced from the same system. It's likely that such a daemon would have to involve some form of deep correlational analysis and structure identification - deep learning comes to mind. I really don't know.

One interesting question is, how would this interact with experimental biology and experimental biologists? The most immediately useful models might be the ones that worked off of individual genomes, such as flux-balance models; they could be applied to data from new experimental conditions and knockouts, or shifted to apply to strain variants and related species and look for missing genes in known pathways, or new genes that looked potentially interesting.

So I don't know a lot. All I do know is that our current approaches for knowledge integration don't scale to the volume of data we're gathering or (perhaps more importantly) to the scale of the biology we're investigating, and I'm pretty sure computational modeling of some sort has to be brought into the fray in practical ways.

Perhaps one way of thinking about this is to ask what types of computational models would serve as good reference resources, akin to a reference genome. The microbiome world is surprisingly bereft of good reference resources, with the 16s databases and IMG/M serving as two of the big ones; but we clearly need more, along the vein of a community KEGG and other such resources, curated and regularly updated.

Some concluding thoughts

Communication of understanding is key to progress in science; we should work on better ways of doing that. Open science (open data, open source, open access) is one way of better communicating data, computational methods, and results.

One theme that stood out for me from the microbiome workshop at OSU was that of energetics, a point that Stephen Giovanonni made most clearly. To paraphrase, "Microbiome science is limited by the difficulty of assessing the pros and cons of metabolic strategies." The guiding force behind evolution and ecology in the microbial world is energetics, and if we can get a mechanistic handle on energy extraction (autotrophy and heterotrophy) in single genomes and then graduate that to metagenome and community analysis, maybe that will provide a solid stepping stone for progress.

I'm a bit skeptical that the patterns that ecology and evolution can predict will be of immediate use for developing a predictive model. On the other hand, Jesse Zaneweld at the meeting presented on the notion that all happy microbiomes look the same, while all dysfunctional microbiomes are dysfunctional in their own special way; and Jesse pointed towards molecular signatures of dysfunction; so perhaps I'm wrong :).

It may well be that our data is still far too sparse to enable us to build a detailed mechanistic understanding of even simple microbial ecosystems. I wouldn't be surprised by this.

Trent Northern from the JGI concluded in his talk that we need model ecosystems too; absolutely! Perhaps experimental model ecosystems, either natural or fabricated, can serve to identify the computational approaches that will be most useful.

Along this vein, are there a natural set of big questions and core systems for which we could think about models? In the developmental biology world, we have a few big model systems that we focused on (mouse, zebrafish, fruit fly, and worm) - what are the equivalent microbial ecosystems?

All things to think about.

--titus

p.s. There are a ton of references and they can be fairly easily found, but a decent starting point might be Davidson et al., 2002, "A genomic regulatory network for development."

by C. Titus Brown at May 14, 2017 10:00 PM

May 13, 2017

Matthieu Brucher

Analog modeling: Comparing preamps

In a previous post, I explained how I modeled the triode inverter circuit. I’ve decided to put it inside two different plugins, so I’d like to present in 4 pictures their differences.

Preamps plugins

The two plugins will start as the modeling of the Fender Bassman preamp (inverter circuit, followed by its associated tone stack) and the other will be the modeling of the inverter section of a Vox AC30 (followed by the tone stack of a JCM800). Compared to the default preamp of Audio ToolKit, the behaviors are quite different, all with just a few differences in the values of the components:

30Hz preamps behavior

200Hz preamps behavior

1kHz preamps behavior

10kHz preamps behavior

I will probably add the options of using a different triode model (Audio Toolkit has lots of options, and I definitely need to present the differences in terms of quality and performance), and perhaps also a way of selecting a different tone stack. But for now, the plugins will propose a single modeling of a triode inverter followed by a tone stack. To model a full amp, you still need to model the final stage and the loudspeaker.

The next picture displays the preamp behavior depending on the used triode function:

Response with different triode functions

The Leach model and the original Koren model don’t behave as well the other models. It’s probably due to different parameters, but they give a good idea of the behavior of the tube. The modified Munro-Piazza is a personal modification of the tube function to make the derivative continuous as well. It helps the convergence when the state of the tube changes fast, even though it is clearly not enough to remove all discontinuities.

The following picture describes the cost with valgrind of the different models:

Triode preamp profiles

Obviously, the cost for the more complex functions, the time is spent trying to figure out the logarithm of a double number. This is because the fast math functions in ATK don’t support this function yet. If we use floating point numbers, then the cost is divided by 3 for the Koren model (for instance).

As the results are quite close, using floats or doubles is a matter of optimization and precision. In the plugins, I will use floats to maximize performance.

Coming next

In the next two weeks, the two plugins will be released. And depending on feedback and comments, I’ll add more options.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at May 13, 2017 07:19 AM

May 12, 2017

Trichech

Hal-hal yang Perlu Diperhatikan Dalam Jual Nasi Box

Jual nasi box jakarta atau nasi kotak merupakan usaha yang paling banyak dilakoni belakangan ini. Ini dikarenakan semakin meningkatnya jenis acara yang diadakan. Dengan begini, usaha ini mampu menjadi ladang rezeki yang melimpah dan menguntungkan. Alasan banyak orang yang berlangganan nasi kotak adalah karena tidak ribet dan simpel. Hal ini tentu tidak didapatkan jika memasak makanan dalam jumlah banyak sendiri. Memasak hidangan sendiri untuk menjamu tamu dalam jumlah banyak tentu melelahkan dan membutuhkan biaya yang tidak sedikit.

Saat ini bisnis jual nasi box bisa menjadi jalan untuk mencari penghidupan. Dengan mengandalkan pendapatan yang didapatkan dari usaha ini mampu untuk membiayai segala keperluan keluarga, mulai dari biaya pendidikan, kesehatan, kebutuhan primer maupun sekunder lainnya. Kebutuhan ini terpenuhi asal mengetahui dan memahami hal-hal apa saja yang perlu diperhatikan dalam menjalankan usaha kuliner ini.

Dalam menjalani usaha jual nasi box sangat penting untuk mempertimbangkan lokasi penjualan. Alamat menjadi poin penting karena tempat inilah yang akan pertama kali di cari oleh pelanggan jika hendak memesan. Selain itu, nomor telepon yang bisa dihubungi juga sangat berpengaruh. Selain kedua hal tadi, anda juga perlu memastikan ketersediaan bahan baku untuk dimasak. Sebab, jika persediaan kurang, maka usaha ini tidak akan berjalan maksimak. Kemungkinan terburuknya adalah pelanggan akan kabur dan beralih ke penjual nasi kotak yang lain.

by admin at May 12, 2017 02:05 AM

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

by admin at May 12, 2017 02:00 AM

May 11, 2017

Leonardo Uieda

Reviews of our Scipy 2017 talk proposal: Bringing the Generic Mapping Tools to Python

Thumbnail image for publication.

This year, Scipy is using a double-open peer-review system, meaning that both authors and reviewers know each others identities. These are the reviews that we got for our proposal and our replies/comments (posted with permission from the reviewers). My sincerest thanks to all reviewers and editors for their time and effort.

The open review model is great because it increases the transparency of the process and might even result in better reviews. I started signing my reviews a few years ago and I found that I'm more careful with the tone of my review to make sure I don't offend anyone and provide constructive feedback.

Now, on to the reviews!

Review 1 - Paul Celicourt

The paper introduces a Python wrapper for the C-based Generic Mapping Tools used to process and analyze time series and gridded data. The content is well organized, but I encourage the authors to consider the following comments: While the authors promise to demonstrate an initial prototype of the wrapper, it is not sure that a WORKING prototype will be available by the time of the conference as claimed by the authors when looking at the potential functionalities to be implemented and presented in the second paragraph of the extended abstract. Furthermore, it is not clear what would be the functionalities of the initial prototype. On top of that, the approach to the implementation is not fully presented. For instance, the Simplified Wrapper and Interface Generator (SWIG) tool may be used to reduce the workload but the authors do not mention whether the wrapper would be manually developed or using an automated tool such as the SWIG. Finally, the portability of the shared memory process has not been addressed.

Thanks for all your comments, Paul! They are good questions and we should have addressed them better in the abstract.

That is a valid concern regarding the working prototype. We're not sure how much of the prototype will be ready for the conference. We are sure that we'll have something to show, even if it's not complete. The focus of the talk will be on our design decisions, implementation details, and the changes in the GMT modern execution mode on which the Python wrappers are based. We'll run some examples of whatever we have working mostly for the "Oooh"s and "Aaah"s.

The wrapper will be manually generated using ctypes. We chose this over SWIG or Cython because ctypes allows us to write pure Python code. It's a much simpler way of wrapping a C library. Not having any compiled extension modules also greatly facilitates distributing the package across operating systems. The same wrapper code can work on Windows, OSX, and Linux (as long as the GMT shared library is available).

The amount of C functions that we'll have to wrap is not that large. Mainly, we need GMT_Call_Module to run a command (like psxy), GMT_Create_Session for generating the session structure, and GMT_Open_VirtualFile and GMT_Read_VirtualFile for passing data to and from Python. The majority of the work will be in creating the Python functions for each GMT command, documenting them, and parsing the Python function arguments into something that GMT_Call_Module accepts. This work would have to be done manually using SWIG or Cython as well, so ctypes is not a disadvantage with regard to this. There are some more details about this in our initial design and goals.

Review 2 - Ricardo Barros Lourenço

The authors submitted a clear abstract, in the sense that they will present a new Python library, which is a binding to the Generic Mapping Tools (GMT) C library, which is widely adopted by the Geosciences community. They were careful in detailing their reasoning in such implementation, and also in analogue initiatives by other groups.

In terms of completeness, the abstract precisely describes that the design plans and some of the implementation would be detailed and explained, as well on a demo of their current version of the library. It was very interesting that the authors, while describing their implementation, also pointed that the library could be used in other applications not necessarily related to geoscientific applications, by the generation of general line plots, bar graphs, histograms, and 3D surfaces. It would be beneficial to the audience to see how this aspect is sustained, by comparing such capabilities with other libraries (such as Matplotlib and Seaborn) and evaluating their contribution to the geoscientific domain, and also on the expanded related areas.

The abstract is highly compelling to the Earth Sciences community members at the event because the GMT module is already used for high-quality visualization (both in electronic, but also in printed outputs - maps - which is an important contribution to) , but with a Python integration it could simplify the integration of "Pythonic" workflows into it, expanding the possibilities in geoscientific visualization, especially in printed maps.

It would be interesting, aside from a presumed comparison in online visualization with matplotlib and cartopy, if the authors would also discuss in their presentation other possible contributions, such as online tile generation in map servers, which is very expensive in terms of computational resources and is still is challenging in an exclusive "Pythonic" environment. Additionally, it would be interesting if the authors provide some clarification if there is any limitation on the usage of such library, more specifically to the high variance in geoscientific data sources, and also in how netCDF containers are consumed in their workflow (considering that these containers don't necessarily conform to a strict standard, allowing users to customize their usage) in terms of the automation of this I/O.

The topic of high relevance because there is still few options for spatial data visualization in a "fully pythonic" environment, and none of them is used in the process of plotting physical maps, in a production setting, such as GMT is. Considering these aspects, I recommend such proposal for acceptance.

Thank you, Ricardo, for your incentives and suggestions for the presentation!

I hadn't thought about the potential use in map tiling but we'll keep an eye on that from now on and see if we have anything to say about it. Thanks!

Regarding netCDF, the idea is to leverage the xarray library for I/O and use their Dataset objects as input and output arguments for the grid related GMT commands. There is also the option of giving the Python functions the file name of a grid and have GMT handle I/O, as it already does in the command line. The appeal of using xarray is that it integrates well with numpy and pandas and can be used instead of gmt grdmath (no need to tie your head in knots over RPN anymore!).

Review 3 - Ryan May

Python bindings for GMT, as demonstrated by the authors, are very much in demand within the geoscience community. The work lays out a clear path towards implementation, so it's an important opportunity for the community to be able offer API and interaction feedback. I feel like this talk would be very well received and kick off an important dialogue within the geoscience Python community.

Thanks, Ryan! Getting community feedback was the motivation for submitting a talk without having anything ready to show yet. It'll be much easier to see what the community wants and thinks before we have fully committed to an implementation. We're very much open and looking forward to getting a ton of questions!


What would you like to see in a GMT Python library? Let us know if there are any questions/suggestions before the conference. See you at Scipy in July!


Thumbnail image for this post is modified from "ScientificReview" by the Center for Scientific Review which is in the public domain.


Comments? Leave one below or let me know on Twitter @leouieda or in the Software Underground Slack group.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

Please enable JavaScript to view the comments powered by Disqus.

May 11, 2017 12:00 PM

May 09, 2017

Matthieu Brucher

Book review: Getting Started With JUCE

After the announce of JUCE 5 release, I played a little bit with it, and then decided to read the only book on JUCE. It’s outdated and tackles JUCE 2.1.2. But who knows, it may be a gem?

Content and opinions

The book starts with a chapter on JUCE and its installation. If the main application changed its name from Introjucer to Projucer (probably because of the change in licence), the rest seems to be quite similar. This app still creates a Visual Studio solution, a Xcode project or makefiles.

The second chapter was the one I was interested in the most, because I plan on creating new component and GUIs for my plugins. The chapter still feels a little bit short, we could have worked more with overwriting custom look and feel, handling more events… It seems that not much changed in this area since JUCE 2.1, so this is quite an achievement for ROLI and the team. If in the future my component can last several major releases, I know I can invest time in learning JUCE.

The next chapter feels useless with C++11/14 or even Boost. It seems JUCE still uses a custom String class, which is too bad, not sure it really brings anything to the library, and there are other APIs that are now deprecated in my opinion (data types like int32, file system handling…)

The fourth chapter deals with streaming data and building a small app that can play sound. It is a nice feature, but I have to say I read it even faster than the previous chapter because it was of no interest to me.

Finally, the last chapter ends the book with a sudden note on some utilities (I can’t even remember them without reading the chapter again), no final conclusion, but a feeling that the book was more a list of tutorials than a real book.

Conclusion

I wouldn’t recommend on buying the book, as the tutorials cover the only bit that is still relevant. But I can hope for an updated version, one day.

by Matt at May 09, 2017 07:49 AM

May 08, 2017

Continuum Analytics news

Anaconda Joins Forces with Leading Companies to Further Innovate Open Data Science

Monday, May 8, 2017
Travis Oliphant
President, Chief Data Scientist & Co-Founder

In addition to announcing the formation of the GPU Open Analytics Initiative with H2O and MapD, today, we are pleased to announce an exciting collaboration with NVIDIA, H2O and MapD, with a goal of democratizing machine learning to increase performance gains of data science workloads. Using NVIDIA’s Graphics Processing Unit (GPU) technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between Central Processing Units (CPUs) and GPUs and move toward their larger business goals. 

The new GPU Data Frame (GDF) will augment the Anaconda platform as the foundational fabric to bring data science technologies together allowing it to take full advantage of GPU performance gains. In most workflows using GPUs, data is first manipulated with the CPU and then loaded to the GPU for analytics. This creates a data transfer “tax” on the overall workflow.   With the new GDF initiative, data scientists will be able to move data easily onto the GPU and do all their manipulation and analytics at the same time without the extra transfer of data. With this collaboration, we are opening the door to an era where innovative AI applications can be deployed into production at an unprecedented pace and often with just a single click.

In a nutshell, this collaboration provides these key benefits:

  • Python Democratization. GPU Data Frame makes it easy to create new optimized data science models and iterate on ideas using the most innovative GPU and AI technologies.

  • Python Acceleration. The standard empowers data scientists with unparalleled acceleration within Python on GPUs for data science workloads, enabling Open Data Science to proliferate across the enterprise.

  • Python Production. Data science teams can move beyond ad-hoc analysis to unearthing game-changing results within production-deployed data science applications that drive measurable business impact.

Anaconda aims to bring the performance, insights and intelligence enterprises need to compete in today’s data-driven economy. We’re excited to be working with NVIDIA, mapD, and H2O as GPU Data Frame pushes the door to Open Data Science wide open by further empowering the data scientist community with unparalleled innovation, enabling Open Data Science to proliferate across the enterprise.

by swebster at May 08, 2017 08:51 PM

Anaconda Easy Button - Microsoft SQL Server and Python

Tuesday, May 9, 2017
Ian Stokes-Rees
Continuum Analytics

Previously there were many twisty roads that you may have followed if you wanted to use Python on a client system to connect to a Microsoft SQL Server database, and not all of those roads would even get you to your destination. With the news that Microsoft SQL Server 2017 has increased support for Python, by including a subset of Anaconda packages on the server-side, I thought it would be useful to demonstrate how Anaconda delivers the easy button to get Python on the client side connected to Microsoft SQL Server.

This blog post demonstrates how Anaconda and Anaconda Enterprise can be used on the client-side to connect Python running on Windows, Mac, or Linux to a SQL Server instance. The instructions should work for many versions SQL Server, Python and Anaconda, including Anaconda Enterprise, our commercially oriented version of Anaconda that adds in strong collaboration, security, and server deployment capabilities. If you run into any trouble let us know either through the Anaconda Community Support mailing list or on Twitter @ContinuumIO.

TL;DR: For the Impatient

If you're the kind of person who just wants the punch line and not the story, there are three core steps to connect to an existing SQL Server database:

  1. Install the SQL Server drivers for your platform on the client system. That is described in the Client Driver Installation section below.

  2. conda install pyodbc

  3. Establish a connection to the SQL Server database with an appropriately constructed connection statement:

     conn = pyodbc.connect(
        r'DRIVER={ODBC Driver 13 for SQL Server};' +
        ('SERVER={server},{port};'   +
         'DATABASE={database};'      +
         'UID={username};'           +
         'PWD={password}').format(
                server= 'sqlserver.testnet.corp',
                  port= 1433,
              database= 'AdventureWorksDW2012',
              username= 'tanya',
              password= 'Tanya1234')
    )
    

For cut-and-paste convenience, here's the string:

 (r'DRIVER={ODBC Driver 13 for SQL Server};' +
 ('SERVER={server},{port};'   +
  'DATABASE={database};'      +
  'UID={username};'           +
  'PWD={password}').format(
                server= 'sqlserver.testnet.corp',
                  port= 1433,
              database= 'AdventureWorksDW2012',
              username= 'tanya',
              password= 'Tanya1234')
)
'DRIVER={ODBC Driver 13 for SQL Server};SERVER=sqlserver.testnet.corp,1433;DATABASE=AdventureWorksDW2012;UID=tanya;PWD=Tanya1234'

Hopefully that doesn't look too intimidating!

Here's the scoop: the Python piece is easy (yay Python!) whereas the challenges are installing the platform-specific drivers (Step 1), and if you don't already have a database properly setup then the SQL Server installation, configuration, database loading, and setting up appropriate security credentials are the parts that the rest of this blog post are going to go into in more detail. As well as a fully worked out example of client-side connection and query.

And you can grab a copy of this blog post as a Jupyter Notebook from Anaconda Cloud.

On With The Story

While this isn't meant to be an exhaustive reference for SQL Server connectivity from Python and Anaconda it does cover several client/server configurations. In all cases I was running SQL Server 2016 on a Windows 10 system. My Linux system was CentOS 6.9 based. My Mac was running macOS 10.12.4, and my client-side Windows system also used Windows 10. The Windows and Mac Python examples were using Anaconda 4.3 with Python 3.6 and pyodbc version 3.0, while the Linux example used Anaconda Enterprise, based on Anaconda 4.2, using Python 2.7.

NOTE: In the examples below the $ symbol indicates the command line prompt. Do not include this in any commands if you cut-and-paste. Your prompt will probably look different!

Server Side Preparation

If you are an experienced SQL Server administrator then you can skip this section. All you need to know are:

  1. The hostname or IP address and port number for your SQL Server instance
  2. The database you want to connect to
  3. The user credentials that will be used to make the connection

The following provides details on how to setup your SQL Server instance to be able to exactly replicate the client-side Python-based connection that follows. If you do not have Microsoft SQL Server it can be downloaded and installed for free and is now available for Windows and Linux. NOTE: The recently released SQL Server 2017 and SQL Server on Azure both require pyodbc version >= 3.2. This blog post has been developed using SQL Server 2016 with pyodbc version 3.0.1.

This demonstration is going to use the Adventure Works sample database provided by Microsoft on CodePlex. There are instructions on how to install this into your SQL Server instance in Step 3 of this blog post, however you can also simply connect to an existing database by adjusting the connection commands below accordingly.

Many of the preparation steps described below are most easily handled using the SQL Server Management Studio which can be downloaded and installed for free.

Additionally this example makes use of the Mixed Authentication Mode which allows SQL Server-based usernames and passwords for database authentication. By default this is not enabled, and only Windows Authentication is permitted, which makes use of the pre-existing Kerberos user authentication token wallet. It should go without saying that you would only change to Mixed Authentication Mode for testing purposes if SQL Server is not already so configured. While the example below focuses on SQL Server Authentication there are also alternatives presented for the Windows Authentication Mode that uses Kerberos tokens.

You should know the hostname and port on which SQL Server is running and be sure you can connect to that hostname (or IP address) and port from the client-side system. The easiest way to test this is with telnet from the command line excuting the command:

$ telnet sqlserver.testnet.corp 1433

Where you would replace sqlserver.testnet.corp with the hostname or IP address of your SQL Server instance and 1433 with the port SQL Server is running on. Port 1433 is the SQL Server default. Executing this command on the client system should return output like:

Trying sqlserver.testnet.corp...
Connected to sqlserver.testnet.corp.
Escape character is '^]'.

telnet> close

At which point you can then type CTRL-] and then close.

If your client system is also Windows you can perform this simple Universal Data Link (UDL) test.

Finally you will need to confirm that your have access credentials for a user known by SQL Server and that the user is permitted to perform SELECT operations (and perhaps others) on the database in question. In this particular example we are making use of a fictional user named Tanya who has the username tanya and a password of Tanya1234. It is a 3 step process to get Tanya access to the Adventure Works database:

  1. The SQL Server user tanya is added as a Login to the DBMS:

    • which can be found in SQL Server Management Studio
    • under Security->Logins
    • right-click on Logins
    • add a New Login...
    • provide a Login name of tanya
    • select SQL Server authentication
    • provide a password of Tanya1234
    • uncheck the option for Enforce password policy (the other two will automatically be unchecked and greyed out)
  2. The database user needs to be added:

    • under Databases->AdventureWorksDW2012->Security->Users
    • right-click on Users
    • add a New User...
    • select SQL user with Login for type
    • add a Username set to tanya
    • add a Login name set to tanya.
  3. Grant tanya permissions on the AdventureWorksDW2012 database by executing the following query:

    use AdventureWorksDW2012; 
    GRANT SELECT, INSERT, DELETE, UPDATE ON SCHEMA::DBO TO tanya;
    

Using the UDL test method described above is a good way to confirm that tanya can connect to the database, even just from localhost, though it does not confirm if she can perform operations such as SELECT. For that I would recommend installing the free Microsoft Command Line Utilities 13.1 for SQL Server

Client Driver Installation

You'll now need to get the drivers installed on your client system. There are two parts to this: the platform specific dynamic libraries for SQL Server, and the Python ODBC interface. You won't be surprised to hear that the platform-specific libraries are harder to get setup, but this should still only take 10-15 minutes and there are established processes for all major operating systems.

Linux

The Linux drivers are avaialble for RHEL, Ubuntu, and SUSE. This is the process that you'd have to follow if you are using Anaconda Enterprise, as well as for anyone using a Linux variant with Anaconda installed.

My Linux test system was using CentOS 6.9, so I followed the RHEL6 installation procedure from Microsoft (linked above), which essentially consisted of 3 steps:

  1. Adding the Microsoft RPM repository to the yum configuration
  2. Pre-emptively removing some packages that may cause conflicts (I didn't have these installed)
  3. Using yum to install the msodbcsql RPM for version 13.1 of the drivers

In my case I had to play around with the yum command and in the end just doing:

$ ACCEPT_EULA=Y yum install msodbcsql

Windows

The Windows drivers are dead easy to install. Download the msodbcsql MSI file, double click to install, and you're in business.

Mac

The SQL Server drivers for Mac need to be installed via Homebrew which is a popular OS X package manager, though not affiliated with nor supported by Apple. Microsoft has created their own tap which is a Homebrew package repository. If you don't already have Homebrew installed you'll need to do that, then Microsoft have provided some simple instructions describing how to add the SQL Server tap and then install the mssql-tools package. The steps are simple enough I'll repeat them here, though check out that link above if you need more details or background.

$ brew tap microsoft/mssql-preview https://github.com/Microsoft/homebrew-mssql-preview
$ brew update
$ brew install mssql-tools

One thing I'll note is that the Homebrew installation output suggested I should execute a command to remove one of the SQL Server drivers. Don't do this! That driver is required. If you've already done it then the way to correct the process is to reset the configuration file by removing and re-adding the package:

$ brew remove  mssql-tools
$ brew install mssql-tools

Install Anaconda

Download and install Anaconda if you don't already have it on your system. There are graphical and command line installers available for Windows, Mac, and Linux. It is about 400 MB to download and a bit over 1 GB installed. If you're looking for a minimal system you can install Miniconda instead (command line only installer) and then a la carte pick the packages you want with conda install commands.

Anaconda Enterprise users or administrators can simply execute the commands below in the conda environment where they want pyodbc to be available.

Python ODBC package

This part is easy. You can just do:

$ conda install pyodbc

And if you're not using Anaconda or prefer pip, then you can also do:

$ pip install pyodbc

NOTE: If you are using the recently released SQL Server 2017 you will need pyodbc >= 3.2. There should be a conda package available for that "shortly" but be sure to check which version you get if you use the conda command above.

Connecting to SQL Server using pyodbc

Now that you've got your server-side and client-side systems setup with the correct software, databases, users, libraries, and drivers it is time to connect. If everything works properly these steps are very simple and work for all platforms. Everything that is platform-specific has been handled elsewhere in the process.

We start by importing the common pyodbc package. This is Microsoft's recommended Python interface to SQL Server. There was an alternate Python interface pymssql that at one point was more reliable for SQL Server connections and even until quite recently was the only way to get Python on a Mac to connect to SQL Server, however with Microsoft's renewed support for Python and Microsoft's own Mac Homebrew packages it is now the case that pyodbc is the leader for all platforms.

import pyodbc

Use a Python dict to define the configuration parameters for the connection

config = dict(server=   'sqlserver.testnet.corp', # change this to your SQL Server hostname or IP address
              port=      1433,                    # change this to your SQL Server port number [1433 is the default]
              database= 'AdventureWorksDW2012',
              username= 'tanya',
              password= 'Tanya1234')

Create a template connection string that can be re-used.

conn_str = ('SERVER={server},{port};'   +
            'DATABASE={database};'      +
            'UID={username};'           +
            'PWD={password}')

If you are using the Windows Authentication mode where existing authorization tokens are picked up automatically this connection string would be changed to remove UID and PWD entries and replace them with TRUSTED_CONNECTION, as below:

trusted_conn_str = ('SERVER={server};'     +
                    'DATABASE={database};' +
                    'TRUSTED_CONNECTION=yes')

Check your configuration looks right:

config
{'database': 'AdventureWorksDW2012',
 'password': 'Tanya1234',
 'port': 1433,
 'server': 'sqlserver.testnet.corp',
 'username': 'tanya'}

Now open a connection by specifying the driver and filling in the connection string with the connection parameters.

The following connection operation can take 10s of seconds to complete.

conn = pyodbc.connect(
    r'DRIVER={ODBC Driver 13 for SQL Server};' +
    conn_str.format(**config)
    )

Executing Queries

Request a cursor from the connection that can be used for queries.

cursor = conn.cursor()

Perform your query.

cursor.execute('SELECT TOP 10 EnglishProductName FROM dbo.DimProduct;')
<pyodbc.Cursor at 0x7f7ca4a82d50>

Loop through to look at the results (an iterable of 1-tuples, containing unicodde strings of the results).

for entry in cursor:
    print(entry)
(u'Adjustable Race', )
(u'Bearing Ball', )
(u'BB Ball Bearing', )
(u'Headset Ball Bearings', )
(u'Blade', )
(u'LL Crankarm', )
(u'ML Crankarm', )
(u'HL Crankarm', )
(u'Chainring Bolts', )
(u'Chainring Nut', )

Data Science Happens Here

Now that we've demonstrated how to connect to a SQL Server instance from Windows, Mac and Linux using Anaconda or Anaconda Enterprise it is possible to use T-SQL queries to interact with that database as you normally would.

Looking to the future, the latest preview release of SQL Server 2017 includes a server-side Python interface built around Anaconda. There are lots of great resources on Python and SQL Server connectivity from the team at Microsoft, and here are a few that you may find particularly interesting:

Next Steps

My bet is that if you're reading a blog post on SQL Server and Python (and you can download a Notebook version of it here) then you're using it in a commercial context. Anaconda Enterprise is going to be the best way for you and your organization to make a strategic investment in Open Data Science.

See how Anaconda Enterprise is transforming data science through our webinar series or grab one of our white papers on Enterprise Open Data Science.

Let us help you be successful in your strategic adoption of Python and Anaconda for high-performance enterprise-oriented open data science connected to your existing data sources and systems, such as SQL Server.

by swebster at May 08, 2017 04:46 PM

Data Science And Deep Learning Application Leaders Form GPU Open Analytics Initiative

Monday, May 8, 2017

Continuum Analytics, H2O.ai and MapD Technologies Create Open Common Data Frameworks for GPU In-Memory Analytics

SAN JOSE, CA—May 8, 2017—Continuum Analytics, H2O.ai, and MapD Technologies have announced the formation of the GPU Open Analytics Initiative (GOAI) to create common data frameworks enabling developers and statistical researchers to accelerate data science on GPUs. GOAI will foster the development of a data science ecosystem on GPUs by allowing resident applications to interchange data seamlessly and efficiently. BlazingDB, Graphistry and Gunrock from UC Davis led by CUDA Fellow John Owens have joined the founding members to contribute their technical expertise.

The formation of the Initiative comes at a time when analytics and machine learning workloads are increasingly being migrated to GPUs. However, while individually powerful, these workloads have not been able to benefit from the power of end-to-end GPU computing. A common standard will enable intercommunication between the different data applications and speed up the entire workflow, removing latency and decreasing the complexity of data flows between core analytical applications. 

At the GPU Technology Conference (GTC), NVIDIA’s annual GPU developers’ conference, the Initiative announced its first project: an open source GPU Data Frame with a corresponding Python API. The GPU Data Frame is a common API that enables efficient interchange of data between processes running on the GPU. End-to-end computation on the GPU avoids transfers back to the CPU or copying of in-memory data reducing compute time and cost for high-performance analytics common in artificial intelligence workloads.

Users of the MapD Core database can output the results of a SQL query into the GPU Data Frame, which then can be manipulated by the Continuum Analytics’ Anaconda NumPy-like Python API or used as input into the H2O suite of machine learning algorithms without additional data manipulation. In early internal tests, this approach exhibited order-of-magnitude improvements in processing times compared to passing the data between applications on a CPU. 

“The data science and analytics communities are rapidly adopting GPU computing for machine learning and deep learning. However, CPU-based systems still handle tasks like subsetting and preprocessing training data, which creates a significant bottleneck,” said Todd Mostak, CEO and co-founder of MapD Technologies. “The GPU Data Frame makes it easy to run everything from ingestion to preprocessing to training and visualization directly on the GPU. This efficient data interchange will improve performance, encouraging development of ever more sophisticated GPU-based applications.” 

“GPU Data Frame relies on the Anaconda platform as the foundational fabric that brings data science technologies together to take full advantage of GPU performance gains,” said Travis Oliphant, co-founder and chief data scientist of Continuum Analytics. “Using NVIDIA’s technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between CPUs and GPUs and move nimbly toward their larger business goals. The key to producing this kind of innovation are great partners like H2O and MapD.”

“Truly diverse open source ecosystems are essential for adoption - we are excited to start GOAI for GPUs alongside leaders in data and analytics pipeline to help standardize data formats,” said Sri Ambati, CEO and co-founder of H2O.ai. “GOAI is a call for the community of data developers and researchers to join the movement to speed up analytics and GPU adoption in the enterprise.”

The GPU Open Analytics Initiative is actively welcoming participants who are committed to open source and to GPUs as a computing platform. 

Details of the GPU Data Frame can be found at the Initiative’s Github link - 
https://github.com/gpuopenanalytics

In conjunction with this announcement, MapD Technologies has announced the immediate open sourcing of the MapD Core database to foster open analytics on GPUs. Anaconda and H2O already have large open source communities, which can benefit from this project immediately and drive further development to accelerate the adoption of data science and analytics on GPUs. 

About Anaconda Powered by Continuum Analytics
Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads and 4 million unique users to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Learn more at continuum.io

About H2O.ai
H2O.ai is focused on bringing AI to businesses through software. Its flagship product is H2O, the leading open source platform that makes it easy for financial services, insurance and healthcare companies to deploy AI and deep learning to solve complex problems. More than 9,000 organizations and 80,000+ data scientists depend on H2O for critical applications like predictive maintenance and operational intelligence. The company -- which was recently named to the CB Insights AI 100 -- is used by 169 Fortune 500 enterprises, including 8 of the world’s 10 largest banks, 7 of the 10 largest insurance companies and 4 of the top 10 healthcare companies. Notable customers include Capital One, Progressive Insurance, Transamerica, Comcast, Nielsen Catalina Solutions, Macy's, Walgreens and Kaiser Permanente.

About MapD Technologies
MapD Technologies is a next-generation analytics software company. Its technology harnesses the massive parallelism of modern graphics processing units (GPUs) to power lightning-fast SQL queries and visualization of large data sets. The MapD analytics platform includes the MapD Core database and MapD Immerse visualization client. These software products provide analysts and data scientists with the fastest time to insight, performance not possible with traditional CPU-based solutions. MapD software runs on-premise and on all leading cloud providers.

Founded in 2013, MapD Technologies originated from research at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). MapD is funded by GV, In-Q-Tel, New Enterprise Associates (NEA), NVIDIA, Vanedge Capital and Verizon Ventures. The company is headquartered in San Francisco.

Visit MapD at www.mapd.com or follow MapD on Twitter @mapd. For more information or to evaluate MapD, contact sales@mapd.com. Press inquiries, please contact press@mapd.com.

Media Contacts:

Jill Rosenthal
Continuum Analytics
anaconda@inkhouse.com

Mary Fuochi
MapD
press@mapd.com

James Christopherson
H2O.ai
james@vscpr.com

 

 

by swebster at May 08, 2017 12:47 PM

Matthew Rocklin

Dask Release 0.14.3

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.3. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on March 22nd.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed

or you can pip install from PyPI

pip install dask[complete] --upgrade

Conda packages should be on the default channel within a few days.

Arrays

Sparse Arrays

Dask.arrays now support sparse arrays and mixed dense/sparse arrays.

>>> import dask.array as da

>>> x = da.random.random(size=(10000, 10000, 10000, 10000),
...                      chunks=(100, 100, 100, 100))
>>> x[x < 0.99] = 0

>>> import sparse
>>> s = x.map_blocks(sparse.COO)  # parallel array of sparse arrays

In order to support sparse arrays we did two things:

  1. Made dask.array support ndarray containers other than NumPy, as long as they were API compatible
  2. Made a small sparse array library that was API compatible to the numpy.ndarray

This process was pretty easy and could be extended to other systems. This also allows for different kinds of ndarrays in the same Dask array, as long as interactions between the arrays are well defined (using the standard NumPy protocols like __array_priority__ and so on.)

Documentation: http://dask.pydata.org/en/latest/array-sparse.html

Update: there is already a pull request for Masked arrays

Reworked FFT code

The da.fft submodule has been extended to include most of the functions in np.fft, with the caveat that multi-dimensional FFTs will only work along single-chunk dimensions. Still, given that rechunking is decently fast today this can be very useful for large image stacks.

Documentation: http://dask.pydata.org/en/latest/array-api.html#fast-fourier-transforms

Constructor Plugins

You can now run arbitrary code whenever a dask array is constructed. This empowers users to build in their own policies like rechunking, warning users, or eager evaluation. A dask.array plugin takes in a dask.array and returns either a new dask array, or returns None, in which case the original will be returned.

>>> def f(x):
...     print('%d bytes' % x.nbytes)

>>> with dask.set_options(array_plugins=[f]):
...     x = da.ones((10, 1), chunks=(5, 1))
...     y = x.dot(x.T)
80 bytes
80 bytes
800 bytes
800 bytes

This can be used, for example, to convert dask.array code into numpy code to identify bugs quickly:

>>> with dask.set_options(array_plugins=[lambda x: x.compute()]):
...     x = da.arange(5, chunks=2)

>>> x  # this was automatically converted into a numpy array
array([0, 1, 2, 3, 4])

Or to warn users if they accidentally produce an array with large chunks:

def warn_on_large_chunks(x):
    shapes = list(itertools.product(*x.chunks))
    nbytes = [x.dtype.itemsize * np.prod(shape) for shape in shapes]
    if any(nb > 1e9 for nb in nbytes):
        warnings.warn("Array contains very large chunks")

with dask.set_options(array_plugins=[warn_on_large_chunks]):
    ...

These features were heavily requested by the climate science community, which tends to serve both highly technical computer scientists, and less technical climate scientists who were running into issues with the nuances of chunking.

DataFrames

Dask.dataframe changes are both numerous, and very small, making it difficult to give a representative accounting of recent changes within a blogpost. Typically these include small changes to either track new Pandas development, or to fix slight inconsistencies in corner cases (of which there are many.)

Still, two highlights follow:

Rolling windows with time intervals

>>> s.rolling('2s').count().compute()
2017-01-01 00:00:00    1.0
2017-01-01 00:00:01    2.0
2017-01-01 00:00:02    2.0
2017-01-01 00:00:03    2.0
2017-01-01 00:00:04    2.0
2017-01-01 00:00:05    2.0
2017-01-01 00:00:06    2.0
2017-01-01 00:00:07    2.0
2017-01-01 00:00:08    2.0
2017-01-01 00:00:09    2.0
dtype: float64

Read Parquet data with Arrow

Dask now supports reading Parquet data with both fastparquet (a Numpy/Numba solution) and Arrow and Parquet-CPP.

df = dd.read_parquet('/path/to/mydata.parquet', engine='fastparquet')
df = dd.read_parquet('/path/to/mydata.parquet', engine='arrow')

Hopefully this capability increases the use of both projects and results in greater feedback to those libraries so that they can continue to advance Python’s access to the Parquet format.

Graph Optimizations

Dask performs a few passes of simple linear-time graph optimizations before sending a task graph to the scheduler. These optimizations currently vary by collection type, for example dask.arrays have different optimizations than dask.dataframes. These optimizations can greatly improve performance in some cases, but can also increase overhead, which becomes very important for large graphs.

As Dask has grown into more communities, each with strong and differing performance constraints, we’ve found that we needed to allow each community to define its own optimization schemes. The defaults have not changed, but now you can override them with your own. This can be set globally or with a context manager.

def my_optimize_function(graph, keys):
    """ Takes a task graph and a list of output keys, returns new graph """
    new_graph = {...}
    return new_graph

with dask.set_options(array_optimize=my_optimize_function,
                      dataframe_optimize=None,
                      delayed_optimize=my_other_optimize_function):
    x, y = dask.compute(x, y)

Documentation: http://dask.pydata.org/en/latest/optimize.html#customizing-optimization

Speed improvements

Additionally, task fusion has been significantly accelerated. This is very important for large graphs, particularly in dask.array computations.

Web Diagnostics

The distributed scheduler’s web diagnostic page is now served from within the dask scheduler process. This is both good and bad:

  • Good: It is much easier to make new visuals
  • Bad: Dask and Bokeh now share a single CPU

Because Bokeh and Dask now share the same Tornado event loop we no longer need to send messages between them to then send out to a web browser. The Bokeh server has full access to all of the scheduler state. This lets us build new diagnostic pages more easily. This has been around for a while but was largely used for development. In this version we’ve switched the new version to be default and turned off the old one.

The cost here is that the Bokeh scheduler can take 10-20% of the CPU use. If you are running a computation that heavily taxes the scheduler then you might want to close your diagnostic pages. Fortunately, this almost never happens. The dask scheduler is typically fast enough to never get close to this limit.

Tornado difficulties

Beware that the current versions of Bokeh (0.12.5) and Tornado (4.5) do not play well together. This has been fixed in development versions, and installing with conda is fine, but if you naively pip install then you may experience bad behavior.

Joblib

The Dask.distributed Joblib backend now includes a scatter= keyword, allowing you to pre-scatter select variables out to all of the Dask workers. This significantly cuts down on overhead, especially on machine learning workloads where most of the data doesn’t change very much.

# Send the training data only once to each worker
with parallel_backend('dask.distributed', scheduler_host='localhost:8786',
                      scatter=[digits.data, digits.target]):
    search.fit(digits.data, digits.target)

Early trials indicate that computations like scikit-learn’s RandomForest scale nicely on a cluster without any additional code.

Documentation: http://distributed.readthedocs.io/en/latest/joblib.html

Preload scripts

When starting a dask.distributed scheduler or worker people often want to include a bit of custom setup code, for example to configure loggers, authenticate with some network system, and so on. This has always been possible if you start scheduler and workers from within Python but is tricky if you want to use the command line interface. Now you can write your custom code as a separate standalone script and ask the command line interface to run it for you at startup:

# scheduler-setup.py
from distributed.diagnostics.plugin import SchedulerPlugin

class MyPlugin(SchedulerPlugin):
    """ Prints a message whenever a worker is added to the cluster """
    def add_worker(self, scheduler=None, worker=None, **kwargs):
        print("Added a new worker at", worker)

    def dask_setup(scheduler):
        plugin = MyPlugin()
        scheduler.add_plugin(plugin)
dask-scheduler --preload scheduler-setup.py

This makes it easier for people to adapt Dask to their particular institution.

Documentation: http://distributed.readthedocs.io/en/latest/setup.html#customizing-initialization

Network Interfaces (for infiniband)

Many people use Dask on high performance supercomputers. This hardware differs from typical commodity clusters or cloud services in several ways, including very high performance network interconnects like InfiniBand. Typically these systems also have normal ethernet and other networks. You’re probably familiar with this on your own laptop when you have both ethernet and wireless:

$ ifconfig
lo          Link encap:Local Loopback                       # Localhost
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
eth0        Link encap:Ethernet  HWaddr XX:XX:XX:XX:XX:XX   # Ethernet
            inet addr:192.168.0.101
            ...
ib0         Link encap:Infiniband                           # Fast InfiniBand
            inet addr:172.42.0.101

The default systems Dask uses to determine network interfaces often choose ethernet by default. If you are on an HPC system then this is likely not optimal. You can direct Dask to choose a particular network interface with the --interface keyword

$ dask-scheduler --interface ib0
distributed.scheduler - INFO -   Scheduler at: tcp://172.42.0.101:8786

$ dask-worker tcp://172.42.0.101:8786 --interface ib0

Efficient as_completed

The as_completed iterator returns futures in the order in which they complete. It is the base of many asynchronous applications using Dask.

>>> x, y, z = client.map(inc, [0, 1, 2])
>>> for future in as_completed([x, y, z]):
...     print(future.result())
2
0
1

It can now also wait to yield an element only after the result also arrives

>>> for future, result in as_completed([x, y, z], with_results=True):
...     print(result)
2
0
1

And also yield all futures (and results) that have finished up until this point.

>>> for futures in as_completed([x, y, z]).batches():
...    print(client.gather(futures))
(2, 0)
(1,)

Both of these help to decrease the overhead of tight inner loops within asynchronous applications.

Example blogpost here: http://matthewrocklin.com/blog/work/2017/04/19/dask-glm-2

Co-released libraries

This release is aligned with a number of other related libraries, notably Pandas, and several smaller libraries for accessing data, including s3fs, hdfs3, fastparquet, and python-snappy each of which have seen numerous updates over the past few months. Much of the work of these latter libraries is being coordinated by Martin Durant

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.1 release on March 22nd

  • Antoine Pitrou
  • Dmitry Shachnev
  • Erik Welch
  • Eugene Pakhomov
  • Jeff Reback
  • Jim Crist
  • John A Kirkham
  • Joris Van den Bossche
  • Martin Durant
  • Matthew Rocklin
  • Michal Ficek
  • Noah D Brenowitz
  • Stuart Archibald
  • Tom Augspurger
  • Wes McKinney
  • wikiped

The following people contributed to the dask/distributed repository since the 1.16.1 release on March 22nd

  • Antoine Pitrou
  • Bartosz Marcinkowski
  • Ben Schreck
  • Jim Crist
  • Jens Nie
  • Krisztián Szűcs
  • Lezyes
  • Luke Canavan
  • Martin Durant
  • Matthew Rocklin
  • Phil Elson

May 08, 2017 12:00 AM

May 02, 2017

Enthought

Webinar: Python for Data Science: A Tour of Enthought’s Professional Training Course

DataView Python for Data Science Webinar
What: A guided walkthrough and Q&A about Enthought’s technical training course “Python for Data Science and Machine Learning” with VP of Training Solutions, Dr. Michael Connell

Who Should Watch: individuals, team leaders, and learning & development coordinators who are looking to better understand the options to increase professional capabilities in Python for data science and machine learning applications

VIEW


Enthought’s Python for Data Science training course is designed to accelerate the development of skill and confidence in using Python’s core data science tools — including the standard Python language, the fast array programming package NumPy, and the Pandas data analysis package, as well as tools for database access (DBAPI2, SQLAlchemy), machine learning (scikit-learn), and visual exploration (Matplotlib, Seaborn).

In this webinar, we give you the key information and insight you need to evaluate whether Enthought’s Python for Data Science course is the right solution to advance your professional data science skills in Python, including:

  • Who will benefit most from the course
  • A guided tour through the course topics
  • What skills you’ll take away from the course, how the instructional design supports that
  • What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
  • What previous course attendees say about the course

VIEW


michael_connell-enthought-vp-trainingPresenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT


Considering Moving to Python for Data Science?

Then Enthought’s Python for Data Science training course is definitely for you! This class has been particularly appealing to people who have been using other tools like R or SAS (or even Excel) for their data science work and want to start applying their analytic skills using the Python toolset.  And it’s no wonder — Python has been identified as the most popular coding language for five years in a row for good reason.

One reason for Python’s broad popularity across a range of disciplines is its efficiency and ease-of-use. Many people consider Python more fun to work in than other languages (and we agree!). Another reason for its popularity among data analysts and data scientists in particular is Python’s extensive (and growing) open source library of powerful tools for preparing, visualizing, analyzing, and modeling data.

Python is also an extraordinarily comprehensive toolset – it supports everything from interactive analysis to automation to software engineering to web app development within a single language and plays very well with other languages like C/C++ or FORTRAN so you can continue leveraging your existing code libraries written in those other languages.

Many organizations are moving to Python so they can consolidate all of their technical work streams under a single comprehensive toolset. In the first part of this class we’ll give you the fundamentals you need to switch from another language to Python and then we cover the core tools that will enable you to do in Python what you were doing with other tools, only faster!

Additional Resources

Upcoming Open Python for Data Science Sessions:
Austin, TX, June 12-16, 2017
San Jose, CA, July 17-21, 2017Learn MoreHave a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.
Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat Sheets
Enthought's Machine Learning with Python Cheat Sheets
Download Enthought’s Pandas Cheat SheetsEnthought's Pandas Cheat Sheets

The post Webinar: Python for Data Science: A Tour of Enthought’s Professional Training Course appeared first on Enthought Blog.

by admin at May 02, 2017 08:03 PM

Matthieu Brucher

Using Audio ToolKit with JUCE 5

As some may have seen online, ROLI released a new version of JUCE. The nice thing is that they added a new tier for people like me who don’t sell plugins but who don’t want to release their code under the GPL license for diverse reasons (for me, it was formerly incompatibility between VST3 license and the GPL).

With JUCE 5, you have support for all major APIS, from VST2 to Audio Unit v3 and also AAX or VST3. And you can develop your own plugins. The caveat with this tier is that you have a splash screen and a tracking of your users… (actually, there is a flag to remove both). the advantage is that on MacOS, there is no more SDK conflicts, and I have Audio Unit 3 support

So I’ve started playing with Projucer and built a barebone ATK plugin that doesn’t do anything. What I can say is that the worst part is handling universal binaries, support 32bits plugins, as the JUCE project builder overwrites all my changes. Even adding ATK is painful with the project manager.

So instead, I’m going the WDL-OL here, and keeping this ATKJUCE plugin as the simple plugin I’ll duplicate by changing the names and its content. I have my builders that build the plugins and creates the installers, all that while keeping the same JUCE core code (it is shared by all plugins).

The next step is trying to make sense of the API to build a nicer GUI than what I currently have (probably something flat). Indeed, the tutorials on the GUI are small and too basic, but WDL-OL was no better in that aspect, but with more examples.

by Matt at May 02, 2017 07:21 AM

May 01, 2017

numfocus

NumFOCUS projects participate in Docathon 2017

Seven NumFOCUS sponsored projects participated in Docathon 2017: IPython, Project Jupyter, Julia, Matplotlib, pandas, SunPy, and yt. The Docathon is like a hackathon but is focused on developing material and tools for documentation. Documentation is one of the most important components of the open science ecosystem—and it’s everywhere! From examples that provide inspiration for things you can […]

by Gina Helfrich at May 01, 2017 02:00 PM

April 28, 2017

Leonardo Uieda

Thoughts from the Introduction to Python Workshop at UH Manoa

Thumbnail image for publication.

Last week, I taught a 3-day Python workshop at the Department of Geology and Geophysics of the University of Hawaii at Manoa, where I'm currently doing a postdoc. It covered the basics of computer programming with Python, starting from the very beginning. Below are thoughts and information about the workshop, the demographics of people who signed up, and the feedback that I got from the participants.

See the workshop page and Github repository for more information and links to material used.

My goals

I wanted this to be a hands-on workshop of the basic concepts needed to use Python for research. Participants who complete the workshop should be able to use Python to gather data from one or more files, process the data, run an analysis, make publication quality figures, and save the output. Most importantly, I wanted participants to know what they should type into Google to learn more about Python.

Materials

The class is based on a mixture of the Software Carpentry lessons Plotting and Programming in Python (under development) and Programming with Python. However, I use temperature data from Berkeley Earth instead of the Gapminder and inflammation data used by Software Carpentry. For example, our goal for the second day of the workshop was to reproduce this figure for average temperature variation in Hawaii from the website:

On the last day, we finished with some code that processed a list of country names to download the respective data file (using requests), load it into Python, make a figure, and save it to a different folder (see this Jupyter notebook for the code).

I also used a few techniques from the Software Carpentry Instructor Training, mainly the shared class notes (I used Google Docs instead of Etherpad) and colored sticky notes. The sticky notes were in two colors: pink and blue. Learners kept the blue sticky note on their laptop lid if everything was OK. They put up the pink one if they have a problem or need help. This way, myself and the teaching assistants can know at a glance who needs help. I had them write positive and negative feedback on the sticky notes at the end of the workshop.

The notebooks that I created during class and some notes for myself are in the Github repository (in the notebooks and notes folders, respectively).

Who attended

I asked all participants to sign up through a Google Form that asked a few questions regarding their operating system, background in programming, and position at the university. The file demographics.csv in the Github repository has the anonymous information from this form. I wrote some code to analyze the data and generate the figures below using pandas and matplotlib. You can find in the demographics-analysis.ipynb Jupyter notebook (also in the Github repo).

First, lets look at how many people signed up and then actually attended each day.

Number of attendants per day of the workshop.

We were very lucky that most people who signed up also attended the workshop on the first day, even though there was no sign up fee. It seems that the workshop really fills a need in the community! Some people gave up after the first day. Maybe it was too fast, or too basic, or life just happened. I can't really tell because I failed to collect feedback at the end of each day. That is something to keep in mind for the next iteration: get feedback every day.

The experience level of participants was more evenly distributed than I expected. I was very pleased with the number of people who had never programmed before. But the distribution made it challenging to keep everyone motivated and following along. From the feedback (see below), it seems that I managed it well enough.

Not surprisingly, most participants who already programmed know Matlab. What was a bit surprising is how few people reported experience with Fortran. Is this a reflection of the age of participants (a lot of young grad students)? The number of Fortran users does correlate with the number of faculty who signed up, so maybe yes.

I was very pleased to have someone from the Nā Kūpuna Senior Citizen Visitor Program and a not insignificant number of faculty. We even had an "interested citizen" who studies film production and education (a personal friend)!

Feedback

Feedback on the colored sticky notes.

This is a synthesis from the feedback given by participants on the last day (using the pink and blue sticky notes):

The Good # The Bad #
Instructor style 5 Too short 5
Examples and exercises 5 Too fast 3
Dense but efficient (learn a lot in little time) 3 Too slow 3
Using real data 3 Hard to multi-task (pay attention + notes + exercise) 2
Simple and accessible level 3 No TA on Monday 2
Shared notes 1 Instructor took too many tangents when teaching 1
Too many people 1
Ran late 1
Jupyter Google maps example failed 1

It was a very funny coincidence that the exact same number of people complained about the pace being either too fast or too slow.

Lessons learned

I was glad to see that the hands-on approach worked and the students appreciated using real data during the exercises. We had very little time (6h total) to cover a lot of material. So it's no wonder that people thought it was too short and maybe didn't explain quite as thoroughly some concepts (like for and if). Regarding the pace, it's hard to satisfy everyone. I expect that novices might have found the pace a bit too fast and more experienced programmers found it too slow (but I don't have data to back that up). Since only 6 people complained about the pace, I guess it wasn't too bad. Not having a TA on Monday (the first day) was not good because that is when the most serious problems occur (Jupyter won't start, where is my Python?, lost my files, etc). The next two days of the workshop were much smoother thanks to the generous help of volunteer TAs Sam Murphy and Julie Schnurr. We also didn't get to cover the last few topics on the last day (functions and getting data from headers).

A few things that I would have done differently:

  • Get feedback through sticky notes at the end of each day. The Software Carpentry material actually recommends this but I completely forgot.
  • Use more pair programming activities. This is also something recommended by Software Carpentry and that I had planned on doing. In the end, I left this as optional and didn't explicitly pair learners. A lot of people were interacting naturally but I would have liked to see more of it.

Have you taught or participated in a workshop like this before? What were your experiences?


Comments? Leave one below or let me know on Twitter @leouieda or in the Software Underground Slack group.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

Please enable JavaScript to view the comments powered by Disqus.

April 28, 2017 12:00 PM

Matthew Rocklin

Dask Development Log

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly(ish) about the work done on Dask and related projects during the previous week. This log covers work done between 2017-04-20 and 2017-04-28. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Development in Dask and Dask-related projects during the last week includes the following notable changes:

  1. Improved Joblib support, accelerating existing Scikit-Learn code
  2. A dask-glm powered LogisticRegression estimator that is scikit-learn compatible
  3. Additional Parquet support by Arrow
  4. Sparse arrays
  5. Better spill-to-disk behavior
  6. AsyncIO compatible Client
  7. TLS (SSL) support
  8. NumPy __array_ufunc__ protocol

Joblib

Scikit learn parallelizes most of their algorithms with Joblib, which provides a simple interface for embarrassingly parallel computations. Dask has been able to hijack joblib code and serve as the backend for some time now, but it had some limitations, particularly because we would repeatedly send data back and forth from a worker to client for every batch of computations.

import distributed.joblib
from joblib import Parallel, parallel_backend

with parallel_backend('dask.distributed', scheduler_host='HOST:PORT'):
    # normal Joblib code

Now there is a scatter= keyword, which allows you to pre-scatter select variables out to all of the Dask workers. This significantly cuts down on overhead, especially on machine learning workloads where most of the data doesn’t change very much.

# Send the training data only once to each worker
with parallel_backend('dask.distributed', scheduler_host='localhost:8786',
                      scatter=[digits.data, digits.target]):
    search.fit(digits.data, digits.target)

Early trials indicate that computations like scikit-learn’s RandomForest scale nicely on a cluster without any additional code.

This is particularly nice because it allows Dask and Scikit-Learn to play well together without having to introduce Dask within the Scikit-Learn codebase at all. From a maintenance perspective this combination is very attractive.

Work done by Jim Crist in dask/distributed #1022

Dask-GLM Logistic Regression

The convex optimization solvers in the dask-glm project allow us to solve common machine learning and statistics problems in parallel and at scale. Historically this young library has contained only optimization solvers and relatively little in the way of user API.

This week dask-glm grew new LogisticRegression and LinearRegression estimators that expose the scalable convex optimization algorithms within dask-glm through a Scikit-Learn compatible interface. This can both speedup solutions on a single computer or provide solutions for datasets that were previously too large to fit in memory.

from dask_glm.estimators import LogisticRegression

est = LogisticRegression()
est.fit(my_dask_array, labels)

This notebook compares performance to the latest release of scikit-learn on a 5,000,000 dataset running on a single machine. Dask-glm beats scikit-learn by a factor of four, which is also roughly the number of cores on the development machine. However in response this notebook by Olivier Grisel shows the development version of scikit-learn (with a new algorithm) beating out dask-glm by a factor of six. This just goes to show you that being smarter about your algorithms is almost always a better use of time than adopting parallelism.

Work done by Tom Augspurger and Chris White in dask/dask-glm #40

Parquet with Arrow

The Parquet format is quickly becoming a standard for parallel and distributed dataframes. There are currently two Parquet reader/writers accessible from Python, fastparquet a NumPy/Numba solution, and Parquet-CPP a C++ solution with wrappers provided by Arrow. Dask.dataframe has supported parquet for a while now with fastparquet.

However, users will now have an option to use Arrow instead by switching the engine= keyword in the dd.read_parquet function.

df = dd.read_parquet('/path/to/mydata.parquet', engine='fastparquet')
df = dd.read_parquet('/path/to/mydata.parquet', engine='arrow')

Hopefully this capability increases the use of both projects and results in greater feedback to those libraries so that they can continue to advance Python’s access to the Parquet format. As a gentle reminder, you can typically get much faster query times by switching from CSV to Parquet. This is often much more effective than parallel computing.

Work by Wes McKinney in dask/dask #2223.

Sparse Arrays

There is a small multi-dimensional sparse array library here: https://github.com/mrocklin/sparse. It allows us to represent arrays compactly in memory when most entries are zero. This differs from the standard solution in scipy.sparse, which can only support arrays of dimension two (matrices) and not greater.

pip install sparse
>>> import numpy as np
>>> x = np.random.random(size=(10, 10, 10, 10))
>>> x[x < 0.9] = 0
>>> x.nbytes
80000

>>> import sparse
>>> s = sparse.COO(x)
>>> s
<COO: shape=(10, 10, 10, 10), dtype=float64, nnz=1074>

>>> s.nbytes
12888

>>> sparse.tensordot(s, s, axes=((1, 0, 3), (2, 1, 0))).sum(axis=1)
array([ 100.93868073,  128.72312323,  119.12997217,  118.56304153,
        133.24522101,   98.33555365,   90.25304866,   98.99823973,
        100.57555847,   78.27915528])

Additionally, this sparse library more faithfully follows the numpy.ndarray API, which is exactly what dask.array expects. Because of this close API matching dask.array is able to parallelize around sparse arrays just as easily as it parallelizes around dense numpy arrays. This gives us a decent distributed multidimensional sparse array library relatively cheaply.

>>> import dask.array as da
>>> x = da.random.random(size=(10000, 10000, 10000, 10000),
...                      chunks=(100, 100, 100, 100))
>>> x[x < 0.9] = 0

>>> s = x.map_blocks(sparse.COO)  # parallel array of sparse arrays

Work on the sparse library is so far by myself and Jake VanderPlas and is available here. Work connecting this up to Dask.array is in dask/dask #2234.

Better spill to disk behavior

I’ve been playing with a 50GB sample of the 1TB Criteo dataset on my laptop (this is where I’m using sparse arrays). To make computations flow a bit faster I’ve improved the performance of Dask’s spill-to-disk policies.

Now, rather than depend on (cloud)pickle we use Dask’s network protocol, which handles data more efficiently, compresses well, and has special handling for common and important types like NumPy arrays and things built out of NumPy arrays (like sparse arrays).

As a result reading and writing excess data to disk is significantly faster. When performing machine learning computations (which are fairly heavy-weight) disk access is now fast enough that I don’t notice it in practice and running out of memory doesn’t significantly impact performance.

This is only really relevant when using common types (like numpy arrays) and when your computation to disk access ratio is relatively high (such as is the case for analytic workloads), but it was a simple fix and yielded a nice boost to my personal productivity.

Work by myself in dask/distributed #946.

AsyncIO compatible Client

The Dask.distributed scheduler maintains a fully asynchronous API for use with non-blocking systems like Tornado or AsyncIO. Because Dask supports Python 2 all of our internal code is written with Tornado. While Tornado and AsyncIO can work together, this generally requires a bit of excess book-keeping, like turning Tornado futures into AsyncIO futures, etc..

Now there is an AsyncIO specific Client that only includes non-blocking methods that are AsyncIO native. This allows for more idiomatic asynchronous code in Python 3.

async with AioClient('scheduler-address:8786') as c:
    future = c.submit(func, *args, **kwargs)
    result = await future

Work by Krisztián Szűcs in dask/distributed #1029.

TLS (SSL) support

TLS (previously called SSL) is a common and trusted solution to authentication and encryption. It is a commonly requested feature by companies of institutions where intra-network security is important. This is currently being worked on now at dask/distributed #1034. I encourage anyone who this may affect to engage on that pull request.

Work by Antoine Pitrou in dask/distributed #1034 and previously by Marius van Niekerk in dask/distributed #866.

NumPy __array_ufunc__

This recent change in NumPy (literally merged as I was typing this blogpost) allows other array libraries to take control of the the existing NumPy ufuncs, so if you call something like np.exp(my_dask_array) this will no longer convert to a NumPy array, but will rather call the appropriate dask.array.exp function. This is a big step towards writing generic array code that works both on NumPy arrays as well as other array projects like dask.array, xarray, bcolz, sparse, etc..

As with all large changes in NumPy this was accomplished through a collaboration of many people. PR in numpy/numpy #8247.

April 28, 2017 12:00 AM

April 25, 2017

Matthieu Brucher

Announcement: ATKStereoUniversalDelay 1.0.0

I’m happy to announce the release of a stereo delay that allows ping-pong like effects based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATKStereoUniversalDelay

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKStereoUniversalDelay.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 25, 2017 07:40 AM

April 24, 2017

numfocus

Moore Foundation gives grant to support NumFOCUS Diversity & Inclusion in Scientific Computing initiatives

As part of our mission to support and promote better science through support of the open source scientific software community, NumFOCUS champions technical progress through diversity. NumFOCUS recognizes that the open source data science community is currently highly homogenous. We believe that diverse contributors and community members produce better science and better projects. NumFOCUS strives […]

by Gina Helfrich at April 24, 2017 07:23 PM

Anyone Can Do Astronomy with Python and Open Data

Ole Moeller-Nilsson, CTO at Pivigo, was kind enough to share his insights on how a beginner can easily get started exploring astronomy using Python. This blog post grew out of a presentation he gave at PyData London meetup on March 7th. Python is a great language for science, and specifically for astronomy. The various packages […]

by Gina Helfrich at April 24, 2017 03:00 PM

April 23, 2017

Titus Brown

A (revised and updated) shotgun metagenome workshop at UC Santa Cruz

We just finished teaching a second version of our two-day shotgun metagenome analysis workshop, this time at UC Santa Cruz (the first one was in October 2016, at Scripps Institute of Oceanography). Harriet Alexander led the workshop and Phillip Brooks and I co-taught; Luiz Irber, Shannon Joslin, and Taylor Reiter TAed. The workshop was hosted by Professor Marilou Sison-Mangus at the Earth and Marine Sciences Building.

(Note that Harriet will be running an expanded version of this workshop at our summer institute, July 17-21. Registration is still open!)

About 30-35 people came the first day, and about 30 were there on the second.

Some good - new lessons!

In addition to our old lessons on Illumina read QC, assembly with MEGAHIT, annotation with Prokka, and quantification with Salmon, we introduced two new lessons --

For all of this we used subset data from Hu et al. (the Banfield Lab), 2016, which is a great low-complexity metagenome.

More good - using XSEDE Jetstream instead of Amazon Web Services!

This was the first genomics workshop in many years where we didn't use Amazon Web Services - we used XSEDE Jetstream instead. See our login instructions here.

Why are we abandoning Amazon? Two reasons --

  • while we've been teaching it for almost 8 years now, the conversion rate seems to be very low: AFAICT our students aren't using it, because it costs money and their advisors don't want to pay for AWS when they can use institutional resources. (This is anecdotal.)
  • since sometime before October 2016, Amazon changed their registration system so that newly registered people cannot start up instances for a few hours after their first try. This is death on half-day and two-days workshops. (You can read a bit more about it here.) There seems to be nothing that AWS folk can do to help us so we are giving up.

I am happy to report that Jetstream went more smoothly than AWS in almost every way and seems to perfectly meet our needs for training! We may have more to say about it after our summer institute's use.

I also suspect that people will be more inclined to use Jetstream if they can get allocations on it for free; there was significant interest in this during the workshop.

Other good --

  • As always, the people that attended the workshop were fantastic, and dealt with our occasional hiccups pretty well!
  • We managed to pretty smoothly move between the command line and the Jupyter Notebook for two of the lessons, which was pretty cool.
  • We managed to implement a simple demo of a tetramer nucleotide frequency clustering system using sourmash and t-SNE - see the notebook on github (which should be run after the initial steps in the binning lesson).

There is no bad or ugly

Nothing went wrong! Which I guess is a 'good' all on its own!

There were a few minor issues with the Jetstream desktop, and some problems with starting up Jetstream instances every now and then, and the guest network at UCSC blocked port 8000 (which we used for Jupyter), but most of the time we could work around these issues.

Feedback from participants

The in-person feedback (which is admittedly always kinder than the anonymous feedback :) was excellent - students really liked the hands-on teaching style (Carpentry-style, but with copy/paste) and the slow teaching pace with lots of time for questions was well received.

Misc notes

As always, our materials are available under CC0 on github - the URL is https://github.com/ngs-docs/2017-ucsc-metagenomics.

--titus

by C. Titus Brown at April 23, 2017 10:00 PM

April 21, 2017

numfocus

NumFOCUS Welcomes SunPy, Our Newest Fiscally Sponsored Project

​NumFOCUS is pleased to announce the addition of SunPy to our fiscally sponsored projects. SunPy is a community-developed, free and open-source software library for solar physics based on Python. The aim of the SunPy project is to provide the software tools necessary so that anyone can analyze solar data. SunPy is written using the Python programming language […]

by Gina Helfrich at April 21, 2017 05:04 PM

April 20, 2017

Continuum Analytics news

Two Peas in a Pod: Anaconda + IBM Cognitive Systems

Thursday, April 20, 2017
Travis Oliphant
President, Chief Data Scientist & Co-Founder

There is no question that deep learning has come out to play across a wide range of sectors—finance, marketing, pharma, legal...the list goes on. What’s more, from now until 2022, the deep learning market is expected to grow more than 65 percent. Clearly, companies are increasingly looking deeply at this popular machine learning approach to help fulfill business needs. Deep learning makes it possible to process giant datasets with billions of elements and extract useful predictive models. Deep learning is transforming the businesses of leading consumer Web and mobile app companies and is also being adopted by more traditional business enterprises. 

That’s why this week we are pleased to announce the availability of Anaconda on IBM’s Cognitive Systems, the company’s high performance deep learning platform, highlighting the fact that Anaconda is regarded as an important capability for developers building cognitive solutions. The platform empowers these developers and data scientists to build and deploy deep learning applications that are ready to scale. Anaconda is also integrating with the IBM PowerAI software distribution that makes it simpler for companies to take advantage of Power performance and GPU optimization for data intensive cognitive workloads. 

At Anaconda, we’re helping leading businesses across the world, like IBM, solve the world’s most challenging problems—from improving medical treatments to discovering planets to predicting effects of public policy—by handing them tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence. This news reiterates the importance of Open Data Science in all factors of business. 

Want to learn more about this news? Read the press release here

by swebster at April 20, 2017 03:05 PM

NeuralEnsemble

PyNN 0.9.0 released

I'm happy to announce the release of PyNN 0.9.0!

This version of PyNN adopts the new, simplified Neo object model, first released as Neo 0.5.0, for the data structures returned by Population.get_data(). For more information on the new Neo API, see the Neo release notes

The main difference for a PyNN user is that the AnalogSignalArray class has been renamed to AnalogSignal, and similarly the Segment.analogsignalarrays attribute is now called Segment.analogsignals

What is PyNN?

PyNN (pronounced 'pine') is a simulator-independent language for building neuronal network models.

In other words, you can write the code for a model once, using the PyNN API and the Python programming language, and then run it without modification on any simulator that PyNN supports (currently NEURON, NEST and Brian as well as the SpiNNaker and BrainScaleS neuromorphic hardware systems).

Even if you don't wish to run simulations on multiple simulators, you may benefit from writing your simulation code using PyNN's powerful, high-level interface. In this case, you can use any neuron or synapse model supported by your simulator, and are not restricted to the standard models.

The code is released under the CeCILL licence (GPL-compatible).

by Andrew Davison (noreply@blogger.com) at April 20, 2017 11:11 AM

April 19, 2017

Matthew Rocklin

Asynchronous Optimization Algorithms with Dask

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

In a previous post we built convex optimization algorithms with Dask that ran efficiently on a distributed cluster and were important for a broad class of statistical and machine learning algorithms.

We now extend that work by looking at asynchronous algorithms. We show the following:

  1. APIs within Dask to build asynchronous computations generally, not just for machine learning and optimization
  2. Reasons why asynchronous algorithms are valuable in machine learning
  3. A concrete asynchronous algorithm (Async ADMM) and its performance on a toy dataset

This blogpost is co-authored by Chris White (Capital One) who knows optimization and Matthew Rocklin (Continuum Analytics) who knows distributed computing.

Reproducible notebook available here

Asynchronous vs Blocking Algorithms

When we say asynchronous we contrast it against synchronous or blocking.

In a blocking algorithm you send out a bunch of work and then wait for the result. Dask’s normal .compute() interface is blocking. Consider the following computation where we score a bunch of inputs in parallel and then find the best:

import dask

scores = [dask.delayed(score)(x) for x in L]  # many lazy calls to the score function
best = dask.delayed(max)(scores)
best = best.compute()  # Trigger all computation and wait until complete

This blocks. We can’t do anything while it runs. If we’re in a Jupyter notebook we’ll see a little asterisk telling us that we have to wait.

A Jupyter notebook cell blocking on a dask computation

In a non-blocking or asynchronous algorithm we send out work and track results as they come in. We are still able to run commands locally while our computations run in the background (or on other computers in the cluster). Dask has a variety of asynchronous APIs, but the simplest is probably the concurrent.futures API where we submit functions and then can wait and act on their return.

from dask.distributed import Client, as_completed
client = Client('scheduler-address:8786')

# Send out several computations
futures = [client.submit(score, x) for x in L]

# Find max as results arrive
best = 0
for future in as_completed(futures):
    score = future.result()
    if score > best:
        best = score

These two solutions are computationally equivalent. They do the same work and run in the same amount of time. The blocking dask.delayed solution is probably simpler to write down but the non-blocking futures + as_completed solution lets us be more flexible.

For example, if we get a score that is good enough then we might stop early. If we find that certain kinds of values are giving better scores than others then we might submit more computations around those values while cancelling others, changing our computation during execution.

This ability to monitor and adapt a computation during execution is one reason why people choose asynchronous algorithms. In the case of optimization algorithms we are doing a search process and frequently updating parameters. If we are able to update those parameters more frequently then we may be able to slightly improve every subsequently launched computation. Asynchronous algorithms enable increased flow of information around the cluster in comparison to more lock-step batch-iterative algorithms.

Asynchronous ADMM

In our last blogpost we showed a simplified implementation of Alternating Direction Method of Multipliers (ADMM) with dask.delayed. We saw that in a distributed context it performed well when compared to a more traditional distributed gradient descent. This algorithm works by solving a small optimization problem on every chunk of our data using our current parameter estimates, bringing these back to the local process, combining them, and then sending out new computation on updated parameters.

Now we alter this algorithm to update asynchronously, so that our parameters change continuously as partial results come in in real-time. Instead of sending out and waiting on batches of results, we now consume and emit a constant stream of tasks with slightly improved parameter estimates.

We show three algorithms in sequence:

  1. Synchronous: The original synchronous algorithm
  2. Asynchronous-single: updates parameters with every new result
  3. Asynchronous-batched: updates with all results that have come in since we last updated.

Setup

We create fake data

n, k, chunksize = 50000000, 100, 50000

beta = np.random.random(k) # random beta coefficients, no intercept
zero_idx = np.random.choice(len(beta), size=10)
beta[zero_idx] = 0 # set some parameters to 0
X = da.random.normal(0, 1, size=(n, k), chunks=(chunksize, k))
y = X.dot(beta) + da.random.normal(0, 2, size=n, chunks=(chunksize,)) # add noise

X, y = persist(X, y)  # trigger computation in the background

We define local functions for ADMM. These correspond to solving an l1-regularized Linear regression problem:

def local_f(beta, X, y, z, u, rho):
    return ((y - X.dot(beta)) **2).sum() + (rho / 2) * np.dot(beta - z + u,
                                                              beta - z + u)

def local_grad(beta, X, y, z, u, rho):
    return 2 * X.T.dot(X.dot(beta) - y) + rho * (beta - z + u)


def shrinkage(beta, t):
    return np.maximum(0, beta - t) - np.maximum(0, -beta - t)

local_update2 = partial(local_update, f=local_f, fprime=local_grad)

lamduh = 7.2 # regularization parameter

# algorithm parameters
rho = 1.2
abstol = 1e-4
reltol = 1e-2

z = np.zeros(p)  # the initial consensus estimate

# an array of the individual "dual variables" and parameter estimates,
# one for each chunk of data
u = np.array([np.zeros(p) for i in range(nchunks)])
betas = np.array([np.zeros(p) for i in range(nchunks)])

Finally because ADMM doesn’t want to work on distributed arrays, but instead on lists of remote numpy arrays (one numpy array per chunk of the dask.array) we convert each our Dask.arrays into a list of dask.delayed objects:

XD = X.to_delayed().flatten().tolist() # a list of numpy arrays, one for each chunk
yD = y.to_delayed().flatten().tolist()

Synchronous ADMM

In this algorithm we send out many tasks to run, collect their results, update parameters, and repeat. In this simple implementation we continue for a fixed amount of time but in practice we would want to check some convergence criterion.

start = time.time()

while time() - start < MAX_TIME:
    # process each chunk in parallel, using the black-box 'local_update' function
    betas = [delayed(local_update2)(xx, yy, bb, z, uu, rho)
             for xx, yy, bb, uu in zip(XD, yD, betas, u)]
    betas = np.array(da.compute(*betas))  # collect results back

    # Update Parameters
    ztilde = np.mean(betas + np.array(u), axis=0)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))
    u += betas - z  # update dual variables

    # track convergence metrics
    update_metrics()

Asynchronous ADMM

In the asynchronous version we send out only enough tasks to occupy all of our workers. We collect results one by one as they finish, update parameters, and then send out a new task.

# Submit enough tasks to occupy our current workers
starting_indices = np.random.choice(nchunks, size=ncores*2, replace=True)
futures = [client.submit(local_update, XD[i], yD[i], betas[i], z, u[i],
                           rho, f=local_f, fprime=local_grad)
           for i in starting_indices]
index = dict(zip(futures, starting_indices))

# An iterator that returns results as they come in
pool = as_completed(futures, with_results=True)

start = time.time()
count = 0

while time() - start < MAX_TIME:
    # Get next completed result
    future, local_beta = next(pool)
    i = index.pop(future)
    betas[i] = local_beta
    count += 1

    # Update parameters (this could be made more efficient)
    ztilde = np.mean(betas + np.array(u), axis=0)

    if count < nchunks:  # artificially inflate beta in the beginning
        ztilde *= nchunks / (count + 1)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))
    update_metrics()

    # Submit new task to the cluster
    i = random.randint(0, nchunks - 1)
    u[i] += betas[i] - z
    new_future = client.submit(local_update2, XD[i], yD[i], betas[i], z, u[i], rho)
    index[new_future] = i
    pool.add(new_future)

Batched Asynchronous ADMM

With enough distributed workers we find that our parameter-updating loop on the client can be the limiting factor. After profiling it seems that our client was bound not by updating parameters, but rather by computing the performance metrics that we are going to use for the convergence plots below (so not actually a limitation in practice). However we decided to leave this in because it is good practice for what is likely to occur in larger clusters, where the single machine that updates parameters is possibly overwhelmed by a high volume of updates from the workers. To resolve this, we build in batching.

Rather than update our parameters one by one, we update them with however many results have come in so far. This provides a natural defense against a slow client. This approach smoothly shifts our algorithm back over to the synchronous solution when the client becomes overwhelmed. (though again, at this scale we’re fine).

Conveniently, the as_completed iterator has a .batches() method that iterates over all of the results that have come in so far.

# ... same setup as before

pool = as_completed(new_betas, with_results=True)

batches = pool.batches()            # <<<--- this is new

while time() - start < MAX_TIME:

    # Get all tasks that have come in since we checked last time
    batch = next(batches)           # <<<--- this is new
    for future, result in batch:
        i = index.pop(future)
        betas[i] = result
        count += 1

    ztilde = np.mean(betas + np.array(u), axis=0)
    if count < nchunks:
        ztilde *= nchunks / (count + 1)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))
    update_metrics()

    # Submit as many new tasks as we collected
    for _ in batch:                 # <<<--- this is new
        i = random.randint(0, nchunks - 1)
        u[i] += betas[i] - z
        new_fut = client.submit(local_update2, XD[i], yD[i], betas[i], z, u[i], rho)
        index[new_fut] = i
        pool.add(new_fut)

Visual Comparison of Algorithms

To show the qualitative difference between the algorithms we include profile plots of each. Note the following:

  1. Synchronous has blocks of full CPU use followed by blocks of no use
  2. The Asynchrhonous methods are more smooth
  3. The Asynchronous single-update method has a lot of whitespace / time when CPUs are idling. This is artifiical and because our code that tracks convergence diagnostics for our plots below is wasteful and inside the client inner-loop
  4. We intentionally leave in this wasteful code so that we can reduce it by batching in the third plot, which is more saturated.

You can zoom in using the tools to the upper right of each plot. You can view the full profile in a full window by clicking on the “View full page” link.

Synchronous

View full page

Asynchronous single-update

View full page

Asynchronous batched-update

View full page

Plot Convergence Criteria

Primal residual for async-admm Primal residual for async-admm

Analysis

To get a better sense of what these plots convey, recall that optimization problems always come in pairs: the primal problem is typically the main problem of interest, and the dual problem is a closely related problem that provides information about the constraints in the primal problem. Perhaps the most famous example of duality is the Max-flow-min-cut Theorem from graph theory. In many cases, solving both of these problems simultaneously leads to gains in performance, which is what ADMM seeks to do.

In our case, the constraint in the primal problem is that all workers must agree on the optimum parameter estimate. Consequently, we can think of the dual variables (one for each chunk of data) as measuring the “cost” of agreement for their respective chunks. Intuitively, they will start out small and grow incrementally to find the right “cost” for each worker to have consensus. Eventually, they will level out at an optimum cost.

So:

  • the primal residual plot measures the amount of disagreement; “small” values imply agreement
  • the dual residual plot measures the total “cost” of agreement; this increases until the correct cost is found

The plots then tell us the following:

  • the cost of agreement is higher for asynchronous algorithms, which makes sense because each worker is always working with a slightly out-of-date global parameter estimate, making consensus harder
  • blocked ADMM doesn’t update at all until shortly after 5 seconds have passed, whereas async has already had time to converge. (In practice with real data, we would probably specify that all workers need to report in every K updates).
  • asynchronous algorithms take a little while for the information to properly diffuse, but once that happens they converge quickly.
  • both asynchronous and synchronous converge almost immediately; this is most likely due to a high degree of homogeneity in the data (which was generated to fit the model well). Our next experiment should involve real world data.

What we could have done better

Analysis wise we expect richer results by performing this same experiment on a real world data set that isn’t as homogeneous as the current toy dataset.

Performance wise we can get much better CPU saturation by doing two things:

  1. Not running our convergence diagnostics, or making them much faster
  2. Not running full np.mean computations over all of beta when we’ve only updated a few elements. Instead we should maintain a running aggregation of these results.

With these two changes (each of which are easy) we’re fairly confident that we can scale out to decently large clusters while still saturating hardware.

April 19, 2017 12:00 AM

April 18, 2017

Enthought

Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.

But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.

Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.

If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the Imputer class is used. Similarly, the decode_csv function in the TensorFlow library can be passed a record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.

The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.

Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:

  1. Identifying missing values in data
  2. Replacing missing values in data, and
  3. Removing missing values from data.

Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.


Identifying Missing Values – The Hard Way: Using Pandas

If you are interested in identifying missing values in a row/column of a DataFrame, you need to understand the isnull, any, all methods on a DataFrame.

Taking a detour, we have so far described missing values as being represented by NA or NaN. Instead, what if missing values in a column are values that aren’t of the same type as the rest of the cells in the column, say for example a string in a column containing integers? Doing so in Pandas is not trivial.

Identifying Missing Values – The Easy Way: Using the Data Import Tool

Highlighting Null Values using the Data Import Tool

Highlighting null values using the Data Import Tool

Instead of giving you the column names and index values of the cells containing missing values, the Data Import Tool shows them to you. Simply checking the `Highlight Missing Values` checkbox in the bottom-left corner of the Data Import Tool will paint the DataFrame to show you the cells that contain missing values. Further, the Data Import Tool understands that your data file might have errors, like having a string value in a column otherwise containing integers. The Data Import Tool highlights the cell and displays the underlying content too.

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values


Replacing Missing Values – The Hard Way: Using Pandas

While Pandas does a great job at handling column operations even if the columns contain NaN values, our data analysis workflow might need us to replace the missing values in our data.

After spending a little time browsing through the Pandas documentation, you will come across the `fillna` method on a DataFrame, which can be used to replace a missing values. The arguments you pass to the fillna method will determine what value the missing values in your DataFrame are replaced with and how the underlying column dtypes change after replacing the missing values.

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Replacing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool, you can replace missing values by right-clicking on the column containing missing values selecting the appropriate Fill Missing Values item. Opting to replace missing values in the column with a specific column will open an additional dialog, prompting you to enter the value.

Fill missing values

Replace missing values in your DataFrame using the Canopy Data Import Tool


Removing Missing Values – The Hard Way: Using Pandas

While removing columns or rows containing missing values might be a little extreme, it might be necessary. Pandas suggests that you use the dropna method on the DataFrame to drop columns or rows that contain missing values. The arguments you pass to the dropna method will determine what rows/columns are removed from the DataFrame.

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Removing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool on the other hand, you can remove rows/columns containing missing values by selecting the “Delete Empty Columns” or “Delete Empty Rows” item from the “Transform” menu. An additional dialog will pop up asking you how lenient you want to be in removing rows/columns containing missing values – if you choose ‘any’, the Data Import Tool will remove rows/columns that contain any missing values; if you choose ‘all’, the Data Import Tool will only remove those rows/columns which contain only missing values.

Delete Empty Rows & Columns

Delete empty cells in rows/columns using the Canopy Data Import Tool

Delete Empty Columns

Choose to delete columns containing any null value or columns full of null values using the Canopy Data Import Tool

Finally, we have data that contains no missing values. So far, we’ve used the DIT to easily discover the missing values in our dataset and to remove/replace the missing values. Finally, by clicking on ‘Use DataFrame’, you can import the dataset as a pandas DataFrame into the IPython workspace of the Canopy Editor. If you’re a data scientist, your data is now void of missing values and can be converted to arrays or variables and passed on to scikit-learn, TensorFlow or any other Machine Learning library of your choice.

Ready to try the Canopy Data Import Tool?

Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way


Additional resources:

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

The post Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way appeared first on Enthought Blog.

by Rahul Poruri at April 18, 2017 02:21 PM

Pierre de Buyl

A concise derivation of the Wiener-Khinchin theorem

While teaching a class on statistical physics, I found myself unhappy with textbook derivations of the Wiener-Khinchin theorem. I worked my way to a very short derivation that is free of integral bounds manipulations and of holes, or so I believe.

In either the books by Balakrishnan (Elements of Nonequilibrium Statistical Mechanics, Ane Books, 2008), Risken (The Fokker-Planck Equation, 2nd edition, Springer-Verlag, 1989), Coffey-Kalmykov-Waldron (The Langevin Equation, 2nd edition, World Scientific, 2004) or MathWorld, I could not find a short derivation that would cleanly take into account the averaging procedure or that would not resort to splittings of the domain of integration. I present here one derivation and a numerical illustration with Python.

by Pierre de Buyl at April 18, 2017 08:00 AM

Matthieu Brucher

Announcement: Audio TK 2.0.0

ATK is updated to 2.0.0 with a major refactoring to ensure signed/unsigned consistency, new Adaptive module and EQ design. Complex-valued filters are also now available to allow simultaneous dual channel processes and advanced filters like complex LMS filters.

Thanks to Travis and Appveyor, binaries for the releases are now updated on Github. On all platforms we compile static and shared libraries. On Linux, gcc 5, gcc 6, clang 3.8 and clang 3.9 are generated, on OS X, XCode 7 and XCode 8 are available as universal binaries, and on Windows, 32 bits, 64 bits, with dynamic or static (no shared libraries in this case) runtime, are also generated.

Download link: ATK 2.0.0

Changelog:
2.0.0
* Refactored fixed line delays (performance improvement)
* Allow new filters to have unconnected inputs (can only be changed inside a filter)
* Refactored the stereo universal delay line to allow more simultaneous channels (renamed to MultipleUniversalDelayLineFilter)
* ATK now allows complex-valued filters with filters to convert from real from/to complex
* Added a BlockLMSFilter with Python wrappers
* Added a LMSFilter with Python wrappers
* Added a RemezBasedCoefficients with Python wrappers to be used with FIRFilter to generate a FIR filter from a template
* Added a RLSFilter with Python wrappers
* Support for IPP as a FFT backend
* Refactored the API for global unsigned consistency

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 18, 2017 07:10 AM

April 17, 2017

Titus Brown

Workshops posted for DIBSI - July 10-15, July 17-21

As part of our Summer Institute in Data Intensive Biology, we will be running nine week-long computational workshops from July 10 to July 17 at the University of California, Davis.

Week 1: July 10-15

Week 2: July 17-21

All workshops will take place at UC Davis; please see the venue information for details.

Workshops may extend into the evening hours; please plan on devoting the entire time to the workshop. Workshops are $350/wk.

On-campus housing information is available for approximately $400/wk, which includes breakfast and dinner. Housing registration currently closes April 26th.

Registration links for each workshop are under the workshop description; housing is linked there as well, and must be booked separately. Attendees of both weeks of workshops may book housing for both weeks, and attendees of the two-week introductory bioinformatics workshop, ANGUS may book a full four weeks of housing.

For questions about registration, travel, invitation letters, or other general topics, please contact dibsi.training@gmail.com. For workshop specific questions, contact the instructors (e-mail links are under each workshop).

--titus

by C. Titus Brown at April 17, 2017 10:00 PM

April 13, 2017

Matthew Rocklin

Streaming Python Prototype

This work is supported by Continuum Analytics, and the Data Driven Discovery Initiative from the Moore Foundation.

This blogpost is about experimental software. The project may change or be abandoned without warning. You should not depend on anything within this blogpost.

This week I built a small streaming library for Python. This was originally an exercise to help me understand streaming systems like Storm, Flink, Spark-Streaming, and Beam, but the end result of this experiment is not entirely useless, so I thought I’d share it. This blogpost will talk about my experience building such a system and what I valued when using it. Hopefully it elevates interest in streaming systems among the Python community.

Background with Iterators

Python has sequences and iterators. We’re used to mapping, filtering and aggregating over lists and generators happily.

seq = [1, 2, 3, 4, 5]
seq = map(inc, L)
seq = filter(iseven, L)

>>> sum(seq) # 2 + 4 + 6
12

If these iterators are infinite, for example if they are coming from some infinite data feed like a hardware sensor or stock market signal then most of these pieces still work except for the final aggregation, which we replace with an accumulating aggregation.

def get_data():
    i = 0
    while True:
        i += 1
        yield i

seq = get_data()
seq = map(inc, seq)
seq = filter(iseven, seq)
seq = accumulate(lambda total, x: total + x, seq)

>>> next(seq)  # 2
2
>>> next(seq)  # 2 + 4
6
>>> next(seq)  # 2 + 4 + 6
12

This is usually a fine way to handle infinite data streams. However this approach becomes awkward if you don’t want to block on calling next(seq) and have your program hang until new data comes in. This approach also becomes awkward when you want to branch off your sequence to multiple outputs and consume from multiple inputs. Additionally there are operations like rate limiting, time windowing, etc. that occur frequently but are tricky to implement if you are not comfortable using threads and queues. These complications often push people to a computation model that goes by the name streaming.

To introduce streaming systems in this blogpost I’ll use my new tiny library, currently called streams (better name to come in the future). However if you decide to use streaming systems in your workplace then you should probably use some other more mature library instead. Common recommendations include the following:

  • ReactiveX (RxPy)
  • Flink
  • Storm (Streamparse)
  • Beam
  • Spark Streaming

Streams

We make a stream, which is an infinite sequence of data into which we can emit values and from which we can subscribe to make new streams.

from streams import Stream
source = Stream()

From here we replicate our example above. This follows the standard map/filter/reduce chaining API.

s = (source.map(inc)
           .filter(iseven)
           .accumulate(lambda total, x: total + x))

Note that we haven’t pushed any data into this stream yet, nor have we said what should happen when data leaves. So that we can look at results, lets make a list and push data into it when data leaves the stream.

results = []
s.sink(results.append)  # call the append method on every element leaving the stream

And now lets push some data in at the source and see it arrive at the sink:

>>> for x in [1, 2, 3, 4, 5]:
...     source.emit(x)

>>> results
[2, 6, 12]

We’ve accomplished the same result as our infinite iterator, except that rather than pulling data with next we push data through with source.emit. And we’ve done all of this at only a 10x slowdown over normal Python iteators :) (this library takes a few microseconds per element rather than CPython’s normal 100ns overhead).

This will get more interesting in the next few sections.

Branching

This approach becomes more interesting if we add multiple inputs and outputs.

source = Stream()
s = source.map(inc)
evens = s.filter(iseven)
evens.accumulate(add)

odds = s.filter(isodd)
odds.accumulate(sub)

Or we can combine streams together

second_source = Stream()
s = combine_latest(second_source, odds).map(sum)

So you may have multiple different input sources updating at different rates and you may have multiple outputs, perhaps some going to a diagnostics dashboard, others going to long-term storage, others going to a database, etc.. A streaming library makes it relatively easy to set up infrastructure and pipe everything to the right locations.

Time and Back Pressure

When dealing with systems that produce and consume data continuously you often want to control the flow so that the rates of production are not greater than the rates of consumption. For example if you can only write data to a database at 10MB/s or if you can only make 5000 web requests an hour then you want to make sure that the other parts of the pipeline don’t feed you too much data, too quickly, which would eventually lead to a buildup in one place.

To deal with this, as our operations push data forward they also accept Tornado Futures as a receipt.

Upstream: Hey Downstream! Here is some data for you
Downstream: Thanks Upstream!  Let me give you a Tornado future in return.
            Make sure you don't send me any more data until that future
            finishes.
Upstream: Got it, Thanks!  I will pass this to the person who gave me the
          data that I just gave to you.

Under normal operation you don’t need to think about Tornado futures at all (many Python users aren’t familiar with asynchronous programming) but it’s nice to know that the library will keep track of balancing out flow. The code below uses @gen.coroutine and yield common for Tornado coroutines. This is similar to the async/await syntax in Python 3. Again, you can safely ignore it if you’re not familiar with asynchronous programming.

@gen.coroutine
def write_to_database(data):
    with connect('my-database:1234/table') as db:
        yield db.write(data)

source = Stream()
(source.map(...)
       .accumulate(...)
       .sink(write_to_database))  # <- sink produces a Tornado future

for data in infinite_feed:
    yield source.emit(data)       # <- that future passes through everything
                                  #    and ends up here to be waited on

There are also a number of operations to help you buffer flow in the right spots, control rate limiting, etc..

source = Stream()
source.timed_window(interval=0.050)  # Capture all records of the last 50ms into batches
      .filter(len)                   # Remove empty batches
      .map(...)                      # Do work on each batch
      .buffer(10)                    # Allow ten batches to pile up here
      .sink(write_to_database)       # Potentially rate-limiting stage

I’ve written enough little utilities like timed_window and buffer to discover both that in a full system you would want more of these, and that they are easy to write. Here is the definition of timed_window

class timed_window(Stream):
    def __init__(self, interval, child, loop=None):
        self.interval = interval
        self.buffer = []
        self.last = gen.moment

        Stream.__init__(self, child, loop=loop)
        self.loop.add_callback(self.cb)

    def update(self, x, who=None):
        self.buffer.append(x)
        return self.last

    @gen.coroutine
    def cb(self):
        while True:
            L, self.buffer = self.buffer, []
            self.last = self.emit(L)
            yield self.last
            yield gen.sleep(self.interval)

If you are comfortable with Tornado coroutines or asyncio then my hope is that this should feel natural.

Recursion and Feedback

By connecting the sink of one stream to the emit function of another we can create feedback loops. Here is stream that produces the Fibonnacci sequence. To stop it from overwhelming our local process we added in a rate limiting step:

from streams import Stream
source = Stream()
s = source.sliding_window(2).map(sum)
L = s.sink_to_list()  # store result in a list

s.rate_limit(0.5).sink(source.emit)  # pipe output back to input

source.emit(0)  # seed with initial values
source.emit(1)
>>> L
[1, 2, 3, 5]

>>> L  # wait a couple seconds, then check again
[1, 2, 3, 5, 8, 13, 21, 34]

>>> L  # wait a couple seconds, then check again
[1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

Note: due to the time rate-limiting functionality this example relied on an event loop running somewhere in another thread. This is the case for example in a Jupyter notebook, or if you have a Dask Client running.

Things that this doesn’t do

If you are familiar with streaming systems then you may say the following:

Lets not get ahead of ourselves; there’s way more to a good streaming system than what is presented here. You need to handle parallelism, fault tolerance, out-of-order elements, event/processing times, etc..

… and you would be entirely correct. What is presented here is not in any way a competitor to existing systems like Flink for production-level data engineering problems. There is a lot of logic that hasn’t been built here (and its good to remember that this project was built at night over a week).

Although some of those things, and in particular the distributed computing bits, we may get for free.

Distributed computing

So, during the day I work on Dask, a Python library for parallel and distributed computing. The core task schedulers within Dask are more than capable of running these kinds of real-time computations. They handle far more complex real-time systems every day including few-millisecond latencies, node failures, asynchronous computation, etc.. People use these features today inside companies, but they tend to roll their own system rather than use a high-level API (indeed, they chose Dask because their system was complex enough or private enough that rolling their own was a necessity). Dask lacks any kind of high-level streaming API today.

Fortunately, the system we described above can be modified fairly easily to use a Dask Client to submit functions rather than run them locally.

from dask.distributed import Client
client = Client()       # start Dask in the background

source.to_dask()
      .scatter()        # send data to a cluster
      .map(...)         # this happens on the cluster
      .accumulate(...)  # this happens on the cluster
      .gather()         # gather results back to local machine
      .sink(...)        # This happens locally

Other things that this doesn’t do, but could with modest effort

There are a variety of ways that we could improve this with modest cost:

  1. Streams of sequences: We can be more efficient if we pass not individual elements through a Stream, but rather lists of elements. This will let us lose the microseconds of overhead that we have now per element and let us operate at pure Python (100ns) speeds.
  2. Streams of NumPy arrays / Pandas dataframes: Rather than pass individual records we might pass bits of Pandas dataframes through the stream. So for example rather than filtering elements we would filter out rows of the dataframe. Rather than compute at Python speeds we can compute at C speeds. We’ve built a lot of this logic before for dask.dataframe. Doing this again is straightforward but somewhat time consuming.
  3. Annotate elements: we want to pass through event time, processing time, and presumably other metadata
  4. Convenient Data IO utilities: We would need some convenient way to move data in and out of Kafka and other common continuous data streams.

None of these things are hard. Many of them are afternoon or weekend projects if anyone wants to pitch in.

Reasons I like this project

This was originally built strictly for educational purposes. I (and hopefully you) now know a bit more about streaming systems, so I’m calling it a success. It wasn’t designed to compete with existing streaming systems, but still there are some aspects of it that I like quite a bit and want to highlight.

  1. Lightweight setup: You can import it and go without setting up any infrastructure. It can run (in a limited way) on a Dask cluster or on an event loop, but it’s also fully operational in your local Python thread. There is no magic in the common case. Everything up until time-handling runs with tools that you learn in an introductory programming class.
  2. Small and maintainable: The codebase is currently a few hundred lines. It is also, I claim, easy for other people to understand. Here is the code for filter:

    class filter(Stream):
        def __init__(self, predicate, child):
            self.predicate = predicate
            Stream.__init__(self, child)
    
        def update(self, x, who=None):
            if self.predicate(x):
                return self.emit(x)
    
  3. Composable with Dask: Handling distributed computing is tricky to do well. Fortunately this project can offload much of that worry to Dask. The dividing line between the two systems is pretty clear and, I think, could lead to a decently powerful and maintainable system if we spend time here.
  4. Low performance overhead: Because this project is so simple it has overheads in the few-microseconds range when in a single process.
  5. Pythonic: All other streaming systems were originally designed for Java/Scala engineers. While they have APIs that are clearly well thought through they are sometimes not ideal for Python users or common Python applications.

Future Work

This project needs both users and developers.

I find it fun and satisfying to work on and so encourage others to play around. The codebase is short and, I think, easily digestible in an hour or two.

This project was built without a real use case (see the project’s examples directory for a basic Daskified web crawler). It could use patient users with real-world use cases to test-drive things and hopefully provide PRs adding necessary features.

I genuinely don’t know if this project is worth pursuing. This blogpost is a test to see if people have sufficient interest to use and contribute to such a library or if the best solution is to carry on with any of the fine solutions that already exist.

pip install git+https://github.com/mrocklin/streams

April 13, 2017 12:00 AM

April 11, 2017

Enthought

Webinar- Get More From Your Core: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core

What: Presentation, demo, and Q&A with Brendon Hall, Geoscience Product Manager, Enthought

Who should watch this webinar:

  • Oil and gas industry professionals who are looking for ways to extract more value from expensive science wells
  • Those interested in learning how artificial intelligence and machine learning techniques can be applied to core analysis

VIEW 


Geoscientists and petroleum engineers rely on accurate core measurements to characterize reservoirs, develop drilling plans and de-risk play assessments. Whole-core CT scans are now routinely performed on extracted well cores, however the data produced from these scans are difficult to visualize and integrate with other measurements.

Virtual Core automates aspects of core description for geologists, drastically reducing the time and effort required for core description, and its unified visualization interface displays cleansed whole-core CT data alongside core photographs and well logs. It provides tools for geoscientists to analyze core data and extract features from sub-millimeter scale to the entire core.

In this webinar and demo, we’ll start by introducing the Clear Core processing pipeline, which automatically removes unwanted artifacts (such as tubing) from the CT image. We’ll then show how the machine learning capabilities in Virtual Core can be used to describe the core, extracting features such as bedding planes and dip angle. Finally, we’ll show how the data can be viewed and analyzed alongside other core data, such as photographs, wellbore images, well logs, plug measurements, and more.

What You’ll Learn:

  • How core CT data, photographs, well logs, borehole images, and more can be integrated into a digital core workshop
  • How digital core data can shorten core description timelines and deliver business results faster
  • How new features can be extracted from digital core data using artificial intelligence
  • Novel workflows that leverage these features, such as identifying parasequences and strategies for determining net pay

VIEW 

Presenter:

Brendon Hall, Geoscience Applications Engineer, Enthought Brendon Hall, Enthought
Geoscience Product Manager and Application Engineer

Additional Resources

Other Blogs and Articles on Virtual Core:

The post Webinar- Get More From Your Core: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core appeared first on Enthought Blog.

by Brendon Hall at April 11, 2017 01:00 PM

Matthieu Brucher

Audio Toolkit: Create a FIR Filter from a Template (EQ module)

Last week, I published a post on adaptive filtering. It was long overdue, but I actually had one other project on hold for even longer: allowing a user to specify a filter template and let Audio Toolkit figure out a FIR filter from this template.

Remez/Parks & McClellan algorithm

The most famous algorithm is the Remez/Parks & McClellan algorithm. In Matlab, it’s called remez, but Remez is actually a more generic algorithm than just FIR determination.

The algorithm starts by selecting a few random points on the templates where the user set non-zero weights. The zero weights are usually the transition zones, which means that the filter can roam free in these sections. Usually, you don’t want to have them too big, especially in bandpass filters. As the resulting filter has ripples, this means that you can select the weight of each bandwidth in the template. Where the ripples should be small, use a big weight, where they don’t matter, use a small one.

Then, the Remez algorithm is all about moving these points to the maximum of the difference between the template and the actual filter. At the end, the result is an optimal filter around the given template, for a given order.

The determination of the result rests often on the selection of the starting points. If all starting points are in only one bandwidth, then the determination of the filter is wrong. As such, Audio Toolkit selects points that are equidistant so that all bandwidths are covered. Of course, if one bandwidth is too small, then the determination will fail.

Demo

There are many good papers on the Remez algorithm for FIR determination so I won’t take the time to rehash something that lots of people did far better than I could. But I’ll try to explain how it goes on a simple example, with the Python script that was used as the reference test case for the development of the plugin.

Instead of using the equidistant start, I used the set of indices coming from the paper I used (and same for the template). As such, the indices are:

[51 101 341 361 531 671 701 851]

After the optimization, we get the following error function:
Remez Iteration 1

The maximum error is 0.0325 in that case. The algorithm then selects new iterations for the next iteration, at the minimum and maximum of the current error function:

[  0 151 296 409 512 595 744 877]

From these indices, we compute the optimal parameters again and then get a new error function (notice that the highlighted points correspond to the previous min/max)

Remez Iteration 2

The maximum error is now 0.162. And we start the selection process again:

[  0 163 320 409 512 579 718 874]

Once again, we get a new error function:

Remez Iteration 3

The max error is a little bit bigger and is now 0.169. We select new indices:

[  0 163 320 409 512 579 718 874]

The indices are identical, and at the next iteration, the search for the best stops.

The resulting filter has the following transfer function (the template is in red)

Estimated filter against template

Conclusion

There is finally a way of designing filters in Audio Toolkit that don’t require you to go to Matlab or Python. This can be quite efficient to design a linear phase filter on the fly in a plugin. There is probably more work to be done in terms of optimization, but the processing part itself is already fully optimized.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 11, 2017 07:51 AM

April 10, 2017

Continuum Analytics news

Open Data Science is a Team Sport

Monday, April 10, 2017
Stephen Kearns
Continuum Analytics

As every March Madness fan knows, athletic talent and coaching are key, but it’s how they come together as a unit that determines a team’s success. Known for its drama-ridden storylines and endless buzzer beaters, the NCAA’s college basketball championship tournaments (both mens and womens) showcase the power of teamwork and dedication. Basketball is a team sport, where the interrelationships of complementary player skills often dictate the game’s winner. Everyone must focus and work together for the common good of the team. These same principles hold true for data scientist teams.

Much like basketball, data science requires a team of players in different positions, including business analysts, data scientists, data engineers, DevOps engineers, and more. However, too many data scientists still function in silos, each working with his/her own tools to manage data sets. Working individually doesn’t work on the court and it won’t work in data science. Data scientists, and their data science equipment, must function together to work as a team. That’s where Open Data Science comes in.

With Open Data Science, team members have their positions, but are able to move around the court and play with flexibility, like a basketball team. Everyone can also score in basketball, and similarly with Open Data Science, all team members, from data engineers to domain experts, are encouraged to contribute wherever their skills intersect with with the goals of the project. In fact, according to our recent survey, company decision leaders and data scientists revealed that 69 percent of respondents associate “Open Data Science” with collaboration. No longer just a one-person job, data science is a team sport.

Open Data Science is an inclusive movement that not only encourages data scientists to function as a cohesive unit, but also embraces open source tools, so they can work together more easily in a connected ecosystem. Instead of pigeonholing data scientists into using a single language or set of tools, Open Data Science facilitates collaboration and enables data science teams to reap the benefits of all available technologies. Open Data Science brings innovation from every community together, making the latest information readily available to all. Collaboration helps enterprises harness their data faster and extract more value—so, don’t drop the ball with your organization’s data science strategy. Make it a true team effort.

Learn more about the Five Dysfunctions of a Data Science team in slides from my latest webinar below, or download the slides here.

 

The Five Dysfunctions of a Data Science Team from Continuum Analytics

by swebster at April 10, 2017 03:35 PM

April 07, 2017

Paul Ivanov

March 29th, 2017

What's missing -- feels like there's something missing --
The capacity is there -- the job's not stressful but
I somehow fail at the ignition stage - all this
fuel just sitting around -- un-utilized potential
How do I light that fire? Set it ablaze
in a daze caught up in the haze of comfort
I need to challenge myself, raising tides lift
all boats, but they also drown 
livestock
cows, horses, and goats, seeking refuge in hills
that once covered in grass now fill up like
lifeboats. 
Doctors in white coats 
say "Keep your spirits up" -- hope floats.

by Paul Ivanov at April 07, 2017 07:00 AM

April 04, 2017

Enthought

Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting

by: Tim Diller, Product Manager and Scientific Software Developer, Enthought

Last week I attended the AIChE (American Institute of Chemical Engineers) Spring Meeting in San Antonio, Texas. It was a great time of year to visit this cultural gem deep in the heart of Texas (and just down the road from our Austin offices), with plenty of good food, sights and sounds to take in on top of the conference and its sessions.

The AIChE Spring Meeting focuses on applications of chemical engineering in industry, and Enthought was invited to present a poster and deliver a “vendor perspective” talk on the Canopy Platform for Process Monitoring and Optimization as part of the “Big Data Analytics” track. This was my first time at AIChE, so some of the names were new, but in a lot of ways it felt very similar to many other engineering conferences I have participated in over the years (for instance, ASME (American Society of Mechanical Engineers), SAE (Society of Automotive Engineers), etc.).

This event underscored that regardless of industry, engineers are bringing the same kinds of practical ingenuity to bear on similar kinds of problems, and with the cost of data acquisition and storage plummeting in the last decade, many engineers are now sitting on more data than they know how to effectively handle.

What exactly is “big data”? Does it really matter for solving hard engineering problems?

One theme that came up time and again in the “Big Data Analytics” sessions Enthought participated in was what exactly “big data” is. In many circles, a good working definition of what makes data “big” is that it exceeds the size of the physical RAM on the machine doing the computation, so that something other than simply loading the data into memory has to be done to make meaningful computations, and thus a working definition of some tens of GB delimits “big” data from “small”.

For others, and many at the conference indeed, a more mundane definition of “big” means that the data set doesn’t fit within the row or column limits of a Microsoft Excel Worksheet.

But the question of whether your data is “big” is really a moot one as far as we at Enthought are concerned; really, being “big” just adds complexity to an already hard problem, and the kind of complexity is an implementation detail dependent on the details of the problem at hand.

And that relates to the central message of my talk, which was that an analytics platform (in this case I was talking about our Canopy Platform) should abstract away the tedious complexities, and help an expert get to the heart of the hard problem at hand.

At AIChE, the “hard problems” at hand seemed invariably to involve one or both of two things: (1) increasing safety/reliability, and (2) increasing plant output.

To solve these problems, two general kinds of activity were on display: different pattern recognition algorithms and tools, and modeling, typically through some kind of regression-based approach. Both of these things are straightforward in the Canopy Platform.

The Canopy Platform is a collection of related technologies that work together in an integrated way to support the scientist/analyst/engineer.

What is the Canopy Platform?

If you’re using Python for science or engineering, you have probably used or heard of Canopy, Enthought’s Python-based data analytics application offering an integrated code editor and interactive command prompt, package manager, documentation browser, debugger, variable browser, data import tool, and lots of hidden features like support for many kinds of proxy systems that work behind the scenes to make a seamless work environment in enterprise settings.

However, this is just one part of the Canopy Platform. Over the years, Enthought has been building other components and related technologies that work together in an integrated way to support the engineer/analyst/scientist solving hard problems.

At the center of the this is the Enthought Python Distribution, with runtime interpreters for Python 2.7 and 3.x and over 450 pre-built Python packages for scientific computing, including tools for machine learning and the kind of regression modeling that was shown in some of the other presentations in the Big Data sessions. Other components of the Canopy Platform include interface modules for Excel (PyXLL) and for National Instruments’ LabView software (Python Integration Toolkit for LabVIEW), among others.

A key component of our Canopy Platform is our Deployment Server, which simplifies the tricky tasks of deploying proprietary applications and packages or creating customized, reproducible Python environments inside an organization, especially behind a firewall or an air-gapped network.

Finally, (and this is what we were really showing off at the AIChE Big Data Analytics session) there are the Data Catalog and the Cloud Compute layers within the Canopy Platform.

The Data Catalog provides an indexed interface to potentially heterogeneous data sources, making them available for search and query based on various kinds of metadata.

The Data Catalog provides an indexed interface to potentially heterogeneous data sources. These can range from a simple network directory with a collection of HDF5 files to a server hosting files with the Byzantine complexity of the IRIG 106 Ch. 10 Digital Recorder Standard used by US military test flight ranges. The nice thing about the Data Catalog is that it lets you query and select data based on computed metadata, for example “factory A, on Tuesdays when Ethylene output was below 10kg/hr”, or in a test flight data example “test flights involving a T-38 that exceeded 10,000 ft but stayed subsonic.”

With the Cloud Compute layer, an expert user can write code and test it locally on some subset of data from the Data Catalog. Then, when it is working to satisfaction, he or she can publish the code as a computational kernel to run on some other, larger subset of the data in the Data Catalog, using remote compute resources, which might be an HPC cluster or an Apache Spark server. That kernel is then available to other users in the organization, who do not have to understand the algorithm to run it on other data queries.

In the demo below, I showed hooking up the Data Catalog to some historical factory data stored on a remote machine.

Data Catalog View The Data Catalog allows selection of subsets of the data set for inspection and ad hoc analysis. Here, three channels are compared using a time window set on the time series data shown on the top plot.

Then using a locally tested and developed compute kernel, I did a principal component analysis on the frequencies of the channel data for a subset of the data in the Data Catalog. Then I published the kernel and ran it on the entire data set using the remote compute resource.

After the compute kernel has been published and run on the entire data set, then the result explorer tool enables further interactions.

Ultimately, the Canopy Platform is for building and distributing applications that solve hard problems.  Some of the products we have built on the platform are available today (for instance, Canopy Geoscience and Virtual Core), others are in prototype stage or have been developed for other companies with proprietary components and are not publicly available.

It was exciting to participate in the Big Data Analytics track this year, to see what others are doing in this area, and to be a part of many interesting and fruitful discussions. Thanks to Ivan Castillo and Chris Reed at Dow for arranging our participation.

The post Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting appeared first on Enthought Blog.

by Tim Diller at April 04, 2017 08:13 PM

Matthieu Brucher

Announcement: ATKChorus 1.1.0 and ATKUniversalVariableDelay 1.1.0

I’m happy to announce the update of the chorus and the universal variable delay based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

This release fixes the noises that can arise in some configuration.

ATKChorus
ATKUniversalVariableDelay

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKChorus.
Direct link for ATKUniversalVariableDelay .

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 04, 2017 07:17 AM

April 03, 2017

numfocus

IBM Brings Jupyter and Spark to the Mainframe

NumFOCUS Platinum Sponsor IBM has been doing wonderful work to support one of our fiscally sponsored projects, Project Jupyter. Brian Granger over at the Jupyter Blog has the details… “For the past few years, Project Jupyter has been collaborating with IBM on a number of initiatives. Much of this work has happened in the Jupyter Incubation Program, […]

by NumFOCUS Staff at April 03, 2017 12:00 AM

March 30, 2017

Continuum Analytics news

Anaconda Leader to Speak at TDWI Accelerate Boston

Thursday, March 30, 2017

Chief Data Scientist and Co-Founder Travis Oliphant to Discuss the Power of the Python Ecosystem and Open Data Science

BOSTON, Mass.—March 30, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Travis Oliphant, chief data scientist and co-founder, will be speaking at TDWI Accelerate Boston on April 4 at 1:30pm EST. As one of the leading conferences on Big Data and data science, Accelerate brings together the brightest and best data minds in the industry to discuss the future of data science and analytics.

In his session, titled “How to Leverage Python, the Fastest-Growing Open Source Tool for Data Scientists,” Oliphant will highlight the power of the Python ecosystem and Open Data Science. With thirteen million downloads and counting, Anaconda remains the most popular Python distribution. Oliphant will specifically address the burgeoning ecosystem forming around the Python and Anaconda combination, including new offerings such as Anaconda Cloud and conda-forge.

WHO: Travis Oliphant, chief data scientist and co-founder, Anaconda Powered By Continuum Analytics
WHAT: How to Leverage Python, the Fastest-Growing Open Source Tool for Data Scientists
WHEN: April 4, 1:30-2:10 p.m. EST
WHERE: The Boston Marriott, Copley Place, 110 Huntington Ave, Boston, MA 02116
REGISTER: HERE

###

About Anaconda Powered by Continuum Analytics

Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io.

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by swebster at March 30, 2017 01:06 PM

Leonardo Uieda

Talk proposal for Scipy 2017: Bringing the Generic Mapping Tools to Python

Thumbnail image for publication.

This is the proposal for talk that I co-authored with Paul Wessel and submitted to Scipy 2017. Fingers crossed that they'll accept it! If not, this post can serve as an example of what not to do for next time :)

The submission is on the GenericMappingTools/scipy2017 Github repository. It covers our initial ideas about the GMT Python interface.

UPDATE (8 May 2017): The proposal was accepted!

UPDATE (11 May 2017): I've posted the reviews we got for the proposal along with some comments and replies.


Abstract

The Generic Mapping Tools (GMT) is an open-source software package widely used in the geosciences to process and visualize time series and gridded data. Maps generated by GMT are ubiquitous in scientific publications in areas such as seismology and oceanography. We present a new GMT Python wrapper library built by the GMT team. We will show the design plans, internal implementations, and demonstrate an initial prototype of the library. Our wrapper connects to the GMT C API using ctypes and allows input and output using data from numpy ndarrays and xarray Datasets. The library is still in early stages of design and implementation and we are eager for contributions and feedback from the Scipy community.

Extended Abstract

The Generic Mapping Tools (GMT) is an open-source software package widely used in the geosciences to process and visualize time series and gridded data. GMT is a command-line tool written in C that is able to generate high quality figures and maps using the Postscript format. Maps generated by GMT are ubiquitous in scientific publications in areas such as seismology and oceanography. GMT has a large, feature rich, mature, and well tested code base. It has benefited from over 28 years of development and heavy usage within the scientific community. It is no wonder that there have been at least three attempts to bridge the gap between GMT and Python: gmtpy, pygmt, and PyGMT. Of the three, only gmtpy has had any development activity since 2014 and is the only project that has documentation. gmtpy interfaces with GMT through subprocesses. It pipes standard input and output to the GMT command-line application. Piping has it's limitations because all data are handled as text, making it difficult or impossible to pass binary data such as netCDF grids. On the Python side, the two main libraries for plotting data on maps are the matplotlib basemap toolkit and Cartopy. Both libraries rely on matplotlib as the backend for generating figures. Basemap is know to have its limitations (e.g., this post by Filipe Fernandes). Cartopy is a great improvement over basemap that fixes some of those limitations. However, Cartopy is still bound by the speed and memory usage constraints of matplotlib when it comes to very large and complex maps.

We present a new GMT Python wrapper library gmt-python built by the GMT team. We will show the design plans, internal implementations, and demonstrate an initial prototype of the library. Starting in version 5, GMT introduced a C API that is exposed through a shared library. The API allows access to GMT modules as C functions and provides mechanisms for input and output of binary data through shared memory. Our Python wrapper connects to this shared library using the ctypes standard library module. The wrapper code can thus be written in pure Python, which greatly simplifies packaging and distribution. Input and output will be handled using the GMT C API "virtual files" that allow access to shared memory between C and Python. We will implement a thin conversion layer between native scientific Python data types and the GMT internal data structures. Thus, we can accept input and produce output as numpy ndarrays and pandas DataFrames for tabular data and xarray Datasets for netCDF grids. Support for displaying figures inline in the Jupyter notebook is planned from the start by retrieving PNG previews of the Postscript figures. This will allow GMT to be seamlessly integrated into the rich scientific Python ecosystem.

Internally, the gmt-python library will contain low-level wrappers around the GMT C API functions and data structures. Users will interact with a higher-level API in which each GMT module is represented by a function. The module functions can take arguments as a single string representing the command-line arguments ("-Rg -JN180"), as keyword arguments (R='g', J='N180'), or as long-form aliases (region='g', projection='N180'). The Python interface relies on new features in GMT that are currently under development in the trunk of the SVN repository, mainly the "modern" execution mode, which greatly simplify the building of Postscript figures and maps. We are working in close collaboration with the rest of the GMT core developers to make changes on the GMT side as we exercise the new C API and discover bugs and missing features.

The capabilities of GMT go beyond the geosciences. It can produce high-quality line plots, bar graphs, histograms, and 3D surfaces. Significant barriers to entry for GMT have been the complexities of programming in bash and the many command-line options and their meanings. The Python wrapper library can serve as a backend for new and easier to use APIs, making GMT more accessible while retaining the high quality of figures. Work on gmt-python is still in early stages of design and implementation. A prototype is not yet available but is predicted in time for the conference in July. We are open and eager for contributions and feedback from the Scipy community.


Comments? Leave one below or let me know on Twitter @leouieda or in the Software Underground Slack group.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

Please enable JavaScript to view the comments powered by Disqus.

March 30, 2017 12:00 PM

March 28, 2017

Mark Fenner

DC SVD II: From Values to Vectors

In our last installment, we discussed solutions to the secular equation. These solutions are the eigenvalues (and/or) singular values of matrices with a particular form. Since this post is otherwise light on technical content, I’ll dive into those matrix forms now. Setting up the Secular Equation to Solve the Eigen and Singular Problems In dividing-and-conquering […]

by Mark Fenner at March 28, 2017 02:11 PM

Matthieu Brucher

Audio Toolkit: Recursive Least Square Filter (Adaptive module)

I’ve started working on adaptive filtering a long time ago, but could never figure out why my simple implementation of the RLS algorithm failed. Well, there was a typo in the reference book!

Now that this is fixed, let’s see what this guy does.

Algorithm

The RLS algorithm learns an input signal based on its past and predicts new values from it. As such, it can be used to learn periodic signals, but also noise. The basis is to predict a new value based on the past, compare it to the actual value and update the set of coefficients. The update itself is based on a memory time constraint, and the higher the value, the slower the update.

Once the filter has learned enough, the learning stage can be shut off, and the filter can be used to select frequencies.

Results

Let’s start with a simple sinusoidal signal, and see if an order 10 can be used to learn it:

Sinusoidal signal learnt with RLS

As it can be seen, at the beginning, the filter is learning, as it doesn’t match the input. After a short time, it does match (zooming on the signal shows that there is a latency and also the amplitude do not exactly match).

Let’s see how it does for more complex signals. Let’s add two additional slightly out of tunes sinusoids:

Three out-of-tune sinusoids learnt with RLS

Once again, after a short time, the learning phase is stable, and we can switch it off and the signal is estimated properly.

Let’s try now something a little bit more complex, and try to denoise an input signal.
Filtered noise

The original noise in blue is estimated in green, and the remainder noise is in red. Obviously, we don’t do a great job here, but let’s see what is actually attenuated:
Filtered noise in the spectral domain

So the middle of the bandwidth is better attenuated that the sides, which is expected in a way.

Now, what does that do to a signal we try to denoise?
Denoised signal

Obviously, the signal is denoised, but also increased! And the same happens in the spectral domain.
Denoised signal in the spectral domain

When looking at the estimated function, the picture is a little bit clearer:
Estimated spectral transfer function

Our noise is actually between 0.6 and 1.2 rad/s (from sampling frequency/10 to sampling frequency/5), and the RLS filter underestimates these a little bit but doesn’t cut the high frequencies, which can lead to ringing…

Also the cost of learning the noise is quite costly:
Learning cost

Learning was only activated during half the total processing time…

Conclusion

RLS filters are interesting to follow a signal. Obviously this filter is just the start of this new module, and I hope I’ll have real denoising filters at some point.

This filter will be available in ATK 2.0.0 and is already in the develop branch with the Python example scripts.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at March 28, 2017 07:34 AM

Matthew Rocklin

Dask and Pandas and XGBoost

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

This post talks about distributing Pandas Dataframes with Dask and then handing them over to distributed XGBoost for training.

More generally it discusses the value of launching multiple distributed systems in the same shared-memory processes and smoothly handing data back and forth between them.

Introduction

XGBoost is a well-loved library for a popular class of machine learning algorithms, gradient boosted trees. It is used widely in business and is one of the most popular solutions in Kaggle competitions. For larger datasets or faster training, XGBoost also comes with its own distributed computing system that lets it scale to multiple machines on a cluster. Fantastic. Distributed gradient boosted trees are in high demand.

However before we can use distributed XGBoost we need to do three things:

  1. Prepare and clean our possibly large data, probably with a lot of Pandas wrangling
  2. Set up XGBoost master and workers
  3. Hand data our cleaned data from a bunch of distributed Pandas dataframes to XGBoost workers across our cluster

This ends up being surprisingly easy. This blogpost gives a quick example using Dask.dataframe to do distributed Pandas data wrangling, then using a new dask-xgboost package to setup an XGBoost cluster inside the Dask cluster and perform the handoff.

After this example we’ll talk about the general design and what this means for other distributed systems.

Example

We have a ten-node cluster with eight cores each (m4.2xlarges on EC2)

import dask
from dask.distributed import Client, progress

>>> client = Client('172.31.33.0:8786')
>>> client.restart()
<Client: scheduler='tcp://172.31.33.0:8786' processes=10 cores=80>

We load the Airlines dataset using dask.dataframe (just a bunch of Pandas dataframes spread across a cluster) and do a bit of preprocessing:

import dask.dataframe as dd

# Subset of the columns to use
cols = ['Year', 'Month', 'DayOfWeek', 'Distance',
        'DepDelay', 'CRSDepTime', 'UniqueCarrier', 'Origin', 'Dest']

# Create the dataframe
df = dd.read_csv('s3://dask-data/airline-data/20*.csv', usecols=cols,
                  storage_options={'anon': True})

df = df.sample(frac=0.2) # XGBoost requires a bit of RAM, we need a larger cluster

is_delayed = (df.DepDelay.fillna(16) > 15)  # column of labels
del df['DepDelay']  # Remove delay information from training dataframe

df['CRSDepTime'] = df['CRSDepTime'].clip(upper=2399)

df, is_delayed = dask.persist(df, is_delayed)  # start work in the background

This loaded a few hundred pandas dataframes from CSV data on S3. We then had to downsample because how we are going to use XGBoost in the future seems to require a lot of RAM. I am not an XGBoost expert. Please forgive my ignorance here. At the end we have two dataframes:

  • df: Data from which we will learn if flights are delayed
  • is_delayed: Whether or not those flights were delayed.

Data scientists familiar with Pandas will probably be familiar with the code above. Dask.dataframe is very similar to Pandas, but operates on a cluster.

>>> df.head()
Year Month DayOfWeek CRSDepTime UniqueCarrier Origin Dest Distance
182193 2000 1 2 800 WN LAX OAK 337
83424 2000 1 6 1650 DL SJC SLC 585
346781 2000 1 5 1140 AA ORD LAX 1745
375935 2000 1 2 1940 DL PHL ATL 665
309373 2000 1 4 1028 CO MCI IAH 643
>>> is_delayed.head()
182193    False
83424     False
346781    False
375935    False
309373    False
Name: DepDelay, dtype: bool

Categorize and One Hot Encode

XGBoost doesn’t want to work with text data like destination=”LAX”. Instead we create new indicator columns for each of the known airports and carriers. This expands our data into many boolean columns. Fortunately Dask.dataframe has convenience functions for all of this baked in (thank you Pandas!)

>>> df2 = dd.get_dummies(df.categorize()).persist()

This expands our data out considerably, but makes it easier to train on.

>>> len(df2.columns)
685

Split and Train

Great, now we’re ready to split our distributed dataframes

data_train, data_test = df2.random_split([0.9, 0.1],
                                         random_state=1234)
labels_train, labels_test = is_delayed.random_split([0.9, 0.1],
                                                    random_state=1234)

Start up a distributed XGBoost instance, and train on this data

%%time
import dask_xgboost as dxgb

params = {'objective': 'binary:logistic', 'nround': 1000,
          'max_depth': 16, 'eta': 0.01, 'subsample': 0.5,
          'min_child_weight': 1, 'tree_method': 'hist',
          'grow_policy': 'lossguide'}

bst = dxgb.train(client, params, data_train, labels_train)

CPU times: user 355 ms, sys: 29.7 ms, total: 385 ms
Wall time: 54.5 s

Great, so we were able to train an XGBoost model on this data in about a minute using our ten machines. What we get back is just a plain XGBoost Booster object.

>>> bst
<xgboost.core.Booster at 0x7fa1c18c4c18>

We could use this on normal Pandas data locally

import xgboost as xgb
pandas_df = data_test.head()
dtest = xgb.DMatrix(pandas_df)

>>> bst.predict(dtest)
array([ 0.464578  ,  0.46631625,  0.47434333,  0.47245741,  0.46194169], dtype=float32)

Of we can use dask-xgboost again to train on our distributed holdout data, getting back another Dask series.

>>> predictions = dxgb.predict(client, bst, data_test).persist()
>>> predictions
Dask Series Structure:
npartitions=93
None    float32
None        ...
         ...
None        ...
None        ...
Name: predictions, dtype: float32
Dask Name: _predict_part, 93 tasks

Evaluate

We can bring these predictions to the local process and use normal Scikit-learn operations to evaluate the results.

>>> from sklearn.metrics import roc_auc_score, roc_curve
>>> print(roc_auc_score(labels_test.compute(),
...                     predictions.compute()))
0.654800768411
fpr, tpr, _ = roc_curve(labels_test.compute(), predictions.compute())
# Taken from
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
plt.figure(figsize=(8, 8))
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

We might want to play with our parameters above or try different data to improve our solution. The point here isn’t that we predicted airline delays well, it was that if you are a data scientist who knows Pandas and XGBoost, everything we did above seemed pretty familiar. There wasn’t a whole lot of new material in the example above. We’re using the same tools as before, just at a larger scale.

Analysis

OK, now that we’ve demonstrated that this works lets talk a bit about what just happened and what that means generally for cooperation between distributed services.

What dask-xgboost does

The dask-xgboost project is pretty small and pretty simple (200 TLOC). Given a Dask cluster of one central scheduler and several distributed workers it starts up an XGBoost scheduler in the same process running the Dask scheduler and starts up an XGBoost worker within each of the Dask workers. They share the same physical processes and memory spaces. Dask was built to support this kind of situation, so this is relatively easy.

Then we ask the Dask.dataframe to fully materialize in RAM and we ask where all of the constituent Pandas dataframes live. We tell each Dask worker to give all of the Pandas dataframes that it has to its local XGBoost worker and then just let XGBoost do its thing. Dask doesn’t power XGBoost, it’s just sets it up, gives it data, and lets it do it’s work in the background.

People often ask what machine learning capabilities Dask provides, how they compare with other distributed machine learning libraries like H2O or Spark’s MLLib. For gradient boosted trees the 200-line dask-xgboost package is the answer. Dask has no need to make such an algorithm because XGBoost already exists, works well and provides Dask users with a fully featured and efficient solution.

Because both Dask and XGBoost can live in the same Python process they can share bytes between each other without cost, can monitor each other, etc.. These two distributed systems co-exist together in multiple processes in the same way that NumPy and Pandas operate together within a single process. Sharing distributed processes with multiple systems can be really beneficial if you want to use multiple specialized services easily and avoid large monolithic frameworks.

Connecting to Other distributed systems

A while ago I wrote a similar blogpost about hosting TensorFlow from Dask in exactly the same way that we’ve done here. It was similarly easy to setup TensorFlow alongside Dask, feed it data, and let TensorFlow do its thing.

Generally speaking this “serve other libraries” approach is how Dask operates when possible. We’re only able to cover the breadth of functionality that we do today because we lean heavily on the existing open source ecosystem. Dask.arrays use Numpy arrays, Dask.dataframes use Pandas, and now the answer to gradient boosted trees with Dask is just to make it really really easy to use distributed XGBoost. Ta da! We get a fully featured solution that is maintained by other devoted developers, and the entire connection process was done over a weekend (see dmlc/xgboost #2032 for details).

Since this has come out we’ve had requests to support other distributed systems like Elemental and to do general hand-offs to MPI computations. If we’re able to start both systems with the same set of processes then all of this is pretty doable. Many of the challenges of inter-system collaboration go away when you can hand numpy arrays between the workers of one system to the workers of the other system within the same processes.

Acknowledgements

Thanks to Tianqi Chen and Olivier Grisel for their help when building and testing dask-xgboost. Thanks to Will Warner for his help in editing this post.

March 28, 2017 12:00 AM

March 23, 2017

Continuum Analytics news

The Conda Configuration Engine for Power Users

Tuesday, April 4, 2017
Kale Franz
Continuum Analytics

Released last fall, conda 4.2 brought with it configuration superpowers. The capabilities are extensive, and they're designed with conda power users, devops engineers, and sysadmins in mind.

Configuration information comes from four basic sources:

  1. hard-coded defaults,
  2. configuration files,
  3. environment variables, and
  4. command-line arguments.

Each time a conda process initializes, an operating context is built that in a cascading fashion merges configuration sources. Command-line arguments hold the highest precedence, and hard-coded defaults the lowest.

The configuration file search path has been dramatically expanded. In order from lowest to highest priority, and directly from the conda code,

SEARCH_PATH = ( 
    '/etc/conda/.condarc', 
    '/etc/conda/condarc', 
    '/etc/conda/condarc.d/', 
    '/var/lib/conda/.condarc', 
    '/var/lib/conda/condarc', 
    '/var/lib/conda/condarc.d/', 
    '$CONDA_ROOT/.condarc', 
    '$CONDA_ROOT/condarc', 
    '$CONDA_ROOT/condarc.d/', 
    '~/.conda/.condarc', 
    '~/.conda/condarc', 
    '~/.conda/condarc.d/', 
    '~/.condarc', 
    '$CONDA_PREFIX/.condarc', 
    '$CONDA_PREFIX/condarc', 
    '$CONDA_PREFIX/condarc.d/', 
    '$CONDARC', 
)

where environment variables and user home directory are expanded on first use. $CONDA_ROOT is automatically set to the root environment prefix (and shouldn't be set by users), and $CONDA_PREFIX is automatically set for activated conda environments. Thus, conda environments can have their own individualized and customized configurations. For the ".d" directories in the search path, conda will read in sorted order any (and only) files ending with .yml or .yaml extensions. The $CONDARC environment variable can be any path to a file having a .yml or .yaml extension, or containing "condarc" in the file name; it can also be a directory.

Environment variables hold second-highest precedence, and all configuration parameters are able to be specified as environment variables. To convert from the condarc file-based configuration parameter name to the environment variable parameter name, make the name all uppercase and prepend CONDA_. For example, conda's always_yes configuration parameter can be specified using a CONDA_ALWAYS_YES environment variable.

Configuration parameters in some cases have aliases. For example, setting always_yes: true or yes: true in a configuration file is equivalent to the command-line flag --yes. They're all also equivalent to both CONDA_ALWAYS_YES=true and CONDA_YES=true environment variables. A validation error is thrown if multiple parameters aliased to each other are specified within a single configuration source.

There are three basic configuration parameter types: primitive, map, and sequence. Each follow a slightly different set of merge rules.

The primitive configuration parameter is the easiest to merge. Within the linearized chain of information sources, the last source that sets the parameter wins. There is one caveat: if the parameter is trailed by a #!final flag, the merge cascade stops for that parameter. (Indeed, the markup concept is borrowed from the !important rule in CSS.) While still giving end-users extreme flexibility in most cases, we also give sysadmins the ability to lock down as much configuration as needed by making files read-only.

Map configuration parameters have elements that are key-value pairs. Merges are at the per-key level. Given two files with the contents

# file: /etc/conda/condarc.d/proxies.yml
proxy_servers: https: 
  http://prod-proxy

# file: ~/.conda/condarc.d/proxies.yml
proxy_servers: 
  http: http://dev-proxy:1080 
  https: http://dev-proxy:1081

the merged proxy_servers configuration will be

proxy_servers: 
  http: http://dev-proxy:1080 
  https: http://dev-proxy:1081

However, by modifying the contents of the first file to be

# file: /etc/conda/condarc.d/proxies.yml
proxy_servers:
  https: http://prod-proxy #!final

the merged settings will be

proxy_servers: 
  http: http://dev-proxy:1080 
  https: http://prod-proxy

Note the use of the !final flag acts at the per-key level. A !final flag can also be set for the parameter as a whole. With the first file again changed to

# file: /etc/conda/condarc.d/proxies.yml
proxy_servers: #!final 
  https: http://prod-proxy

the merged settings will be

proxy_servers: 
  https: http://prod-proxy

with no http key defined.

The sequence parameter merges are the most involved. Consider contents of the three files

# file: /etc/conda/condarc
channels: 
  - one 
  - two

# file: ~/.condarc
channels: 
  - three 
  - four

# file: $CONDA_PREFIX/.condarc
channels: 
  - five 
  - six

the final merged configuration will be

channels: 
  - five 
  - six 
  - three 
  - four 
  - one 
  - two

Sequence order within each individual configuration source is preserved, while still respecting sources' overall precedence. Just like map parameters, a !final flag can be used for a sequence parameter as a whole. However, the !final flag does not apply to individual elements of sequence parameters, and instead !top and !bottom flags are available. Modifying the sequence example to the following

# file: /etc/conda/condarc
channels: 
  - one #!top 
  - two

# file: ~/.condarc
channels: #!final 
  - three 
  - four #!bottom

# file: $CONDA_PREFIX/.condarc
channels: 
  - five 
  - six

will yield a final merged configuration

channels: 
  - one 
  - three 
  - two 
  - four

Managing all of these new sources of configuration could become difficult without some new tools. The most basic is conda config --validate, which simply exits 0 if conda's configured state passes all validation tests. The command conda config --describe (recently added in 4.3.16) gives a detailed description of available configuration parameters.

We've also added the commands conda config --show-sources and conda config --show. The first displays all of the configuration information conda recognizes--in its non-merged form broken out per source. The second gives the final, merged values for all configuration parameters.

Conda's configuration engine gives power users tools for ultimate control. If you've read to this point, that's probably you. And as a conda power user, please consider participating in the conda canary program. Be on the cutting edge, and also help influence new conda features and behaviors before they're solidified in general availability releases.

by swebster at March 23, 2017 02:13 PM

Matthew Rocklin

Dask Release 0.14.1

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.1. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on February 27th.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed

or you can pip install from PyPI

pip install dask[complete] --upgrade

Arrays

Recent work in distributed computing and machine learning have motivated new performance-oriented and usability changes to how we handle arrays.

Automatic chunking and operation on NumPy arrays

Many interactions between Dask arrays and NumPy arrays work smoothly. NumPy arrays are made lazy and are appropriately chunked to match the operation and the Dask array.

>>> x = np.ones(10)                 # a numpy array
>>> y = da.arange(10, chunks=(5,))  # a dask array
>>> z = x + y                       # combined become a dask.array
>>> z
dask.array<add, shape=(10,), dtype=float64, chunksize=(5,)>

>>> z.compute()
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

Reshape

Reshaping distributed arrays is simple in simple cases, and can be quite complex in complex cases. Reshape now supports a much more broad set of shape transformations where any dimension is collapsed or merged to other dimensions.

>>> x = da.ones((2, 3, 4, 5, 6), chunks=(2, 2, 2, 2, 2))
>>> x.reshape((6, 2, 2, 30, 1))
dask.array<reshape, shape=(6, 2, 2, 30, 1), dtype=float64, chunksize=(3, 1, 2, 6, 1)>

This operation ends up being quite useful in a number of distributed array cases.

Optimize Slicing to Minimize Communication

Dask.array slicing optimizations are now careful to produce graphs that avoid situations that could cause excess inter-worker communication. The details of how they do this is a bit out of scope for a short blogpost, but the history here is interesting.

Historically dask.arrays were used almost exclusively by researchers with large on-disk arrays stored as HDF5 or NetCDF files. These users primarily used the single machine multi-threaded scheduler. We heavily tailored Dask array optimizations to this situation and made that community pretty happy. Now as some of that community switches to cluster computing on larger datasets the optimization goals shift a bit. We have tons of distributed disk bandwidth but really want to avoid communicating large results between workers. Supporting both use cases is possible and I think that we’ve achieved that in this release so far, but it’s starting to require increasing levels of care.

Micro-optimizations

With distributed computing also comes larger graphs and a growing importance of graph-creation overhead. This has been optimized somewhat in this release. We expect this to be a focus going forward.

DataFrames

Set_index

Set_index is smarter in two ways:

  1. If you set_index on a column that happens to be sorted then we’ll identify that and avoid a costly shuffle. This was always possible with the sorted= keyword but users rarely used this feature. Now this is automatic.
  2. Similarly when setting the index we can look at the size of the data and determine if there are too many or too few partitions and rechunk the data while shuffling. This can significantly improve performance if there are too many partitions (a common case).

Shuffle performance

We’ve micro-optimized some parts of dataframe shuffles. Big thanks to the Pandas developers for the help here. This accelerates set_index, joins, groupby-applies, and so on.

Fastparquet

The fastparquet library has seen a lot of use lately and has undergone a number of community bugfixes.

Importantly, Fastparquet now supports Python 2.

We strongly recommend Parquet as the standard data storage format for Dask dataframes (and Pandas DataFrames).

dask/fastparquet #87

Distributed Scheduler

Replay remote exceptions

Debugging is hard in part because exceptions happen on remote machines where normal debugging tools like pdb can’t reach. Previously we were able to bring back the traceback and exception, but you couldn’t dive into the stack trace to investigate what went wrong:

def div(x, y):
    return x / y

>>> future = client.submit(div, 1, 0)
>>> future
<Future: status: error, key: div-4a34907f5384bcf9161498a635311aeb>

>>> future.result()  # getting result re-raises exception locally
<ipython-input-3-398a43a7781e> in div()
      1 def div(x, y):
----> 2     return x / y

ZeroDivisionError: division by zero

Now Dask can bring a failing task and all necessary data back to the local machine and rerun it so that users can leverage the normal Python debugging toolchain.

>>> client.recreate_error_locally(future)
<ipython-input-3-398a43a7781e> in div(x, y)
      1 def div(x, y):
----> 2     return x / y
ZeroDivisionError: division by zero

Now if you’re in IPython or a Jupyter notebook you can use the %debug magic to jump into the stacktrace, investigate local variables, and so on.

In [8]: %debug
> <ipython-input-3-398a43a7781e>(2)div()
      1 def div(x, y):
----> 2     return x / y

ipdb> pp x
1
ipdb> pp y
0

dask/distributed #894

Async/await syntax

Dask.distributed uses Tornado for network communication and Tornado coroutines for concurrency. Normal users rarely interact with Tornado coroutines; they aren’t familiar to most people so we opted instead to copy the concurrent.futures API. However some complex situations are much easier to solve if you know a little bit of async programming.

Fortunately, the Python ecosystem seems to be embracing this change towards native async code with the async/await syntax in Python 3. In an effort to motivate people to learn async programming and to gently nudge them towards Python 3 Dask.distributed we now support async/await in a few cases.

You can wait on a dask Future

async def f():
    future = client.submit(func, *args, **kwargs)
    result = await future

You can put the as_completed iterator into an async for loop

async for future in as_completed(futures):
    result = await future
    ... do stuff with result ...

And, because Tornado supports the await protocols you can also use the existing shadow concurrency API (everything prepended with an underscore) with await. (This was doable before.)

results = client.gather(futures)         # synchronous
...
results = await client._gather(futures)  # asynchronous

If you’re in Python 2 you can always do this with normal yield and the tornado.gen.coroutine decorator.

dask/distributed #952

Inproc transport

In the last release we enabled Dask to communicate over more things than just TCP. In practice this doesn’t come up (TCP is pretty useful). However in this release we now support single-machine “clusters” where the clients, scheduler, and workers are all in the same process and transfer data cost-free over in-memory queues.

This allows the in-memory user community to use some of the more advanced features (asynchronous computation, spill-to-disk support, web-diagnostics) that are only available in the distributed scheduler.

This is on by default if you create a cluster with LocalCluster without using Nanny processes.

>>> from dask.distributed import LocalCluster, Client

>>> cluster = LocalCluster(nanny=False)

>>> client = Client(cluster)

>>> client
<Client: scheduler='inproc://192.168.1.115/8437/1' processes=1 cores=4>

>>> from threading import Lock         # Not serializable
>>> lock = Lock()                      # Won't survive going over a socket
>>> [future] = client.scatter([lock])  # Yet we can send to a worker
>>> future.result()                    # ... and back
<unlocked _thread.lock object at 0x7fb7f12d08a0>

dask/distributed #919

Connection pooling for inter-worker communications

Workers now maintain a pool of sustained connections between each other. This pool is of a fixed size and removes connections with a least-recently-used policy. It avoids re-connection delays when transferring data between workers. In practice this shaves off a millisecond or two from every communication.

This is actually a revival of an old feature that we had turned off last year when it became clear that the performance here wasn’t a problem.

Along with other enhancements, this takes our round-trip latency down to 11ms on my laptop.

In [10]: %%time
    ...: for i in range(1000):
    ...:     future = client.submit(inc, i)
    ...:     result = future.result()
    ...:
CPU times: user 4.96 s, sys: 348 ms, total: 5.31 s
Wall time: 11.1 s

There may be room for improvement here though. For comparison here is the same test with the concurent.futures.ProcessPoolExecutor.

In [14]: e = ProcessPoolExecutor(8)

In [15]: %%time
    ...: for i in range(1000):
    ...:     future = e.submit(inc, i)
    ...:     result = future.result()
    ...:
CPU times: user 320 ms, sys: 56 ms, total: 376 ms
Wall time: 442 ms

Also, just to be clear, this measures total roundtrip latency, not overhead. Dask’s distributed scheduler overhead remains in the low hundreds of microseconds.

dask/distributed #935

There has been activity around Dask and machine learning:

  • dask-learn is undergoing some performance enhancements. It turns out that when you offer distributed grid search people quickly want to scale up their computations to hundreds of thousands of trials.
  • dask-glm now has a few decent algorithms for convex optimization. The authors of this wrote a blogpost very recently if you’re interested: Developing Convex Optimization Algorithms in Dask
  • dask-xgboost lets you hand off distributed data in Dask dataframes or arrays and hand it directly to a distributed XGBoost system (that Dask will nicely set up and tear down for you). This was a nice example of easy hand-off between two distributed services running in the same processes.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.0 release on February 27th

  • Antoine Pitrou
  • Brian Martin
  • Elliott Sales de Andrade
  • Erik Welch
  • Francisco de la Peña
  • jakirkham
  • Jim Crist
  • Jitesh Kumar Jha
  • Julien Lhermitte
  • Martin Durant
  • Matthew Rocklin
  • Markus Gonser
  • Talmaj

The following people contributed to the dask/distributed repository since the 1.16.0 release on February 27th

  • Antoine Pitrou
  • Ben Schreck
  • Elliott Sales de Andrade
  • Martin Durant
  • Matthew Rocklin
  • Phil Elson

March 23, 2017 12:00 AM

numfocus

PyData Atlanta Meetup Celebrates 1 Year and over 1,000 members

PyData Atlanta holds a meetup at MailChimp, where Jim Crozier spoke about analyzing NFL data with PySpark. Atlanta tells a new story about data by Rob Clewley In late 2015, the three of us (Tony Fast, Neel Shivdasani, and myself) had been regularly  nerding out about data over beers and becoming fast friends. We were […]

by NumFOCUS Staff at March 23, 2017 12:00 AM

March 22, 2017

Matthew Rocklin

Developing Convex Optimization Algorithms in Dask

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

We build distributed optimization algorithms with Dask.  We show both simple examples and also benchmarks from a nascent dask-glm library for generalized linear models.  We also talk about the experience of learning Dask to do this kind of work.

This blogpost is co-authored by Chris White (Capital One) who knows optimization and Matthew Rocklin (Continuum Analytics) who knows distributed computing.

Introduction

Many machine learning and statistics models (such as logistic regression) depend on convex optimization algorithms like Newton’s method, stochastic gradient descent, and others.  These optimization algorithms are both pragmatic (they’re used in many applications) and mathematically interesting.  As a result these algorithms have been the subject of study by researchers and graduate students around the world for years both in academia and in industry.

Things got interesting about five or ten years ago when datasets grew beyond the size of working memory and “Big Data” became a buzzword.  Parallel and distributed solutions for these algorithms have become the norm, and a researcher’s skillset now has to extend beyond linear algebra and optimization theory to include parallel algorithms and possibly even network programming, especially if you want to explore and create more interesting algorithms.

However, relatively few people understand both mathematical optimization theory and the details of distributed systems. Typically algorithmic researchers depend on the APIs of distributed computing libraries like Spark or Flink to implement their algorithms. In this blogpost we explore the extent to which Dask can be helpful in these applications. We approach this from two perspectives:

  1. Algorithmic researcher (Chris): someone who knows optimization and iterative algorithms like Conjugate Gradient, Dual Ascent, or GMRES but isn’t so hot on distributed computing topics like sockets, MPI, load balancing, and so on
  2. Distributed systems developer (Matt): someone who knows how to move bytes around and keep machines busy but doesn’t know the right way to do a line search or handle a poorly conditioned matrix

Prototyping Algorithms in Dask

Given knowledge of algorithms and of NumPy array computing it is easy to write parallel algorithms with Dask. For a range of complicated algorithmic structures we have two straightforward choices:

  1. Use parallel multi-dimensional arrays to construct algorithms from common operations like matrix multiplication, SVD, and so on. This mirrors mathematical algorithms well but lacks some flexibility.
  2. Create algorithms by hand that track operations on individual chunks of in-memory data and dependencies between them. This is very flexible but requires a bit more care.

Coding up either of these options from scratch can be a daunting task, but with Dask it can be as simple as writing NumPy code.

Let’s build up an example of fitting a large linear regression model using both built-in array parallelism and fancier, more customized parallelization features that Dask offers. The dask.array module helps us to easily parallelize standard NumPy functionality using the same syntax – we’ll start there.

Data Creation

Dask has many ways to create dask arrays; to get us started quickly prototyping let’s create some random data in a way that should look familiar to NumPy users.

import dask
import dask.array as da
import numpy as np

from dask.distributed import Client

client = Client()

## create inputs with a bunch of independent normals
beta = np.random.random(100)  # random beta coefficients, no intercept
X = da.random.normal(0, 1, size=(1000000, 100), chunks=(100000, 100))
y = X.dot(beta) + da.random.normal(0, 1, size=1000000, chunks=(100000,))

## make sure all chunks are ~equally sized
X, y = dask.persist(X, y)
client.rebalance([X, y])

Observe that X is a dask array stored in 10 chunks, each of size (100000, 100). Also note that X.dot(beta) runs smoothly for both numpy and dask arrays, so we can write code that basically works in either world.

Caveat: If X is a numpy array and beta is a dask array, X.dot(beta) will output an in-memory numpy array. This is usually not desirable as you want to carefully choose when to load something into memory. One fix is to use multipledispatch to handle odd edge cases; for a starting example, check out the dot code here.

Dask also has convenient visualization features built in that we will leverage; below we visualize our data in its 10 independent chunks:

Create data for dask-glm computations

Array Programming

If you can write iterative array-based algorithms in NumPy, then you can write iterative parallel algorithms in Dask

As we’ve already seen, Dask inherits much of the NumPy API that we are familiar with, so we can write simple NumPy-style iterative optimization algorithms that will leverage the parallelism dask.array has built-in already. For example, if we want to naively fit a linear regression model on the data above, we are trying to solve the following convex optimization problem:

Recall that in non-degenerate situations this problem has a closed-form solution that is given by:

We can compute $\beta^*$ using the above formula with Dask:

## naive solution
beta_star = da.linalg.solve(X.T.dot(X), X.T.dot(y))

>>> abs(beta_star.compute() - beta).max()
0.0024817567237768179

Sometimes a direct solve is too costly, and we want to solve the above problem using only simple matrix-vector multiplications. To this end, let’s take this one step further and actually implement a gradient descent algorithm which exploits parallel matrix operations. Recall that gradient descent iteratively refines an initial estimate of beta via the update:

where $\alpha$ can be chosen based on a number of different “step-size” rules; for the purposes of exposition, we will stick with a constant step-size:

## quick step-size calculation to guarantee convergence
_, s, _ = np.linalg.svd(2 * X.T.dot(X))
step_size = 1 / s - 1e-8

## define some parameters
max_steps = 100
tol = 1e-8
beta_hat = np.zeros(100) # initial guess

for k in range(max_steps):
    Xbeta = X.dot(beta_hat)
    func = ((y - Xbeta)**2).sum()
    gradient = 2 * X.T.dot(Xbeta - y)

    ## Update
    obeta = beta_hat
    beta_hat = beta_hat - step_size * gradient
    new_func = ((y - X.dot(beta_hat))**2).sum()
    beta_hat, func, new_func = dask.compute(beta_hat, func, new_func)  # <--- Dask code

    ## Check for convergence
    change = np.absolute(beta_hat - obeta).max()

    if change < tol:
        break

>>> abs(beta_hat - beta).max()
0.0024817567259038942

It’s worth noting that almost all of this code is exactly the same as the equivalent NumPy code. Because Dask.array and NumPy share the same API it’s pretty easy for people who are already comfortable with NumPy to get started with distributed algorithms right away. The only thing we had to change was how we produce our original data (da.random.normal instead of np.random.normal) and the call to dask.compute at the end of the update state. The dask.compute call tells Dask to go ahead and actually evaluate everything we’ve told it to do so far (Dask is lazy by default). Otherwise, all of the mathematical operations, matrix multiplies, slicing, and so on are exactly the same as with Numpy, except that Dask.array builds up a chunk-wise parallel computation for us and Dask.distributed can execute that computation in parallel.

To better appreciate all the scheduling that is happening in one update step of the above algorithm, here is a visualization of the computation necessary to compute beta_hat and the new function value new_func:

Gradient descent step Dask graph

Each rectangle is an in-memory chunk of our distributed array and every circle is a numpy function call on those in-memory chunks. The Dask scheduler determines where and when to run all of these computations on our cluster of machines (or just on the cores of our laptop).

Array Programming + dask.delayed

Now that we’ve seen how to use the built-in parallel algorithms offered by Dask.array, let’s go one step further and talk about writing more customized parallel algorithms. Many distributed “consensus” based algorithms in machine learning are based on the idea that each chunk of data can be processed independently in parallel, and send their guess for the optimal parameter value to some master node. The master then computes a consensus estimate for the optimal parameters and reports that back to all of the workers. Each worker then processes their chunk of data given this new information, and the process continues until convergence.

From a parallel computing perspective this is a pretty simple map-reduce procedure. Any distributed computing framework should be able to handle this easily. We’ll use this as a very simple example for how to use Dask’s more customizable parallel options.

One such algorithm is the Alternating Direction Method of Multipliers, or ADMM for short. For the sake of this post, we will consider the work done by each worker to be a black box.

We will also be considering a regularized version of the problem above, namely:

At the end of the day, all we will do is:

  • create NumPy functions which define how each chunk updates its parameter estimates
  • wrap those functions in dask.delayed
  • call dask.compute and process the individual estimates, again using NumPy

First we need to define some local functions that the chunks will use to update their individual parameter estimates, and import the black box local_update step from dask_glm; also, we will need the so-called shrinkage operator (which is the proximal operator for the $l1$-norm in our problem):

from dask_glm.algorithms import local_update

def local_f(beta, X, y, z, u, rho):
    return ((y - X.dot(beta)) **2).sum() + (rho / 2) * np.dot(beta - z + u,
                                                                  beta - z + u)

def local_grad(beta, X, y, z, u, rho):
    return 2 * X.T.dot(X.dot(beta) - y) + rho * (beta - z + u)


def shrinkage(beta, t):
    return np.maximum(0, beta - t) - np.maximum(0, -beta - t)

## set some algorithm parameters
max_steps = 10
lamduh = 7.2
rho = 1.0

(n, p) = X.shape
nchunks = X.npartitions

XD = X.to_delayed().flatten().tolist()  # A list of pointers to remote numpy arrays
yD = y.to_delayed().flatten().tolist()  # ... one for each chunk

# the initial consensus estimate
z = np.zeros(p)

# an array of the individual "dual variables" and parameter estimates,
# one for each chunk of data
u = np.array([np.zeros(p) for i in range(nchunks)])
betas = np.array([np.zeros(p) for i in range(nchunks)])

for k in range(max_steps):

    # process each chunk in parallel, using the black-box 'local_update' magic
    new_betas = [dask.delayed(local_update)(xx, yy, bb, z, uu, rho,
                                            f=local_f,
                                            fprime=local_grad)
                 for xx, yy, bb, uu in zip(XD, yD, betas, u)]
    new_betas = np.array(dask.compute(*new_betas))

    # everything else is NumPy code occurring at "master"
    beta_hat = 0.9 * new_betas + 0.1 * z

    # create consensus estimate
    zold = z.copy()
    ztilde = np.mean(beta_hat + np.array(u), axis=0)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))

    # update dual variables
    u += beta_hat - z

>>> # Number of coefficients zeroed out due to L1 regularization
>>> print((z == 0).sum())
12

There is of course a little bit more work occurring in the above algorithm, but it should be clear that the distributed operations are not one of the difficult pieces. Using dask.delayed we were able to express a simple map-reduce algorithm like ADMM with similarly simple Python for loops and delayed function calls. Dask.delayed is keeping track of all of the function calls we wanted to make and what other function calls they depend on. For example all of the local_update calls can happen independent of each other, but the consensus computation blocks on all of them.

We hope that both parallel algorithms shown above (gradient descent, ADMM) were straightforward to someone reading with an optimization background. These implementations run well on a laptop, a single multi-core workstation, or a thousand-node cluster if necessary. We’ve been building somewhat more sophisticated implementations of these algorithms (and others) in dask-glm. They are more sophisticated from an optimization perspective (stopping criteria, step size, asynchronicity, and so on) but remain as simple from a distributed computing perspective.

Experiment

We compare dask-glm implementations against Scikit-learn on a laptop, and then show them running on a cluster.

Reproducible notebook is available here

We’re building more sophisticated versions of the algorithms above in dask-glm.  This project has convex optimization algorithms for gradient descent, proximal gradient descent, Newton’s method, and ADMM.  These implementations extend the implementations above by also thinking about stopping criteria, step sizes, and other niceties that we avoided above for simplicity.

In this section we show off these algorithms by performing a simple numerical experiment that compares the numerical performance of proximal gradient descent and ADMM alongside Scikit-Learn’s LogisticRegression and SGD implementations on a single machine (a personal laptop) and then follows up by scaling the dask-glm options to a moderate cluster.  

Disclaimer: These experiments are crude. We’re using artificial data, we’re not tuning parameters or even finding parameters at which these algorithms are producing results of the same accuracy. The goal of this section is just to give a general feeling of how things compare.

We create data

## size of problem (no. observations)
N = 8e6
chunks = 1e6
seed = 20009
beta = (np.random.random(15) - 0.5) * 3

X = da.random.random((N,len(beta)), chunks=chunks)
y = make_y(X, beta=np.array(beta), chunks=chunks)

X, y = dask.persist(X, y)
client.rebalance([X, y])

And run each of our algorithms as follows:

# Dask-GLM Proximal Gradient
result = proximal_grad(X, y, lamduh=alpha)

# Dask-GLM ADMM
X2 = X.rechunk((1e5, None)).persist()  # ADMM prefers smaller chunks
y2 = y.rechunk(1e5).persist()
result = admm(X2, y2, lamduh=alpha)

# Scikit-Learn LogisticRegression
nX, ny = dask.compute(X, y)  # sklearn wants numpy arrays
result = LogisticRegression(penalty='l1', C=1).fit(nX, ny).coef_

# Scikit-Learn Stochastic Gradient Descent
result = SGDClassifier(loss='log',
                       penalty='l1',
                       l1_ratio=1,
                       n_iter=10,
                       fit_intercept=False).fit(nX, ny).coef_

We then compare with the $L_{\infty}$ norm (largest different value).

abs(result - beta).max()

Times and $L_\infty$ distance from the true “generative beta” for these parameters are shown in the table below:

Algorithm Error Duration (s)
Proximal Gradient 0.0227 128
ADMM 0.0125 34.7
LogisticRegression 0.0132 79
SGDClassifier 0.0456 29.4

Again, please don’t take these numbers too seriously: these algorithms all solve regularized problems, so we don’t expect the results to necessarily be close to the underlying generative beta (even asymptotically). The numbers above are meant to demonstrate that they all return results which were roughly the same distance from the beta above. Also, Dask-glm is using a full four-core laptop while SKLearn is restricted to use a single core.

In the sections below we include profile plots for proximal gradient and ADMM. These show the operations that each of eight threads was doing over time. You can mouse-over rectangles/tasks and zoom in using the zoom tools in the upper right. You can see the difference in complexity of the algorithms. ADMM is much simpler from Dask’s perspective but also saturates hardware better for this chunksize.

Profile Plot for Proximal Gradient Descent

Profile Plot for ADMM

The general takeaway here is that dask-glm performs comparably to Scikit-Learn on a single machine. If your problem fits in memory on a single machine you should continue to use Scikit-Learn and Statsmodels. The real benefit to the dask-glm algorithms is that they scale and can run efficiently on data that is larger-than-memory by operating from disk on a single computer or on a cluster of computers working together.

Cluster Computing

As a demonstration, we run a larger version of the data above on a cluster of eight m4.2xlarges on EC2 (8 cores and 30GB of RAM each.)

We create a larger dataset with 800,000,000 rows and 15 columns across eight processes.

N = 8e8
chunks = 1e7
seed = 20009
beta = (np.random.random(15) - 0.5) * 3

X = da.random.random((N,len(beta)), chunks=chunks)
y = make_y(X, beta=np.array(beta), chunks=chunks)

X, y = dask.persist(X, y)

We then run the same proximal_grad and admm operations from before:

# Dask-GLM Proximal Gradient
result = proximal_grad(X, y, lamduh=alpha)

# Dask-GLM ADMM
X2 = X.rechunk((1e6, None)).persist()  # ADMM prefers smaller chunks
y2 = y.rechunk(1e6).persist()
result = admm(X2, y2, lamduh=alpha)

Proximal grad completes in around seventeen minutes while ADMM completes in around four minutes. Profiles for the two computations are included below:

Profile Plot for Proximal Gradient Descent

We include only the first few iterations here. Otherwise this plot is several megabytes.

Link to fullscreen plot

Profile Plot for ADMM

Link to fullscreen plot

These both obtained similar $L_{\infty}$ errors to what we observed before.

Algorithm Error Duration (s)
Proximal Gradient 0.0306 1020
ADMM 0.00159 270

Although this time we had to be careful about a couple of things:

  1. We explicitly deleted the old data after rechunking (ADMM prefers different chunksizes than proximal_gradient) because our full dataset, 100GB, is close enough to our total distributed RAM (240GB) that it’s a good idea to avoid keeping replias around needlessly. Things would have run fine, but spilling excess data to disk would have negatively affected performance.
  2. We set the OMP_NUM_THREADS=1 environment variable to avoid over-subscribing our CPUs. Surprisingly not doing so led both to worse performance and to non-deterministic results. An issue that we’re still tracking down.

Analysis

The algorithms in Dask-GLM are new and need development, but are in a usable state by people comfortable operating at this technical level. Additionally, we would like to attract other mathematical and algorithmic developers to this work. We’ve found that Dask provides a nice balance between being flexible enough to support interesting algorithms, while being managed enough to be usable by researchers without a strong background in distributed systems. In this section we’re going to discuss the things that we learned from both Chris’ (mathematical algorithms) and Matt’s (distributed systems) perspective and then talk about possible future work. We encourage people to pay attention to future work; we’re open to collaboration and think that this is a good opportunity for new researchers to meaningfully engage.

Chris’s perspective

  1. Creating distributed algorithms with Dask was surprisingly easy; there is still a small learning curve around when to call things like persist, compute, rebalance, and so on, but that can’t be avoided. Using Dask for algorithm development has been a great learning environment for understanding the unique challenges associated with distributed algorithms (including communication costs, among others).
  2. Getting the particulars of algorithms correct is non-trivial; there is still work to be done in better understanding the tolerance settings vs. accuracy tradeoffs that are occurring in many of these algorithms, as well as fine-tuning the convergence criteria for increased precision.
  3. On the software development side, reliably testing optimization algorithms is hard. Finding provably correct optimality conditions that should be satisfied which are also numerically stable has been a challenge for me.
  4. Working on algorithms in isolation is not nearly as fun as collaborating on them; please join the conversation and contribute!
  5. Most importantly from my perspective, I’ve found there is a surprisingly large amount of misunderstanding in “the community” surrounding what optimization algorithms do in the world of predictive modeling, what problems they each individually solve, and whether or not they are interchangeable for a given problem. For example, Newton’s method can’t be used to optimize an l1-regularized problem, and the coefficient estimates from an l1-regularized problem are fundamentally (and numerically) different from those of an l2-regularized problem (and from those of an unregularized problem). My own personal goal is that the API for dask-glm exposes these subtle distinctions more transparently and leads to more thoughtful modeling decisions “in the wild”.

Matt’s perspective

This work triggered a number of concrete changes within the Dask library:

  1. We can convert Dask.dataframes to Dask.arrays. This is particularly important because people want to do pre-processing with dataframes but then switch to efficient multi-dimensional arrays for algorithms.
  2. We had to unify the single-machine scheduler and distributed scheduler APIs a bit, notably adding a persist function to the single machine scheduler. This was particularly important because Chris generally prototyped on his laptop but we wanted to write code that was effective on clusters.
  3. Scheduler overhead can be a problem for the iterative dask-array algorithms (gradient descent, proximal gradient descent, BFGS). This is particularly a problem because NumPy is very fast. Often our tasks take only a few milliseconds, which makes Dask’s overhead of 200us per task become very relevant (this is why you see whitespace in the profile plots above). We’ve started resolving this problem in a few ways like more aggressive task fusion and lower overheads generally, but this will be a medium-term challenge. In practice for dask-glm we’ve started handling this just by choosing chunksizes well. I suspect that for the dask-glm in particular we’ll just develop auto-chunksize heuristics that will mostly solve this problem. However we expect this problem to recur in other work with scientists on HPC systems who have similar situations.
  4. A couple of things can be tricky for algorithmic users:
    1. Placing the calls to asynchronously start computation (persist, compute). In practice Chris did a good job here and then I came through and tweaked things afterwards. The web diagnostics ended up being crucial to identify issues.
    2. Avoiding accidentally calling NumPy functions on dask.arrays and vice versa. We’ve improved this on the dask.array side, and they now operate intelligently when given numpy arrays. Changing this on the NumPy side is harder until NumPy protocols change (which is planned).

Future work

There are a number of things we would like to do, both in terms of measurement and for the dask-glm project itself. We welcome people to voice their opinions (and join development) on the following issues:

  1. Asynchronous Algorithms
  2. User APIs
  3. Extend GLM families
  4. Write more extensive rigorous algorithm testing - for satisfying provable optimality criteria, and for robustness to various input data
  5. Begin work on smart initialization routines

What is your perspective here, gentle reader? Both Matt and Chris can use help on this project. We hope that some of the issues above provide seeds for community engagement. We welcome other questions, comments, and contributions either as github issues or comments below.

Acknowledgements

Thanks also go to Hussain Sultan (Capital One) and Tom Augspurger for collaboration on Dask-GLM and to Will Warner (Continuum) for reviewing and editing this post.

March 22, 2017 12:00 AM

numfocus

nteract: Building on top of Jupyter (from a rich REPL toolkit to interactive notebooks)

This post originally appeared on the nteract blog.   nteract builds upon the very successful foundations of Jupyter. I think of Jupyter as a brilliantly rich REPL toolkit. A typical REPL (Read-Eval-Print-Loop) is an interpreter that takes input from the user and prints results (on stdout and stderr). ​ Here’s the standard Python interpreter; a REPL many […]

by NumFOCUS Staff at March 22, 2017 12:00 AM

March 21, 2017

Matthieu Brucher

Announcement: Audio TK 1.5.0

ATK is updated to 1.5.0 with new features oriented around preamplifiers and optimizations. It is also now compiled on Appveyor: https://ci.appveyor.com/project/mbrucher/audiotk.

Thanks to Travis and Appveyor, binaries for the releases are now updated on Github. On all platforms we compile static and shared libraries. On Linux, gcc 5, gcc 6, clang 3.8 and clang 3.9 are generated, on OS X, XCode 7 and XCode 8 are available as universal binaries, and on Windows, 32 bits, 64 bits, with dynamic or static (no shared libraries in this case) runtime, are also generated.

Download link: ATK 1.5.0

Changelog:
1.5.0
* Adding a follower class solid state preamplifier with Python wrappers
* Adding a Dempwolf model for tube filters with Python wrappers
* Adding a Munro-Piazza model for tube filters with Python wrappers
* Optimized distortion and preamplifier filters by using fmath exp calls

1.4.1
* Vectorized x4 the IIR part of the IIR filter
* Vectorized delay filters
* Fixed bug in gain filters

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at March 21, 2017 08:24 AM

March 20, 2017

Continuum Analytics news

​Announcing Anaconda Project: Data Science Project Encapsulation and Deployment, the Easy Way!

Monday, March 20, 2017
Christine Doig
Christine Doig
Sr. Data Scientist, Product Manager

Kristopher Overholt
Kristopher Overholt
Product Manager

One year ago, we presented Anaconda and Docker: Better Together for Reproducible Data Science. In that blog post, we described our vision and a foundational approach to portable and reproducible data science using Anaconda and Docker.

This approach embraced the philosophy of Open Data Science in which data scientists can connect the powerful data science experience of Anaconda with the tools that they know and love, which today includes Jupyter notebooks, machine learning frameworks, data analysis libraries, big data computations and connectivity, visualization toolkits, high-performance numerical libraries and more.

We also discussed how data scientists could use Anaconda to develop data science analyses on their local machine, then use Docker to deploy those same data science analyses into production. This was the state of data science encapsulation and deployment that we presented last year:

In this blog post, we’ll be diving deeper into how we’ve created a standard data science project encapsulation approach that helps data scientists deploy secure, scalable and reproducible projects across an entire team with Anaconda.

This blog post also provides more details about how we’re using Anaconda and Docker for encapsulation and containerization of data science projects to power the data science deployment functionality in the next generation of Anaconda Enterprise, which augments our truly end-to-end data science platform.

Supercharge Your Data Science with More Than Just Dockerfiles!

The reality is, as much as Docker is loved and used by the DevOps community, it is not the preferred tool or entrypoint for data scientists looking to deploy their applications. Using Docker alone as a data science encapsulation strategy still requires coordination with their IT and DevOps teams to write their Dockerfiles, install the required system libraries in their containers, and orchestrate and deploy their Docker containers into production.

Having data scientists worry about infrastructure details and DevOps tooling takes away time from their most valuable skills: finding insights in data, modeling and running experiments, and delivering consumable data-driven applications to their team and end-users.

Data scientists enjoy using the packages they know and love with Anaconda along with conda environments, and wish it was as easy to deploy data science projects as it is to get Anaconda running in their laptop.

By working directly with our amazing customers and users and listening to the needs of their data science teams over the last five years, we have clearly identified how Anaconda and Docker can be used together for data science project encapsulation and as a more useful abstraction layer for data scientists: Anaconda Projects.

The Next Generation of Portable and Reproducible Data Science with Anaconda

As part of the next generation of data science encapsulation, reproducibility and deployment, we are happy to announce the release of Anaconda Project with the latest release of Anaconda! Download the latest version of Anaconda 4.3.1 to get started with Anaconda Project today.

Or, if you already have Anaconda, you can install Anaconda Project using the following command:

conda install anaconda-project

Anaconda Project makes it easy to encapsulate data science projects and makes them fully portable and deployment-ready. It automates the configuration and setup of data science projects, such as installing the necessary packages and dependencies, downloading data sets and required files, setting environment variables for credentials or runtime configuration, and running commands.

Anaconda Project is an open source tool created by Continuum Analytics that delivers light-weight, efficient encapsulation and portability of data science projects. Learn more by checking out the Anaconda Project documentation.

Anaconda Project makes it easy to reproduce your data science analyses, share data science projects with others, run projects across different platforms, or deploy data science applications with a single-click in Anaconda Enterprise.

Whether you’re running a project locally or deploying a project with Anaconda Enterprise, you are using the same project encapsulation standard: an Anaconda Project. We’re bringing you the next generation of true Open Data Science deployment in 2017 with Anaconda:

New Release of Anaconda Navigator with Support for Anaconda Projects

As part of this release of Anaconda Project, we’ve integrated easy data science project creation and encapsulation to the familiar Anaconda Navigator experience, which is a graphical interface for your Anaconda environments and data science tools. You can easily create, edit, and upload Anaconda Projects to Anaconda Cloud through a graphical interface:

Download the latest version of Anaconda 4.3.1 to get started with Anaconda Navigator and Anaconda Project today.

Or, if you already have Anaconda, you can install the latest version of Anaconda Navigator using the following command:

conda install anaconda-navigator

When you’re using Anaconda Project with Navigator, you can create a new project and specify its dependencies, or you can import an existing conda environment file (environment.yaml) or pip requirements file (requirements.txt).

Anaconda Project examples:

  • Image classifier web application using Tensorflow and Flask
  • Live Python and R notebooks that retrieve the latest stock market data
  • Interactive Bokeh and Shiny applications for data clustering, cross filtering, and data exploration
  • Interactive visualizations of data sets with Bokeh, including streaming data
  • Machine learning models with REST APIs

To get started even quicker with portable data science projects, refer to the example Anaconda Projects on Anaconda Cloud.

Deploying Secure and Scalable Data Science Projects with Anaconda Enterprise

The new data science deployment and collaboration functionality in Anaconda Enterprise leverages Anaconda Project plus industry-standard containerization with Docker and enterprise-ready container orchestration technology with Kubernetes.

This productionization and deployment strategy makes it easy to create and deploy data science projects with a single-click for projects that use Python 2, Python 3, R, (including their dependencies in C++, Fortran, Java, etc.) or anything else you can build with the 730+ packages in Anaconda.

From Data Science Development to Deployment with Anaconda Projects and Anaconda Enterprise

All of this is possible without having to edit Dockerfiles directly, install system packages in your Docker containers, or manually deploy Docker containers into production. Anaconda Enterprise handles all of that for you, so you can get back to doing data science analysis.

The result is that any project that a data scientist can create on their machine with Anaconda can be deployed to an Anaconda Enterprise cluster in a secure, scalable, and highly-available manner with just a single click, including live notebooks, interactive applications, machine learning models with REST APIs, or any other projects that leverage the 730+ packages in Anaconda.

Anaconda is such a foundational and ubiquitous data science platform that other lightweight data science workspaces and workbenches are using Anaconda as a necessary core component for their portable and reproducible data science. Anaconda is the leading Open Data Science platform powered by Python and empowers data scientists with a truly integrated experience and support for end-to-end workflows. Why would you want your data science team using Anaconda in production with anything other than Anaconda Enterprise?

Anaconda Enterprise is a true end-to-end data science platform that integrates with all of the most popular tools and platforms and provides your data science team with an on-premises package repository, secure enterprise notebook collaboration, data science and analytics on Hadoop/Spark, and secure and scalable data science deployment.

Anaconda Enterprise also includes support for all of the 730+ Open Data Science packages in Anaconda. Finally, Anaconda Scale is the only recommended and certified method for deploying Anaconda to a Hadoop cluster for PySpark or SparkR jobs.

Getting Started with Anaconda Enterprise and Anaconda Projects

Anaconda Enterprise uses Anaconda Project and Docker as its standard project encapsulation and deployment format to enable simple one-click deployments of secure and scalable data science applications for your entire data science team.

Are you interested in using Anaconda Enterprise in your organization to deploy data science projects, including live notebooks, machine learning models, dashboards, and interactive applications?

Access to the next generation of Anaconda Enterprise v5, which features one-click secure and scalable data science deployments, is now available as a technical preview as part of the Anaconda Enterprise Innovator Program.

Join the Anaconda Enterprise v5 Innovator Program today to discover the powerful data science deployment capabilities for yourself. Anaconda Enterprise handles your secure and scalable data science project encapsulation and deployment requirements so that your data science team can focus on data exploration and analysis workflows and spend less time worrying about infrastructure and DevOps tooling.

by swebster at March 20, 2017 05:30 PM

March 16, 2017

Titus Brown

Registration reminder for our two-week summer workshop on high-throughput sequencing data analysis!

Our two-week summer workshop (announcement, direct link) is shaping up quite well, but the application deadline is today! So if you're interested, you should apply sometime before the end of the day. (We'll leave applications open as long as it's March 17th somewhere in the world.)

Some updates and expansions on the original announcement --

  • we'll be training attendees in high-performance computing, in the service of doing bioinformatics analyses. To that end, we've received a large grant from NSF XSEDE, and we'll be using JetStream for our analyses.
  • we have limited financial support that will be awarded after acceptances are issued in a week.

Here's the original announcement below:

ANGUS: Analyzing High Throughput Sequencing Data

June 26-July 8, 2017

University of California, Davis

  • Zero-entry - no experience required or expected!
  • Hands-on training in using the UNIX command line to analyze your sequencing data.
  • Friendly, helpful instructors and TAs!
  • Summer sequencing camp - meet and talk science with great people!
  • Now in its eighth year!

The workshop fee will be $500 for the two weeks, and on-campus room and board is available for $500/week. Applications will close March 17th. International and industry applicants are more than welcome!

Please see http://ivory.idyll.org/dibsi/ANGUS.html for more information, and contact dibsi.training@gmail.com if you have questions or suggestions.


--titus

by C. Titus Brown at March 16, 2017 11:00 PM

numfocus

Facebook Makes Sophisticated Forecasting Techniques Available to Non-Experts Thanks to Stan, a NumFOCUS Sponsored Project

Facebook Prophet Stan Facebook Forecasting Tool Prophet is built on Stan Facebook recently announced that they have made their forecasting tool, Prophet, open source. This is great news for data scientists and business analysts alike—forecasting is an important but tricky process that is critical to many, both for-profit and non-profit organizations. The Prophet forecasting tool is able […]

by NumFOCUS Staff at March 16, 2017 12:00 AM

March 14, 2017

Thomas Wiecki

Random-Walk Bayesian Deep Networks: Dealing with Non-Stationary Data

Download the NB: https://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/random_walk_deep_net.ipynb

(c) 2017 by Thomas Wiecki -- Quantopian Inc.

Most problems solved by Deep Learning are stationary. A cat is always a cat. The rules of Go have remained stable for 2,500 years, and will likely stay that way. However, what if the world around you is changing? This is common, for example when applying Machine Learning in Quantitative Finance. Markets are constantly evolving so features that are predictive in some time-period might not lose their edge while other patterns emerge. Usually, quants would just retrain their classifiers every once in a while. This approach of just re-estimating the same model on more recent data is very common. I find that to be a pretty unsatisfying way of modeling, as there are certain shortfalls:

  • The estimation window should be long so as to incorporate as much training data as possible.
  • The estimation window should be short so as to incorporate only the most recent data, as old data might be obsolete.
  • When you have no estimate of how fast the world around you is changing, there is no principled way of setting the window length to balance these two objectives.

Certainly there is something to be learned even from past data, we just need to instill our models with a sense of time and recency.

Enter random-walk processes. Ever since I learned about them in the stochastic volatility model they have become one of my favorite modeling tricks. Basically, it allows you to turn every static model into a time-sensitive one.

You can read more about the details of a random-walk priors here, but the central idea is that, in any time-series model, rather than assuming a parameter to be constant over time, we allow it to change gradually, following a random walk. For example, take a logistic regression:

$$ Y_i = f(\beta X_i) $$

Where $f$ is the logistic function and $\beta$ is our learnable parameter. If we assume that our data is not iid and that $\beta$ is changing over time. We thus need a different $\beta$ for every $i$:

$$ Y_i = f(\beta_i X_i) $$

Of course, this will just overfit, so we need to constrain our $\beta_i$ somehow. We will assume that while $\beta_i$ is changing over time, it will do so rather gradually by placing a random-walk prior on it:

$$ \beta_t \sim \mathcal{N}(\beta_{t-1}, s^2) $$

So $\beta_t$ is allowed to only deviate a little bit (determined by the step-width $s$) form its previous value $\beta_{t-1}$. $s$ can be thought of as a stability parameter -- how fast is the world around you changing.

Let's first generate some toy data and then implement this model in PyMC3. We will then use this same trick in a Neural Network with hidden layers.

If you would like a more complete introduction to Bayesian Deep Learning, see my recent ODSC London talk. This blog post takes things one step further so definitely read further below.

In [1]:
%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale


import warnings
from scipy import VisibleDeprecationWarning
warnings.filterwarnings("ignore", category=VisibleDeprecationWarning) 

sns.set_context('notebook')

Generating data

First, lets generate some toy data -- a simple binary classification problem that's linearly separable. To introduce the non-stationarity, we will rotate this data along the center across time. Safely skip over the next few code cells.

In [2]:
X, Y = sklearn.datasets.make_blobs(n_samples=1000, centers=2, random_state=1)
X = scale(X)
colors = Y.astype(str)
colors[Y == 0] = 'r'
colors[Y == 1] = 'b'

interval = 20
subsample = X.shape[0] // interval
chunk = np.arange(0, X.shape[0]+1, subsample)
degs = np.linspace(0, 360, len(chunk))

sep_lines = []

for ii, (i, j, deg) in enumerate(list(zip(np.roll(chunk, 1), chunk, degs))[1:]):
    theta = np.radians(deg)
    c, s = np.cos(theta), np.sin(theta)
    R = np.matrix([[c, -s], [s, c]])

    X[i:j, :] = X[i:j, :].dot(R)
In [4]:
import base64
from tempfile import NamedTemporaryFile

VIDEO_TAG = """<video controls>
 <source src="data:video/x-m4v;base64,{0}" type="video/mp4">
 Your browser does not support the video tag.
</video>"""


def anim_to_html(anim):
    if not hasattr(anim, '_encoded_video'):
        anim.save("test.mp4", fps=20, extra_args=['-vcodec', 'libx264'])

        video = open("test.mp4","rb").read()

    anim._encoded_video = base64.b64encode(video).decode('utf-8')
    return VIDEO_TAG.format(anim._encoded_video)

from IPython.display import HTML

def display_animation(anim):
    plt.close(anim._fig)
    return HTML(anim_to_html(anim))
from matplotlib import animation

# First set up the figure, the axis, and the plot element we want to animate
fig, ax = plt.subplots()
ims = [] #l, = plt.plot([], [], 'r-')
for i in np.arange(0, len(X), 10):
    ims.append([(ax.scatter(X[:i, 0], X[:i, 1], color=colors[:i]))])

ax.set(xlabel='X1', ylabel='X2')
# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.ArtistAnimation(fig, ims,
                                 interval=500, 
                                 blit=True);

display_animation(anim)
Out[4]:
Your browser does not support the video tag.

The last frame of the video, where all data is plotted is what a classifier would see that has no sense of time. Thus, the problem we set up is impossible to solve when ignoring time, but trivial once you do.

How would we classically solve this? You could just train a different classifier on each subset. But as I wrote above, you need to get the frequency right and you use less data overall.

Random-Walk Logistic Regression in PyMC3

In [5]:
from pymc3 import HalfNormal, GaussianRandomWalk, Bernoulli
from pymc3.math import sigmoid
import theano.tensor as tt


X_shared = theano.shared(X)
Y_shared = theano.shared(Y)

n_dim = X.shape[1] # 2

with pm.Model() as random_walk_perceptron:
    step_size = pm.HalfNormal('step_size', sd=np.ones(n_dim), 
                              shape=n_dim)
    
    # This is the central trick, PyMC3 already comes with this distribution
    w = pm.GaussianRandomWalk('w', sd=step_size, 
                              shape=(interval, 2))
    
    weights = tt.repeat(w, X_shared.shape[0] // interval, axis=0)
    
    class_prob = sigmoid(tt.batched_dot(X_shared, weights))
    
    # Binary classification -> Bernoulli likelihood
    pm.Bernoulli('out', class_prob, observed=Y_shared)

OK, if you understand the stochastic volatility model, the first two lines should look fairly familiar. We are creating 2 random-walk processes. As allowing the weights to change on every new data point is overkill, we subsample. The repeat turns the vector [t, t+1, t+2] into [t, t, t, t+1, t+1, ...] so that it matches the number of data points.

Next, we would usually just apply a single dot-product but here we have many weights we're applying to the input data, so we need to call dot in a loop. That is what tt.batched_dot does. In the end, we just get probabilities (predicitions) for our Bernoulli likelihood.

On to the inference. In PyMC3 we recently improved NUTS in many different places. One of those is automatic initialization. If you just call pm.sample(n_iter), we will first run ADVI to estimate the diagional mass matrix and find a starting point. This usually makes NUTS run quite robustly.

In [6]:
with random_walk_perceptron:
    trace_perceptron = pm.sample(2000)
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -90.867: 100%|██████████| 200000/200000 [01:13<00:00, 2739.70it/s]
Finished [100%]: Average ELBO = -90.869
100%|██████████| 2000/2000 [00:39<00:00, 50.58it/s]

Let's look at the learned weights over time:

In [7]:
plt.plot(trace_perceptron['w'][:, :, 0].T, alpha=.05, color='r');
plt.plot(trace_perceptron['w'][:, :, 1].T, alpha=.05, color='b');
plt.xlabel('time'); plt.ylabel('weights'); plt.title('Optimal weights change over time'); sns.despine();

As you can see, the weights are slowly changing over time. What does the learned hyperplane look like? In the plot below, the points are still the training data but the background color codes the class probability learned by the model.

In [8]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
grid_2d = np.tile(grid_2d, (interval, 1))
dummy_out = np.ones(grid_2d.shape[0], dtype=np.int8)

X_shared.set_value(grid_2d)
Y_shared.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace_perceptron, model=random_walk_perceptron, samples=500)

def create_surface(X, Y, grid, ppc, fig=None, ax=None):
    artists = []
    cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
    contour = ax.contourf(*grid, ppc, cmap=cmap)
    artists.extend(contour.collections)
    artists.append(ax.scatter(X[Y==0, 0], X[Y==0, 1], color='b'))
    artists.append(ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r'))
    _ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X1', ylabel='X2');
    return artists

fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
    a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].mean(axis=0).reshape(100, 100), fig=fig, ax=ax)
    axs.append(a)
    
anim2 = animation.ArtistAnimation(fig, axs,
                                 interval=1000);
display_animation(anim2)
100%|██████████| 500/500 [00:23<00:00, 24.47it/s]
Out[8]:
Your browser does not support the video tag.

Nice, we can see that the random-walk logistic regression adapts its weights to perfectly separate the two point clouds.

Random-Walk Neural Network

In the previous example, we had a very simple linearly classifiable problem. Can we extend this same idea to non-linear problems and build a Bayesian Neural Network with weights adapting over time?

If you haven't, I recommend you read my original post on Bayesian Deep Learning where I more thoroughly explain how a Neural Network can be implemented and fit in PyMC3.

Lets generate some toy data that is not linearly separable and again rotate it around its center.

In [9]:
from sklearn.datasets import make_moons
X, Y = make_moons(noise=0.2, random_state=0, n_samples=5000)
X = scale(X)

colors = Y.astype(str)
colors[Y == 0] = 'r'
colors[Y == 1] = 'b'

interval = 20
subsample = X.shape[0] // interval
chunk = np.arange(0, X.shape[0]+1, subsample)
degs = np.linspace(0, 360, len(chunk))

sep_lines = []

for ii, (i, j, deg) in enumerate(list(zip(np.roll(chunk, 1), chunk, degs))[1:]):
    theta = np.radians(deg)
    c, s = np.cos(theta), np.sin(theta)
    R = np.matrix([[c, -s], [s, c]])

    X[i:j, :] = X[i:j, :].dot(R)
In [28]:
fig, ax = plt.subplots()
ims = []
for i in np.arange(0, len(X), 10):
    ims.append((ax.scatter(X[:i, 0], X[:i, 1], color=colors[:i]),))

ax.set(xlabel='X1', ylabel='X2')
anim = animation.ArtistAnimation(fig, ims,
                                 interval=500, 
                                 blit=True);

display_animation(anim)
Out[28]:
Your browser does not support the video tag.

Looks a bit like Ying and Yang, who knew we'd be creating art in the process.

On to the model. Rather than have all the weights in the network follow random-walks, we will just have the first hidden layer change its weights. The idea is that the higher layers learn stable higher-order representations while the first layer is transforming the raw data so that it appears stationary to the higher layers. We can of course also place random-walk priors on all weights, or only on those of higher layers, whatever assumptions you want to build into the model.

In [11]:
np.random.seed(123)

ann_input = theano.shared(X)
ann_output = theano.shared(Y)

n_hidden = [2, 5]

# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden[0]).astype(theano.config.floatX)
init_2 = np.random.randn(n_hidden[0], n_hidden[1]).astype(theano.config.floatX)
init_out = np.random.randn(n_hidden[1]).astype(theano.config.floatX)
    
with pm.Model() as neural_network:
    # Weights from input to hidden layer
    step_size = pm.HalfNormal('step_size', sd=np.ones(n_hidden[0]), 
                              shape=n_hidden[0])
    
    weights_in_1 = pm.GaussianRandomWalk('w1', sd=step_size, 
                                         shape=(interval, X.shape[1], n_hidden[0]),
                                         testval=np.tile(init_1, (interval, 1, 1))
                                        )
    
    weights_in_1_rep = tt.repeat(weights_in_1, 
                                 ann_input.shape[0] // interval, axis=0)
    
    weights_1_2 = pm.Normal('w2', mu=0, sd=1., 
                            shape=(1, n_hidden[0], n_hidden[1]),
                            testval=init_2)
    
    weights_1_2_rep = tt.repeat(weights_1_2, 
                                ann_input.shape[0], axis=0)
    
    weights_2_out = pm.Normal('w3', mu=0, sd=1.,
                              shape=(1, n_hidden[1]),
                              testval=init_out)
    
    weights_2_out_rep = tt.repeat(weights_2_out, 
                                  ann_input.shape[0], axis=0)
      

    # Build neural-network using tanh activation function
    act_1 = tt.tanh(tt.batched_dot(ann_input, 
                         weights_in_1_rep))
    act_2 = tt.tanh(tt.batched_dot(act_1, 
                         weights_1_2_rep))
    act_out = tt.nnet.sigmoid(tt.batched_dot(act_2, 
                                             weights_2_out_rep))
        
    # Binary classification -> Bernoulli likelihood
    out = pm.Bernoulli('out', 
                       act_out,
                       observed=ann_output)

Hopefully that's not too incomprehensible. It is basically applying the principles from the random-walk logistic regression but adding another hidden layer.

I also want to take the opportunity to look at what the Bayesian approach to Deep Learning offers. Usually, we fit these models using point-estimates like the MLE or the MAP. Let's see how well that works on a structually more complex model like this one:

In [12]:
import scipy.optimize
with neural_network:
    map_est = pm.find_MAP(fmin=scipy.optimize.fmin_l_bfgs_b)
In [13]:
plt.plot(map_est['w1'].reshape(20, 4));

Some of the weights are changing, maybe it worked? How well does it fit the training data:

In [14]:
ppc = pm.sample_ppc([map_est], model=neural_network, samples=1)
print('Accuracy on train data = {:.2f}%'.format((ppc['out'] == Y).mean() * 100))
100%|██████████| 1/1 [00:00<00:00,  6.32it/s]
Accuracy on train data = 76.64%

Now on to estimating the full posterior, as a proper Bayesian would:

In [15]:
with neural_network:
    trace = pm.sample(1000, tune=200)
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -538.86: 100%|██████████| 200000/200000 [13:06<00:00, 254.43it/s]
Finished [100%]: Average ELBO = -538.69
100%|██████████| 1000/1000 [1:22:05<00:00,  4.97s/it]
In [16]:
plt.plot(trace['w1'][200:, :, 0, 0].T, alpha=.05, color='r');
plt.plot(trace['w1'][200:, :, 0, 1].T, alpha=.05, color='b');
plt.plot(trace['w1'][200:, :, 1, 0].T, alpha=.05, color='g');
plt.plot(trace['w1'][200:, :, 1, 1].T, alpha=.05, color='c');

plt.xlabel('time'); plt.ylabel('weights'); plt.title('Optimal weights change over time'); sns.despine();

That already looks quite different. What about the accuracy:

In [17]:
ppc = pm.sample_ppc(trace, model=neural_network, samples=100)
print('Accuracy on train data = {:.2f}%'.format(((ppc['out'].mean(axis=0) > .5) == Y).mean() * 100))
100%|██████████| 100/100 [00:00<00:00, 112.04it/s]
Accuracy on train data = 96.72%

I think this is worth highlighting. The point-estimate did not do well at all, but by estimating the whole posterior we were able to model the data much more accurately. I'm not quite sure why that is the case. It's possible that we either did not find the true MAP because the optimizer can't deal with the correlations in the posterior as well as NUTS can, or the MAP is just not a good point. See my other blog post on hierarchical models as for why the MAP is a terrible choice for some models.

On to the fireworks. What does this actually look like:

In [18]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
grid_2d = np.tile(grid_2d, (interval, 1))
dummy_out = np.ones(grid_2d.shape[0], dtype=np.int8)

ann_input.set_value(grid_2d)
ann_output.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
    a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].mean(axis=0).reshape(100, 100), fig=fig, ax=ax)
    axs.append(a)
    
anim2 = animation.ArtistAnimation(fig, axs,
                                  interval=1000);
display_animation(anim2)
100%|██████████| 500/500 [00:58<00:00,  7.82it/s]
Out[18]:
Your browser does not support the video tag.

Holy shit! I can't believe that actually worked. Just for fun, let's also make use of the fact that we have the full posterior and plot our uncertainty of our prediction (the background now encodes posterior standard-deviation where red means high uncertainty).

In [19]:
fig, ax = plt.subplots()
chunk = np.arange(0, X.shape[0]+1, subsample)
chunk_grid = np.arange(0, grid_2d.shape[0]+1, 10000)
axs = []
for (i, j), (i_grid, j_grid) in zip((list(zip(np.roll(chunk, 1), chunk))[1:]), (list(zip(np.roll(chunk_grid, 1), chunk_grid))[1:])):
    a = create_surface(X[i:j], Y[i:j], grid, ppc['out'][:, i_grid:j_grid].std(axis=0).reshape(100, 100), 
                       fig=fig, ax=ax)
    axs.append(a)

anim2 = animation.ArtistAnimation(fig, axs,
                                  interval=1000);
display_animation(anim2)
Out[19]:
Your browser does not support the video tag.

Conclusions

In this blog post I explored the possibility of extending Neural Networks in new ways (to my knowledge), enabled by expressing them in a Probabilistic Programming framework. Using a classic point-estimate did not provide a good fit for the data, only full posterior inference using MCMC allowed us to fit this model adequately. What is quite nice, is that we did not have to do anything special for the inference in PyMC3, just calling pymc3.sample() gave stable results on this complex model.

Initially I built the model allowing all parameters to change, but realizing that we can selectively choose which layers to change felt like a profound insight. If you expect the raw data to change, but the higher-level representations to remain stable, as was the case here, we allow the bottom hidden layers to change. If we instead imagine e.g. handwriting recognition, where your handwriting might change over time, we would expect lower level features (lines, curves) to remain stable but allow changes in how we combine them. Finally, if the world remains stable but the labels change, we would place a random-walk process on the output layer. Of course, if you don't know, you can have every layer change its weights over time and give each one a separate step-size parameter which would allow the model to figure out which layers change (high step-size), and which remain stable (low step-size).

In terms of quantatitative finance, this type of model allows us to train on much larger data sets ranging back a long time. A lot of that data is still useful to build up stable hidden representations, even if for predicition you still want your model to predict using its most up-to-date state of the world. No need to define a window-length or discard valuable training data.

In [24]:
%load_ext watermark
%watermark -v -m -p numpy,scipy,sklearn,theano,pymc3,matplotlib
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.6.0
IPython 5.1.0

numpy 1.11.3
scipy 0.18.1
sklearn 0.18.1
theano 0.9.0beta1.dev-9f1aaacb6e884ebcff9e249f19848db8aa6cb1b2
pymc3 3.0
matplotlib 2.0.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)
system     : Darwin
release    : 16.4.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit

by Thomas Wiecki at March 14, 2017 02:00 PM

Matthieu Brucher

AudioToolkit: creating a simple plugin with WDL-OL

Audio Toolkit was started several years ago now, there are more than a dozen plugins based on the platform, applications using it, but I never wrote a tutorial explaining how to use it. Users had to find out for themselves. This changes today.

Building Audio Toolkit

Let’s start with building Audio ToolKit. It uses CMake to ease the pain of supporting several platforms, although you can build it yourself if you generate config.h.

You will require Boost, Eigen and FFTW if you want to test the library and ensure that everything is all right.

Windows

Windows may be the most complicated platform. This stems from the fact that the runtime is different for each version of the Microsoft compiler (except after 2015), and usually that’s not the one you have with your DAW (and thus probably not the one you have with your users’ DAW).

SO the first question is which kind of build you need. For a plugin, I think it is clearly a static runtime that you require, for an app, I would suggest the dynamic runtime. For this, in the CMake GUI, set MSVC_RUNTIME to Static or Dynamic. Enable the same output, static for a plugin and shared libraries for an application.

Note that tests require the shared libraries.

macSierra/OS X

On OS X, just create the default Xcode project, you may want to also generate ATK with CMAKE_OSX_ARCHITECTURES to i386 to get a 32bits version, or x86_64 for a universal binary (I’ll use i386 in this tutorial).

The same rules for static/shared apply here.

Linux

For Linux, I don’t have a plugin support in WDL-OL, but suffice to say that it is the ideas in the next section that are actually relevant.

Building a plugin with WDL-OL

I’ll use the same simple code to generate a simple plugin that does more or less nothing except copy data from the input to the output inside a plugin.

Common code

Start by using the duplicate.py script to create your own plugin. Use a “1-1” PLUG_CHANNEL_IO value to create a mono plugin (this is in resource.h). More advanced configurations can be seen on the ATK plugin repository.

Now, we need an input and an output filter for our pipeline. Let’s add them to our plugin class:

#include <ATK/Core/InPointerFilter.h>
#include <ATK/Core/OutPointerFilter.h>

and new members:

  ATK::InPointerFilter<double> inFilter;
  ATK::OutPointerFilter<double> outFilter;

Now, in the initialization list, add the following:

inFilter(nullptr, 1, 0, false), outFilter(nullptr, 1, 0, false)
  outFilter.set_input_port(0, &inFilter, 0);
  Reset();

This is required to setup the pipeline and initialize the internal variables.
In Reset() put the following:

  int sampling_rate = GetSampleRate();
 
  if(sampling_rate != outFilter.get_output_sampling_rate())
  {
    inFilter.set_input_sampling_rate(sampling_rate);
    inFilter.set_output_sampling_rate(sampling_rate);
    outFilter.set_input_sampling_rate(sampling_rate);
    outFilter.set_output_sampling_rate(sampling_rate);
  }

This ensures that all the sampling rates are consistent. If this is not required for a copy pipeline, for EQs, modeling filters, this is mandatory. Also ATK requires the pipeline to be consistent, so you can’t connect filters that don’t have matching input/output sampling rates. Some of them can change rates, like oversampling and undersampling ones, but they are the exception, not the rule.

And now, the only thing that remains is to actually trigger the pipeline:

  inFilter.set_pointer(inputs[0], nFrames);
  outFilter.set_pointer(outputs[0], nFrames);
  outFilter.process(nFrames);

Now, the WDL-OL projects must be adapted.

Windows

In both cases, it is quite straightforward: set include paths and libraries for the link stage.

For Windows, you need to have a matching ATK build for Debug/Release. In the project properties, add ATK include folder in Project->Properties->C++->Preprocessor->AdditionalIncludeDirectories.

Then add in the link page the ATK libraries you require (Project->Properties->Link->AdditionalDependencies) for each configuration.

macSierra/OS X

On OS X, it is easier to add the include/library folders, by adding them to ADDITIONAL_INCLUDES and ADDITIONAL_LIBRARY_PATHS.

The second step is to add the libraries to the project by adding them to the Link Binary With Libraries list for each target you want to build.

Conclusion

That’s it!

In the end, I hope that I showed that it is easy to build something with Audio ToolKit.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at March 14, 2017 08:49 AM

Enthought

Webinar: Using Python and LabVIEW Together to Rapidly Solve Engineering Problems

1703-Python-LabVIEW-Webinar-Banner-900x300

When: On-Demand (Live webcast took place March 28, 2017)
What: Presentation, demo, and Q&A with Collin Draughon, Software Product Manager, National Instruments, and Andrew Collette, Scientific Software Developer, Enthought

View Now  If you missed the live session, fill out the form to view the recording!


Engineers and scientists all over the world are using Python and LabVIEW to solve hard problems in manufacturing and test automation, by taking advantage of the vast ecosystem of Python software.  But going from an engineer’s proof-of-concept to a stable, production-ready version of Python, smoothly integrated with LabVIEW, has long been elusive.

In this on-demand webinar and demo, we take a LabVIEW data acquisition app and extend it with Python’s machine learning capabilities, to automatically detect and classify equipment vibration.  Using a modern Python platform and the Python Integration Toolkit for LabVIEW, we show how easy and fast it is to install heavy-hitting Python analysis libraries, take advantage of them from live LabVIEW code, and finally deploy the entire solution, Python included, using LabVIEW Application Builder.


Python_LabVIEW_VI_Diagram

In this webinar, you’ll see how easy it is to solve an engineering problem by using LabVIEW and Python together.

What You’ll Learn:

  • How Python’s machine learning libraries can simplify a hard engineering problem
  • How to extend an existing LabVIEW VI using Python analysis libraries
  • How to quickly bundle Python and LabVIEW code into an installable app

Who Should Watch:

  • Engineers and managers interested in extending LabVIEW with Python’s ecosystem
  • People who need to easily share and deploy software within their organization
  • Current LabVIEW users who are curious what Python brings to the table
  • Current Python users in organizations where LabVIEW is used

How LabVIEW users can benefit from Python:

  • High-level, general purpose programming language ideally suited to the needs of engineers, scientists, and analysts
  • Huge, international user base representing industries such as aerospace, automotive, manufacturing, military and defense, research and development, biotechnology, geoscience, electronics, and many more
  • Tens of thousands of available packages, ranging from advanced 3D visualization frameworks to nonlinear equation solvers
  • Simple, beginner-friendly syntax and fast learning curve

View Now  If you missed the live webcast, fill out the form to view the recording

Presenters:

Collin Draughon, National Instruments, Software Product Manager Collin Draughon, National Instruments
Software Product Manager
Andrew Collette, Enthought, Scientific Software Developer Andrew Collette, Enthought
Scientific Software Developer
Python Integration Toolkit for LabVIEW core developer

FAQs and Additional Resources

Python Integration Toolkit for LabVIEW

Quickly and efficiently access scientific and engineering tools for signal processing, machine learning, image and array processing, web and cloud connectivity, and much more. With only minimal coding on the Python side, this extraordinarily simple interface provides access to all of Python’s capabilities.

  • What is the Python Integration Toolkit for LabVIEW?

The Python Integration Toolkit for LabVIEW provides a seamless bridge between Python and LabVIEW. With fast two-way communication between environments, your LabVIEW project can benefit from thousands of mature, well-tested software packages in the Python ecosystem.

Run Python and LabVIEW side by side, and exchange data live. Call Python functions directly from LabVIEW, and pass arrays and other numerical data natively. Automatic type conversion virtually eliminates the “boilerplate” code usually needed to communicate with non-LabVIEW components.

Develop and test your code quickly with Enthought Canopy, a complete integrated development environment and supported Python distribution included with the Toolkit.

  • What is LabVIEW?

LabVIEW is a software platform made by National Instruments, used widely in industries such as semiconductors, telecommunications, aerospace, manufacturing, electronics, and automotive for test and measurement applications. In August 2016, Enthought released the Python Integration Toolkit for LabVIEW, which is a “bridge” between the LabVIEW and Python environments.

  • Who is Enthought?

Enthought is a global leader in software, training, and consulting solutions using the Python programming language.

The post Webinar: Using Python and LabVIEW Together to Rapidly Solve Engineering Problems appeared first on Enthought Blog.

by admin at March 14, 2017 05:00 AM

numfocus

Technical preview: Native GPU programming with CUDAnative.jl (Julia)

This post originally appeared on Julialang.org blog. 14 Mar 2017  |  Tim Besard After 2 years of slow but steady development, we would like to announce the first preview release of native GPU programming capabilities for Julia. You can now write your CUDA kernels in Julia, albeit with some restrictions, making it possible to use […]

by NumFOCUS Staff at March 14, 2017 12:00 AM

Some fun with π in Julia

This post originally appeared on the Julialang.org blog. Some fun with π in Julia 14 Mar 2017  |  Simon Byrne, Luis Benet and David Sanders This post is available as a Jupyter notebook here π in Julia (Simon Byrne) Like most technical languages, Julia provides a variable constant for π. However Julia’s handling is a […]

by NumFOCUS Staff at March 14, 2017 12:00 AM