December 12, 2017

Continuum Analytics

Anaconda Welcomes Lars Ewe as SVP of Engineering

Anaconda, Inc., provider of the most popular Python data science platform, today announced Lars Ewe as the company’s new senior vice president (SVP) of engineering. With more than 20 years of enterprise engineering experience, Ewe brings a strong foundation in big data, real-time analytics and security. He will lead the Anaconda Enterprise and Anaconda Distribution engineering teams.

by Team Anaconda at December 12, 2017 02:00 PM

December 11, 2017

Jake Vanderplas

Optimization of Scientific Code with Cython: Ising Model

Python is quick and easy to code, but can be slow when doing intensive numerical operations. Translating code to Cython can be helpful, but in most cases requires a bit of trial and error to achieve the optimal result. Cython's tutorials contain a lot of information, but for iterative workflows like optimization with Cython, it's often useful to see it done "live".

For that reason, I decided to record some screencasts showing this iterative optimization process, using an Ising Model, as an example application.

by Jake VanderPlas at December 11, 2017 08:00 PM

Filipe Saraiva

KDE Edu Sprint 2017

Two months ago I attended to KDE Edu Sprint 2017 at Berlin. It was my first KDE sprint (really, I send code to KDE software since 2010 and never went to a sprint!) so I was really excited for the event.

KDE Edu is the an umbrella for specific educational software of KDE. There are a lot of them and it is the main educational software suite in free software world. Despite it, KDE Edu has received little attention in organization side, for instance the previous KDE Edu sprint occurred several years ago, our website has some problems, and more.

Therefore, this sprint was an opportunity not only for developers work in software development, but for works in organization side as well.

In organization work side, we discuss about the rebranding of some software more related to university work than for “education” itself, like Cantor and Labplot. There was a wish to create something like a KDE Research/Science in order to put software like them and others like Kile and KBibTex in a same umbrella. There is a discussion about this theme.

Other topic in this point was the discussions about a new website, more oriented to teach how to use KDE software in educational context than present a set of software. In fact, I think we need to do it and strengthen the “KDE Edu brand” in order to have a specific icon+link in KDE products page.

Follow, the developers in the sprint agreed with the multi operating system policy for KDE Edu. KDE software can be built and distributed to users of several OS, not only Linux. During the sprint some developers worked to bring installers for Windows, Mac OS, porting applications to Android, and creating independent installers for Linux distributions using flatpak.

Besides the discussions in this point, I worked to bring a rule to send e-mail to KDE Edu mailing list for each new Differential Revisions of KDE Edu software in Phabricator. Sorry devs, our mailboxes are full of e-mails because me.

Now in development work side, my focus was work hard on Cantor. First, I made some task triage in our workboard, closing, opening, and putting more information in some tasks. Secondly, I reviewed some works made by Rishabh Gupta, my student during GSoC 2017. He ported the Lua and R backend to QProcess and it will be available soon.

After it I worked to port Python 3 backend to Python/C API. This work is in progress and I expect to finish it to release in 18.04.

Of course, besides this amount of work we have fun with some beers and German food (and some American food and Chinese food and Arab food and Italian food as well)! I was happy because my 31 years birthday was in the first day of the sprint, so thank you KDE for coming to my birthday party full of code and good beers and pork dishes. 🙂

To finish, it is always a pleasure to meet the gearheads like my Spanish friends Albert and Aleix, the only other Mageia user I found personally in my life Timothée, my GSoC student Rishabh, my irmão brasileiro Sandro, and the new friends Sanjiban and David.

Thank you KDE e.V for provide resources to the sprint and thank you Endocode for hosting the sprint.

by Filipe Saraiva at December 11, 2017 03:22 PM

December 10, 2017

Titus Brown

The #CommonsPilot kicks off!!

(Just in case it's not clear, I do not speak for the NIH or for the Data Commons Pilot Phase Consortium in this blog post! These are my own views and perspectives, as always.)

I'm just coming back from the #CommonsPilot kickoff meeting. This was our first face-to-face meeting on the Data Commons Pilot effort, which is a new trans-NIH effort.

The Data Commons Pilot started with the posting of a funding call to assemble a Pilot Phase Consortium using a little-known NIH funding mechanism called an "Other Transactions" agreement. This is a fundamentally different award system from grants, contracts, and cooperative agreements: it lets the NIH interact closely with awardees, adjust funding as needed on very short time scale, and otherwise gives them a level of engagement with the actual work that I've never seen before.

I, along with many others, applied to this funding call (I'll be posting our initial application soon!) and after many trials and tribbleations I ended up being selected to work on training and outreach. Since then I've also become involved in internal coordination, which dovetails nicely with the training/outreach role.

The overall structure of the Data Commons Pilot Phase Consortium is hard to explain and not fully worked out yet, but we have a bunch of things that kind of resemble focus groups, called "Key Capabilities", that correspond to elements of the funding call -- we've put together a draft Web site that lists them all. For example, Key Capability 2 is "GUIDs" - this group of nice people is going to be concerned with identification of "objects" in Data Commonses. Likewise, there's a Scientific Use Cases KC (KC8) that is focused on what researchers and clinicians actually want to do.

(The complete list of awardees is here.)

This kickoff meeting was ...interesting. There were about 100 people (NIH folk, data stewards, OT awardees, cloud providers, and others) at the meeting, and the goal was to dig in to what we actually needed to do during the first 180 days of this effort - aka Pilot Phase I. (Stay tuned on that front.) We managed to put together something that was more "Unconference style" than the typical NIH organizational meeting, and this resulted in what I would call "chaos lite", which was not uniformly enjoyable but also not uniformly miserable. I'm not sure how close we came to actually nailing down what we needed to do, but we are certainly closer to it than we were before!

So... really, what is a Data Commons?

No one really knows, in detail. Let's start there!

(I recalled that Cameron Neylon had written about this, and a quick google search found this post from 2008. (I find it grimly amusing how many of the links in his blog post no longer work...) Some pretty good stuff in there!) I don't know of earlier mentions of the Commons, but a research commons has been being discussed for about a decade.

What is clear from my 2017 vantage point is that a data commons should provide some combination of tools, data, and compute infrastructure, so that people can bring their own tools and bring their own data and combine it with other tools and other data to do data analysis. In the context of a biomedical data commons we have to also be cognizant of legal and ethical issues surrounding access to and use of controlled data, which was a pretty big topic for us (there's a whole Key Capability devoted just to that - see KC6!)

There are, in fact, many biomedical data commons efforts - e.g. the NCI Genomic Data Commons, which shares a number of participants with the #CommonsPilot, and others that I discovered just this week (e.g. the Analysis Commons). So this Data Commons (#CommonsPilot, to be clear) is just one of many. And I think that has interesting implications that I'm only beginning to appreciate.

Something else that has changed since Cameron's 2008 blog post is the power and ubiquity of the cloud platforms. "Cloud" is now an everyday word, and many researchers, academic and industry and nonprofit, are using it every day. So it has become much clearer that cloud is one future of biomedical compute, if not the only one.

(I would like to make it clear that Bitcoin is not part of the #CommonsPilot effort. Just in case anyone was wondering how buzzword compliant we were going to try to be.)

But this still leaves us at a bit of an impasse. OK, so we're talking about tools, data, and compute infrastructure... in the cloud... that leaves a lot of room :).

Things that I haven't mentioned yet, that are explicitly or implicitly part of the Commons Pilot effort as I see it.

  • openness. We must build an open platform to enable a true commons that is accessible to everyone, vice issues of controlled data access. See: Commons.

  • eventual community governance, in some shape or form. (Geoff Bilder, Jennifer Lin, and Cameron Neylon cover this brilliantly in their Principles for Open Scholarly Infrastructure.)

  • multi-tenant. This isn't going to run just on one cloud provider, or one HPC.

  • platform mentality. This is gonna have to be a platform, folks, and we're gonna have to dogfood it. (obligatory link to Yegge rant)

  • larger than any one funding organization. This is necessary for long-term sustainability reasons, but also is an important requirement for a Commons in the first place. There may be disproportionate inputs from certain funders at the beginning, but ultimately I suspect that any Commons will need to be a meeting place for research writ large - which inevitably means not only NIH funded researchers, not just US researchers, but researchers world wide.

I haven't quite wrapped my head around the scope that these various requirements imply, but I think it becomes quite interesting in its implications. More on that as I noodle.

Why are Commonses needed, and what would a successful #CommonsPilot enable?

Perhaps my favorite section of the #CommonsPilot meeting was the brainstorming bit around why we needed a Commons, and what this effort could enable (as part of the larger Commons ecosystem). Here, in no particular order, is what we collectively came up with. (David Siedzik ran the session very capably!)

(As I write up the list below, I'd like to point out that this is really very incomplete. We only did this exercise for about 30 minutes, and many important issues were raised afterwards that weren't captured by this exercise. So this is definitely incomplete and moreover only reflects my memory and notes. Riff on it as you will in comments!)

  • The current scale of data overwhelms naive/simple platforms.
  • The #CommonsPilot must enable access to restricted data in a more uniform way, such that e.g. cross-data set integration becomes more possible.
  • The #CommonsPilot must have a user interface for browsing and exploratory investigation.
  • The #CommonsPilot will enable alignments of approaches across data sets.
  • Integration of tools and data is much easier in a #Commons.
  • Distribution and standardization of tools, data formats, and metadata will enhance robustness of analyses
  • There will be a community of users that will drive extensions to and enhancement of the platform over time.
  • Time to results will decrease as we more and more effectively employ compute across large clouds, and reuse previous results.
  • We expect standardization around formats and approaches, writ large (that is, the #CommonsPilot will contribute significantly to the refinement and deployment of standards and conventions).
  • The #CommonsPilot will expand accessibility of tools, compute, and data to many more scientists.
  • We hope to reduce redundant and repeated analyses where it makes sense.
  • Methods sharing will happen more, if we are successful!
  • Lower costs to data analysis, and lower barriers to doing research as a result.
  • An enhanced ability to share in new ways that we can't fully appreciate yet.
  • A resulting encouragement and development of new types of questions and inquiry.
  • Enhanced sustainability of data as a currency in research.
  • We hope to enhance and extend the life cycle of data.
  • We hope to enable comparison and benchmarking of approaches on a common platform.
  • We hope to help shape policy by demonstrating the value of cloud, and the value of open.
  • More ethical and effective use of more (all!?) data
  • More robust security/auditing of data access and tool use.
  • Enhanced training and documentation around responsible conduct of computational research.

So as you can see it's a pretty unambitious effort and I wouldn't be at all surprised if we were done in a year.

I'd love to explore these issues in comments or in blog posts that other people write about why we're wrong, or incomplete, or short-sighted, or too visionary. Fair game, folks - go for it!

How open are we gonna be about all of this?

That's a good question and I have two answers:

  1. We are hoping to be more open than ever before. As a sign of this, Vivien Bonazzi claims that at least one loud-mouthed open science advocate is involved in the effort. I'll let you know who it is when I find out myself.

  2. Not as open as I'd like to be, and for good reasons. While this effort is partly about building a platform for community, and community engagement will be an intrinsic part of this effort (more on that, sooner or later!), there are contractual issues and NIH requirements that need to be met. Moreover, we need to thread the needle of permitting internal frank discussions while promoting external engagement.

So we'll see!

--titus

by C. Titus Brown at December 10, 2017 11:00 PM

December 06, 2017

Matthew Rocklin

Dask Development Log

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Current development in Dask and Dask-related projects includes the following efforts:

  1. A possible change to our community communication model
  2. A rewrite of the distributed scheduler
  3. Kubernetes and Helm Charts for Dask
  4. Adaptive deployment fixes
  5. Continued support for NumPy and Pandas’ growth
  6. Spectral Clustering in Dask-ML

Community Communication

Dask community communication generally happens in Github issues for bug and feature tracking, the Stack Overflow #dask tag for user questions, and an infrequently used Gitter chat.

Separately, Dask developers who work for Anaconda Inc (there are about five of us part-time) use an internal company chat and a closed weekly video meeting. We’re now trying to migrate away from closed systems when possible.

Details about future directions are in dask/dask #2945. Thoughts and comments on that issue would be welcome.

Scheduler Rewrite

When you start building clusters with 1000 workers the distributed scheduler can become a bottleneck on some workloads. After working with PyPy and Cython development teams we’ve decided to rewrite parts of the scheduler to make it more amenable to acceleration by those technologies. Note that no actual acceleration has occurred yet, just a refactor of internal state.

Previously the distributed scheduler was focused around a large set of Python dictionaries, sets, and lists that indexed into each other heavily. This was done both for low-tech code technology reasons and for performance reasons (Python core data structures are fast). However, compiler technologies like PyPy and Cython can optimize Python object access down to C speeds, so we’re experimenting with switching away from Python data structures to Python objects to see how much this is able to help.

This change will be invisible operationally (the full test suite remains virtually unchanged), but will be a significant change to the scheduler’s internal state. We’re keeping around a compatibility layer, but people who were building their own diagnostics around the internal state should check out with the new changes.

Ongoing work by Antoine Pitrou in dask/distributed #1594

Kubernetes and Helm Charts for Dask

In service of the Pangeo project to enable scalable data analysis of atmospheric and oceanographic data we’ve been improving the tooling around launching Dask on Cloud infrastructure, particularly leveraging Kubernetes.

To that end we’re making some flexible Docker containers and Helm Charts for Dask, and hope to combine them with JupyterHub in the coming weeks.

Work done by myself in the following repositories. Feedback would be very welcome. I am learning on the job with Helm here.

If you use Helm on Kubernetes then you might want to try the following:

helm repo add dask https://dask.github.io/helm-chart
helm update
helm install dask/dask

This installs a full Dask cluster and a Jupyter server. The Docker containers contain entry points that allow their environments to be updated with custom packages easily.

This work extends prior work on the previous package, dask-kubernetes, but is slightly more modular for use alongside other systems.

Adaptive deployment fixes

Adaptive deployments, where a cluster manager scales a Dask cluster up or down based on current workloads recently got a makeover, including a number of bug fixes around odd or infrequent behavior.

Work done by Russ Bubley here:

Keeping up with NumPy and Pandas

NumPy 1.14 is due to release soon. Dask.array had to update how it handled structured dtypes in dask/dask #2694 (Work by Tom Augspurger).

Dask.dataframe is gaining the ability to merge/join simultaneously on columns and indices, following a similar feature released in Pandas 0.22. Work done by Jon Mease in dask/dask #2960

Spectral Clustering in Dask-ML

Dask-ML recently added an approximate and scalable Spectral Clustering algorithm in dask/dask-ml #91 (gallery example).

December 06, 2017 12:00 AM

December 05, 2017

Continuum Analytics

How to Get Ready for the Release of conda 4.4

As the year winds down it’s time to say out with the old and in with the new. Well, conda is no different. What does conda 4.4 have in store for you? Say goodbye to “source activate” in conda. That is so 2017. With conda 4.4 you can snappily “conda activate” and “conda deactivate” your …
Read more →

by Rory Merritt at December 05, 2017 10:06 PM

Jake Vanderplas

Installing Python Packages from a Jupyter Notebook

In software, it's said that all abstractions are leaky, and this is true for the Jupyter notebook as it is for any other software. I most often see this manifest itself with the following issue:

I installed package X and now I can't import it in the notebook. Help!

This issue is a perrennial source of StackOverflow questions (e.g. this, that, here, there, another, this one, that one, and this... etc.).

Fundamentally the problem is usually rooted in the fact that the Jupyter kernels are disconnected from Jupyter's shell; in other words, the installer points to a different Python version than is being used in the notebook. In the simplest contexts this issue does not arise, but when it does, debugging the problem requires knowledge of the intricacies of the operating system, the intricacies of Python package installation, and the intricacies of Jupyter itself. In other words, the Jupyter notebook, like all abstractions, is leaky.

In the wake of several discussions on this topic with colleagues, some online (exhibit A, exhibit B) and some off, I decided to treat this issue in depth here. This post will address a couple things:

  • First, I'll provide a quick, bare-bones answer to the general question, how can I install a Python package so it works with my jupyter notebook, using pip and/or conda?.

  • Second, I'll dive into some of the background of exactly what the Jupyter notebook abstraction is doing, how it interacts with the complexities of the operating system, and how you can think about where the "leaks" are, and thus better understand what's happening when things stop working.

  • Third, I'll talk about some ideas the community might consider to help smooth-over these issues, including some changes that the Jupyter, Pip, and Conda developers might consider to ease the cognitive load on users.

This post will focus on two approaches to installing Python packages: pip and conda. Other package managers exist (including platform-specific tools like yum, apt, homebrew, etc., as well as cross-platform tools like enstaller), but I'm less familiar with them and won't be remarking on them further.

by Jake VanderPlas at December 05, 2017 05:00 PM

Leonardo Uieda

GMT and open-source at #AGU17 and a GMT/Python online demo

Thumbnail image for publication.

The AGU Fall Meeting is happening next week in New Orleans, potentially gathering more than 20,000 geoscientists in a single place. Me and Paul will be there to talk about the next version of the Generic Mapping Tools, my work on GMT/Python, and the role of open-source software in the Geosciences.

There is so much going on at AGU that it can be daunting just to browse the scientific program. I haven't even started and my calendar is already packed. For now, I'll just share the sessions and events in which I'm taking part.

Earth ArXiv meetup

Thursday evening - TBD

The Earth ArXiv logo

The Earth ArXiv is a brand new community developed preprint server for the Earth and Planetary Sciences. Me and some other folks who are involved will get together for dinner/drinks on Thursday to nerd-out offline for a change.

If you are interested in getting involved in Earth ArXiv, join the Loomio group and the ESIP Slack channel and say "Hi". The community is very welcoming and it needs all the help it can get to grow.

We don't know where we'll meet yet but keep posted on Slack and Loomio if you're interested in joining us.

Panel session in the AGU Data Fair

Wednesday 12:30pm - Room 203

I was invited to be a panelist on the Data Capacity Building session of the AGU Data Fair. The fair has other very interesting panels happing throughout the week. They all center around "data": where to get it, what to do with it, how to preserve it, and how to give and receive credit for it.

We'll be discussing what to do with the data once you acquire. From the panel description:

The panel will discuss the challenges the researcher faces and how methods for managing data are currently available or are expected in the future that will help the researcher build value and capacity in the research data lifecycle.

The discussion will be in an Ask-Me-Anything style (AMA) with moderated questions from the audience (on and offline). If you have any questions that you want us to tackle, tweet them using the hashtag #AGUDataCapacity. They'll be added to a list for the moderators.

I'm really looking forward to this panel and getting to meet some new people in the process.

Paul's talk about GMT6

Thursday 4:15pm - Room 228

Paul is giving the talk The Generic Mapping Tools 6: Classic versus Modern Mode at the Challenges and Benefits of Open-Source Software and Open Data session. He'll be showcasing the new changes that are coming to GMT6, including "modern mode" and a new gmt subplot command. These are awesome new features of GMT aimed at making it more accessible to new users. For all the GMT gurus out there: Don't worry, they're also a huge time saver by eliminating many repeated command online options and boilerplate code.

Panel session about open-source software

Thursday 4-6pm - Room 238

I'll also be a panelist on the session Open-Source Software in the Geosciences. The lineup of panelists is amazing and I'm honored to be included among them. It'll be hard to contain the fan-boy in me. I wonder if geophysicists are used to getting asked for autographs.

The discussion will center around the role of open-source software in our science, how it's affected the careers of those who make it, and what we can do to make it a viable career path for new geoscientists.

My contribution is the abstract "Nurturing reliable and robust open-source scientific software".

Many thanks to the chairs and conveners for putting it together. I'll surely have a lot more to say after the panel.

Poster about GMT/Python

Friday morning - Poster Hall D-F

Last but not least, I'll be presenting the poster "A modern Python interface for the Generic Mapping Tools" about my work on GMT/Python. Come see the poster and chat with me and Paul! I'd love to hear what you want to see in this software. I'll also have a laptop and tablets for you to play around with a demo.

My AGU 2017 poster

You can download a PDF of the poster from figshare at doi:10.6084/m9.figshare.5662411.

A lot has happened since my last update after Scipy2017. Much of the infrastructure work to interface with the C API is done but there is still a lot do. Luckily, we just got our first code contributor last week so it looks like I'll have some help!

You can try out the latest features in an online demo Jupyter notebook by visiting agu2017demo.gmtpython.xyz

The notebook is running on the newly re-released mybinder.org service. The Jupyter team did an amazing job!

Come say "Hi"

If you'll be at AGU next week, stop by the poster on Friday or join the panel sessions if you want to chat or have any questions/suggestions. If you won't, there is always Twitter and the Software Underground Slack group.

See you in New Orleans!


The photo of Bourbon Street in the thumbnail is copyright Chris Litherland and licensed CC-BY-SA.


Comments? Leave one below or let me know on Twitter @leouieda.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

Please enable JavaScript to view the comments powered by Disqus.

December 05, 2017 12:00 PM

December 04, 2017

Continuum Analytics

Anaconda Training: A Learning Path for Data Scientists

Here at Anaconda, our mission has always been to make the art of data science accessible to all. We strive to empower people to overcome technical obstacles to data analysis so they can focus on asking better questions of their data and solving actual, real-world problems. With this goal in mind, we’re excited to announce …
Read more →

by Rory Merritt at December 04, 2017 10:00 PM

December 01, 2017

Titus Brown

Four steps in five minutes to deploy a Carpentry lesson for a class of 30

binder is an awesome technology for making GitHub repositories "executable". With binder 2.0, this is now newly stable and feature-full!

Yesterday I gave a two hour class, introducing graduate students to things like Jupyter Notebook and Pandas, via the Data Carpentry Python for Ecologists lesson. Rather than having everyone install their own software on their laptop (hah!), I decided to give them a binder!

For this purpose, I needed a repository that contained the lesson data and told binder how to install pandas and matplotlib.

Since I'm familiar with GitHub and Python requirements.txt files, it took me about 5 minutes. And the class deployment was flawless!

Building the binder repo

  1. Create a github repo (https://github.com/ngs-docs/2017-davis-ggg201a-day1), optionally with a README.md file.

  2. Upload surveys.csv into the github repository (I got this file as per Data Carpentry's Python for Ecology lesson).

  3. Create a requirements.txt file containing:

pandas
numpy
matplotlib

-- this tells binder to install those things when running this repository.

  1. Paste the GitHub URL into the 'URL' entry box at mybinder.org and click 'launch'.

Two optional steps

These steps aren't required but make life nicer for users.

  1. Upload an index.ipynb notebook so that people will be sent to a default notebook rather than being dropped into the Jupyter Console; note, you'll have to put 'index.ipynb' into the 'Path to a notebook file' field at mybinder.org for the redirect to happen.

  2. Grab the 'launch mybinder' Markdown text from the little dropdown menu to the right of 'launch', and paste it into the README.md in your github repo. This lets people click on a cute little 'launch binder' button to launch a binder from the repo.

Try it yourself!

Click on the button below,

Binder

or visit this URL.

-- in either case you should be sent ~instantly to a running Jupyter Notebook with the surveys.csv data already present.

Magic!!

What use is this?

This is an excellent way to do a quick demo in a classroom!

It could serve as a quickfix for people attending a Carpentry workshop who are having trouble installing the software.

(I've used it for both - since you can get a command line terminal as well as Python Notebooks and RStudio environments, it's ideal for short R, Python, and shell workshops.)

The big downside so far is that the environment produced by mybinder.org is temporary and expires after some amount of inactivity, so it's not ideal for workshops with lots of discussion - the repo may go away! No good way to deal with that currently; that's something that a custom JupyterHub deployment would fix, but that is too heavyweight for me at the moment.

(We came up with a lot of additional use cases for binder, here.).

Thoughts & comments welcome, as always!

--titus

by C. Titus Brown at December 01, 2017 11:00 PM

November 30, 2017

November 28, 2017

Matthieu Brucher

Announcement: ATKColoredExpander 2.0.0

I’m happy to announce the update of ATK Colored Expander based on the Audio Toolkit and JUCE. They are available on Windows (AVX compatible processors) and OS X (min. 10.9, SSE4.2) in different formats.

This plugin requires the universal runtime on Windows, which is automatically deployed with Windows update (see tis discussion on the JUCE forum). If you don’t have it installed, please check Microsoft website.

ATK Colored Expander 2.0.0

The supported formats are:

  • VST2 (32bits/64bits on Windows, 32/64bits on OS X)
  • VST3 (32bits/64bits on Windows, 32/64bits on OS X)
  • Audio Unit (32/64bits, OS X)

Direct link for ATKGuitarPreamp.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at November 28, 2017 08:09 AM

November 25, 2017

numfocus

My Favourite Tool: IPython

re-posted with permission from Software Carpentry My favorite tool is … IPython. IPython is a Python interpreter with added features that make it an invaluable tool for interactive coding and data exploration. IPython is most commonly taught via the Jupyter notebook, an interactive web-based tool for evaluating code, but IPython can be used on its own directly […]

by NumFOCUS Staff at November 25, 2017 04:48 PM

November 23, 2017

Titus Brown

Why are taxonomic assignments so different for Tara bins? (Black Friday Morning Bioinformatics)

Happy (day after) Thanksgiving!

Now that we can parse custom taxonomies in sourmash and use them for genome classification (tutorial) I thought I'd revisit the Tara ocean genome bins produced by Delmont et al. and Tully et al. (see this blog post for details).

Back when I first looked at the Tully and Delmont bins, my tools for parsing taxonomy were quite poor, and I was limited to using the Genbank taxonomy. This meant that I couldn't deal properly with places where the imputed taxonomies assigned by the authors extended beyond Genbank.

Due to requests from Trina McMahon and Sarah Stevens, this is no longer a constraint! We can now load in completely custom taxonomies and mix and match them as we wish.

How does this change things?

A new sourmash lca command, compare_csv

For the purposes of today's blog post, I added a new command to sourmash lca. compare_csv takes in two taxonomy spreadsheets and compares the classifications, generating a list of compatible and incompatible classifications.

Some quality control and evaluation

(Let's start by asking if our scripts work in the first place!)

When I was working on the sourmash lca stuff I noticed something curious: when I read in the delmont classification spreadsheet and re-classified the delmont genomes, I found more classifications for the genomes than were in the input spreadsheet.

So, for example, when I do:

sourmash lca classify --db delmont-MAGs-k31.lca.json.gz \
    --query delmont-genome-sigs --traverse-directory \
    -o delmont-genome-sigs.classify.csv

and then compare the output CSV with the original table from the Delmont et al. paper,

sourmash lca compare_csv delmont-genome-sigs.classify.csv \
    tara-delmont-SuppTable3.csv

I get the following:

957 total assignments, 24 differ between spreadsheets.
24 are compatible (one lineage is ancestor of another.
0 are incompatible (there is a disagreement in the trees).

What!? Why would we be able to classify new things?

Looking into it, it turns out that these differences are because one input genome's classification informs others, but the way that Delmont et al. did their classifications did not take into account their own genome bins.

For example, TARA_ASW_MAG_00041 is classified as genus Emiliania by sourmash, but is simply Eukaryota in Delmont et al.'s paper. The new classification for 00041 comes from the other genome bin TARA_ASW_MAG_00032 which was firmly classified as Emiliana and shares approximately 1.6% of its k-mers with the 00041.

If this holds up, it provides some nice context for Trina's original request for a quick way to classify new genomes against previously classified bins. Quickly feeding custom classifications into new classifications seems quite useful!

We see the same thing when I reclassify the Tully et al. genome sigs against themselves. If I do:

sourmash lca classify \
    --db tully-MAGs-k31.lca.json.gz \
    --query tully-genome-sigs --traverse-directory \
    -o tully-genome-sigs.classify.csv
sourmash lca compare_csv tully-genome-sigs.classify.csv \
    tara-tully-Table4.csv

then I get:

2009 total assignments, 7 differ between spreadsheets.
7 are compatible (one lineage is ancestor of another.
0 are incompatible (there is a disagreement in the trees).
  • so no incompatibilities, but a few "extensions".

What about incompatibilities?

The above was really just internal validation - can we classify genomes against themselves and get consistent answers? It was unexpectedly interesting but not terribly so.

But what if we take the collections of genome bins from tully and reclassify them based on the delmont classifications? And vice versa?

Reclassifying tully with delmont

Let's give it a try!

First, classify the tully genome signatures with an LCA database built from the delmont data:

sourmash lca classify \
    --db delmont-MAGs-k31.lca.json.gz \
    --query tully-genome-sigs --traverse-directory \
    -o tully-query.delmont-db.sigs.classify.csv

Then, compare:

sourmash lca compare_csv \
    tully-genome-sigs.classify.csv \
    tully-query.delmont-db.sigs.classify.csv \
    --start-column=3

and we get:

987 total assignments, 889 differ between spreadsheets.
296 are compatible (one lineage is ancestor of another.
593 are incompatible (there is a disagreement in the trees).
164 incompatible at rank superkingdom
255 incompatible at rank phylum
107 incompatible at rank class
54 incompatible at rank order
13 incompatible at rank family
0 incompatible at rank genus
0 incompatible at rank species

Ouch: almost two thirds are incompatible, 164 of them at the superkingdom level!

For example, in the tully data set, TOBG_MED-875 is classified as a Euryarchaeota, novelFamily_I, but using the delmont data set, it gets classified as Actinobacteria! Digging a bit deeper, this is based on approximately 290kb of sequence, much of it from TARA_MED_MAG_00029, which is classified as Actinobacteria and shares about 8.6% of its k-mers with TOBG_MED-875. So that's the source of that disagreement.

(Some provisional digging suggests that there's a lot of Actinobacterial proteins in TOBG_MED-875, but this would need to be verified by someone more skilled in protein-based taxonomic analysis than me.)

Reclassifying delmont with tully

What happens in the other direction?

First, classify the delmont signatures with the tully database:

sourmash lca classify \
    --db tully-MAGs-k31.lca.json.gz \
    --query delmont-genome-sigs --traverse-directory \
    -o delmont-query.tully-db.sigs.classify.csv

Then, compare:

sourmash lca compare_csv delmont-genome-sigs.classify.csv \
    delmont-query.tully-db.sigs.classify.csv \
    --start-column=3

And see:

604 total assignments, 537 differ between spreadsheets.
193 are compatible (one lineage is ancestor of another.
344 are incompatible (there is a disagreement in the trees).
95 incompatible at rank superkingdom
151 incompatible at rank phylum
66 incompatible at rank class
25 incompatible at rank order
7 incompatible at rank family
0 incompatible at rank genus
0 incompatible at rank species

As you'd expect, this more or less agrees with the results above - lots of incompatibilities, with fully 1/6th incompatible at the rank of superkingdom (!!).

Why are thing classified so differently!?

First, a big caveat: my code may be completely wrong. If so, well, best to find out now! I've done only the lightest of spot checks and I welcome further investigation. (TBH, I'm actually kind of hoping that Meren, the senior author on the Delmont et al. study, dives into the Tully data sets and does a more robust reclassification using his methods - he has an inspiring history of doing things like that. ;)

But, assuming my code isn't completely wrong...

On first blush, there are three other possibilities. For each classification, the tully classification could be wrong, the delmont classification could be wrong, or both classifications could be wrong. Either way, they're inconsistent!

On second blush, this all strikes me as a bit of a disaster. Were the taxonomic classification methods used by the Delmont and Tully papers really so different!? How do we trust our own classifications, much less anyone else's?

I will fall back on my usual refrain: we need tools that let us detect and resolve such disagreements quickly and reliably. Maybe sourmash can provide the former, but I'm pretty sure k-mers are too specific to do a good job of resolving disagreements above the genus level.

Anyhoo, I'm out of time for today, so I'll just end with some thoughts for What Next.

What next?

Other than untangling disagreements, what other things could we do? Well, we've just added 60,000 genomes from the JGI IMG database to our previous collection of 100,000 genomes from Genbank, so we can do a classification against all available genomes! And, if we're feeling ambitious, we could reclassify all the genomes against themselves. That might be interesting...

Appendix: Building the databases

Install sourmash lca as in the tutorial.

Grab and unpack the genome signatures for the tully and delmont studies:

curl -L https://osf.io/vngdz/download -o delmont-genome-sigs.tar.gz
tar xzf delmont-genome-sigs.tar.gz
curl -L https://osf.io/28r6m/download -o tully-genome-sigs.tar.gz
tar xzf tully-genome-sigs.tar.gz

Grab the classifications, too:

curl -L -O https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv
curl -L -O https://github.com/ctb/2017-sourmash-lca/raw/master/tara-tully-Table4.csv

Then, build the databases:

sourmash lca index -k 31 --scaled=10000 \
    tara-tully-Table4.csv tully-MAGs-k31.lca.json.gz \
    tully-genome-sigs --traverse-directory
sourmash lca index -k 31 --scaled=10000 \
    tara-delmont-SuppTable3.csv delmont-MAGs-k31.lca.json.gz \
    delmont-genome-sigs --traverse-directory

and now all of the commands above should work.

The whole thing takes about 5 minutes on my laptop, and requires less than 1 GB of RAM and < 100 MB of disk space for the data.

by C. Titus Brown at November 23, 2017 11:00 PM

November 21, 2017

numfocus

My Favourite Tool: Jupyter Notebook

re-posted with permission from Software Carpentry My favourite tool is … the Jupyter Notebook. One of my favourite tools is the Jupyter notebook. I use it for teaching my students scientific computing with Python. Why I like it: Using Jupyter with the plugin RISE, I can create presentations including code cells that I can edit and execute live during […]

by NumFOCUS Staff at November 21, 2017 02:22 PM

Matthieu Brucher

Announcement: ATKColoredCompressor 2.0.0

I’m happy to announce the update of ATK Colored Compressor based on the Audio Toolkit and JUCE. They are available on Windows (AVX compatible processors) and OS X (min. 10.9, SSE4.2) in different formats.

ATK Colored Compressor 2.0.0

The supported formats are:

  • VST2 (32bits/64bits on Windows, 32/64bits on OS X)
  • VST3 (32bits/64bits on Windows, 32/64bits on OS X)
  • Audio Unit (32/64bits, OS X)

Direct link for ATKColoredCompressor .

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at November 21, 2017 08:35 AM

Matthew Rocklin

Dask Release 0.16.0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.16.0. This is a major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.15.3 release on September 24th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Conda packages are available on both conda-forge and default channels.

Full changelogs are available here:

Some notable changes follow.

Breaking Changes

  • The dask.async module was moved to dask.local for Python 3.7 compatibility. This was previously deprecated and is now fully removed.
  • The distributed scheduler’s diagnostic JSON pages have been removed and replaced by more informative templated HTML.
  • The use of commonly-used private methods _keys and _optimize have been replaced with the Dask collection interface (see below).

Dask collection interface

It is now easier to implement custom collections using the Dask collection interface.

Dask collections (arrays, dataframes, bags, delayed) interact with Dask schedulers (single-machine, distributed) with a few internal methods. We formalized this interface into protocols like .__dask_graph__() and .__dask_keys__() and have published that interface. Any object that implements the methods described in that document will interact with all Dask scheduler features as a first-class Dask object.

class MyDaskCollection(object):
    def __dask_graph__(self):
        ...

    def __dask_keys__(self):
        ...

    def __dask_optimize__(self, ...):
        ...

    ...

This interface has already been implemented within the XArray project for labeled and indexed arrays. Now all XArray classes (DataSet, DataArray, Variable) are fully understood by all Dask schedulers. They are as first-class as dask.arrays or dask.dataframes.

import xarray as xa
from dask.distributed import Client

client = Client()

ds = xa.open_mfdataset('*.nc', ...)

ds = client.persist(ds)  # XArray object integrate seamlessly with Dask schedulers

Work on Dask’s collection interfaces was primarily done by Jim Crist.

Bandwidth and Tornado 5 compatibility

Dask is built on the Tornado library for concurrent network programming. In an effort to improve inter-worker bandwidth on exotic hardware (Infiniband), Dask developers are proposing changes to Tornado’s network infrastructure.

However, in order to use these changes Dask itself needs to run on the next version of Tornado in development, Tornado 5.0.0, which breaks a number of interfaces on which Dask has relied. Dask developers have been resolving these and we encourage other PyData developers to do the same. For example, neither Bokeh nor Jupyter work on Tornado 5.0.0-dev.

Dask inter-worker bandwidth is peaking at around 1.5-2GB/s on a network theoretically capable of 3GB/s. GitHub issue: pangeo #6

Dask worker bandwidth

Network performance and Tornado compatibility are primarily being handled by Antoine Pitrou.

Parquet Compatibility

Dask.dataframe can use either of the two common Parquet libraries in Python, Apache Arrow and Fastparquet. Each has its own strengths and its own base of users who prefer it. We’ve significantly extended Dask’s parquet test suite to cover each library, extending roundtrip compatibility. Notably, you can now both read and write with PyArrow.

df.to_parquet('...', engine='fastparquet')
df = dd.read_parquet('...', engine='pyarrow')

There is still work to be done here. The variety of parquet reader/writers and conventions out there makes completely solving this problem difficult. It’s nice seeing the various projects slowly converge on common functionality.

This work was jointly done by Uwe Korn, Jim Crist, and Martin Durant.

Retrying Tasks

One of the most requested features for the Dask.distributed scheduler is the ability to retry failed tasks. This is particularly useful to people using Dask as a task queue, rather than as a big dataframe or array.

future = client.submit(func, *args, retries=5)

Task retries were primarily built by Antoine Pitrou.

Transactional Work Stealing

The Dask.distributed task scheduler performs load balancing through work stealing. Previously this would sometimes result in the same task running simultaneously in two locations. Now stealing is transactional, meaning that it will avoid accidentally running the same task twice. This behavior is especially important for people using Dask tasks for side effects.

It is still possible for the same task to run twice, but now this only happens in more extreme situations, such as when a worker dies or a TCP connection is severed, neither of which are common on standard hardware.

Transactional work stealing was primarily implemented by Matthew Rocklin.

New Diagnostic Pages

There is a new set of diagnostic web pages available in the Info tab of the dashboard. These pages provide more in-depth information about each worker and task, but are not dynamic in any way. They use Tornado templates rather than Bokeh plots, which means that they are less responsive but are much easier to build. This is an easy and cheap way to expose more scheduler state.

Task page of Dask's scheduler info dashboard

Nested compute calls

Calling .compute() within a task now invokes the same distributed scheduler. This enables writing more complex workloads with less thought to starting worker clients.

import dask
from dask.distributed import Client
client = Client()  # only works for the newer scheduler

@dask.delayed
def f(x):
    ...
    return dask.compute(...)  # can call dask.compute within delayed task

dask.compute([f(i) for ...])

Nested compute calls were primarily developed by Matthew Rocklin and Olivier Grisel.

More aggressive Garbage Collection

The workers now explicitly call gc.collect() at various times when under memory pressure and when releasing data. This helps to avoid some memory leaks, especially when using Pandas dataframes. Doing this carefully proved to require a surprising degree of detail.

Improved garbage collection was primarily implemented and tested by Fabian Keller and Olivier Grisel, with recommendations by Antoine Pitrou.

Dask-ML

A variety of Dask Machine Learning projects are now being assembled under one unified repository, dask-ml. We encourage users and researchers alike to read through that project. We believe there are many useful and interesting approaches contained within.

The work to assemble and curate these algorithms is primarily being handled by Tom Augspurger.

XArray

The XArray project for indexed and labeled arrays is also releasing their major 0.10.0 release this week, which includes many performance improvements, particularly for using Dask on larger datasets.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.3 release on September 24th:

  • Ced4
  • Christopher Prohm
  • fjetter
  • Hai Nguyen Mau
  • Ian Hopkinson
  • James Bourbeau
  • James Munroe
  • Jesse Vogt
  • Jim Crist
  • John Kirkham
  • Keisuke Fujii
  • Matthias Bussonnier
  • Matthew Rocklin
  • mayl
  • Martin Durant
  • Olivier Grisel
  • severo
  • Simon Perkins
  • Stephan Hoyer
  • Thomas A Caswell
  • Tom Augspurger
  • Uwe L. Korn
  • Wei Ji
  • xwang777

The following people contributed to the dask/distributed repository since the 1.19.1 release on September 24nd:

  • Alvaro Ulloa
  • Antoine Pitrou
  • chkoar
  • Fabian Keller
  • Ian Hopkinson
  • Jim Crist
  • Kelvin Yang
  • Krisztián Szűcs
  • Matthew Rocklin
  • Mike DePalatis
  • Olivier Grisel
  • rbubley
  • Tom Augspurger

The following people contributed to the dask/dask-ml repository

  • Evan Welch
  • Matthew Rocklin
  • severo
  • Tom Augspurger
  • Trey Causey

In addition, we are proud to announce that Olivier Grisel has accepted commit rights to the Dask projects. Olivier has been particularly active on the distributed scheduler, and on related projects like Joblib, SKLearn, and Cloudpickle.

November 21, 2017 12:00 AM

November 20, 2017

numfocus

Hackseq17: Canada’s Genomics Hackathon for Open Science

This post contributed by Jake Lever, a PhD student in the University of British Columbia’s Bioinformatics program. NumFOCUS was pleased to provide funding support to the hackseq17 hackathon. hackseq17: Canada’s genomics hackathon hackseq17: Canada’s genomics hackathon was held at the University of British Columbia (UBC) in late October. This event brought together a diverse set […]

by NumFOCUS Staff at November 20, 2017 06:00 PM

November 17, 2017

numfocus

Theano and the Future of PyMC

This is a guest post by Christopher Fonnesbeck of PyMC3. — PyMC, now in its third iteration as PyMC3, is a project whose goal is to provide a performant, flexible, and friendly Python interface to Bayesian inference.  The project relies heavily on Theano, a deep learning framework, which has just announced that development will not be […]

by NumFOCUS Staff at November 17, 2017 05:25 PM

November 15, 2017

numfocus

Quantopian commits to fund pandas as a new NumFOCUS Corporate Partner

NumFOCUS welcomes Quantopian as our first Emerging Leader Corporate Partner, a partnership for small but growing companies who are leading by providing fiscal support to our open source projects. — Quantopian Supports Open Source by John Fawcett, CEO and founder of Quantopian       It all started with a single tweet. While scrolling through […]

by NumFOCUS Staff at November 15, 2017 12:03 AM

November 14, 2017

Matthieu Brucher

Audio Toolkit: Performance on the IIR SIMD filters

In release 2.2.0, ATK gained new EQ filters that are vectorized. These filters cannot be used to filter different bands from the same input signal (yet), but they can be used to filter in the same way several channels.

The question is to know if this is really faster than several independent channels, so I’ve set up a test case with a SIMD solution.

The first file has 5 test cases. Four of them are using a vectorized DF1 with different inputs. The fifth one is the TDF2 equivalent of one of these cases. The second file is a fully vectorized DF1 and a TDF2 test cases. It uses a dispatcher to select the best platform according to the CPU it runs on.

So the question is: which one of these is the fastest?

I ran the application on Linux through valgrind, after compiling it with gcc 7 and no specific instruction set. Here are the results:

EQ with no specific instruction set

I’ve only selected the process calls, and it is obvious the timings are dominated by two things:

  • The conversion of the individual signals to the SIMD signal
  • The actual EQ processing

Let’s compare first the non SIMD versions. The DF1 versions spend 13.9 million cycles, when the TDF2 only 3.6, but for only for one channel, so that’s 14.4 million cycles. The DF1 has dedicated SIMD lines that makes it faster that the TDF2 version.

On the SIMD side, things are different. The DF1 version is almost twice as slow as the TDF2 version, and the TDF2 version itself is only slightly slower than the non SIMD version when taking the conversion times into account (there is probably things I need to do to optimize it there!).

When using AVX2 for the non SIMD filters, some get faster:

EQ with AVX2 instruction set

The SIMD filters were not supposed to get faster, their code is still exactly the same. The DF1 non SIMD is now 20% faster and TDF2 stays at the same.

The conclusions are simple: SIMD TDF2 is good but the framework around still need to get better. The non SIMD TDF2 filter will require care so that they get faster. By making this one faster, perhaps I’ll may be able to make the SIMD version also faster!

There are more and more SIMD filters in ATK, let me know what you think of this effort.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at November 14, 2017 08:06 AM

November 09, 2017

Jake Vanderplas

Exploring Line Lengths in Python Packages

This week, Twitter upped their single-tweet character limit from 140 to 280, purportedly based on this interesting analysis of tweet lengths published on Twitter's engineering blog. The gist of the analysis is this: English language tweets display a roughly log-normal distribution of character counts, except near the 140-character limit, at which the distribution spikes:

The analysis takes this as evidence that twitter users often "cram" their longer thoughts into the 140 character limit, and suggest that a 280-character limit would more naturally accommodate the distribution of people's desired tweet lengths.

This immediately brought to mind another character limit that many Python programmers face in their day-to-day lives: the 79-character line limit suggested by Python's PEP8 style guide:

Limit all lines to a maximum of 79 characters.

I began to wonder whether popular Python packages (e.g. NumPy, SciPy, Pandas, Scikit-Learn, Matplotlib, AstroPy) display anything similar to what is seen in the distribution of tweet lengths.

Spoiler alert: they do! And the details of the distribution reveal some insights into the programming habits and stylistic conventions of the communities who write them.

by Jake VanderPlas at November 09, 2017 10:00 PM

November 08, 2017

Titus Brown

How specific are k-mers for taxonomic assignment of microbes, anyway?

I've been on a bit of a k-mers and taxonomy kick lately, as readers may have seen. Now that I can parse "free" taxonomies from sources other than NCBI, I decided the code was well-enough baked to put it into sourmash. So, over the last week I started integrating the lowest common ancestor code and taxonomy parsing code into sourmash; once that pull request is merged, sourmash will have lca index, lca classify, lca summarize, and lca rankinfo commands - see the tutorial for more information on how to run this stuff.

To celebrate the addition of all this code, I thought I'd address the question, "how specific are k-mers?" That is, if we reach into a bucket of metagenome sequence and pick out a k-mer that is identified with a particular species, how certain can we be of that identification (vs it being actually a different species in the same genus, family, order, etc.)

This question comes up routinely in discussions and now I have the information to answer it, at least in the context of all known microbial genomes! (The answer will, of course, be very biased by what's not in our databases.)

After downloading the genbank LCA database, I ran the command:

sourmash lca rankinfo genbank-k31.lca.json.gz

and after a little data munging got this:

superkingdom: 38077 (0.4%)
phylum: 24627 (0.3%)
class: 39306 (0.4%)
order: 66423 (0.7%)
family: 97908 (1.1%)
genus: 1103223 (12.0%)
species: 7818876 (85.1%)

That is, 85% of the 31-mers in genbank are specific to the species level; 12% are specific to the genus level; and the remaining ~3% are family or above.

I would argue this makes genus level assignments pretty reliable, when using k-mers. (It doesn't say much about what you're missing, of course.)

--titus

by C. Titus Brown at November 08, 2017 11:00 PM

November 07, 2017

Continuum Analytics

Utilizing the New Compilers in Anaconda Distribution 5

Part of what made the recent release of Anaconda Distribution 5 so exciting was our switch from OS-provided compiler tools to our own Anaconda toolsets. This change has allowed us to make major steps forward in the capabilities of our compilers, specifically regarding security and performance. In this post, we’ll show you how to use …
Read more →

by Rory Merritt at November 07, 2017 04:45 PM

November 03, 2017

numfocus

SciPy 1.0 — 16 Years in the Making

Congratulations, SciPy! SciPy—a NumFOCUS Affiliated Project—recently crossed a major milestone for any open source project: version 1.0! NumFOCUS extends our hearty congratulations to all of the SciPy contributors and community members who helped get the project to this point. “SciPy the library has been a cornerstone for the scientific Python community. By providing a consistent […]

by NumFOCUS Staff at November 03, 2017 03:59 PM

Matthew Rocklin

Optimizing Data Structure Access in Python

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

Last week at PyCon DE I had the good fortune to meet Stefan Behnel, one of the core developers of Cython. Together we worked to optimize a small benchmark that is representative of Dask’s central task scheduler, a pure-Python application that is primarily data structure bound.

Our benchmark is a toy problem that creates three data structures that index each other with dictionaries, lists, and sets, and then does some simple arithmetic. (You don’t need to understand this benchmark deeply to read this article.)

import random
import time

nA = 100
nB = 100
nC = 100

A = {'A-%d' % i: ['B-%d' % random.randint(0, nB - 1)
                  for i in range(random.randint(0, 5))]
     for i in range(nA)}

B = {'B-%d' % i: {'C-%d' % random.randint(0, nC - 1)
                  for i in range(random.randint(1, 3))}
     for i in range(nB)}

C = {'C-%d' % i: i for i in range(nC)}

data = ['A-%d' % i for i in range(nA)]


def f(A, B, C, data):
    for a_key in data:
        b_keys = A[a_key]
        for b_key in b_keys:
            for c_key in B[b_key]:
                C[c_key] += 1


start = time.time()

for i in range(10000):
    f(A, B, C, data)

end = time.time()

print("Duration: %0.3f seconds" % (end - start))
$ python benchmark.py
Duration: 1.12 seconds

This is an atypical problem Python optimization because it is primarily bound by data structure access (dicts, lists, sets), rather than numerical operations commonly optimized by Cython (nested for loops over floating point arithmetic). Python is already decently fast here, typically within a factor of 2-5x of compiled languages like Java or C++, but still we’d like to improve this when possible.

In this post we combine two different methods to optimize data-structure bound workloads:

  1. Compiling Python code with Cython with no other annotations
  2. Interning strings for more rapid dict lookups

Finally at the end of the post we also run the benchmark under PyPy to compare performance.

Cython

First we compile our Python code with Cython. Normally when using Cython we annotate our variables with types, giving the compiler enough information to avoid using Python altogether. However in our case we don’t have many numeric operations and we’re going to be using Python data structures regardless, so this won’t help much. We compile our original Python code without alteration.

cythonize -i benchmark.py

And run

$ python -c "import benchmark"
Duration: 0.73 seconds

This gives us a decent speedup from 1.1 seconds to 0.73 seconds. This isn’t huge relative to typical Cython speedups (which are often 10-100x) but would be a very welcome change for our scheduler, where we’ve been chasing 5% optimizations for a while now.

Interning Strings

Our second trick is to intern strings. This means that we try to always have only one copy of every string. This improves performance when doing dictionary lookups because of the following:

  1. Python computes the hash of the string only once (strings cache their hash value once computed)
  2. Python checks for object identity (fast) before moving on to value equality (slow)

Or, anecdotally, text is text is faster in Python than text == text. If you ensure that there is only one copy of every string then you only need to do identity comparisons like text is text.

So, if any time we see a string "abc" it is exactly the same object as all other "abc" strings in our program, then string-dict lookups will only require a pointer/integer equality check, rather than having to do a full string comparison.

Adding string interning to our benchmark looks like the following:

inter = {}

def intern(x):
    try:
        return inter[x]
    except KeyError:
        inter[x] = x
        return x

A = {intern('A-%d' % i): [intern('B-%d' % random.randint(0, nB - 1))
                  for i in range(random.randint(0, 5))]
     for i in range(nA)}

B = {intern('B-%d' % i): {intern('C-%d' % random.randint(0, nC - 1))
                  for i in range(random.randint(1, 3))}
     for i in range(nB)}

C = {intern('C-%d' % i): i for i in range(nC)}

data = [intern('A-%d' % i) for i in range(nA)]

# The rest of the benchmark is as before

This brings our duration from 1.1s down to 0.75s. Note that this is without the separate Cython improvements described just above.

Cython + Interning

We can combine both optimizations. This brings us to around 0.45s, a 2-3x improvement over our original time.

cythonize -i benchmark2.py

$ python -c "import benchmark2"
Duration: 0.46 seconds

PyPy

Alternatively, we can just run everything in PyPy.

$ pypy3 benchmark1.py  # original
Duration: 0.25 seconds

$ pypy3 benchmark2.py  # includes interning
Duraiton: 0.20 seconds

So PyPy can be quite a bit faster than Cython on this sort of code (which is not a big surprise). Interning helps a bit, but not quite as much.

This is fairly encouraging. The Dask scheduler can run under PyPy even while Dask clients and workers run under normal CPython (for use with the full PyData stack).

Preliminary Results on Dask Benchmark

We started this experiment with the assumption that our toy benchmark somehow represented the Dask’s scheduler in terms of performance characteristics. This assumption, of course, is false. The Dask scheduler is significantly more complex and it is difficult to build a single toy example to represent its performance.

When we try these tricks on a slightly more complex benchmark that actually uses the Dask scheduler we find the following results:

  • Cython: almost no effect
  • String Interning: almost no effect
  • PyPy: almost no effect

However I have only spent a brief amount of time on this (twenty minutes?) and so I hope that the lack of a performance gain here is due to lack of effort.

If anyone is interested in this I hope that this blogpost contains enough information to get anyone started if they want to investigate further.

November 03, 2017 12:00 AM

November 02, 2017

Continuum Analytics

AnacondaCON 2018: Anaconda Opens Registration and Call for Speakers for Second Annual User Conference

Four-day event brings together data science community to discover trends in data science, connect with thought leaders and learn all things Python AUSTIN, TX—November 2, 2017—Anaconda, Inc., the most popular Python data science platform provider, today announced that registration is now open for AnacondaCON 2018, taking place April 8-11, 2018 in Austin, Texas. The company …
Read more →

by Team Anaconda at November 02, 2017 03:07 PM

November 01, 2017

Enthought

Webinar: Machine Learning Mastery Workshop: An Exclusive Peek “Under the Hood” of Enthought Training

What: A guided walkthrough and live Q&A about Enthought’s new “Machine Learning Mastery Workshop” training course.

Who Should Watch: If predictive modeling and analytics would be valuable in your work, come to the webinar to find out what all the fuss is about and what there is to know. Whether you are looking to get started with machine learning, interested in refining your machine learning skills, or want to transfer your skills from another toolset to Python, come to the webinar to find out if Enthought’s highly interactive, expertly taught Machine Learning Mastery Workshop might be a good fit for accelerating your development!

View


Why Has Machine Learning Become So Popular?

Artificial Intelligence and Machine Learning are a defining feature of the 21st century and are quickly becoming a key factor in gaining and maintaining competitive advantage in each industry which incorporates them. Why is machine learning so beneficial?  Because it provides a fast and flexible way to build models that can surface signal, find patterns, and predict future behavior.  These powerful models are used for:

  • Forecasting supply chain availability
  • Clustering product defects for QA
  • Anticipating movements in financial markets
  • Predicting chemical tolerances
  • Optimizing the placement of advertisements
  • Managing process engineering
  • Modeling reservoir production
  • and much more.

In response to growing demand for Machine Learning expertise, Enthought has developed an intensive 3-day guided practicum to bring you up to speed quickly on key concepts and skills in this exciting realm. Join us in this webinar for an in-depth overview of Enthought’s Machine Learning Mastery Workshop — a training course designed to accelerate the development of intuition, skill, and confidence in applying machine learning methods to solve real-world problems.

In the webinar we’ll describe how Enthought’s training course combines conceptual knowledge of machine learning models with intensive experience applying them to real-world data to develop skill in applying Python’s machine learning tools, such as the scikit-learn package, to make predictions about complicated phenomena by leveraging the information contained in numerical data, natural language, 2D images, and discrete categories.

The hands-on, interactive course was created ground up by our training experts to enable you to develop transferable skills in Machine Learning that you can apply back at work the next day.

In this webinar, we’ll give you the key information and insight you need to quickly evaluate whether Enthought’s Machine Learning Mastery Workshop course is the right solution for you to build skills in using Python for advanced analytics, including:

  • Who will benefit most from the course, and what pre-requisite knowledge is required
  • What topics the course covers – a guided tour
  • What new knowledge, skills, and capabilities you’ll take away, and how the course design supports those outcomes
  • What the (highly interactive) learning experience is like
  • Why this course is different from other training alternatives (with a preview of actual course materials!)
  • What previous workshop attendees say about our courses

View


Presenter: Dr. Dillon Niederhut,

Enthought Training Instructor

Ph.D., University of California at Berkeley

 


 

Additional Resources

Upcoming Open Machine Learning Mastery Workshop Sessions:

Austin, TX, Feb. 21-23, 2017
Houston, TX, Apr. 18-20, 2018
Cambridge, UK, May 9-11, 2018

Upcoming Open Python for Data Science Sessions:

New York City, NY, Dec. 4-8, 2018
London, UK, Feb. 19-23, 2018
Washington, DC, Apr. 23-27, 2018
San Jose, CA, May 14-18, 2018

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat Sheets

Enthought's Machine Learning with Python Cheat Sheets

Additional Webinars in the Training Series:

Python for MATLAB Users: What You Need to Know

Python for Scientists and Engineers: A Tour of Enthought’s Professional Technical Training Course

Python for Data Science: A Tour of Enthought’s Professional Technical Training Course

Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

The post Webinar: Machine Learning Mastery Workshop: An Exclusive Peek “Under the Hood” of Enthought Training appeared first on Enthought Blog.

by admin at November 01, 2017 07:01 PM

October 31, 2017

numfocus

NumFOCUS welcomes Jim Weiss, Events Coordinator

NumFOCUS is pleased to announce Jim Weiss has been hired as our new Events Coordinator, bringing over seven years of event management experience. Prior to joining NumFOCUS, Jim worked in Washington, D.C., coordinating congressional hearings for the U.S. House of Representatives and managing events for the U.S. Air Force. Jim hails from Pennsylvania and graduated […]

by NumFOCUS Staff at October 31, 2017 09:08 PM

October 30, 2017

Titus Brown

For want of $2.41... some background on reimbursements.

A week ago, I submitted a reimbursement request for about $2200 - I'd just come back from a pretty extensive east coast trip, visiting five different institutions (including the FBI!) on various grant-funded projects. The reimbursement ended up going on five different grants, and involved a fair amount of explanatory paperwork (e.g. I'd scheduled a meeting for project B because I had to be near there for grant A; how do you allocate reimbursements for the plane flight back? Angels on the heads of pins stuff.)

On Monday, that reimbursement got kicked back to me because I'd misidentified a $2.41 purchase of milk at a convenience store as "miscellaneous supplies" rather than "food". This occasioned a snarky tweet on my part about how annoying the reimbursement process is.

As these things do, my tweet got interpreted by some people as I intended - "ugh, paperwork is annoying and frustrating" - and by some others as me being an entitled git who thought $2.41 was worth reimbursing. While I would characterize some of the responses on the latter point as not very thoughtful, I think there are some interesting points to be discussed in response. So here are my thoughts!


Reimbursements are frustrating for most academics I know, and (as always) the burden falls more on junior people than on senior people. At least two people from my lab (i.e. junior to me) responded to my tweet with their own complaints about their reimbursements being kicked back, for similar reasons. So this issue of "blind bureaucracy", as someone put it, affects others - mostly people junior to me.

If I'd known the $2.41 charge was going to be a problem, I'd probably have just eaten (hah!) it. But you never know what bit of paperwork you'll fail to fill out properly - it could have just as easily been the $400 flight back to California. Regardless, the effect would be the same: until I fill the paperwork out right for the $2.41, I don't get any of my $2200 back.

Why even bother putting the $2.41 on the reimbursement? Honestly, I put all my receipts in my shoulder bag and then mechanically scan them in and upload them into the system. This is because anything I don't have a receipt for is essentially invisible and I usually end up forgoing the reimbursement process for it. I also cannot reimburse late fees or interest charges due to slow filing (or slow approval) of expenses. So on top of the stuff that I can't reimburse (like cookies and apples for the lab) I've probably lost about $1000 this year alone, and conservatively estimate my overall burn for the last 10 years as between $5000-10,000. Even small things add up! So I have A Process to minimize this loss.

In this case, all of this was on my personal card, so I didn't need to put the $2.41 into the system. But since I'm behind on other reimbursements from the summer institute I ran, my travel card is not valid at the moment. If I'd been using my travel card, the $2.41 could have gone on there and I would have had to put in the receipt because every transaction is entered into the system and demands a receipt. That's why I upload all the receipts reflexively as part of My Process :).

It turns out I can bear $2200 for a month (although this month was actually about $4000, if you include the Binder workshop), since I'm (a) reasonably well paid and (b) partnered with someone who is reasonably well paid and ( c) neither of us have any lingering student loans. This is not necessarily true of people in my lab, which is one reason why I have another $1,200 sitting on my personal card: UC Davis can't pay for some things in advance, and some things can't be put on travel cards, so even people in my lab who have travel cards are occasionally asked to sit on thousands of dollars of credit card charges. In two recent situations I simply paid for it in advance myself and am waiting for the people in question to get reimbursed so they can pay me back.

I know many faculty who do variants on this - they outrate pay for things, or have an in-lab slush fund, or otherwise help out with money for their students and postdocs.

Of course, many other faculty - especially the junior faculty, and/or the faculty with young kids, and/or the faculty with non-working spouses, and/or the multitude of reasons why people don't have any spare cash - cannot do this, and often there's no recourse for these people or their labs, as far as I can tell. Every institution I've been at has had different rules, all of them somewhat arbitrary and bureaucratic, for who gets travel or purchasing cards, what gets reimbursed when, and whether or not travel advances are allowed. It's often super hard to know what rules are there and how to exploit them in the service of your lab - an underappreciated aspect of professordom, I think.

One person brought up the massive unfairness inherent in the situation. Why am I even getting anything reimbursed when there are starving grad students to be fed?? Well, each of the five funding sources I'm using for reimbursement for this (and my students and postdocs) is money that I received in the service of accomplishing a specific goal. Much of that required this travel (e.g. I did the osfclient grant closeout on this trip, and that had to be done in person). I don't think it's reasonable to ask me to pay for that travel myself - or, at least, if the grant came with that expectation, I'd be a lot more hesitant about taking on the work. And it's certainly not ok for the grad students to pay for their own travel. So the work that the funding agencies wanted done wouldn't get done. And we wouldn't have the money in the first place if there wasn't a way to pay for the things that needed doing. So I'm definitely not taking money away from graduate students who might otherwise get reimbursed were it not for my Scrooge-like ways.

Why talk about this, anyway?

Travel and conference costs have been cited on my Twitter feed a few times in just the past few months as one of the big challenges for diversity in research. If you don't have much money, being asked to spot $1000 (or lots more) for plane flights and hotel rooms is essentially impossible. Even students who aren't massively in debt balk at this! And since many underrepresented minorities don't have lots of money to start with, this burden falls disproportionately on them. That's actually one of the problematic aspects of structural inequality, as I understand it - the small stuff that falls on the privileged hits the underprivileged with considerably more weight.

Interestingly, the message "look, just cover the small stuff yourself if you're a professor" is arguably the wrong one to send here, because it assumes a lot about the situation of the professor. (I know professors who are far more cash poor than their grad students.) I think a far better message is to encourage institutions to pay for more stuff up front.

I've also noticed that many people less fortunate than me (or simply more careful than me) are hesitant to spend money on good things because of the extra burden of reimbursement. For example: one of the unpublicized success stories of our workshops is that frequently we provide free food and coffee as part of the workshop, which really makes for a positive learning atmosphere. But for long-running or big workshops, this can really add up - this summer I was carrying a $7,000 tab for a few weeks until I got the big reimbursements through the system. Most faculty would simply not do this, I think, because of the time and cost involved (several have expressed amazement at my willingness to do this). But it's something that really helps our workshops be successful, thus it's worth it to me.

So overall I believe the academy is poorer - in diversity, and in activities - because there is this expectation of reimbursement for activities, rather than up-front payment.

I have to say that our bureaucracy doesn't really care about any of this. The rules are put in place and enforced by people who care about financial compliance, and not getting sued by the federal or state government for bad accounting. It takes champions - generally faculty champions - to advocate for any kind of human-centric change, and frankly driving change within large bureaucratic institutions is not often very effective or at least requires massive and focused effort. So I'm not currently charging the castle trying to make the reimbursement rules at UC Davis more humane; I've got other battles to fight that are more important to me (and that I think are more likely to serve my aspirations in outreach and diversity). The most I feel I can do is act within my sphere of influence to help those I'm directly responsible for. So that's what I do.

(Not to mention, the cost involved in spending so many person-hours on reviewing every jot and tittle of reimbursement requests must be significant. There's gotta be a better way!)


Awareness of the larger context and impact of a painful reimbursement process is important, though, and hopefully this is least mildly informative :).

And the situation could definitely use some fixing.

I'd be really interested in hearing what academic institutions do well in this space. I know that some institutions are pretty flexible about travel advances, and others give travel cards to anyone who wants them. Some institutions pay for all big travel costs in advance (airfare, hotel rooms). I've heard of a few that even offer debit cards for job interviewees. I'd be interested in advocating for this kind of thing at UC Davis, but I'm not sure what actually works well for people and institutions. Pointers and tips welcome!

--titus

by C. Titus Brown at October 30, 2017 11:00 PM

Continuum Analytics

Getting Started with GPU Computing in Anaconda

Anaconda Distribution makes it easy to get started with GPU computing with several GPU-enabled packages that can be installed directly from our package repository. In this blog post, we’ll give you some pointers on where to get started with GPUs in Anaconda Distribution.

by Sheyna Webster at October 30, 2017 04:56 PM

October 29, 2017

Titus Brown

The 2017 binder workshop!

tl;dr? We ran a workshop on binder. It was fun!

workshop attendee photo

What is binder?

Imagine... that you are visiting the data repository for a preprint you are reviewing, and with the click of a button you are brought to a fully configured RStudio Server containing that data.

Imagine... you are running a workshop, and you want to introduce everyone in the workshop to a machine-learning approach. You give them all the same URL, and within seconds everyone in the room is looking at their own live environment, copied from your blueprint but individually modifiable and exportable.

Imagine... your lab has a collection of standard data analysis protocols in Jupyter Notebooks on your GitHub site, and anyone in your lab can, with a single click, bring them to life and run them on a new data set.

Binder is a concept and technology that makes all of the above, and more, tantalizingly close to everyday realization! The techie version is this: currently,

  • upon Web request, binder grabs a GitHub repository, inspects it, and builds a custom Docker image based on a variety of configuration detection;

  • then, binder spins up a Docker container and redirects the Web browser to that repo;

  • at some point, binder detects lack of activity and shuts down the container.

All of this is (currently) done without authentication or payment of any time, which makes it a truly zero configuration/single-click experience for the user.

Just as important, the binder infrastructure is meant to be widely distributed, reusable, and hackable open source tech that supports multiple deployments and customization!

The workshop!

In 2016, I wrote a proposal to fund a workshop on binder to the Sloan Foundation and it was funded!! We finally ran the workshop last week, with the following organizing committee:

Why a workshop?

Many people, including myself, see massive potential in binder, but it is still young. The workshop was intended to explore possible technical directions for binder's evolution, build community around the binder ecosystem, and explore issues of sustainability.

One particular item that came up early on in the workshop was that there are many possible integration points for binder into current data and compute infrastructure providers. That's great! But, in the long term, we also need to plan for the current set of endeavors failing or evolving, so we should be building a community around the core binder concepts and developing de facto standards and practice. This will allow us to evolve with endeavors as well as finding new partners.

So that's why we ran a workshop!

Who came to the workshop?

The workshop attendees were a collection of scientists, techies, librarians, and data people. For this first workshop I did my best to reach out to people from a variety of communities - researchers from a variety of disciplines, librarians, trainers, data scientists, programmers, HPC admins, and research infrastructure specialists. In this, we somewhat succeeded! We didn't advertise very widely, partly just because of a last minute time crunch, and also because too many people would have been a problem for the space we had.

As we figure out more of a framework and sales pitch for binder, I expect the set of possible attendees to expand. Still, for hackfest-like workshops, I'm a big fan of small diverse groups of people in a friendly environment.

What is the current state of binder?

The original mybinder.org Web site was created and supported by the Freeman Lab, but maintenance on the site suffered when Jeremy Freeman moved to the Chan-Zuckerberg Initiative and became even busier than before.

The Jupyter folk picked up the binder concept and reimplemented the Web site with somewhat enhanced functionality, building the new BinderHub software in Python around JupyterHub and splitting the repository-to-docker code out into repo2docker. This is now running on a day-to-day basis on a beta site.

A rough breakdown, and links to documentation, follow:

JupyterHub - JupyterHub manages multiple instances of the single-user Jupyter notebook server. JupyterHub can be used to serve notebooks to a class of students, a corporate data science group, or a scientific research group.

Zero-to-JupyterHub - Zero to JupyterHub with Kubernetes is a tutorial to help install and manage JupyterHub.

BinderHub - BinderHub builds "binders" containing data+code from GitHub repos and then serves the binders in a custom computing environment. beta.mybinder.org is a public BinderHub.

repo2docker - repo2docker builds, runs, and pushes Docker images from source code repositories.

Highlights of the binder workshop!

What did we do? We ran things as an unconference, and had a lot of discussions and brainstorming around use cases and the like, with some super cool results. The notes from those are linked below!

A few highlights of the meeting deserve, well, highlighting --

  • Amazingly, we got to the point where binder ran an RStudio Server instance, started from a Jupyter console!! Some tweets of this made the rounds, but it may take a few more weeks for this to make it into production. (This was based on Ryan Lovett's earlier work, which was then hacked on by Carl Boettiger, Yuvi Panda and Aaron Culich at the workshop. I have it on authority that Adelaide Rhodes asking lots of questions by way of encouragement ;).

  • Everyone who attended the workshop got to the point where we had our own BinderHub instance on Google!! (We used these JupyterHub and BinderHub instructions). w00000t! (Session notes)

  • Yuvi Panda gave us a rundown on the data8 / "Foundations of Data Science" course at UC Berkeley, which uses JupyterHub to host several thousand users, with up to 700 concurrent sessions!

We came up with lots of use cases - see ~duplicate set of notes, here.

Other stuff we did at the workshop

(All the notes are on GitHub, here)

Here is a fairly comprehensive list of the other activities at the workshop --

Issues that we only barely touched on:

  • "I have a read only large dataset I want to provide access to for untrusted users, who can do whatever they want but in a safe way." What are good practices for this situation? How do we provide good access without downloading the whole thing?

  • It would be nice to initiate and control (?) Common Workflow Language workflows from binder - see nice Twitter conversation with Michael Crusoe.

  • How do we do continuous integration on notebooks??

  • We need some sort of introspection and badging framework for how reproducible a notebook is likely to be - what are best practices here? Is it "just" a matter of specifying software versions etc and bundling data, or ...??

Far reaching issues and questions --

  • it's likely that the future of binder involves many people running many different binderhub instances. What kind of clever things can we do with federation? Would it be possible for people to run a binder backend "close" to their data and then allow other binderhubs to connect to that, for example?

  • Many issues of publishing workflows, provenance, legality - notes

  • It would be super cool if realtime collaboration was supported by JupyterHub or BinderHub... it's coming, I hear. Soon, one hopes!

Topics we left almost completely untouched:

What's next?

I'm hoping to find money or time to run at least two more hackfests or conference -- perhaps we can run one in Europe, too.

It would be good to run something with a focus on developing training materials (and/or exemplary notebooks) - see Use Cases, above.

I'm hoping to find support to do some demo integrations with scholarly infrastructure, as in in the Imagine... section, above.

If (if) we ran a conference, I could see having some of the following sessions: - A hackfest building notebooks - A panel on deployment - keynote on the roadmap for binder and JupyterHub - Some sort of community fest

If you're interested in any of this, please indicate your interest in future workshops!!

Where to get started with binder

There are lots of example repositories, here:

github.com/binder-examples

you can click "Launch Binder" in any of the READMEs to see examples!


There is a gitter chat channel that is pretty active and good for support: see gitter.im/jupyterhub/binder


And, finally, there is a google groups forum, binderhub-dev

Some other links worth mentioning:

  • nbflow - one-button reproducible workflows with Jupyter Notebook and Scons (see video).
  • papermill - parameterize, execute, and analyze notebooks.
  • dataflow - a kernel to support Python dataflows in the Jupyter Notebook environment.

  • an example of how to use the new postBuild functionality to install jupyter notebook extensions.

aaaand some notes from singularity:

One way to convert docker images to singularity images, using docker2singularity

docker run -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/image:/output \
    --privileged -t --rm singularityware/docker2singularity ubuntu:14.04  

Another way to simply run docker containers in singularity:

singularity exec docker://my/container <runcommand>

The End

I have no particular conclusion other than we'll have to do this again!

--titus

by C. Titus Brown at October 29, 2017 11:00 PM

October 28, 2017

Titus Brown

Classifying genome bins using a custom reference database - maybe this time it'll work?

The story so far:

Trina would like to taxonomically classify new genomes based on a custom classification of old genomes.

Titus wrote some trial code to do this.

Sarah (in Trina's lab) tried the code and realized that it was limited by the NCBI taxonomy, which defeated the whole point of the thing.

Titus retired from the field to consider his options.

My new attempt

We have a new script, classify-free-tax.py.

This script does the following:

  • parses a spreadsheet containing custom taxonomic lineages (i.e. this format);
  • builds a custom taxonomic classification based on the spreadsheet but rooted in NCBI taxonomies;
  • creates a k-mer classifier that combines the k-mers from the custom classification with all of Genbank;
  • applies that k-mer classifier to a set of query genomes.

It's a long, ugly script, but it seems to work OK on my test subsets. Let's give it a try!

Running classify-free-tax.py

Let's reclassify 100 randomly chosen genomes from the Delmont et al., 2017 study, as per this blog post, but using our new code.

Here we're interested in validation, so we're going to use the 100 randomly chosen genomes to classify the 100 randomly chosen genomes and compare the results.

There are two new commands. The first indexes the custom genomes so we can use them in the next classification step:

2017-sourmash-revindex/extract-hashvals-by-sample.py delmont delmont/*.sig

The next command uses the indexed custom genomes, plus the Genbank LCA database, to classify all of the genomes under delmont/:

PYTHONPATH=$PYTHONPATH:2017-sourmash-revindex/ \
    ../classify-free-tax.py tara-delmont-SuppTable3.csv delmont \
    delmont/*.sig -o reclassify.csv

Validating the results

To see how well things worked, we can now run a comparison of the input spreadsheet vs the reclassified spreadsheet using cmp-csv.py:

../cmp-csv.py tara-delmont-SuppTable3.csv reclassify.csv

The output is below. The first lineage line is from the input spreadsheet, and the second lineage line is the re-classification using our software.

tl;dr? Out of 100 genome bins from Delmont et al., we get five disagreements, and it looks like the reclassification is fine.

TARA_ANW_MAG_00051
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Rickettsiales', 'Pelagibacteraceae', '', '']
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Pelagibacterales', 'Pelagibacteraceae', '', '']
--
TARA_IOS_MAG_00076
         ['Eukaryota', '', '', '', '', '', '']
         ['Eukaryota', 'Haptophyta', 'Prymnesiophyceae', 'Isochrysidales', 'Noelaerhabdaceae', 'Emiliania', '']
--
TARA_PSE_MAG_00092
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Rickettsiales', 'Pelagibacteraceae', '', '']
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Pelagibacterales', 'Pelagibacteraceae', '', '']
--
TARA_PSW_MAG_00133
         ['Eukaryota', '', '', '', '', '', '']
         ['Eukaryota', 'Haptophyta', 'Prymnesiophyceae', 'Isochrysidales', 'Noelaerhabdaceae', 'Emiliania', '']
--
TARA_RED_MAG_00001
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Sphingomonadales', 'Erythrobacteraceae', 'Citromicrobium', '']
         ['Bacteria', 'Proteobacteria', 'Alphaproteobacteria', 'Sphingomonadales', 'Sphingomonadaceae', 'Citromicrobium', '']
--

In three of the five, there is an internal disagreement in the taxonomies, e.g. Pelagibacteraceae has been rerooted under Pelagibacterales in our spreadsheet. This is probably due to changes in the NCBI taxonomy between when Delmont et al. did their classification and when I downloaded the NCBI taxonomy a few months ago.

In the other two of the five, Delmont et al. could only classify the bins down to Eukaryota, but our k-mer classifier is telling us that they belong to genus Emiliana. On inspection (hint: pass the --debug flag to the classify script), approximately 470,000 of the k-mers in TARA_IOS_MAG_00076 belong to genus Emiliana, and 1.41 million of the k-mers in TARA_PSW_MAG_00133 belong to genus Emiliana, so I'm going to provisionally argue that our re-classification is correct.

What next?

The code is ugly and needs refactoring, tests, etc.

It'd be fun to repeat the taxonomic examination of the Tara ocean bins once I have it working more nicely. I should be able to do a much better (more thorough) job of comparison than before!

More broadly, this is a useful extension of the least-common-ancestor code that I posted a few weeks back and I need to revisit all the old code in light of this. I think I'll be retooling the LCA code substantially to allow the use of custom taxonomies/genome classifications.

I really want to dig into the 8,000 genomes that were posted a little while back (from this paper), and I should make sure my taxonomy code works with the Genome Taxonomy Database.

In the meantime, I'd love to get feedback on the above code if anyone is interested in trying it out!

Appendix: full set of commands

First, repeat (most of) the steps in this blog post to install and compute a bunch of things.

# create a virtualenv and install sourmash
python3.5 -m venv ~/py3
. ~/py3/bin/activate
pip install -U pip
pip install -U Cython
pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer

pip install -U https://github.com/dib-lab/sourmash/archive/master.zip

# grab the 2017-sourmash-lca repository
git clone https://github.com/ctb/2017-sourmash-lca

# download subsample data
cd 2017-sourmash-lca/
curl -L https://osf.io/73yfz/download?version=1 -o subsample.tar.gz
tar xzf subsample.tar.gz
cd subsample

# compute signatures for delmont
cd delmont
sourmash compute -k 21,31,51 --scaled 10000 --name-from-first *.fa.gz

# fix names to match identifiers in spreadsheet (below)
python ../delmont-fix-sig-names.py *.sig

for i in *.sig
do
    mv $i.fixed $i
done
cd ../

# download NCBI LCA database
curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz
mkdir -p db
cd db/
tar xzf ../genbank-lca-2017.08.26.tar.gz
cd ../

# grab delmont spreadsheet
curl -O -L https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv

If you've already done the above, you'll need to also clone the 2017-sourmash-revindex repository:

git clone https://github.com/ctb/2017-sourmash-revindex.git

and now you're ready to run the commands above.

by C. Titus Brown at October 28, 2017 10:00 PM

October 25, 2017

Continuum Analytics

Announcing the Release of Anaconda Distribution 5.0

We’re thrilled to announce the release of Anaconda Distribution 5.0! With over 4.5 million active users, Anaconda Distribution is the world’s most popular and trusted distribution for data science. It allows you to easily install 1,000+ Python and R data science packages and manage your packages, dependencies, and environments—all with the single click of a …
Read more →

by Team Anaconda at October 25, 2017 08:06 PM

October 21, 2017

Titus Brown

Grokking "custom" taxonomies.

(Some more Saturday/Sunday Morning Bioinformatics...)

A while back I posted some hacky code to classify genome bins using a custom reference database. Much to my delight, Sarah Stevens (a grad student in Trina McMahon's lab) actually tried using it! And of course it didn't do what she wanted.

The problem is that all of my least-common-ancestor scripts rely utterly and completely on the taxonomic IDs from NCBI, but anyone who wants to do classification of new genome bins against their own old genome bins doesn't have NCBI taxonomic IDs. (This should have been obvious to me but I was avoiding thinking about it, because I have a whole nice pile of Python code that has to change. Ugh.)

So basically my code ends up doing a classification with custom genomes that is only useful so far as the custom genomes match NCBI's taxonomy. NOT SO USEFUL.

I'd already noticed that my code for getting taxonomic IDs from the Tara binned genomes barfed on lots of assignments (see script links after "used the posted lineages to assign NCBI taxonomic IDs" in this blog post). This was because the Tully and Delmont genome bins contained a lot of novel species. But I didn't know what to do about that so I punted.

Unfortunately I can punt no more. So I'm working on fixing that, somehow. Today's blog post is about the fallout from some initial work.


Statement of Problem, revisited

We want to classify NEW genome bins using Genbank as well as a custom collection of OLD genome bins that contains both NCBI taxonomy (where available) and custom taxonomic elements.

My trial solution

The approach I took was to build custom extensions to the NCBI taxonomy by rooting the custom taxonomies to NCBI taxids.

Basically, given a custom spreadsheet like this that contains unique identifiers and a taxonomic lineage, I walk through each row, and for each row, find the least common ancestor that shares a name and rank with NCBI. Then the taxonomy for each row consists of triples (rank, name, taxid) where the taxid below the least common ancestor is set to None.

This yields assignments like this, one for each row.

[('superkingdom', 'Bacteria', 2),
 ('phylum', 'Proteobacteria', 1224),
 ('class', 'Alphaproteobacteria', 28211),
 ('order', 'unclassifiedAlphaproteobacteria', None),
 ('family', 'SAR116cluster', None)]

So far so good!

You can see the code here if you are interested.

Problems and challenges with my trial solution

So I ran this on the Delmont and Tully spreadsheets, and by and large I could compute NCBI-rooted custom taxonomies for most of the rows. (You can try it for yourself if you like; install stuff, then run

python parse-free-tax.py db/genbank/{names,nodes}.dmp.gz tara-delmont-SuppTable3.csv > tara-delmont-ncbi-disagree.txt
python parse-free-tax.py db/genbank/{names,nodes}.dmp.gz tara-tully-Table4.csv > tara-tully-ncbi-disagree.txt

in the 2017-sourmash-lca directory.)

But! There were a few places where there were apparent disagreements between what was in NCBI and what was in the spreadsheeet.

Other than typos and parsing errors, these disagreements came in two flavors.

First, there were things that looked like minor alterations in spelling - e.g., below, the custom taxonomy file had "Flavobacteria", while NCBI spells it "Flavobacteriia".

confusing lineage #15
    CSV:  Bacteria, Bacteroidetes, Flavobacteria, Flavobacteriales, Flavobacteriaceae, Xanthomarina
    NCBI: Bacteria, Bacteroidetes, Flavobacteriia, Flavobacteriales, Flavobacteriaceae, Xanthomarina
(2 rows in spreadsheet)

Second, there were just plain ol' disagreements, e.g. below the custom taxonomy file says that genus "Haliea" belongs under order "Alteromonadales_3" while NCBI claims that it belongs under order "Cellvibrionales". This could be a situation where the researchers reused the genus name "Haliea" under a new order, OR (and I think this is more likely) perhaps the researchers are correcting the NCBI taxonomy.

confusing lineage #12
    CSV:  Bacteria, Proteobacteria, Gammaproteobacteria, Alteromonadales_3, Alteromonadaceae, Haliea
    NCBI: Bacteria, Proteobacteria, Gammaproteobacteria, Cellvibrionales, Halieaceae, Haliea
(2 rows in spreadsheet)

See tara-tully-ncbi-disagree.txt and tara-delmont-ncbi-disagree.txt for the full list.

I have no conclusion.

I'm not sure what to do at this point. For now, I'll plan to override the NCBI taxonomy with the researcher-specific taxonomy, but these disagreements seems like useful diagnostic output. It would be interesting to know what the root cause of this is, though; do these names come from using CheckM to assign taxonomy, as in Delmont et al's workflow? And if so, can I resolve a lot of this by just using that taxonomy? :) Inquiring minds must know!!

Regardless, I'm enthusiastic about the bit where my internal error checking in the script catches these issues. (And hopefully I'll never have to write that code again. It's very unpleasant.)

Onwards! Upwards!

--titus

by C. Titus Brown at October 21, 2017 10:00 PM

Bruno Pinho

A simple guide to calculate the Hoek-Brown Failure Criteria in Python

Rock masses usually reduces the strength of the intact rock due to the presence of discontinuities. Engineers and geologists must take this effect into account to predict failure in slope faces and tunnel excavations. In this post, I will take you through the process of calculating the Hoek-Brown Failure Criteria in Python.

by Bruno Ruas de Pinho at October 21, 2017 02:30 AM

October 19, 2017

Continuum Analytics

InfoWorld: 5 essential Python tools for data science—now improved

If you want to master, or even just use, data analysis, Python is the place to do it. Python is easy to learn, it has vast and deep support, and most every data science library and machine learning framework out there has a Python interface.

by Sheyna Webster at October 19, 2017 04:30 PM

October 17, 2017

Matthieu Brucher

Announcement: Audio TK 2.2.0

ATK is updated to 2.2.0 with the major introduction of vectorized filters. This means that some filters (EQ for now) can use vectorization for maximum performance. More filters will be introduced later as well as the Python support. Vector lanes of size 4 and 8 are supported as well as instruction sets from SSE2 to AVX512.

This is also the first major release that officially supports the JUCE framework. This means that ATK can be added as modules (directly source code without requiring any binaries) in the Projucer. The caveat is that SIMD filters are not available in this configuration due to the requirement for CMake support to build the SIMD filters.

Download link: ATK 2.2.0

Changelog:
2.2.0
* Introduced SIMD filters with libsimdpp
* Refactored EQ filters to work with SIMD filters
* Added module files for JUCE Projucer
2.1.1
* Added a Gain Max Compressor filter with wrappers
* Added a dry run call on BaseFilter to setup maximum sizes on a pipeline
* Added a IIR TDF2 (Transposed Direct Form 2) filter implementation (no Python wrappers for now)
* Fixed max gain reduction in the expanders to use 20 log10 instead of 10 log10 (as it is applied to the amplitude and not power)
* Fix a bug in OutCircularPointerFilter with offset handling
* Fix a bug in RIAA inverse filters

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at October 17, 2017 07:21 AM

October 16, 2017

Matthew Rocklin

Streaming Dataframes

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

This post is about experimental software. This is not ready for public use. All code examples and API in this post are subject to change without warning.

Summary

This post describes a prototype project to handle continuous data sources of tabular data using Pandas and Streamz.

Introduction

Some data never stops. It arrives continuously in a constant, never-ending stream. This happens in financial time series, web server logs, scientific instruments, IoT telemetry, and more. Algorithms to handle this data are slightly different from what you find in libraries like NumPy and Pandas, which assume that they know all of the data up-front. It’s still possible to use NumPy and Pandas, but you need to combine them with some cleverness and keep enough intermediate data around to compute marginal updates when new data comes in.

Example: Streaming Mean

For example, imagine that we have a continuous stream of CSV files arriving and we want to print out the mean of our data over time. Whenever a new CSV file arrives we need to recompute the mean of the entire dataset. If we’re clever we keep around enough state so that we can compute this mean without looking back over the rest of our historical data. We can accomplish this by keeping running totals and running counts as follows:

total = 0
count = 0

for filename in filenames:  # filenames is an infinite iterator
    df = pd.read_csv(filename)
    total = total + df.sum()
    count = count + df.count()
    mean = total / count
    print(mean)

Now as we add new files to our filenames iterator our code prints out new means that are updated over time. We don’t have a single mean result, we have continuous stream of mean results that are each valid for the data up to that point. Our output data is an infinite stream, just like our input data.

When our computations are linear and straightforward like this a for loop suffices. However when our computations have several streams branching out or converging, possibly with rate limiting or buffering between them, this for-loop approach can grow complex and difficult to manage.

Streamz

A few months ago I pushed a small library called streamz, which handled control flow for pipelines, including linear map operations, operations that accumulated state, branching, joining, as well as back pressure, flow control, feedback, and so on. Streamz was designed to handle all of the movement of data and signaling of computation at the right time. This library was quietly used by a couple of groups and now feels fairly clean and useful.

Streamz was designed to handle the control flow of such a system, but did nothing to help you with streaming algorithms. Over the past week I’ve been building a dataframe module on top of streamz to help with common streaming tabular data situations. This module uses Pandas and implements a subset of the Pandas API, so hopefully it will be easy to use for programmers with existing Python knowledge.

Example: Streaming Mean

Our example above could be written as follows with streamz

source = Stream.filenames('path/to/dir/*.csv')  # stream of filenames
sdf = (source.map(pd.read_csv)                  # stream of Pandas dataframes
             .to_dataframe(example=...))        # logical streaming dataframe

sdf.mean().stream.sink(print)                   # printed stream of mean values

This example is no more clear than the for-loop version. On its own this is probably a worse solution than what we had before, just because it involves new technology. However it starts to become useful in two situations:

  1. You want to do more complex streaming algorithms

    sdf = sdf[sdf.name == 'Alice']
    sdf.x.groupby(sdf.y).mean().sink(print)
    
    # or
    
    sdf.x.rolling('300ms').mean()
    

    It would require more cleverness to build these algorithms with a for loop as above.

  2. You want to do multiple operations, deal with flow control, etc..

    sdf.mean().sink(print)
    sdf.x.sum().rate_limit(0.500).sink(write_to_database)
    ...
    

    Consistently branching off computations, routing data correctly, and handling time can all be challenging to accomplish consistently.

Jupyter Integration and Streaming Outputs

During development we’ve found it very useful to have live updating outputs in Jupyter.

Usually when we evaluate code in Jupyter we have static inputs and static outputs:

However now both our inputs and our outputs are live:

We accomplish this using a combination of ipywidgets and Bokeh plots both of which provide nice hooks to change previous Jupyter outputs and work well with the Tornado IOLoop (streamz, Bokeh, Jupyter, and Dask all use Tornado for concurrency). We’re able to build nicely responsive feedback whenever things change.

In the following example we build our CSV to dataframe pipeline that updates whenever new files appear in a directory. Whenever we drag files to the data directory on the left we see that all of our outputs update on the right.

What is supported?

This project is very young and could use some help. There are plenty of holes in the API. That being said, the following works well:

Elementwise operations:

sdf['z'] = sdf.x + sdf.y
sdf = sdf[sdf.z > 2]

Simple reductions:

sdf.sum()
sdf.x.mean()

Groupby reductions:

sdf.groupby(sdf.x).y.mean()

Rolling reductions by number of rows or time window

sdf.rolling(20).x.mean()
sdf.rolling('100ms').x.quantile(0.9)

Real time plotting with Bokeh (one of my favorite features)

sdf.plot()

What’s missing?

  1. Parallel computing: The core streamz library has an optional Dask backend for parallel computing. I haven’t yet made any attempt to attach this to the dataframe implementation.
  2. Data ingestion from common streaming sources like Kafka. We’re in the process now of building asynchronous-aware wrappers around Kafka Python client libraries, so this is likely to come soon.
  3. Out-of-order data access: soon after parallel data ingestion (like reading from multiple Kafka partitions at once) we’ll need to figure out how to handle out-of-order data access. This is doable, but will take some effort. This is where more mature libraries like Flink are quite strong.
  4. Performance: Some of the operations above (particularly rolling operations) do involve non-trivial copying, especially with larger windows. We’re relying heavily on the Pandas library which wasn’t designed with rapidly changing data in mind. Hopefully future iterations of Pandas (Arrow/libpandas/Pandas 2.0?) will make this more efficient.
  5. Filled out API: Many common operations (like variance) haven’t yet been implemented. Some of this is due to laziness and some is due to wanting to find the right algorithm.
  6. Robust plotting: Currently this works well for numeric data with a timeseries index but not so well for other data.

But most importantly this needs use by people with real problems to help us understand what here is valuable and what is unpleasant.

Help would be welcome with any of this.

You can install this from github

pip install git+https://github.com/mrocklin/streamz.git

Documentation and code are here:

Current work

Current and upcoming work is focused on data ingestion from Kafka and parallelizing with Dask.

October 16, 2017 12:00 AM

October 15, 2017

Titus Brown

A brief introduction to osfclient, a command line client for the Open Science Framework

Over the last few months, Tim Head has been pushing forward the osfclient project, an effort to build a simple and friendly command-line interface to the Open Science Framework's file storage. This project was funded by a gift to my lab through the Center for Open Science (COS) to the tune of about $20k, given by an anonymous donor.

The original project was actually to write an OSF integration for Galaxy, but that project was first delayed by my move to UC Davis and then suffered from Michael Crusoe's move to work on the Common Workflow Language. After talking with the COS folk, we decided to repurpose the money to something that addresses a need in my lab - using the Open Science Framework to share files.

Our (Tim's) integration effort resulted in osfclient, a combination Python API and command-line program. The project is still in its early stages, but a few people have found it useful - in addition to increasing usage within my lab, @zkamvar has used it to transfer "tens of thousands of files", and @danudwary found it "just worked" for grabbing some big files. And new detailed use cases are emerging regularly.

Most exciting of all, we've had contributions from a number of other people already, and I'm looking forward to this project growing to meet the needs of the open science community!

Taking a step back: why OSF, and why a command-line client?

I talked a bit about "why OSF?" in a previous blog post, but the short version is that it's a globally accessible place to store files for science, and it works well for that! It fits a niche that we haven't found any other solutions for - free storage for medium size genomics files - and we're actively exploring its use in about a dozen different projects.

Our underlying motivations for building a command-line client for OSF were several:

  • we often need to retrieve full folder/directory hierarchies of files for research and training purposes;

  • frequently, we want to retrieve those file hierarchies on remote (cloud or HPC) systems;

  • we're often grabbing files that are larger than GitHub supports;

  • sometimes these files are from private projects that we cannot (or don't want to) publicize;

Here, the Open Science Framework was already an 80% solution (supporting folder hierarchies, large file storage, and a robust permissions system), but it didn't have a command-line client - we were reduced to using curl or wget on individual files, or (in theory) writing our own REST queries.

Enter osfclient!

Using osfclient, a quickstart

(See "Troubleshooting osfclient installs" at the bottom if you run into any troubles running these commands!)

In a Python 3 environment, do:

pip install osfclient

and then execute:

osf -p fuqsk clone

This will go to the osfclient test project on http://osf.io, and download all the files that are part of that project -- if you execute:

find fuqsk

you should see:

fuqsk
fuqsk/figshare
fuqsk/figshare/this is a test text file
fuqsk/figshare/this is a test text file/hello.txt
fuqsk/googledrive
fuqsk/googledrive/google test file.gdoc
fuqsk/googledrive/googledrive-hello.txt
fuqsk/osfstorage
fuqsk/osfstorage/hello.txt
fuqsk/osfstorage/test-subfolder
fuqsk/osfstorage/test-subfolder/hello-from-subfolder.txt

which showcases a particularly nice feature of the OSF that I'll talk about below.

A basic overview of what osfclient did

If you go to the project URL, http://osf.io/fuqsk, you will see a file storage hierarchy that looks like so:

OSF folders screenshot

What osfclient is doing is grabbing all of the different storage files and downloading them to your local machine. Et voila!

What's with the 'figshare' and 'googledrive' stuff? Introducing add-ons/integrations.

In the above, you'll notice that there are these subdirectories named figshare and googledrive. What are those?

The Open Science Framework can act as an umbrella integration for a variety of external storage services - see the docs. They support Amazon S3, Dropbox, Google Drive, Figshare, and a bunch of others.

In the above project, I linked in my Google Drive and Figshare accounts to OSF, and connected specific remote folders/projects into the OSF project (this one from Google Drive, and this one from figshare). This allows me (and others with permissions on the project) to access and manage those files from within a single Web UI on the OSF.

osfclient understands some of these integrations (and it's pretty trivial to add a new one to the client, at least), and it does the most obvious thing possible with them when you do a osfclient clone: it grabs the files and downloads them! (It should also be able to push to those remote storages, but I haven't tested that today.)

Interestingly, this appears to be a good simple way to layer OSF's project hierarchy and permission system on top of more complex and/or less flexible and/or non-command-line-friendly systems. For example, Luiz Irber recently uploaded a very large file to google drive via rclone and it showed up in his OSF project just fine.

This reasonably flexible imposition of an overall namespace on a disparate collection of storages is pretty nice, and could be a real benefit for large, complex projects.

Other things you can do with osfclient

osfclient also has file listing and file upload functionality, along with some configurability in terms of providing a default project and permissions within specific directories. The osfclient User Guide has some brief instructions along these lines.

osfclient also contains a Python API for OSF, and you can see a bit more about that here, in Tim Head and Erin Braswell's webinar materials.

What's next?

There are a few inconveniences about the OSF that could usefully be worked around, and a lot of features to be added in osfclient. In no particular order, here are a few of the big ones that require significant refactoring or design decisions or even new REST API functionality on the OSF side --

  • we want to make osf behave a bit more like git - see the issue. This would make it easier to teach and use, we think. In particular we want to avoid having to specify the project name every time.
  • speaking of project names, I don't think the project UIDs on the OSF (fuqsk above) are particular intuitive or type-able, and it would be great to have a command line way of discovering the project UID for your project of interest.
  • I'd also like to add project creation and maybe removal via the command line, as well as project registration - more on that later.
  • the file storage hierarchy above, with osfstorage/ and figshare/ as top level directories, isn't wonderful for command line folk - there are seemingly needless hierarchies in there. I'm not sure how to deal with this but there are a couple of possible solutions, including adding a per-project 'remapping' configuration that would move the files around.

Concluding thoughts

The OSF offers a simple, free, Web friendly, and convenient way to privately and publicly store collections of files under 5 GB in size on a Web site. osfclient provides a simple and reasonably functional way to download files from and upload files to the OSF via the command line. Give it a try!

Appendix: Troubleshooting osfclient installs

  1. If you can't run pip install on your system, you may need to either run the command as root, OR establish a virtual environment -- something like

python -m virtualenv -p python3.5 osftest . osftest/bin/activate pip install osfclient

will create a virtualenv, activate it, and install osfclient. (If you run into problems)

  1. If you get a requests.exceptions.SSLError, you may be on a Mac and using an old version of openssl. You can try pip install -U pyopenssl. If that doesn't work, please add a comment to this issue.

  2. Note that a conda install for osfclient exists, and you should be able to do conda install -c conda-forge osfclient.

by C. Titus Brown at October 15, 2017 10:00 PM

October 10, 2017

Matthew Rocklin

Notes on Kafka in Python

Summary

I recently investigated the state of Python libraries for Kafka. This blogpost contains my findings.

Both PyKafka and confluent-kafka have mature implementations and are maintained by invested companies. Confluent-kafka is generally faster while PyKafka is arguably better designed and documented for Python usability.

Conda packages are now available for both. I hope to extend one or both to support asynchronous workloads with Tornado.

Disclaimer: I am not an expert in this space. I have no strong affiliation with any of these projects. This is a report based on my experience of the past few weeks. I don’t encourage anyone to draw conclusions from this work. I encourage people to investigate on their own.

Introduction

Apache Kafka is a common data system for streaming architectures. It manages rolling buffers of byte messages and provides a scalable mechanism to publish or subscribe to those buffers in real time. While Kafka was originally designed within the JVM space the fact that it only manages bytes makes it easy to access from native code systems like C/C++ and Python.

Python Options

Today there are three independent Kafka implementations in Python, two of which are optionally backed by a C implementation, librdkafka, for speed:

  • kafka-python: The first on the scene, a Pure Python Kafka client with robust documentation and an API that is fairly faithful to the original Java API. This implementation has the most stars on GitHub, the most active development team (by number of committers) but also lacks a connection to the fast C library. I’ll admit that I didn’t spend enough time on this project to judge it well because of this.

  • PyKafka: The second implementation chronologically. This library is maintained by Parse.ly a web analytics company that heavily uses both streaming systems and Python. PyKafka’s API is more creative and designed to follow common Python idioms rather than the Java API. PyKafka has both a pure Python implementation and connections to the low-level librdkafka C library for increased performance.

  • Confluent-kafka: Is the final implementation chronologically. It is maintained by Confluent, the primary for-profit company that supports and maintains Kafka. This library is the fastest, but also the least accessible from a Python perspective. This implementation is written in CPython extensions, and the documentation is minimal. However, if you are coming from the Java API then this is entirely consistent with that experience, so that documentation probably suffices.

Performance

Confluent-kafka message-consumption bandwidths are around 50% higher and message-production bandwidths are around 3x higher than PyKafka, both of which are significantly higher than kafka-python. I’m taking these numbers from this blogpost which gives benchmarks comparing the three libraries. The primary numeric results follow below:

Note: It’s worth noting that this blogpost was moving smallish 100 byte messages around. I would hope that Kafka would perform better (closer to network bandwidths) when messages are of a decent size.

Producer Throughput

time_in_seconds MBs/s Msgs/s
confluent_kafka_producer 5.4 17 183000
pykafka_producer_rdkafka 16 6.1 64000
pykafka_producer 57 1.7 17000
python_kafka_producer 68 1.4 15000

Consumer Throughput

time_in_seconds MBs/s Msgs/s
confluent_kafka_consumer 3.8 25 261000
pykafka_consumer_rdkafka 6.1 17 164000
pykafka_consumer 29 3.2 34000
python_kafka_consumer 26 3.6 38000

Note: I discovered this article on parsely/pykafka #559, which has good conversation about the three libraries.

I profiled PyKafka in these cases and it doesn’t appear that these code paths have yet been optimized. I expect that modest effort could close that gap considerably. This difference seems to be more from lack of interest than any hard design constraint.

It’s not clear how critical these speeds are. According to the PyKafka maintainers at Parse.ly they haven’t actually turned on the librdkafka optimizations in their internal pipelines, and are instead using the slow Pure Python implementation, which is apparently more than fast enough for common use. Getting messages out of Kafka just isn’t their bottleneck. It may be that these 250,000 messages/sec limits are not significant in most applications. I suspect that this matters more in bulk analysis workloads than in online applications.

Pythonic vs Java APIs

It took me a few times to get confluent-kafka to work. It wasn’t clear what information I needed to pass to the constructor to connect to Kafka and when I gave the wrong information I received no message that I had done anything incorrectly. Docstrings and documentation were both minimal. In contrast, PyKafka’s API and error messages quickly led me to correct behavior and I was up and running within a minute.

However, I persisted with confluent-kafka, found the right Java documentation, and eventually did get things up and running. Once this happened everything fell into place and I was able to easily build applications with Confluent-kafka that were both simple and fast.

Development experience

I would like to add asynchronous support to one or both of these libraries so that they can read or write data in a non-blocking fashion and play nicely with other asynchronous systems like Tornado or Asyncio. I started investigating this with both libraries on GitHub.

Developers

Both libraries have a maintainer who is somewhat responsive and whose time is funded by the parent company. Both maintainers seem active on a day-to-day basis and handle contributions from external developers.

Both libraries are fully active with a common pattern of a single main dev merging work from a number of less active developers. Distributions of commits over the last six months look similar:

confluent-kafka-python$ git shortlog -ns --since "six months ago"
38  Magnus Edenhill
5  Christos Trochalakis
4  Ewen Cheslack-Postava
1  Simon Wahlgren

pykafka$ git shortlog -ns --since "six months ago"
52  Emmett Butler
23  Emmett J. Butler
20  Marc-Antoine Parent
18  Tanay Soni
5  messense
1  Erik Stephens
1  Jeff Widman
1  Prateek Shrivastava
1  aleatha
1  zpcui

Codebase

In regards to the codebases I found that PyKafka was easier to hack on for a few reasons:

  1. Most of PyKafka is written in Python rather than C extensions, and so it is more accessible to a broader development base. I find that Python C extensions are not pleasant to work with, even if you are comfortable with C.
  2. PyKafka appears to be much more extensively tested. PyKafka actually spins up a local Kafka instance to do comprehensive integration tests while Confluent-kafka seems to only test API without actually running against a real Kakfa instance.
  3. For what it’s worth, PyKafka maintainers responded quickly to an issue on Tornado. Confluent-kafka maintainers still have not responded to a comment on an existing Tornado issue, even though that comment had signfiicnatly more content (a working prototype).

To be clear, no maintainer has any responsibility to answer my questions on github. They are likely busy with other things that are of more relevance to their particular mandate.

Conda packages

I’ve pushed/updated recipes for both packages on conda-forge. You can install them as follows:

conda install -c conda-forge pykafka                 # Linux, Mac, Windows
conda install -c conda-forge python-confluent-kafka  # Linux, Mac

In both cases this these are built against the fast librdkafka C library (except on Windows) and install that library as well.

Future plans

I’ve recently started work on streaming systems and pipelines for Dask, so I’ll probably continue to investigate this space. I’m still torn between the two implementations. There are strong reasons to use either of them.

Culturally I am drawn to Parse.ly’s PyKafka library. They’re clearly Python developers writing for Python users. However the costs of using a non-Pythonic system here just aren’t that large (Kafka’s API is small), and Confluent’s interests are more aligned with investing in Kafka long term than are Parse.ly’s.

October 10, 2017 12:00 AM

October 09, 2017

Continuum Analytics

Strata Data Conference Grows Up

The Strata conference will always hold a place in my heart, as it’s one of the events that inspired Travis and I to found Anaconda. We listened to open source-driven talks about data lakes and low-cost storage and knew there would be a demand for tools to help organizations and data scientists derive value from these mountains of information.

by Team Anaconda at October 09, 2017 02:31 PM

October 05, 2017

Continuum Analytics

Database Trends & Applications: Machine Learning and Data Science are Top Trends at Strata Data

Data professionals and vendors converged at Strata Data in New York to trade tips and tricks for handling big data. Top of mind for most was the impact of machine learning and how it’s continuing to evolve as the “next big thing.”

by Sheyna Webster at October 05, 2017 04:29 PM

October 04, 2017

October 03, 2017

Continuum Analytics

Seven Things You Might Not Know About Numba

In this post, I want to dive deeper and demonstrate several aspects of using Numba on the GPU that are often overlooked. I’ll quickly breeze through a number of topics, but I’ll provide links throughout for additional reading.

by Team Anaconda at October 03, 2017 06:00 PM

Database Trends & Applications: Anaconda Partners with Microsoft to Provide Data Science Python Programs

Anaconda, Inc., a Python data science platform provider, is partnering with Microsoft to embed Anaconda into Azure Machine Learning, Visual Studio and SQL Server to deliver data insights in real time.

by Sheyna Webster at October 03, 2017 04:28 PM

October 02, 2017

Continuum Analytics

Intellyx: Anaconda: Delivering a Python Data Science Platform for the Enterprise

If you’re going down the data science road, there’s a pretty good chance that you’re using Python and Anaconda as part of your toolset. That’s because Anaconda is the most popular open source Python distribution for data science and is both downloaded directly and included in numerous data science platforms.

by Sheyna Webster at October 02, 2017 04:27 PM

Intellyx: Anaconda: Delivering a Python Data Science Platform for the Enterprise

If you’re going down the data science road, there’s a pretty good chance that you’re using Python and Anaconda as part of your toolset. That’s because Anaconda is the most popular open source Python distribution for data science and is both downloaded directly and included in numerous data science platforms.

by Sheyna Webster at October 02, 2017 04:27 PM

App Developer Magazine: Python-powered machine learning with Anaconda and MS partnership

Anaconda, Inc. has announced it is partnering with Microsoft to embed Anaconda into Azure Machine Learning, Visual Studio and SQL Server to deliver data insights in real time.

by Sheyna Webster at October 02, 2017 04:24 PM

September 29, 2017

Continuum Analytics

ZDNet: Strata NYC 2017 to Hadoop: Go jump in a data lake

http://www.zdnet.com/article/strata-nyc-2017-to-hadoop-go-jump-in-a-data-lake/

by Sheyna Webster at September 29, 2017 04:37 PM

September 28, 2017

September 26, 2017

Continuum Analytics

Anaconda and Microsoft Partner to Deliver Python-Powered Machine Learning

Strata Data Conference, NEW YORK––September 26, 2017––Anaconda, Inc., the most popular Python data science platform provider, today announced it is partnering with Microsoft to embed Anaconda into Azure Machine Learning, Visual Studio and SQL Server to deliver data insights in real time.

by Sheyna Webster at September 26, 2017 01:00 PM

Matthieu Brucher

Book Review: Mastering The Challenges Of Leading Change: Inspire the People and Succeed Where Others Fail

I like change. More precisely, I like improving things. An as some of the people in my entourage would say, I can be a bull in a china shop. So this book sounded interesting.

Discussion

The book is split in four parts that are explained in the introduction: priorities, politics, people and perseverance.
At the end of each chapter, there is a small questionnaire with actions to do to improve your ability to change.

Let’s try with priorities. Obviously, this is all about how you want to start the change. If you stay put, you are not going forward. So the first chapter is about looking for people who will help you to change. The second chapter deals with setting up a team ready for leading the change, and the last one of this part is about the type of changes we are aiming for.

Second part tackles politics, or more precisely what happens behind the change. Communication is crucial, for the fourth chapter, but I think communication is also getting the message that lies behind a situation. It’s also about being ready to listen to others solutions than yours to improve a situation. The bad part is there are also people who just want to mess with you. I do remember some cases where it happened, and I wish I had in lace all the tools from the first two parts, it would probably have solved my problems with Machiavellis, as the author calls them!

Then we move to handling people. We go back to communication, and the importance of getting direct communication, and not indirect one, then how to handle a group and move in the same direction. I liked also the last part, when you don’t do something just for your job, but also for having new friends.

Last part is about perseverance. There are going to be issues, fires to put out, and you need a proper team. Then perseverance is also about keeping in movement and be ready for the next change. And to do so, you also need to cultivate people who may lead the next change!

Conclusion

Great book. Even if part of it is just sanity, it’s also about noticing that some of the things you may do for your friends, you need to do them when leading change. For me, this proves (all aspects of) communication (are/)is everything.

So now, let’s apply this book to my professional life!

by Matt at September 26, 2017 07:31 AM

September 25, 2017

Enthought

Enthought at the 2017 Society of Exploration Geophysicists (SEG) Conference

2017 will be Enthought’s 11th year at the SEG (Society of Exploration Geophysicists) Annual Meeting, and we couldn’t be more excited to be at the leading edge of the digital transformation in oil & gas being driven by the capabilities provided by machine learning and artificial intelligence.

Now in its 87th year, the Annual SEG (Society of Exploration Geophysicists) Meeting will be held in Houston, Texas on September 24-27, 2017 at the George R. Brown Convention Center. The SEG Annual Meeting will be the first large conference to take place in Houston since Hurricane Harvey and its devastating floods, and we’re so pleased to be a small part of getting Houston back “open for business.”

Pre-Event Kickoff: The Machine Learning Geophysics Hackathon

We had such a great experience at the EAGE Subsurface Hackathon in Paris in June that when we heard our friends at Agile Geoscience were planning a machine learning in geophysics hackathon for the US, we had to join! Brendon Hall, Enthought’s Energy Solutions Group Director will be there as a participant and coach and Enthought CEO Eric Jones will be on the judging panel.

Come Meet Us on the SEG Expo Floor & Learn About Our AI-Enabled Solutions for Oil & Gas

Presentations in Enthought Booth #318 (just to the left from the main entrance before the main aisle):

  • Monday, Sept 25, 12-12:45 PM: Lessons Learned From the Front Line: Moving AI From Research to Application
  • Tues, Sept 26, 1-1:45 PM: Canopy Geoscience: Building Innovative, AI-Enabled Geoscience Applications
  • Wed, Sept 27, 12-12:45 PM: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core

Hart Energy’s E&P Magazine Features Canopy Geoscience

Canopy Geoscience, Enthought’s cross-domain AI platform for oil & gas, was featured in the September 2017 edition of E&P magazine. See the coverage in the online SEG Technology Showcase, in the September print edition, or in the online E&P Flipbook.


Enthought's Canopy Geoscience featured in E&P's September 2017 edition

The post Enthought at the 2017 Society of Exploration Geophysicists (SEG) Conference appeared first on Enthought Blog.

by admin at September 25, 2017 05:14 PM

September 24, 2017

Matthew Rocklin

Dask Release 0.15.3

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.3. This release contains stability enhancements and bug fixes. This blogpost outlines notable changes since the 0.15.2 release on August 30th.

You can conda install Dask:

conda install -c conda-forge dask

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on conda-forge channels. They will be on defaults in a few days.

Full changelogs are available here:

Some notable changes follow.

Masked Arrays

Dask.array now supports masked arrays similar to NumPy.

In [1]: import dask.array as da

In [2]: x = da.arange(10, chunks=(5,))

In [3]: mask = x % 2 == 0

In [4]: m = da.ma.masked_array(x, mask)

In [5]: m
Out[5]: dask.array<masked_array, shape=(10,), dtype=int64, chunksize=(5,)>

In [6]: m.compute()
Out[6]:
masked_array(data = [-- 1 -- 3 -- 5 -- 7 -- 9],
             mask = [ True False  True False  True False  True False  True False],
       fill_value = 999999)

This work was primarily done by Jim Crist and partially funded by the UK Met office in support of the Iris project.

Constants in atop

Dask.array experts will be familiar with the atop function, which powers a non-trivial amount of dask.array and is commonly used by people building custom algorithms. This function now supports constants when the index given is None.

atop(func, 'ijk', x, 'ik', y, 'kj', CONSTANT, None)

Memory management for workers

Dask workers spill excess data to disk when they reach 60% of their alloted memory limit. Previously we only measured memory use by adding up the memory use of every piece of data produce by the worker. This could fail under a few situations

  1. Our per-data estiamtes were faulty
  2. User code consumed a large amount of memory without our tracking it

To compensate we now also periodically check the memory use of the worker using system utilities with the psutil module. We dump data to disk if the process rises about 70% use, stop running new tasks if it rises above 80%, and restart the worker if it rises above 95% (assuming that the worker has a nanny process).

Breaking Change: Previously the --memory-limit keyword to the dask-worker process specified the 60% “start pushing to disk” limit. So if you had 100GB of RAM then you previously might have started a dask-worker as follows:

dask-worker ... --memory-limit 60e9  # before specify 60% target

And the worker would start pushing to disk once it had 60GB of data in memory. However, now we are changing this meaning to be the full amount of memory given to the process.

dask-worker ... --memory-limit 100e9A  # now specify 100% target

Of course, you don’t have to sepcify this limit (many don’t). It will be chosen for you automatically. If you’ve never cared about this then you shouldn’t start caring now.

More about memory management here: http://distributed.readthedocs.io/en/latest/worker.html?highlight=memory-limit#memory-management

Statistical Profiling

Workers now poll their worker threads every 10ms and keep a running count of which functions are being used. This information is available on the diagnostic dashboard as a new “Profile” page. It provides information that is orthogonal, and generally more detailed than the typical task-stream plot.

These plots are available on each worker, and an aggregated view is available on the scheduler. The timeseries on the bottom allows you to select time windows of your computation to restrict the parallel profile.

More information about diagnosing performance available here: http://distributed.readthedocs.io/en/latest/diagnosing-performance.html

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.2 release on August 30th

  • Adonis
  • Christopher Prohm
  • Danilo Horta
  • jakirkham
  • Jim Crist
  • Jon Mease
  • jschendel
  • Keisuke Fujii
  • Martin Durant
  • Matthew Rocklin
  • Tom Augspurger
  • Will Warner

The following people contributed to the dask/distributed repository since the 1.18.3 release on September 2nd:

  • Casey Law
  • Edrian Irizarry
  • Matthew Rocklin
  • rbubley
  • Tom Augspurger
  • ywangd

September 24, 2017 12:00 AM

September 23, 2017

Bruno Pinho

Integrating & Exploring 3: Download and Process DEMs in Python

This tutorial shows how to automate downloading and processing DEM files. It shows a one-liner code to download SRTM (30 or 90 m) data and how to use rasterio to reproject the downloaded data into a desired CRS, spatial resolution or bounds.

by Bruno Ruas de Pinho at September 23, 2017 01:00 AM

September 22, 2017

numfocus

Open Journals joins NumFOCUS Sponsored Projects

NumFOCUS is pleased to welcome the Open Journals as a fiscally sponsored project. Open Journals is a collection of open source, open access journals. The team behind Open Journals believes that code review and high-quality, reusable software are a critical–but often overlooked–part of academic research. The primary goal of Open Journals is to provide venues […]

by NumFOCUS Staff at September 22, 2017 02:00 PM