## April 23, 2015

### Titus Brown

#### More on scientific software

So I wrote this thing that got an awful lot of comments, many telling me that I'm just plain wrong. I think it's impossible to respond comprehensively :). But here are some responses.

## What is, what could be, and what should be

In that blog post, I argued that software shouldn't be considered a primary output of scientific research. But I completely failed to articulate a distinction between what we do today with respect to scientific software, what we could be doing in the not-so-distant future, and what we should be doing. Worse, I mixed them all up!

Peer reviewed publications and grants are the current coin of the realm. When we submit papers and grants for peer review, we have to deal with what those reviewers think right now. In bioinformatics, this largely means papers get evaluated on their perceived novelty and impact (even in impact-blind journals). Software papers are generally evaluated poorly on these metrics, so it's hard to publish bioinformatics software papers in visible places, and it's hard to argue in grants to the NIH (and most of the biology-focused NSF) that pure software development efforts are worthwhile. This is what is, and it makes it hard for methods+software research to get publications and funding.

Assuming that you agree that methods+software research is important in bioinformatics, what could we be doing in the near distant future to boost the visibility of methods+software? Giving DOIs to software is one way to accrue credit to software that is highly used, but citations take a long time to pile up, reviewers won't know what to expect in terms of numbers (50 citations? is that a lot?), and my guess is that they will be poorly valued in important situations like funding and advancement. It's an honorable attempt to hack the system and software DOIs are great for other purposes, but I'm not optimistic about their near- or middle-term impact.

We could also start to articulate values and perspectives to guide reviewers and granting systems. And this is what I'd like to do. But first, let me rant a bit.

I think people underestimate the hidden mass in the scientific iceberg. Huge amounts of money are spent on research, and I would bet that there are at least twenty thousand PI-level researchers around the world in biology. In biology-related fields, any of these people may be called upon to review your grant or your paper, and their opinions will largely be, well, their own. To get published, funded, or promoted, you need to convince some committee containing these smart, and opinionated researchers that what you're doing is both novel and impactful. To do that, you have to appeal largely to values and beliefs that they already hold.

Moreover, this set of researchers - largely made of people who have reached tenured professor status - sits on editorial boards, funding agency panels, and tenure and promotion committees. None of these boards and funding panels exist in a vacuum, and while to some extent program managers can push in certain directions, they are ultimately beholden to the priorities of the funding agency, which are (in the best case) channeled from senior scientists.

If you wonder why open access took so damn long to happen, this is one reason - the cultural "mass" of researchers that needs to shift their opinions is huge and unwieldy and resistant to change. And they are largely invisible, and subject to only limited persuasion.

One of the most valuable efforts we can make is to explore what we should be doing, and place it on a logical and sensical footing, and put it out there. For example, check out the CRA's memo on best practices in Promotion and Tenure of Interdisciplinary Faculty - great and thoughtful stuff, IMO. We need a bunch of well thought out opinions in this vein. What guidelines do we want to put in place for evaluating methods+software? How should we evaluate methods+software researchers for impact? When we fund software projects, what should we be looking for?

And that brings me to what should be doing, which is ultimately what I am most interested in. For example, I must admit to deep confusion about what a maturity model for bioinformatics software should look like; this feeds into funding requests, which ultimately feeds into promotion and tenure. I don't know how to guide junior faculty in this area either; I have lots of opinions, but they're not well tested in the marketplace of ideas.

I and others are starting to have the opportunity to make the case for what we should be doing in review panels; what case should we make?

It is in this vein, then, that I am trying to figure out what value to place on software itself, and I'm interested in how to promote methods+software researchers and research. Neil Saunders had an interesting comment that I want to highlight here: he said,

My own feeling is that phrases like "significant intellectual contribution" are just unhelpful academic words,

I certainly agree that this is an imprecise concept, but I can guarantee that in the US, this is one of the three main questions for researchers at hiring, promotion, and tenure. (Funding opportunities and fit are my guesses for the other two.) So I would push on this point: researchers need to appear to have a clear intellectual contribution at every stage of the way, whatever that means. What it means is what I'm trying to explore.

## Software is a tremendously important and critical part of the research endeavor

...but it's not enough. That's my story, and I'm sticking to it :).

I feel like the conversation got a little bit sidetracked by discussions of Nobel Prizes (mea partly culpa), and I want to discuss PhD theses instead. To get a PhD, you need to do some research; if you're a bioinformatics or biology grad student who is focused on methods+software, how much of that research can be software, and what else needs to be there?

And here again I get to dip into my own personal history.

I spent 9 years in graduate school. About 6 years into my PhD, I had a conversation with my advisor that went something like this:

Me, age ~27 - "Hey, Eric, I've got ~two first-author papers, and another one or two coming, along with a bunch of papers. How about I defend my PhD on the basis of that work, and stick around to finish my experimental work as a postdoc?"

Eric - blank look "All your papers are on computational methods. None of them count for your PhD."

Me - "Uhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhmmmmmmmmmm..."

(I did eventually graduate, but only after three more years of experiments.)

In biology, we have to be able to defend our computational contributions in the face of an only slowly changing professoriate. And I'm OK with that, but I think we should make it clear up front.

Since then, I've graduated three (soon to be five, I hope!) graduate students, one in biology and two in CS. In every single case, they've done a lot of hacking. And in every single case they've been asked to defend their intellectual contribution. This isn't just people targeting my students - I've sat on committees where students have produced masses of experimental data, and if they weren't prepared to defend their experimental design, their data interpretation, and the impact and significance of their data interpretation, they weren't read to defend. This is a standard part of the PhD process at Caltech, at MSU, and presumably at UC Davis.

So: to successfully receive a PhD, you should have to clearly articulate the problem you're tackling, its place in the scientific literature, the methods and experiments you're going to use, the data you got, the interpretation you place on that data, and the impact of their results on current scientific thinking. It's a pretty high bar, and one that I'm ok with.

One of the several failure modes I see for graduate students is the one where graduate students spend a huge amount of time developing software and more or less assume that this work will lead to a PhD. Why would they be thinking that?

• Their advisor may not be particularly computational and may be giving poor guidance (which includes poorly explained criteria).
• Their advisor may be using them (intentionally or unintentionally) - effective programmers are hard to find.
• The grad student may be resistant to guidance.

I ticked all of these as a graduate student, but I had the advantage of being a 3rd-generation academic, so I knew the score. (And I still ran into problems.) In my previous blog post, I angered and upset some people by my blunt words (I honestly didn't think "grad student hacker fallacy" was so rude ;( but it's a real problem that I confront regularly.

Computational PhD students need to do what every scientific PhD student needs to do: clearly articulate their problem, place it in the scientific literature, define the computational methods and experiments they're going to do/have done, explain the data and their interpretation of it, and explore how it impacts science. Most of this involves things other than programming and running software! It's impossible to put down percent effort estimates that apply broadly, but my guess is that you shoul spend at least a year understanding your results and interpreting and explaining your work.

Conveniently, however, once you've done that for your PhD, you're ready to go in the academic world! These same criteria (expanded in scope) apply to getting a postdoc, publishing as a postdoc, getting a faculty position, applying for grants, and getting tenure. Moreover, I believe many of the same criteria apply broadly to research outside of academia (which is one reason I'm still strongly +1 on getting a PhD, no matter your ultimate goals).

(Kyle Cranmer's comment on grad student efforts here was perfect.)

## Software as...

As far as software being a primary product of research -- Konrad Hinsen nails it. It's not, but neither are papers, and I'm good with both statements :). Read his blog post for the full argument. The important bit is that very little stands on its own; there always needs to be communication effort around software, data, and methods.

Ultimately, I learned a lot by admitting confusion! Dan Katz and Konrad Hinsen pointed out that software is communication, and Kai Blin drew a great analogy between software and experimental design. These are perspectives that I hadn't seen said so clearly before and they've really made me think differently; both are interesting and provocative analogies and I'm hoping that we can develop them further as a community.

## How do we change things?

Kyle Cranmer and Rory Kirchner had a great comment chain on broken value systems and changing the system. I love the discussion, but I'm struggling with how to respond. My tentative and mildly unhappy conclusion is that I may have bought into the intellectual elitism of academia a bit too much (see: third generation academic), but this may also be how I've gotten where I am, so... mixed bag? (Rory made me feel old and dull, too, which is pretty cool in a masochistic kind of way.)

One observation is that, in software, novelty is cheap. It's very, very easy to tweak something minorly, and fairly easy to publish it without generating any new understanding. How do we distinguish a future Heng Li or an Aaron Quinlan (who have enabled new science by cleanly solving "whole classes of common problems that you don't even have to think about anymore") from humdrum increment, and reward them properly in the earlier stages of their career? I don't know, but the answer has to be tied to advancing science, which is hard to measure on any short timescale. (Sean Eddy's blog post has the clearest view on solutions that I've yet seen.)

Another observation (nicely articulated by Daisie Huang) is that (like open data) this is another game theoretic situation, where the authors of widely used software sink their time and energy into the community but don't necessarily gain wide recognition for their efforts. There's a fat middle ground of software that's reasonably well used but isn't samtools, and this ecosystem needs to be supported. This is much harder to argue - it's a larger body of software, it's less visible, and it's frankly much more expensive to support. (Carl Boettiger's comment is worth reading here.) The funding support isn't there, although that might change in the next decade. (This is the proximal challenge for me, since I place my own software, khmer, in this "fat middle ground"; how do I make a clear argument for funding?)

Kyle Cranmer and others pointed to some success in "major instrumentation" and methods-based funding and career paths in physics (help, can't find link/tweets!). This is great, but I think it's also worth discussing the overall scale of things. Physics has a few really big and expensive instruments, with a few big questions, and with thousands of engineers devoted to them. Just in sequencing, biology has thousands (soon millions) of rather cheap instruments, devoted to many thousands of questions. If my prediction that software will "eat" this part of the world becomes true, we will need tens of thousands of data intensive biologists at a minimum, most working to some large extent on data analysis and software. I think the scale of the need here is simply much, much larger than in physics.

I am supremely skeptical of the idea that universities as we currently conceive of them are the right home for stable, mature software development. We either need to change universities in the right way (super hard) or find other institutions (maybe easier). Here, the model to watch may well be the Center for Open Science, which produces the Open Science Framework (among others). My interpretation is that they are trying to merge scientific needs with the open source development model. (Tellingly, they are doing so largely with foundation funding; the federal funding agencies don't have good mechanisms for funding this kind of thing in biology, at least.) This may be the right model (or at least on the path towards one) for sustained software development in the biological sciences: have an institution focused on sustainability and quality, with a small diversity of missions, that can afford to spend the money to keep a number of good software engineers focused on those missions.

Thanks, all, for the comments and discussions!

--titus

## April 22, 2015

### Gaël Varoquaux

#### MLOSS: machine learning open source software workshop @ ICML 2015

Note

This year again we will have an exciting workshop on the leading-edge machine-learning open-source software. This subject is central to many, because software is how we propagate, reuse, and apply progress in machine learning.

Want to present a project? The deadline for the call for papers is Apr 28th, in a few days : http://mloss.org/workshop/icml15/

The workshop will be help at the ICML conference, in Lille France, on July 10th. ICML –International Conference in Machine Learning– is the leading venue for academic research in machine learning. It’s a fantastic place to hold such a workshop, as the actors of theoretical progress are all around. Software is the bridge that brings this progress beyond papers.

There is a long tradition of MLOSS workshop, with one every year and a half. Last time, at NIPS 2013, I could feel a bit of a turning point, as people started feeling that different software slotted together, to create an efficient and state-of-the art working environment. For this reason, we have entitled this year’s workshop ‘open ecosystems’, stressing that contributions in the scope of the workshop, that build a thriving work environment, are not only machine learning software, but also better statistics or numerical tools.

We have two keynotes with important contributions to such ecosystems:

• John Myles White (Facebook), lead developer of Julia statistics and machine learning: “Julia for machine learning: high-level syntax with compiled-code speed”
• Matthew Rocklin (Continuum Analytics), developer of Python computational tools, in particular Blaze (confirmed): “Blaze, a modern numerical engine with out-of-core and out-of-order computations”.

There will be also a practical presentation on how to set up an open-source project, discussing hosting, community development, quality assurance, license choice, by yours truly.

## April 21, 2015

### Titus Brown

#### Is software a primary product of science?

Update - I've written Yet Another blog post, More on scientific software on this topic. I think this blog post is a mess so you should read that one first ;).

This blog post was spurred by a simple question from Pauline Barmby on Twitter. My response didn't, ahem, quite fit in 144 characters :).

First, a little story. (To paraphrase Greg Wilson, "I tell a lot of stories. Some of them aren't true. But this one is!")

When we were done writing Best Practices for Scientific Computing, we tried submitting it to a different high-profile journal than the one that ultimately accepted it (PLoS Biology, where it went on to become the most highly read article of 2014 in PLoS Biology). The response from the editor went something like this: "We recognize the importance of good engineering, but we regard writing software as equivalent to building a telescope - it's important to do it right, but we don't regard a process paper on how to build telescopes better as an intellectual contribution." (Disclaimer: I can't find the actual response, so this is a paraphrase, but it was definitely a "no" and for about that reason.)

## Is scientific software like instrumentation?

When I think about scientific software as a part of science, I inevitably start with its similarities to building scientific instruments. New instrumentation and methods are absolutely essential to scientific progress, and it is clear that good engineering and methods development skills are incredibly helpful in research.

So, why did the editors at High Profile Journal bounce our paper? I infer that they drew exactly this parallel and thought no further.

But scientific software is only somewhat like new methods or instrumentation.

First, software can spread much faster and be used much more like a black box than most methods, and instrumentation inevitably involves either construction or companies that act as middlemen. With software, it's like you're shipping kits or plans for 3-D printing - something that is as close to immediately usable as it comes. If you're going to hand someone an immediately usable black box (and pitch it as such), I would argue that you should take a bit more care in building said black box.

Second, complexity in software scales much faster than in hardware (citation needed). This is partly due to human nature & a failure to think long-term, and partly due to the nature of software - software can quickly have many more moving parts than hardware, and at much less (short term) cost. Frankly, most software stacks resemble massive Rube Goldberg machines (read that link!) This means that different processes are needed here.

Third, at least in my field (biology), we are undergoing a transition to data intensive research, and software methods are becoming ever more important. There's no question that software is going to eat biology just like it's eating the rest of the world, and an increasingly large part of our primary scientific output in biology is going to hinge directly on computation (think: annotations. 'nuff said).

If we're going to build massively complex black boxes that under-pin all of our science, surely that means that the process is worth studying intellectually?

## Is scientific software a primary intellectual output of science?

No.

I think concluding that it is is an example of the logical fallacy "affirming the consequent" - or, "confusion of necessity and sufficiency". I'm not a logician, but I would phrase it like this (better phrasing welcome!) --

Good software is necessary for good science. Good science is an intellectual contribution. Therefore good software is an intellectual contribution.

Hopefully when phrased that way it's clear that it's nonsense.

I'm naming this "the fallacy of grad student hackers", because I feel like it's a common failure mode of grad students that are good at programming. I actually think it's a tremendously dangerous idea that is confounding a lot of the discussion around software contributions in science.

To illustrate this, I'll draw the analog to experimental labs: you may have people who are tremendously good at doing certain kinds of experiments (e.g. expert cloners, or PCR wizards, or micro-injection aficionados, or WMISH bravados) and with whom you can collaborate to rapidly advance your research. They can do things that you can't, and they can do them quickly and well! But these people often face dead ends in academia and end up as eterna-postdocs, because (for better or for worse) what is valued for first authorship and career progression is intellectual contribution, and doing experiments well is not sufficient to demonstrate an intellectual contribution. Very few people get career advancement in science by simply being very good at a technique, and I believe that this is OK.

Back to software - writing software may become necessary for much of science but I don't think it should ever be sufficient as a primary contribution. Worse, it can become (often becomes?) an engine of procrastination. Admittedly, that procrastination leads to things like IPython Notebook, so I don't want to ding it, but neither are all (or even most ;) grad students like Fernando Perez, either.

## Let's admit it, I'm just confused

This leaves us with a conundrum.

Software is clearly a force multiplier - "better software, better research!.

However, I don't think it can be considered a primary output of science. Dan Katz said, "Nobel prizes have been given for inventing instruments. I'm eagerly awaiting for one for inventing software [sic]" -- but I think he's wrong. Nobels have been given because of the insight enabled by inventing instruments, not for inventing instruments. (Corrections welcome!) So while I, too, eagerly await the explicit recognition that software can push scientific insight forward in biology, I am not holding my breath - I think it's going to look much more like the 2013 Chemistry Nobel, which is about general computational methodology. (My money here would be on a Nobel in Medicine for genome assembly methods, which should follow on separately from massively parallel sequencing methods and shotgun sequencing - maybe Venter, Church, and Myers/Pevzner deserve three different Nobels?)

Despite that, we do need to incentivize it, especially in biology but also more generally. Sean Eddy wrote AN AWESOME BLOG POST ON THIS TOPIC in 2010 (all caps because IT'S AWESOME AND WHY HAVEN'T WE MOVED FURTHER ON THIS <sob>). This is where DOIs for software usually come into play - hey, maybe we can make an analogy between software and papers! But I worry that this is a flawed analogy (for reasons outlined above) and will simply support the wrong idea that doing good hacking is sufficient for good science.

We also have a new problem - the so-called Big Data Brain Drain, in which it turns out that the skills that are needed for advancing science are also tremendously valuable in much more highly paid jobs -- much like physics number crunchers moving to finance, research professors in biology face a future where all our grad students go on to make more than us in tech. (Admittedly, this is only a problem if we think that more people clicking on ads is more important than basic research.) Jake Vanderplas (the author of the Big Data Brain Drain post) addressed potential solutions to this in Hacking Academia, about which I have mixed feelings. While I love both Jake and his blog post (platonically), there's a bit too much magical thinking in that post -- I don't see (m)any of those solutions getting much traction in academia.

The bottom line for me is that we need to figure it out, but I'm a bit stuck on practical suggestions. Natural selection may apply -- whoever figures this out in biology (basic research institutions and/or funding bodies) will have quite an edge in advancing biomedicine -- but natural selection works across multiple generations, and I could wish for something a bit faster. But I don't know. Maybe I'll bring it up at SciFoo this year - "Q: how can we kill off the old academic system faster?" :)

I'll leave you with two little stories.

## The problem, illustrated

In 2009, we started working on what would ultimately become Pell et al., 2012. We developed a metric shit-ton of software (that's a scientific measure, folks) that included some pretty awesomely scalable sparse graph labeling approaches. The software worked OK for our problem, but was pretty brittle; I'm not sure whether or not our implementation of this partitioning approach is being used by anyone else, nor am I sure if it should be :).

However, the paper has been a pretty big hit by traditional scientific metrics! We got it into PNAS by talking about the data structure properties and linking physics, computer science, and biology together. It helped lead directly to Chikhi and Rizk (2013), and it has been cited a whole bunch of times for (I think) its theoretical contributions. Yay!

Nonetheless, the incredibly important and tricky details of scalably partitioning 10 bn node graphs were lost from that paper, and the software was not a big player, either. Meanwhile, Dr. Pell left academia and moved on to a big software company where (on his first day) he was earning quite a bit more than me (good on him! I'd like a 5% tithe, though, in the future :) :). Trust me when I say that this is a net loss to academia.

Summary: good theory, useful ideas, lousy software. Traditional success. Lousy outcomes.

## A contrapositive

In 2011, we figured out that linear compression ratios for sequence data simply weren't going to cut it in the face of the continued rate of data generation, and we developed digital normalization, a deceptively simple idea that hasn't really been picked up by the theoreticians. Unlike the Pell work above, it's not theoretically well studied at all. Nonetheless, the preprint has a few dozen citations (because it's so darn useful) and the work is proving to be a good foundation for further research for our lab. Perhaps the truest measure of its memetic success is that it's been reimplemented by at least three different sequencing centers.

The software is highly used, I think, and many of our efforts on the khmer software have been aimed at making diginorm and downstream concepts more robust.

Summary: lousy theory, useful ideas, good software. Nontraditional success. Awesome outcomes.

## Ways forward?

I simply don't know how to chart a course forward. My current instinct (see below) is to shift our current focus much more to theory and ideas and further away from software, largely because I simply don't see how to publish or fund "boring" things like software development. (Josh Bloom has an excellent blog post that relates to this particular issue: Novelty Squared)

I've been obsessing over these topics of software and scientific focus recently (see The three porridge bowls of scientific software development and Please destroy this software after publication. kthxbye) because I'm starting to write a renewal for khmer's funding. My preliminary specific aims look something like this:

Aim 1: Expand low memory and streaming approaches for biological sequence analysis.

Aim 2: Develop graph-based approaches for analyzing genomic variation.

Aim 3: Optimize and extend a general purpose graph analysis library

Importantly, everything to do with software maintenance, support, and optimization is in Aim 3 and is in fact only a part of that aim. I'm not actually saddened by that, because I believe that software is only interesting because of the new science it enables. So I need to sell that to the NIH, and there software quality is (at best) a secondary consideration.

On the flip side, by my estimate 75% of our khmer funding is going to software maintenance, most significantly in paying down our technical debt. (In the grant I am proposing to decrease this to ~50%.)

I'm having trouble justifying this dichotomy mentally myself, and I can only imagine what the reviewers might think (although hopefully they will only glance at the budget ;).

So this highlights one conundrum: given my estimates and my priorities, how would you suggest I square these stated priorities with my funding allocations? And, in these matters, have I been wrong to focus on software quality, or should I have focused instead on accruing technical debt in the service of novel ideas and functionality? Inquiring minds want to know.

--titus

### Matthew Rocklin

#### Profiling Data Throughput

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Disclaimer: This post is on experimental/buggy code.

tl;dr We measure the costs of processing semi-structured data like JSON blobs.

## Semi-structured Data

Semi-structured data is ubiquitous and computationally painful. Consider the following JSON blobs:

{'name': 'Alice',   'payments': [1, 2, 3]}
{'name': 'Bob',     'payments': [4, 5]}
{'name': 'Charlie', 'payments': None}


This data doesn’t fit nicely into NumPy or Pandas and so we fall back to dynamic pure-Python data structures like dicts and lists. Python’s core data structures are surprisingly good, about as good as compiled languages like Java, but dynamic data structures present some challenges for efficient parallel computation.

## Volume

Semi-structured data is often at the beginning of our data pipeline and so often has the greatest size. We may start with 100GB of raw data, reduce to 10GB to load into a database, and finally aggregate down to 1GB for analysis, machine learning, etc., 1kB of which becomes a plot or table.

Data Bandwidth (MB/s) In Parallel (MB/s)
Disk I/O500500
Decompression100500
Deserialization50250
In-memory computation2000oo
Shuffle930

Common solutions for large semi-structured data include Python iterators, multiprocessing, Hadoop, and Spark as well as proper databases like MongoDB and ElasticSearch. Two months ago we built dask.bag, a toy dask experiment for semi-structured data. Today we’ll strengthen the dask.bag project and look more deeply at performance in this space.

We measure performance with data bandwidth, usually in megabytes per second (MB/s). We’ll build intuition for why dealing with this data is costly.

## Dataset

As a test dataset we play with a dump of GitHub data from https://www.githubarchive.org/. This data records every public github event (commit, comment, pull request, etc.) in the form of a JSON blob. This data is representative fairly representative of a broader class of problems. Often people want to do fairly simple analytics, like find the top ten committers to a particular repository, or clean the data before they load it into a database.

We’ll play around with this data using dask.bag. This is both to get a feel for what is expensive and to provide a cohesive set of examples. In truth we won’t do any real analytics on the github dataset, we’ll find that the expensive parts come well before analytic computation.

Items in our data look like this:

>>> import json
>>> path = '/home/mrocklin/data/github/2013-05-0*.json.gz'
({u'actor': u'mjcramer',
u'actor_attributes': {u'gravatar_id': u'603762b7a39807503a2ee7fe4966acd1',
u'type': u'User'},
u'created_at': u'2013-05-01T00:01:28-07:00',
u'master_branch': u'master',
u'ref': None,
u'ref_type': u'repository'},
u'public': True,
u'repository': {u'created_at': u'2013-05-01T00:01:28-07:00',
u'description': u'',
u'fork': False,
u'forks': 0,
u'has_issues': True,
u'has_wiki': True,
u'id': 9787210,
u'master_branch': u'master',
u'name': u'settings',
u'open_issues': 0,
u'owner': u'mjcramer',
u'private': False,
u'pushed_at': u'2013-05-01T00:01:28-07:00',
u'size': 0,
u'stargazers': 0,
u'url': u'https://github.com/mjcramer/settings',
u'watchers': 0},
u'type': u'CreateEvent',
u'url': u'https://github.com/mjcramer/settings'},)

## Disk I/O and Decompression – 100-500 MB/s

Data Bandwidth (MB/s)
Parallel Read from disk with gzip.open500

A modern laptop hard drive can theoretically read data from disk to memory at 800 MB/s. So we could burn through a 10GB dataset in fifteen seconds on our laptop. Workstations with RAID arrays can do a couple GB/s. In practice I get around 500 MB/s on my personal laptop.

In [1]: import json
In [2]: import dask.bag as db
In [3]: from glob import glob
In [4]: path = '/home/mrocklin/data/github/2013-05-0*.json.gz'

In [5]: %time compressed = '\n'.join(open(fn).read() for fn in glob(path))
CPU times: user 75.1 ms, sys: 1.07 s, total: 1.14 s
Wall time: 1.14 s

In [6]: len(compressed) / 0.194 / 1e6  # MB/s
508.5912175438597

To reduce storage and transfer costs we often compress data. This requires CPU effort whenever we want to operate on the stored values. This can limit data bandwidth.

In [7]: import gzip
In [8]: %time total = '\n'.join(gzip.open(fn).read() for fn in glob(path))
CPU times: user 12.2 s, sys: 18.7 s, total: 30.9 s
Wall time: 30.9 s

In [9]: len(total) / 30.9 / 1e6         # MB/s  total bandwidth
Out[9]: 102.16563844660195

In [10]: len(compressed) / 30.9 / 1e6   # MB/s  compressed bandwidth
Out[10]: 18.763559482200648

So we lose some data bandwidth through compression. Where we could previously process 500 MB/s we’re now down to only 100 MB/s. If we count bytes in terms of the amount stored on disk then we’re only hitting 18 MB/s. We’ll get around this with multiprocessing.

## Decompression and Parallel processing – 500 MB/s

Fortunately we often have more cores than we know what to do with. Parallelizing reads can hide much of the decompression cost.

In [12]: import dask.bag as db

In [13]: %time nbytes = db.from_filenames(path).map(len).sum().compute()
CPU times: user 130 ms, sys: 402 ms, total: 532 ms
Wall time: 5.5 s

In [14]: nbytes / 5.5 / 1e6
Out[14]: 573.9850932727272

Dask.bag infers that we need to use gzip from the filename. Dask.bag currently uses multiprocessing to distribute work, allowing us to reclaim our 500 MB/s throughput on compressed data. We also could have done this with multiprocessing, straight Python, and a little elbow-grease.

## Deserialization – 30 MB/s

Data Bandwidth (MB/s)

Once we decompress our data we still need to turn bytes into meaningful data structures (dicts, lists, etc..) Our GitHub data comes to us as JSON. This JSON contains various encodings and bad characters so, just for today, we’re going to punt on bad lines. Converting JSON text to Python objects explodes out in memory a bit, so we’ll consider a smaller subset for this part, a single day.

In [20]: def loads(line):
...          except: return None

In [21]: path = '/home/mrocklin/data/github/2013-05-01-*.json.gz'
In [22]: lines = list(db.from_filenames(path))

In [23]: %time blobs = list(map(loads, lines))
CPU times: user 10.7 s, sys: 760 ms, total: 11.5 s
Wall time: 11.3 s

In [24]: len(total) / 11.3 / 1e6
Out[24]: 33.9486321238938

In [25]: len(compressed) / 11.3 / 1e6
Out[25]: 6.2989179646017694

So in terms of actual bytes of JSON we can only convert about 30MB per second. If we count in terms of the compressed data we store on disk then this looks more bleak at only 6 MB/s.

### This can be improved by using faster libraries – 50 MB/s

The ultrajson library, ujson, is pretty slick and can improve our performance a bit. This is what Pandas uses under the hood.

In [28]: import ujson
...          except: return None

In [30]: %time blobs = list(map(loads, lines))
CPU times: user 6.37 s, sys: 1.17 s, total: 7.53 s
Wall time: 7.37 s

In [31]: len(total) / 7.37 / 1e6
Out[31]: 52.05149837177748

In [32]: len(compressed) / 7.37 / 1e6
Out[32]: 9.657771099050203

### Or through Parallelism – 150 MB/s

This can also be accelerated through parallelism, just like decompression. It’s a bit cumbersome to show parallel deserializaiton in isolation. Instead we’ll show all of them together. This will under-estimate performance but is much easier to code up.

In [33]: %time db.from_filenames(path).map(loads).count().compute()
CPU times: user 32.3 ms, sys: 822 ms, total: 854 ms
Wall time: 2.8 s

In [38]: len(total) / 2.8 / 1e6
Out[38]: 137.00697964285717

In [39]: len(compressed) / 2.8 / 1e6
Out[39]: 25.420633214285715

## Mapping and Grouping - 2000 MB/s

Data Bandwidth (MB/s)
Simple Python operations1400
Complex CyToolz operations2600

Once we have data in memory, Pure Python is relatively fast. Cytoolz moreso.

In [55]: %time set(d['type'] for d in blobs)
CPU times: user 162 ms, sys: 123 ms, total: 285 ms
Wall time: 268 ms
Out[55]:
{u'CommitCommentEvent',
u'CreateEvent',
u'DeleteEvent',
u'FollowEvent',
u'ForkEvent',
u'GistEvent',
u'GollumEvent',
u'IssueCommentEvent',
u'IssuesEvent',
u'MemberEvent',
u'PublicEvent',
u'PullRequestEvent',
u'PullRequestReviewCommentEvent',
u'PushEvent',
u'WatchEvent'}

In [56]: len(total) / 0.268 / 1e6
Out[56]: 1431.4162052238805

In [57]: import cytoolz
In [58]: %time _ = cytoolz.groupby('type', blobs)  # CyToolz FTW
CPU times: user 144 ms, sys: 0 ns, total: 144 ms
Wall time: 144 ms

In [59]: len(total) / 0.144 / 1e6
Out[59]: 2664.024604166667

So slicing and logic are essentially free. The cost of compression and deserialization dominates actual computation time. Don’t bother optimizing fast per-record code, especially if CyToolz has already done so for you. Of course, you might be doing something expensive per record. If so then most of this post isn’t relevant for you.

## Shuffling - 5-50 MB/s

Data Bandwidth (MB/s)
Naive groupby with on-disk Shuffle25
Clever foldby without Shuffle250

For complex logic, like full groupbys and joins, we need to communicate large amounts of data between workers. This communication forces us to go through another full serialization/write/deserialization/read cycle. This hurts. And so, the single most important message from this post:

That being said, people will inevitably ignore this advice so we need to have a not-terrible fallback.

In [62]: %time dict(db.from_filenames(path)
...                   .groupby('type')
...                   .map(lambda (k, v): (k, len(v))))
CPU times: user 46.3 s, sys: 6.57 s, total: 52.8 s
Wall time: 2min 14s
Out[62]:
{'CommitCommentEvent': 17889,
'CreateEvent': 210516,
'DeleteEvent': 14534,
'FollowEvent': 35910,
'ForkEvent': 67939,
'GistEvent': 7344,
'GollumEvent': 31688,
'IssueCommentEvent': 163798,
'IssuesEvent': 102680,
'MemberEvent': 11664,
'PublicEvent': 1867,
'PullRequestEvent': 69080,
'PullRequestReviewCommentEvent': 17056,
'PushEvent': 960137,
'WatchEvent': 173631}

In [63]: len(total) / 134 / 1e6  # MB/s
Out[63]: 23.559091

This groupby operation goes through the following steps:

2. Decompress GZip
3. Deserialize with ujson
4. Do in-memory groupbys on chunks of the data
5. Reserialize with msgpack (a bit faster)
6. Append group parts to disk
7. Read in new full groups from disk
8. Deserialize msgpack back to Python objects
9. Apply length function per group

Some of these steps have great data bandwidths, some less-so. When we compound many steps together our bandwidth suffers. We get about 25 MB/s total. This is about what pyspark gets (although today pyspark can parallelize across multiple machines while dask.bag can not.)

Disclaimer, the numbers above are for dask.bag and could very easily be due to implementation flaws, rather than due to inherent challenges.

>>> import pyspark
>>> sc = pyspark.SparkContext('local[8]')
>>> rdd = sc.textFile(path)
...         .keyBy(lambda d: d['type'])
...         .groupByKey()
...         .map(lambda (k, v): (k, len(v)))
...         .collect())

I would be interested in hearing from people who use full groupby on BigData. I’m quite curious to hear how this is used in practice and how it performs.

## Creative Groupbys - 250 MB/s

Don’t use groupby. You can often work around it with cleverness. Our example above can be handled with streaming grouping reductions (see toolz docs.) This requires more thinking from the programmer but avoids the costly shuffle process.

In [66]: %time dict(db.from_filenames(path)
...                   .foldby('type', lambda total, d: total + 1, 0, lambda a, b: a + b))
Out[66]:
{'CommitCommentEvent': 17889,
'CreateEvent': 210516,
'DeleteEvent': 14534,
'FollowEvent': 35910,
'ForkEvent': 67939,
'GistEvent': 7344,
'GollumEvent': 31688,
'IssueCommentEvent': 163798,
'IssuesEvent': 102680,
'MemberEvent': 11664,
'PublicEvent': 1867,
'PullRequestEvent': 69080,
'PullRequestReviewCommentEvent': 17056,
'PushEvent': 960137,
'WatchEvent': 173631}
CPU times: user 322 ms, sys: 604 ms, total: 926 ms
Wall time: 13.2 s

In [67]: len(total) / 13.2 / 1e6  # MB/s
Out[67]: 239.16047181818183

We can also spell this with PySpark which performs about the same.

>>> dict(rdd.map(loads)  # PySpark equivalent
...         .keyBy(lambda d: d['type'])
...         .combineByKey(lambda d: 1, lambda total, d: total + 1, lambda a, b: a + b)
...         .collect())

## Use a Database

By the time you’re grouping or joining datasets you probably have structured data that could fit into a dataframe or database. You should transition from dynamic data structures (dicts/lists) to dataframes or databases as early as possible. DataFrames and databases compactly represent data in formats that don’t require serialization; this improves performance. Databases are also very clever about reducing communication.

Tools like pyspark, toolz, and dask.bag are great for initial cleanings of semi-structured data into a structured format but they’re relatively inefficient at complex analytics. For inconveniently large data you should consider a database as soon as possible. That could be some big-data-solution or often just Postgres.

## Better data structures for semi-structured data?

Dynamic data structures (dicts, lists) are overkill for semi-structured data. We don’t need or use their full power but we inherit all of their limitations (e.g. serialization costs.) Could we build something NumPy/Pandas-like that could handle the blob-of-JSON use-case? Probably.

DyND is one such project. DyND is a C++ project with Python bindings written by Mark Wiebe and Irwin Zaid and historically funded largely by Continuum and XData under the same banner as Blaze/Dask. It could probably handle the semi-structured data problem case if given a bit of love. It handles variable length arrays, text data, and missing values all with numpy-like semantics:

>>> from dynd import nd
>>> data = [{'name': 'Alice',                       # Semi-structured data
...          'location': {'city': 'LA', 'state': 'CA'},
...          'credits': [1, 2, 3]},
...         {'name': 'Bob',
...          'credits': [4, 5],
...          'location': {'city': 'NYC', 'state': 'NY'}}]

>>> dtype = '''var * {name: string,
...                   location: {city: string,
...                              state: string[2]},
...                   credits: var * int}'''        # Shape of our data

>>> x = nd.array(data, type=dtype)                  # Create DyND array

>>> x                                               # Store compactly in memory
nd.array([["Alice", ["LA", "CA"], [1, 2, 3]],
["Bob", ["NYC", "NY"], [4, 5]]])

>>> x.location.city                                 # Nested indexing
nd.array([ "LA", "NYC"],
type="strided * string")

>>> x.credits                                       # Variable length data
nd.array([[1, 2, 3],    [4, 5]],
type="strided * var * int32")

>>> x.credits * 10                                  # And computation
nd.array([[10, 20, 30],     [40, 50]],
type="strided * var * int32")

Sadly DyND has functionality gaps which limit usability.

>>> -x.credits                                      # Sadly incomplete :(
TypeError: bad operand type for unary -

I would like to see DyND mature to the point where it could robustly handle semi-structured data. I think that this would be a big win for productivity that would make projects like dask.bag and pyspark obsolete for a large class of use-cases. If you know Python, C++, and would like to help DyND grow I’m sure that Mark and Irwin would love the help

## Comparison with PySpark

1. Doesn’t engage the JVM, no heap errors or fiddly flags to set
2. Conda/pip installable. You could have it in less than twenty seconds from now.
3. Slightly faster in-memory implementations thanks to cytoolz; this isn’t important though
4. Good handling of lazy results per-partition
5. Faster / lighter weight start-up times
6. (Subjective) I find the API marginally cleaner

PySpark pros:

1. Supports distributed computation (this is obviously huge)
2. More mature, more filled out API
3. HDFS integration

Dask.bag reinvents a wheel; why bother?

1. Given the machinery inherited from dask.array and toolz, dask.bag is very cheap to build and maintain. It’s around 500 significant lines of code.
2. PySpark throws Python processes inside a JVM ecosystem which can cause some confusion among users and a performance hit. A task scheduling system in the native code ecosystem would be valuable.
3. Comparison and competition is healthy
4. I’ve been asked to make a distributed array. I suspect that distributed bag is a good first step.

## April 20, 2015

### Titus Brown

#### Statistics from applications to the 2015 course on NGS analysis

Here are some statistics from this year's applications to the NGS course. Briefly, this is a two-week workshop on sequence analysis at the command line and in the cloud.

The short version is that demand remains high; note that we admit only 24 applicants, so generally < 20%...

Year Number of applications Note
2010 33
2011 133
2012 170
2013 210
2014 170 (shifted the timing to Aug)
2015 155 (same timing as 2014)

The demand is still high, although maybe starting to dip?

Status: Number Percent
1st or 2nd year graduate student 20 12.6%
3rd year+ graduate student 40 25.2%
Post-doctoral researcher 36 22.6%
Non-tenure faculty or staff 20 12.6%
Tenure-line faculty 24 15.1%
Other 19 11.9%

Lots of tenure-line faculty feel they need this training...

Primary training/background: Number Percent
Bioinformatics 11 6.9%
Biology 112 70.4%
Computer Science 3 1.9%
Physics 0 0%
Other 33 20.8%

I should look into "Other"!

--titus

## April 19, 2015

### Titus Brown

#### Dear Doc Brown: how can I find a postdoc where I can do open science?

I got the following e-mail from a student I know -- lightly edited to protect the innocent:

I am at the stage where I am putting together a list of people that I want to post-doc with.

So, a question to you:

1. How can I find people who do open science?

2. Say I go for an interview, would it be "polite" to ask them to see if I can do open science even if they're not doing it themselves? Do you have any suggestions on how, exactly, to ask?

The reason why I am asking is because I rarely hear about openly doing (or even talking about) science in biomedical fields, outside of the standard communication methods (e.g. presenting at a meeting). Most of the people in my field seem somewhat conservative in this regard. Plus, I really don't want to butt heads with my postdoc mentor on this kind of topic.

Any advice? I have some but I'll save it for later so I can incorporate other people's advice ;).

thanks,

--titus

p.s. Yes, I have permission to post this question!

### Paul Ivanov

#### My first 200 K

Yesterday, I rode my longest bike ride to date - the El Cerrito-Davis 200K - with the San Francisco Randonneurs. A big thank you to all the volunteers and randos who made my first 200k so much fun.

First, for the uninitiated, an aside about randonneuring:

I discovered the sport because I'm a cheapskate. I had gotten more and more into cycling over the past 2 years or so, and though I was riding through the East Bay hills mostly alone, I wanted to do a "Century" - a 100 mile ride. Looking up local rides I found out that, while most centuries cost a nontrivial amount of money for the poor grad student I was back then ($60-$200), the San Francisco Randonneur rides were all $10-$20. As I dug deeper and learned about the sport, I found out that The reason for low cost, is that rando rides are unsupported - randonneuring is all about self sufficiency. You are expected to bring gear to fix your own flats, as well as carry or procure your own snacks and beverages.

The Randonneurs USA (RUSA) website succinctly summarizes the sport.

Randonneuring is long-distance unsupported endurance cycling. This style of riding is non-competitive in nature, and self-sufficiency is paramount. When riders participate in randonneuring events, they are part of a long tradition that goes back to the beginning of the sport of cycling in France and Italy. Friendly camaraderie, not competition, is the hallmark of randonneuring.

This description is strikingly similar to my beliefs both about cycling and computing, so I knew I found a new activity I would greatly enjoy. This has been borne out on three previous occasions when I have participated in the ~100 km Populaire rides, which are intended as a way to introduce riders to the sport, yet still be within reach of wide variety of cycling abilities.

We moved in late December, and between unpacking, rainy weekends, and being sick - I haven't been able to get much riding in. However, I've been wanting to do a 200K for a long time - my century, coveted for years, seemed within reach more than ever. To seal my commitment to the ride, I went ahead and ordered a spiffy looking SFR cycling jersey:

Here's the 213km (132 miles) ride map and elevation profile, what follows is my ride report.

## Start Control

The ride started at 8am at a Starbucks by El Cerrito BART station, so I just rode there from my house as I frequently do on my morning commute. I prepared my bike, packed my gear and food the night before, and the only thing I needed to do was to fill up my water bottle before the ride started. I showed up a dozen minutes before the start, with most riders already assembled, got my short drip and a heated up croissant, and totally failed to get any water. It's not a super terrible thing, I frequently don't end up drinking much at the beginning of my rides, but I had now set myself up to ride to the second control (44 km / 27 miles) without any water.

Randonneuring isn't about racing - as you traverse from control to control, there's just a window of time that you have for each control, and so long as you make it through each control withing that time window, you complete the event! There is no ordinal placement, the first rider to finish has just as much bragging rights as the last. The only time people talk about time is when they're trying to make a new personal best. The second control was open roughly between 9 and 11am, so I could have stopped off somewhere to pickup water, but that didn't fit into my ride plan.

You see, the cue sheet is two pages - with the first page dedicated to getting you to the second control, as there are a lot of turns to make in the East bay until you get to Benicia. Accordingly, my plan was to stick with the "fast" group who know the route well, so that I wouldn't have to look at the cue sheet at all. This was a success - and I got to chat with Jesse, whom I rode a fair chunk of the 111 km Lucas Valley Populaire back in October (here's a good video of the first chunk of that ride).

## Control #2

When 7 of us got to the Benicia control at 9:45, I volunteered to guard the bikes as folks filed into the gas station mart to buy something (getting a timestamped receipt is how you prove that you did the ride from one control to the next within the allotted time). Lesson learned: I should have used that time to shed my warm second jersey and long pants, as it had warmed up by then. It just so happens that by the time I went inside, there was a line for the bathroom, and by the time I got out, the fast group was just heading out, yet I still needed to change.

Because keeping up with the "fast" group wasn't that big of a deal and actually rather fun, I really wanted to try to catch up to them, so I stepped it up, and told a couple of other riders that I'd do my best to pull us to them (in case you didn't know - the aerodynamics of cycling make it much easier for those behind the leader to keep up the pace, even if that pace isn't something they could comfortably do on their own). We didn't succeed in catching them during the first 10 mile stretch of road, but I kept pushing with a high cadence in my highest gear, and about 5 miles later, we caught up with them! The only problem was, by this point, I had wasted so much energy, that I couldn't make the last 20 feet to them, though I was happy to see the two guys who'd been letting my push the whole time, use their fresher legs and join the group. So there I was, a few dozen feet from my intended riding partners, but as more and more pavement went past our wheels, the distance between us slowly widened.

Somewhat disheartened by this (though, again, happy that I all of my pulling wasn't all for naught, since I bridged two other to the group), and now overheating, I decide to stop by the side of the road, removed my long sleeved shirt, put on sunscreen (Lesson learned: don't forget your ears, too - they're the only part of my that burned). A fellow rando rider Eric went past,to my smiling cheer of "Go get 'em!". Then when pair of riders asked if I'm alright, I nodded, and decided to hop back on the bike and ride with them for a bit. It was nice to let someone else pull for a bit, but shortly thereafter, we started the first serious ascent, and my heart was pounding to hard from exhaustion and the heat - and I had to stop to again to catch my breath and get some more food in me.

Luckily, when I resumed riding up that hill, I had a mental shift, gave myself a break, and given how tired my legs were already, even though I wasn't even half-way through the ride, I reminded myself that I'll just spin in a low gear if I have to, there's no rush, I'm not racing anyone, and though I know this won't end up being as good of a 200k as it could have been had I trained more in recent months, it was still up to me to enjoy the ride. One of the highlights of the ride were all of the different butterflies I got to see along the way that I started noticing after this change in mental attitude.

After the long climb, followed by a very nice descent, I got onto the Silverado Trail for 14 miles of a straight road with minimal elevation changes. Though my legs were again cooperating more, it was starting to get kind of old, and then out of nowhere, one of the riders I had pulled earlier rides past me, but then proceeds to slow down and ride along side me for a chat. He hadn't been able to keep up with the fast group for very long, and ended up stopping somewhere along the way to eat, which is why I didn't notice that I had passed him. We took turns pulling for each other, which took away from the monotony of Silverado, but he had a lot more in the tank, and I again wasn't able to keep up, losing him with 4 miles to go to St Helena.

Needing another break, with 4 miles to the next control, I decided I would need to spend a while there, recuperating, if I am to make it through the rest of the ride.

## Control #3

I got to Model Bakery at 1:10pm, with many familiar faces from earlier in the day already enjoying their food, but ended up staying there until 2:30 - eating my food, drinking water, just letting my legs rest.

I ended up riding out solo, and really enjoyed the early parts of 128 (after missing the turnoff by a hundred feet) - luckily this was just the spot I had my last stop at, so I quickly turned around and got back on the road.

The problem was it kept getting hotter and hotter - it seemed that I couldn't go a mile without taking a good gulp of water. I still stuck to my strategy of just spinning fast without really pushing hard, since recovering from being out of breath is way faster than waiting for exhausted legs to obey your commands. I made it a good chunk of the way to Winters, but still ended up having to stop half way up the climb near Lake Berryessa. Another SFR rider, Julie, climbed past me, checking if I was OK as she went by. It's great to have that kind of camaraderie along the ride, a couple of people even gently expressed their concern that my rear wheel was out of true - which I knew but kept putting off getting it fixed. These nudges gave me more resolve to get that taken care of.

Finishing off the last of the Haribo Gummi Candy Gold-Bears that I brought with me (and you know you're tired and dehydrated when it takes effort to just chew), I got back on the bike and headed further up the hill. Then, finally, I didn't think I could ever be so cheered up by road sign (hint: they only put "TRUCKS USE LOWER GEAR" signs at the top of big hills).

## Control #4

I finally got to Winters at 5:37pm, got myself a Chai Smoothie, and Julie, who unfortunately was going to miss her train home, proposed that we ride together the rest of the way to Davis. Again - though I really enjoy my alone time while cycling, it's also quite fun to have strong riders to ride with, so as the sun started descending behind us, and no longer scorchingly hot, we set out for the final 17 miles to Davis at a good clip, given how much riding we had already done.

## Finish Control

Dodging drunk college kids (it was Picnic Day at UC Davis) was the last challenge of the ride. As a UCD alum, this was a homecoming of sorts, so I lead the way through town as we made our way to the Amtrak station. We finished just before 7, and I caught the 7:25pm train back to the Bay Area - enjoying the company of a handful of other randonneurs.

Thanks for reading my ride report!

                   _
/ \
A*   \^   -
,./   _.\\ / \
/ ,--.S    \/   \
/  "~,_     \    \
__o           ?
_ \<,_         /:\
--(_)/-(_)----.../ | \
--------------.......J


## April 18, 2015

### Titus Brown

#### First thoughts on Galileo's Middle Finger

I'm reading Galileo's Middle Finger by Dr. Alice Dreger (@alicedreger), and it's fantastic. It's a paean to evidence-based popular discourse on scientific issues -- something I am passionate about -- and it's very well written.

I bought the book because I ran across Dr. Dreger's excellent and hilarious live-tweeting of her son's sex-ed class (see storify here), which reminded me that I'd first read about her in the article Reluctant Crusader, on "Why Alice Dreger's writing on sex and science makes liberals so angry." While I'm pretty liberal in outlook, I'm also a scientist by inclination and training, and I often see the kinds of conflicts that Dr. Dreger talks about (where what people want to be true is unsupported by evidence, or directly conflicts with evidence).

The book is chock full of examples where Dr. Dreger examines controversies in science. The common theme is that some plucky scientist or set of scientists publish some perspective that is well supported by their data, but that runs against some commonly held perspective (or at least some perspective that an activist group holds). Vested interests of some kind then follow with scurrilous public or academic attacks that take years or decades to figure out. Dr. Dreger spends much of the book (so far) exploring the "playbook" used by these attackers to smear, harass, and undermine the original researchers.

At this point, I think it's important to note that most of the controversies that Dr. Dreger discusses are not "settled" -- science rarely settles things quickly -- but that in all the cases, there appears to be strong empirical evidence to support the conclusions being published. What Dr. Dreger never argues is that the particular science she is discussing is settled; rather, she often argues that it's not settled, and that the attackers are trying to make it look like it is, as a way to shut down further investigation. The kind of "double negative" espoused by Dr. Dreger ("my research doesn't show that gravity works, it shows that there is no reason to believe that gravity doesn't work") is how I try to operate in my own research, and I have an awful lot of sympathy for this general strategy as I think it's how science should work.

There are a few things about Dr. Dreger's book that rub me the wrong way, and I may or may not blog about them in detail when I'm done with the book. (Two brief items: despite showing how easily peer review can be manipulated to support personal vendettas, she consistently uses "peer reviewed" as a label to put work that supports her own positions beyond doubt. Also, she's so impassioned about the issues that she comes off as wildly un-objective at times. I think she also downplays just how complicated a lot of the research she's examining is to do and understand.) Most of my concerns can be attributed to this being an advocacy book aimed at a popular audience, where careful and objective presentations of the underlying science need to be weighed against the audience, and in compensation for this Dr. Dreger provides plenty of footnotes and citations that let interested people follow up on specific items.

But! What I wanted to write about in this blog post is two things -- first, MONEY. And second, media, the Internet, and Internet harassment.

## First, MONEY.

I'm only about halfway through the book, but it's striking that Dr. Dreger has so far not talked about some really big issues like global warming, which is kind of a poster child for science denialism these days. All of the issues in the book have to do with controversies where relatively small amounts of money were on the line. Unlike global warming, or tobacco and cancer, "all" that is at stake in the controversies in the book is sociopolitical agendas and human identity - crucially, nothing that hits at a big industry's bottom line.

What Dr. Dreger points out, though, is that even in circumstances where money is not the main issue, it is very hard for evidence to even get a fair hearing. (Even discounting those research imbroglios that Dr. Dreger is herself involved in, she presents plenty of data that the way our society handles contentious situations is just broken. More on that below.) But it doesn't take much imagination to guess that when real money is on the line, the "plucky scientist" faces even more massive obstacles. We've seen exactly how this plays out in the case of tobacco and cancer, where it took decades for scientists and patient advocates to overcome the industry-funded nonsense.

The same thing seems to be happening in climate change policy. I'm minded of a comment on Twitter that I've since lost track of -- badly paraphrased, it goes something like this: "Giant oil companies would have us believe that scientists are the ones with the overwhelming conflict of interest in the global warming discussion. What?"

I like to think about these things in terms of Bayesian priors. Who am I more likely to believe in a disagreement? The interest group which has billions of dollars at stake, or the scientists who are at least trained in objective inquiry? It's almost insane to fail to take into account the money stakes. Moreover, there are plenty of indications that, even when wrong initially, scientists are self-correcting; but I've never seen an interest group go "oops, I guess we got that wrong, let's rethink." So I guess you know my position here :). But thinking of it in terms of priors, however strong they may be, means that I'm more alert to the possibility of counter-evidence.

Anyway, I'm not an expert on any of this; my area of expertise is confined to a few areas in genomics and computational biology at this point. So it's very hard for me to evaluate the details of some of Dr. Dreger's cases. But I think that's one of the points she's getting at in the book -- who do we trust when seemingly trustworthy academic societies get manipulated by activist agendas? How do we reach some sort of conclusion, if not broad consensus then at least academic consensus, on issues that are (at the least) not well understood yet? And (one place where Dr. Dreger excels) how do we evaluate and decide on the ethical standards to apply to biomedical (or other) research? All very relevant to many things going on today, and all very tricky.

## Media, and the Internet

At the point I'm at in the book, Dr. Dreger has undergone a transition from being press-driven to being Internet driven. She points out that in the recession, the big media companies essentially lost the staff to pursue deep, technically tricky stories; coincident, "mainstream media" lost a lot of trust as the Internet came along and helped enable audience fragmentation. This meant that she had to take her fights to the Internet to convince the masses (or at least the Google search engine) much more so than relying on deep investigation by arbiters of "official" social policy like the NY Times or New Yorker.

I think in many ways this is a massive improvement -- I hate the fact that, in science, we've got these same kinds of official arbiters -- but it's interesting to read through the book and recognize Dr. Dreger's shift in thinking and tactics. It's especially ironic given that the same kind of tactics she starts to use in the dexamethasone investigation are the ones used against her in some of her intersex work. But, again, this is kind of the point of her book - it's not enough to rely on someone's credentials and sense of justice to evaluate their message, you need to actually look into the evidence yourself.

It's particularly interesting to fold Dr. Dreger's observations about Internet harassment and shaming into my general body of knowledge about how the Internet is used to, well, harass, shame, derail, and otherwise sandbag particular messages and people. Which reminds me, I need to watch all of Dr. Gabriella Coleman's PyCon 2015 keynote on Anonymous...

## A Not-Really Conclusion

At the end of the day, I worry that the "trust no one, investigate yourself" message is too challenging for our culture to grasp in a productive way. And yet I'm firmly convinced it's the only way forward. How can we do this better as scientists and educators?

--titus

p.s. I'm thinking about instituting a commenting policy, perhaps one based on Captain Awkward's policy. Thoughts?

p.p.s. There's some sort of irony in me leaving Michigan State just as I discover that Dr. Dreger is local. I may try to track her down for coffee while I'm still in town, although I'm sure she's super busy...

## April 16, 2015

### Titus Brown

#### Please destroy this software after publication. kthxbye.

tl;dr? A while back I wrote that there are three uses of research software: replication, reproduction, and reuse. The world of computational science would be better off if people clearly delineated whether or not they wanted anyone else to reuse their software, and I think it's a massive mistake to expect that everyone's software should be reusable.

A few months back, I reviewed a pretty exciting paper - one I will probably highlight on my blog, when it comes out. The paper outlined a fairly simple concept for comparing sequences and then used that to develop some new ultra-scalable functionality. The theory seemed novel, the computational results were pretty good, and I recommended acceptance (or minor revisions). This was in spite of the fact that the authors stated quite clearly that they had produced largely unusable software.

Other reviewers were not quite so forgiving, however -- one reviewer declined to review the paper until they could run the software on their own data.

This got me thinking - I think I have a reputation as wanting people to release their software in a useful manner, but I've actually shied away from requiring it on several occasions. Here was a situation that was a pretty direct conflict: neat result, with software that was not intended for reuse. Interestingly, I've drawn this line before, although without much personal comment. In my blog post on review criteria for bioinformatics papers, there's nothing there about whether or not the software is reusable - it must just be legally readable and executable. But I'm also pretty loud-mouthed about wanting good quality (or at least better quality) software out there in the bioinformatics world!

So what gives? I felt that the new theory looked pretty awesome, and would be tremendously useful, while the implementation was (as stated) unlikely to be something I (or others) used. So what? Publish!

I think this highlights that there are two different possible goals for bioinformatics papers. One goal is the standard scientific goal: to demonstrate a new method or technique, whether it be mathematical or computational. The other goal is different, and in some ways much harder: to provide a functioning tool for use and reuse. These should have different review standards, and that maybe the authors should be given the opportunity to distinguish clearly between the two goals.

There's actually a lot of commonality between what I would request of the software from either kind of paper, a technique paper or a tool paper.

• Both need to be accessible for download and viewing - otherwise, how can I understand the details of the implementation?
• Both types of software need to be usable enough to reproduce the results in the paper, in theory (e.g. given sufficient access to compute infrastructure).
• Both should be in a publicly accessible and archived format, to avoid loss of the software from personal Web sites, etc.
• Both should show evidence of decent principles of basic software engineering, including the use of version control, some form of testing (albeit unit testing or functional testing or even just defined input with known good output), release/version information, installation/dependency information, and the like.

However, there are some interesting differences. Off the top of my head, I'm thinking that:

• Crucially, the software from the technique paper would not need to be open source - by the OSI definition, the technique code would not need to be freely modifiable or re-sharable.

(To be clear, I know of neither any formal (journal) requirements nor ethical requirements that a particular implementation be modifiable or redistributable.)

• Nor need the software from the technique paper be written in a general way (e.g. to robustly process different formats), or for broader re-use. In particular, this means that documentation and unit/functional tests might be minimal - enough to support replication but no more.

• The software from the technique paper should be accessible to basic review, but should not be subject to code review on style or implementation - correctness only.

• Software from a "tools" paper, by contrast, should be held to much higher standards, and be subject to code review (in parts) and examination for documentation and installation and ... oh, heck, just start with the sustainability evaluation checklist at the SSI!

I'm aware that by having such relaxed constraints on technique publication I'm more or less directly contradicting myself in my earlier blog post on automated testing and research software - all of that definitely holds for software that you hope to be reused.

I'm not sure how or where to draw the line here, exactly. It's certainly reasonable to say that software that doesn't have unit tests is likely to be wrong, and therefore unit tests should be required - but, in science, we should never rely on a single study to prove something anyway, so I'm not sure why it matters if software is wrong in some details. This is where the difference between "replicability" and "reproducibility" becomes important. If I can't replicate your computation (at least in theory) then you have no business publishing it; but reproducing it is something that is a much larger task, outside the scope of any given paper.

I want to quote David States, who wrote a comment two years ago on my blog:

Too often, developers work in isolation, and this creates a high risk for code errors and science errors. Good code needs to be accessible and this includes not just sharing of the source code itself but also use of effective style, inclusion of tests and validation procedures and appropriate documentation.

I think I agree - but what's the minimum, here, for a technique paper that is meant to be a demonstration of a technique and no more?

One final point: in all of this we should recognize that the current situation is quite poor, in that quite a bit of software is simply inaccessible for replication purposes. (This mirrors my personal experiences in bioinformatics review, too.)

Improving this situation is important, but I think we need to be precise about what the minimim is. I don't think we're going to get very far by insisting that all code be held to high standards; that's a generational exercise (and part of why I'm so bullish on Software Carpentry).

So: what's the minimum necessary for decent science?

--titus

p.s. In case anyone is wondering, I don't think our software really meets my own criteria for tool publication, although it's getting closer.

p.p.s. Drawing this distinction leads in some very good directions for publishers and funding bodies to think about, too. More on that in another blog post, if I get the chance.

p.p.p.s. My 2004 paper (Brown and Callan) has a table that's wrong due to a fencepost error. But it's not seriously wrong. shrug

#### The PyCon 2015 Ally's Workshop

At PyCon 2015, I had the pleasure of attending the Ally Skills Workshop, organized by @adainitiative (named after Ada Lovelace).

The workshop was a 3 hour strongly guided discussion centering around 4-6 person group discussion of short scenarios. There's a guide to running them here, although I personally would not have wanted to run one without attending one first!

I attended the workshop for at least three reasons --

First, I want to do better myself. I have put some effort into (and received a lot of encouragement for) making my lab an increasingly open and welcoming place. While I have heard concerns about being insufficiently critical and challenging of bad ideas in science (and I have personally experienced a few rather odd situations where obviously bad ideas weren't called out in my past labs), I don't see any inherent conflict between being welcoming and being intellectually critical - in fact, I rather suspect they are mutually supportive, especially for the more junior people.

But, doing better is surprisingly challenging; everyone needs a mentor, or at least guideposts. So when I heard about this workshop, I leapt at the chance to attend!

Second, I am interested in connecting these kinds of things to my day job in academia, where I am now a professor at UC Davis. UC Davis is the home of the somewhat notorious Jonathan Eisen, who is notorious for many reasons that include boycotting and calling out conferences that have low diversity. UC Davis also has an effort to increase diversity at the faculty level, and I think that this is an important effort. I'm hoping to be involved in this when I actually take up residence in Davis, and learning to be a male ally is one way to help. More, I think that Davis would be a natural home to some of these ally workshops, and so I attended the Ally Skills workshop to explore this.

And third, I was just curious! It's surprisingly tricky to confront and talk about sexism effectively, and I thought seeing how the the pros did it would a good way to start.

Interestingly, 2/3 of my lab attended the workshop, too - without me requesting it. I think they found it valuable, too.

## The workshop itself

Valerie Aurora ran the workshop, and it's impossible to convey how good it was, but I'll try by picking out some choice quotes:

"You shouldn't expect praise or credit for behaving like a decent human being."

"Sometimes, you just need a flame war to happen." (paraphrase)

"It's not up to the victim whether you enforce your code of conduct."

"The physiological effects of alcohol are actually limited, and most effects of alcohol are socially and/or culturally mediated."

"Avoid rules lawyering. I don't now if you've ever worked with lawyers, but software engineers are almost as bad."

"One problem for male allies is the assumption that you are only talking to a woman because you are sexually interested in them."

"Trolls are good at calibrating their level of awfulness to something that you will feel guilty about moderating."

Read the blog post "Tone policing only goes one way.

Overall, a great experience and something I hope to help host more of at UC Davis.

--titus

### Continuum Analytics

#### Find Continuum at PyData Dallas

PyData Dallas, the first PyData conference in Texas, is taking place next week, April 24-26. PyData has been a wonderful conference for fostering the Python community and giving developers and other Python enthusiasts the opportunity to share their ideas, projects and the future of Python. Continuum Analytics is proud to be a founding sponsor for such an innovative, community-driven conference.

## April 15, 2015

### Titus Brown

#### The three porridge bowls of sustainable scientific software development

(The below issues are very much on my mind as I think about how to apply for another NIH grant to fund continued development on the khmer project.)

Imagine that we have a graph of novel functionality versus software engineering effort for a particular project, cast in the shape of a tower or pyramid, i.e. a support structure for cool science.

The more novel functionality implemented, the taller the building, and the broader the software engineering base needs to be to support the building. If you have too much novel functionality with too little software engineering base, the tower will have too little support and catastrophe can ensue - either no new functionality can be added past a certain point, or we discover that much of the implemented functionality is actually unstable and incorrect.

Since everybody likes novel functionality - for example, it's how we grade grants in science -- this is a very common failure mode. It is particularly problematic in situations where we have built a larger structure by placing many of these individual buildings on top of others; the entire structure is not much stronger than its weakest (least supported) component.

Another possible failure mode is if the base becomes too big too soon:

That is, if too much effort is spent on software engineering at the expense of building novel functionality on top of it, then the building remains the same height while the base broadens. This is a failure for an individual project, because no new functionality gets built, and the project falls out of funding.

In the worst case, the base can become over-wrought and be designed to support functionality that doesn't yet exist. In most situations, this work will be entirely wasted, either because the base was designed for the wrong functionality, or because the extra work put into the base will delay the work put into new features.

Where projects are designed to be building blocks from the start, as opposed to a leap into the unknown like most small-lab computational science projects, a different structure is worth investing in -- but I'm skeptical that this is ever the way to start a project.

Supporting this kind of project is something that Dan Katz has written and presented about; see (for example) A Method to Select e-Infrastructure Components to Sustain.

And, of course, the real danger is that we end up in a situation where a poorly engineered structure is used to support a much larger body of scientific work:

The question that I am trying to understand is this: what are the lifecycle stages for research software, and how should we design for them (as researchers), and how should we think about funding them (as reviewers and program officers)?

To bring things back to the title, how do we make sure we mix the right amount of software development (cold porridge) with novel functionality (hot porridge) to make something edible for little bears?

--titus

### Matthieu Brucher

#### Announcement: ATKStereoCompressor 1.0.0

I’m happy to announce the release of a stereo compressor based on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. This stereo compressor can work on two channels, left/right or middle/side, possibly in linked mode (only one set of parameters), and can be set up to mix the input signal with the compressed signal (serial/parallel compression).

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

ATK SD1, ATKCompressor and ATKUniversalDelay were upgraded after AU validation failed. This is now fixed.

## April 14, 2015

### Jan Hendrik Metzen

#### Probability calibration

As a follow-up of my previous post on reliability diagrams, I have worked jointly with Alexandre Gramfort, Mathieu Blondel and Balazs Kegl (with reviews by the whole team, in particular Olivier Grisel) on adding probability calibration and reliability diagrams to scikit-learn. Those have been added in the recent 0.16 release of scikit-learn as CalibratedClassifierCV and calibration_curve.

This post contains an interactive version of the documentation in the form of an IPython notebook; parts of the text/code are thus due to my coauthors.

Note that the 0.16 release of scikit-learn contains a bug in IsotonicRegression, which has been fixed in the 0.16.1 release. For obtaining correct results with this notebook, you need to use 0.16.1 or any later version.

## Reliability curves

In [1]:
Expand Code
import numpy as np
np.random.seed(0)

import matplotlib
matplotlib.use("svg")
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import (brier_score_loss, precision_score, recall_score,
f1_score, log_loss)
from sklearn.cross_validation import train_test_split


When performing classification you often want not only to predict the class label, but also obtain a probability of the respective label. This probability gives you some kind of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some even do not not support probability prediction. The calibration module allows you to better calibrate the probabilities of a given model, or to add support for probability prediction.

Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly interpreted as a confidence level. For instance, a well calibrated (binary) classifier should classify the samples such that among the samples to which it gave a predict_proba value close to 0.8, approximately 80% actually belong to the positive class. The following plot compares how well the probabilistic predictions of different classifiers are calibrated:

In [2]:
Expand Code
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=2)

train_samples = 100  # Samples used for training the models

X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]

# Create classifiers
lr = LogisticRegression()
gnb = GaussianNB()
svc = LinearSVC(C=1.0)
rfc = RandomForestClassifier(n_estimators=100)

In [3]:
Expand Code
plt.figure(figsize=(9, 9))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(lr, 'Logistic'),
(gnb, 'Naive Bayes'),
(svc, 'Support Vector Classification'),
(rfc, 'Random Forest')]:
clf.fit(X_train, y_train)
if hasattr(clf, "predict_proba"):
prob_pos = clf.predict_proba(X_test)[:, 1]
else:  # use decision function
prob_pos = clf.decision_function(X_test)
prob_pos = \
(prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
fraction_of_positives, mean_predicted_value = \
calibration_curve(y_test, prob_pos, n_bins=10)

ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
label="%s" % (name, ))

ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
histtype="step", lw=2)

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots  (reliability curve)')

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()


LogisticRegression returns well calibrated predictions by default as it directly optimizes log-loss. In contrast, the other methods return biased probabilities; with different biases per method:

• Naive Bayes (GaussianNB) tends to push probabilties to 0 or 1 (note the counts in the histograms). This is mainly because it makes the assumption that features are conditionally independent given the class, which is not the case in this dataset which contains 2 redundant features.

• RandomForestClassifier shows the opposite behavior: the histograms show peaks at approximately 0.2 and 0.9 probability, while probabilities close to 0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil and Caruana [4]: "Methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values. Because predictions are restricted to the interval [0,1], errors caused by variance tend to be one-sided near zero and one. For example, if a model should predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict zero. If we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away from 0. We observe this effect most strongly with random forests because the base-level trees trained with random forests have relatively high variance due to feature subseting." As a result, the calibration curve shows a characteristic sigmoid shape, indicating that the classifier could trust its "intuition" more and return probabilties closer to 0 or 1 typically.

• Linear Support Vector Classification (LinearSVC) shows an even more sigmoid curve as the RandomForestClassifier, which is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [4]), which focus on hard samples that are close to the decision boundary (the support vectors).

## Calibration of binary classifiers¶

Two approaches for performing calibration of probabilistic predictions are provided: a parametric approach based on Platt's sigmoid model and a non-parametric approach based on isotonic regression (sklearn.isotonic). Probability calibration should be done on new data not used for model fitting. The class CalibratedClassifierCV uses a cross-validation generator and estimates for each split the model parameter on the train samples and the calibration of the test samples. The probabilities predicted for the folds are then averaged. Already fitted classifiers can be calibrated by CalibratedClassifierCV via the paramter cv="prefit". In this case, the user has to take care manually that data for model fitting and calibration are disjoint.

The following images demonstrate the benefit of probability calibration. The first image present a dataset with 2 classes and 3 blobs of data. The blob in the middle contains random samples of each class. The probability for the samples in this blob should be 0.5.

In [4]:
Expand Code
n_samples = 50000
n_bins = 3  # use 3 bins for calibration_curve as we have 3 clusters here

# Generate 3 blobs with 2 classes where the second blob contains
# half positive samples and half negative samples. Probability in this
# blob is therefore 0.5.
centers = [(-5, -5), (0, 0), (5, 5)]
X, y = datasets.make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
centers=centers, shuffle=False, random_state=42)

y[:n_samples // 2] = 0
y[n_samples // 2:] = 1
sample_weight = np.random.RandomState(42).rand(y.shape[0])

# split train, test for calibration
X_train, X_test, y_train, y_test, sw_train, sw_test = \
train_test_split(X, y, sample_weight, test_size=0.9, random_state=42)

plt.figure()
y_unique = np.unique(y)
colors = cm.rainbow(np.linspace(0.0, 1.0, y_unique.size))
for this_y, color in zip(y_unique, colors):
this_X = X_train[y_train == this_y]
this_sw = sw_train[y_train == this_y]
plt.scatter(this_X[:, 0], this_X[:, 1], s=this_sw * 50, c=color, alpha=0.5,
label="Class %s" % this_y)
plt.legend(loc="best")
plt.title("Data")

Out[4]:
<matplotlib.text.Text at 0x5b37b10>

The following image shows on the data above the estimated probability using a Gaussian naive Bayes classifier without calibration, with a sigmoid calibration and with a non-parametric isotonic calibration. One can observe that the non-parametric model provides the most accurate probability estimates for samples in the middle, i.e., 0.5.

In [5]:
Expand Code
# Gaussian Naive-Bayes with no calibration
clf = GaussianNB()
clf.fit(X_train, y_train)  # GaussianNB itself does not support sample-weights
prob_pos_clf = clf.predict_proba(X_test)[:, 1]

# Gaussian Naive-Bayes with isotonic calibration
clf_isotonic = CalibratedClassifierCV(clf, cv=2, method='isotonic')
clf_isotonic.fit(X_train, y_train, sw_train)
prob_pos_isotonic = clf_isotonic.predict_proba(X_test)[:, 1]

# Gaussian Naive-Bayes with sigmoid calibration
clf_sigmoid = CalibratedClassifierCV(clf, cv=2, method='sigmoid')
clf_sigmoid.fit(X_train, y_train, sw_train)
prob_pos_sigmoid = clf_sigmoid.predict_proba(X_test)[:, 1]

print("Brier scores: (the smaller the better)")

clf_score = brier_score_loss(y_test, prob_pos_clf, sw_test)
print("No calibration: %1.3f" % clf_score)

clf_isotonic_score = brier_score_loss(y_test, prob_pos_isotonic, sw_test)
print("With isotonic calibration: %1.3f" % clf_isotonic_score)

clf_sigmoid_score = brier_score_loss(y_test, prob_pos_sigmoid, sw_test)
print("With sigmoid calibration: %1.3f" % clf_sigmoid_score)

Brier scores: (the smaller the better)
No calibration: 0.104
With isotonic calibration: 0.084
With sigmoid calibration: 0.109

In [6]:
Expand Code
plt.figure()
order = np.lexsort((prob_pos_clf, ))
plt.plot(prob_pos_clf[order], 'r', label='No calibration (%1.3f)' % clf_score)
plt.plot(prob_pos_isotonic[order], 'g', linewidth=3,
label='Isotonic calibration (%1.3f)' % clf_isotonic_score)
plt.plot(prob_pos_sigmoid[order], 'b', linewidth=3,
label='Sigmoid calibration (%1.3f)' % clf_sigmoid_score)
plt.plot(np.linspace(0, y_test.size, 51)[1::2],
y_test[order].reshape(25, -1).mean(1),
'k', linewidth=3, label=r'Empirical')
plt.ylim([-0.05, 1.05])
plt.xlabel("Instances sorted according to predicted probability "
"(uncalibrated GNB)")
plt.ylabel("P(y=1)")
plt.legend(loc="upper left")
plt.title("Gaussian naive Bayes probabilities")

Out[6]:
<matplotlib.text.Text at 0x623ce90>

The following experiment is performed on an artificial dataset for binary classification with 100.000 samples (1.000 of them are used for model fitting) with 20 features. Of the 20 features, only 2 are informative and 10 are redundant. The figure shows the estimated probabilities obtained with logistic regression, a linear support-vector classifier (SVC), and linear SVC with both isotonic calibration and sigmoid calibration. The calibration performance is evaluated with Brier score brier_score_loss, reported in the legend (the smaller the better).

In [7]:
Expand Code
# Create dataset of classification task with many redundant and few
# informative features
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=10,
random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.99,
random_state=42)

In [8]:
Expand Code
def plot_calibration_curve(est, name, fig_index):
"""Plot calibration curve for est w/o and with calibration. """
# Calibrated with isotonic calibration
isotonic = CalibratedClassifierCV(est, cv=2, method='isotonic')

# Calibrated with sigmoid calibration
sigmoid = CalibratedClassifierCV(est, cv=2, method='sigmoid')

# Logistic regression with no calibration as baseline
lr = LogisticRegression(C=1., solver='lbfgs')

fig = plt.figure(fig_index, figsize=(9, 9))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(lr, 'Logistic'),
(est, name),
(isotonic, name + ' + Isotonic'),
(sigmoid, name + ' + Sigmoid')]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
if hasattr(clf, "predict_proba"):
prob_pos = clf.predict_proba(X_test)[:, 1]
else:  # use decision function
prob_pos = clf.decision_function(X_test)
prob_pos = \
(prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())

clf_score = brier_score_loss(y_test, prob_pos, pos_label=y.max())
print("%s:" % name)
print("\tBrier: %1.3f" % (clf_score))
print("\tPrecision: %1.3f" % precision_score(y_test, y_pred))
print("\tRecall: %1.3f" % recall_score(y_test, y_pred))
print("\tF1: %1.3f\n" % f1_score(y_test, y_pred))

fraction_of_positives, mean_predicted_value = \
calibration_curve(y_test, prob_pos, n_bins=10)

ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
label="%s (%1.3f)" % (name, clf_score))

ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
histtype="step", lw=2)

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots  (reliability curve)')

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()

In [9]:
# Plot calibration cuve for Linear SVC
plot_calibration_curve(LinearSVC(), "SVC", 2)

Logistic:
Brier: 0.099
Precision: 0.872
Recall: 0.851
F1: 0.862

SVC:
Brier: 0.163
Precision: 0.872
Recall: 0.852
F1: 0.862

SVC + Isotonic:
Brier: 0.100
Precision: 0.853
Recall: 0.878
F1: 0.865

SVC + Sigmoid:
Brier: 0.099
Precision: 0.874
Recall: 0.849
F1: 0.861



One can observe here that logistic regression is well calibrated as its curve is nearly diagonal. Linear SVC's calibration curve has a sigmoid curve, which is typical for an under-confident classifier. In the case of LinearSVC, this is caused by the margin property of the hinge loss, which lets the model focus on hard samples that are close to the decision boundary (the support vectors). Both kinds of calibration can fix this issue and yield nearly identical results. The next figure shows the calibration curve of Gaussian naive Bayes on the same data, with both kinds of calibration and also without calibration.

In [10]:
# Plot calibration cuve for Gaussian Naive Bayes
plot_calibration_curve(GaussianNB(), "Naive Bayes", 1)

Logistic:
Brier: 0.099
Precision: 0.872
Recall: 0.851
F1: 0.862

Naive Bayes:
Brier: 0.118
Precision: 0.857
Recall: 0.876
F1: 0.867

Naive Bayes + Isotonic:
Brier: 0.098
Precision: 0.883
Recall: 0.836
F1: 0.859

Naive Bayes + Sigmoid:
Brier: 0.109
Precision: 0.861
Recall: 0.871
F1: 0.866



One can see that Gaussian naive Bayes performs very badly but does so in an other way than linear SVC: While linear SVC exhibited a sigmoid calibration curve, Gaussian naive Bayes' calibration curve has a transposed-sigmoid shape. This is typical for an over-confident classifier. In this case, the classifier's overconfidence is caused by the redundant features which violate the naive Bayes assumption of feature-independence.

Calibration of the probabilities of Gaussian naive Bayes with isotonic regression can fix this issue as can be seen from the nearly diagonal calibration curve. Sigmoid calibration also improves the brier score slightly, albeit not as strongly as the non-parametric isotonic calibration. This is an intrinsic limitation of sigmoid calibration, whose parametric form assumes a sigmoid rather than a transposed-sigmoid curve. The non-parametric isotonic calibration model, however, makes no such strong assumptions and can deal with either shape, provided that there is sufficient calibration data. In general, sigmoid calibration is preferable if the calibration curve is sigmoid and when there is few calibration data while isotonic calibration is preferable for non- sigmoid calibration curves and in situations where many additional data can be used for calibration.

## Multi-class classification¶

CalibratedClassifierCV can also deal with classification tasks that involve more than two classes if the base estimator can do so. In this case, the classifier is calibrated first for each class separately in an one-vs-rest fashion. When predicting probabilities for unseen data, the calibrated probabilities for each class are predicted separately. As those probabilities do not necessarily sum to one, a postprocessing is performed to normalize them.

The next image illustrates how sigmoid calibration changes predicted probabilities for a 3-class classification problem. Illustrated is the standard 2-simplex, where the three corners correspond to the three classes. Arrows point from the probability vectors predicted by an uncalibrated classifier to the probability vectors predicted by the same classifier after sigmoid calibration on a hold-out validation set. Colors indicate the true class of an instance (red: class 1, green: class 2, blue: class 3).

In [11]:
Expand Code
np.random.seed(0)

# Generate data
X, y = datasets.make_blobs(n_samples=1000, n_features=2, random_state=42,
cluster_std=5.0)
X_train, y_train = X[:600], y[:600]
X_valid, y_valid = X[600:800], y[600:800]
X_train_valid, y_train_valid = X[:800], y[:800]
X_test, y_test = X[800:], y[800:]

In [12]:
Expand Code
# Train uncalibrated random forest classifier on whole train and validation
# data and evaluate on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train_valid, y_train_valid)
clf_probs = clf.predict_proba(X_test)
score = log_loss(y_test, clf_probs)

# Train random forest classifier, calibrate on validation data and evaluate
# on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_valid, y_valid)
sig_clf_probs = sig_clf.predict_proba(X_test)
sig_score = log_loss(y_test, sig_clf_probs)

In [13]:
Expand Code
# Plot changes in predicted probabilities via arrows
plt.figure(0, figsize=(10, 8))
colors = ["r", "g", "b"]
for i in range(clf_probs.shape[0]):
plt.arrow(clf_probs[i, 0], clf_probs[i, 1],
sig_clf_probs[i, 0] - clf_probs[i, 0],
sig_clf_probs[i, 1] - clf_probs[i, 1],

# Plot perfect predictions
plt.plot([1.0], [0.0], 'ro', ms=20, label="Class 1")
plt.plot([0.0], [1.0], 'go', ms=20, label="Class 2")
plt.plot([0.0], [0.0], 'bo', ms=20, label="Class 3")

# Plot boundaries of unit simplex
plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], 'k', label="Simplex")

# Annotate points on the simplex
plt.annotate(r'($\frac{1}{3}$, $\frac{1}{3}$, $\frac{1}{3}$)',
xy=(1.0/3, 1.0/3), xytext=(1.0/3, .23), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.plot([1.0/3], [1.0/3], 'ko', ms=5)
plt.annotate(r'($\frac{1}{2}$, $0$, $\frac{1}{2}$)',
xy=(.5, .0), xytext=(.5, .1), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($0$, $\frac{1}{2}$, $\frac{1}{2}$)',
xy=(.0, .5), xytext=(.1, .5), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($\frac{1}{2}$, $\frac{1}{2}$, $0$)',
xy=(.5, .5), xytext=(.6, .6), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($0$, $0$, $1$)',
xy=(0, 0), xytext=(.1, .1), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($1$, $0$, $0$)',
xy=(1, 0), xytext=(1, .1), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($0$, $1$, $0$)',
xy=(0, 1), xytext=(.1, 1), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.grid("off")
for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
plt.plot([0, x], [x, 0], 'k', alpha=0.2)
plt.plot([0, 0 + (1-x)/2], [x, x + (1-x)/2], 'k', alpha=0.2)
plt.plot([x, x + (1-x)/2], [0, 0 + (1-x)/2], 'k', alpha=0.2)

plt.title("Change of predicted probabilities after sigmoid calibration")
plt.xlabel("Probability class 1")
plt.ylabel("Probability class 2")
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)
plt.legend(loc="best")

print("Log-loss of")
print(" * uncalibrated classifier trained on 800 datapoints: %.3f "
% score)
print(" * classifier trained on 600 datapoints and calibrated on "
"200 datapoint: %.3f" % sig_score)

Log-loss of
* uncalibrated classifier trained on 800 datapoints: 1.280
* classifier trained on 600 datapoints and calibrated on 200 datapoint: 0.536


The base classifier is a random forest classifier with 25 base estimators (trees). If this classifier is trained on all 800 training datapoints, it is overly confident in its predictions and thus incurs a large log-loss. Calibrating an identical classifier, which was trained on 600 datapoints, with method='sigmoid' on the remaining 200 datapoints reduces the confidence of the predictions, i.e., moves the probability vectors from the edges of the simplex towards the center:

In [14]:
Expand Code
# Illustrate calibrator
plt.figure(1, figsize=(10, 8))
# generate grid over 2-simplex
p1d = np.linspace(0, 1, 20)
p0, p1 = np.meshgrid(p1d, p1d)
p2 = 1 - p0 - p1
p = np.c_[p0.ravel(), p1.ravel(), p2.ravel()]
p = p[p[:, 2] >= 0]

calibrated_classifier = sig_clf.calibrated_classifiers_[0]
prediction = np.vstack([calibrator.predict(this_p)
for calibrator, this_p in
zip(calibrated_classifier.calibrators_, p.T)]).T
prediction /= prediction.sum(axis=1)[:, None]

# Ploit modifications of calibrator
for i in range(prediction.shape[0]):
plt.arrow(p[i, 0], p[i, 1],
prediction[i, 0] - p[i, 0], prediction[i, 1] - p[i, 1],
# Plot boundaries of unit simplex
plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], 'k', label="Simplex")

plt.grid("off")
for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
plt.plot([0, x], [x, 0], 'k', alpha=0.2)
plt.plot([0, 0 + (1-x)/2], [x, x + (1-x)/2], 'k', alpha=0.2)
plt.plot([x, x + (1-x)/2], [0, 0 + (1-x)/2], 'k', alpha=0.2)

plt.title("Illustration of sigmoid calibrator")
plt.xlabel("Probability class 1")
plt.ylabel("Probability class 2")
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)

Out[14]:
(-0.05, 1.05)

This calibration results in a lower log-loss. Note that an alternative would have been to increase the number of base estimators which would have resulted in a similar decrease in log-loss.

## Summary

In summary, the newly added CalibratedClassifierCV allows to improve the quality of predicted class probabilities of binary and multi-class classifiers. It provides both a parametric calibration assuming a sigmoid shape of the calibration curve and a non-parametric calibration based on isotonic regression, which can cope with any shape of the calibration curve, provided that sufficient calibration data exists. The main bottlenecks of the method are the increased computation time due to the internal cross-validation loop and the necessity of additional calibration data, which can be alleviated by cross-validation.

## References

[1] Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001

[2] Transforming Classifier Scores into Accurate Multiclass Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)

[3] Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, J. Platt, (1999)

[4] Predicting Good Probabilities with Supervised Learning, A. Niculescu-Mizil & R. Caruana, ICML 2005

In [15]:
%load_ext watermark
%watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy

Jan Hendrik Metzen 14/04/2015

CPython 2.7.5+
IPython 2.4.1

numpy 1.9.2

compiler   : GCC 4.8.1
system     : Linux
release    : 3.16.0-28-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit


This post was written as an IPython notebook. You can download this notebook.

## April 13, 2015

### Titus Brown

#### PyCon sprints - playing with Docker

I'm at the PyCon 2015 sprints (day 2), and I took the opportunity to play around with Docker a bit.

First, I created a local docker container that contained an installed version of khmer. I ran a blank docker container:

docker run -it ubuntu


and then installed the khmer prereqs on it, followed by khmer (run on the docker container):

apt-get update
apt-get install -y python-pip python-dev
pip install khmer


then I logged out, and created a named docker container from that:

docker commit -m "installed khmer" -a "Titus Brown" 28138ab4095d titus/khmer:v1.3


And now I can run the khmer scripts like this --

docker run -it titus/khmer:v1.3 /usr/local/bin/normalize-by-median.py


Next, I automated all of this with a Dockerfile:

cat > Dockerfile <<EOF
FROM ubuntu:14.04
MAINTAINER Titus Brown <titus@idyll.org>
RUN apt-get update && apt-get install -y python-dev python-pip
RUN pip install -U setuptools
RUN pip install khmer==1.3
EOF

docker build -t titus/khmer:v1.3 .


Either container lets me run normalize by median on a file outside of the container like so:

mkdir foo
curl https://raw.githubusercontent.com/ged-lab/khmer/master/data/100k-filtered.fa > foo/sequence.fa

docker run -v $PWD/foo:/data -it titus/khmer:v1.3 normalize-by-median.py /data/sequence.fa -o /data/result.fa  Here, I'm running 'normalize-by-median' on the input file 'foo/sequence.fa', and the output is placed in 'foo/result.fa'. The trick is that the './foo/' directory is mounted on the docker container as '/data/', so normalize-by-median.py sees it as '/data/sequence.fa'. Nice and easy, so far... --titus #### PyCon sprints - playing with named pipes and streaming and khmer I'm at the PyCon 2015 sprints (day 2), and I took the opportunity to play around with named pipes. I was reminded of named pipes by Vince Buffalo in this great blog post, and since we at the khmer project are very interested in streaming, and named pipes fit well with a streaming perspective, I thought I'd check out named pipes as a way to communicate sequences between different khmer scripts. First, I tried using named pipes to tie digital normalization together with splitting reads out into two output files, left and right: mkfifo aa mkfifo bb # set up diginorm, reading from 'aa' and writing to 'bb' normalize-by-median.py aa -o bb & # split reads into left and right, reading from 'bb' and outputting to # output.1 / output.2 split-paired-reads.py bb -1 output.1 -2 output.2 & # feed in your sequences! cat sequence.fa > aa  Here the setup is a bit weird, because you have to set up all the commands that are going to operate on the data before feeding in the data. But it all works! Next, I tried process substitution. This does essentially the same thing as above, but is much nicer and more succinct: normalize-by-median.py sequence.fa -o >(split-paired-reads.py /dev/stdin -1 output.1 -2 output.2)  Finally, I tried to make use of functionality from our new semi-streaming paper to run 3-pass digital normalization using semi-streaming -- trim-low-abund.py -Z 20 -C 2 sequence.fa -o >(normalize-by-median.py /dev/stdin -o result.fa  but this fell apart because 'trim-low-abund.py' doesn't support -o ;). This led to a few issues being filed in our issue tracker... Very neat stuff. It has certainly given me a strong reason to make sure all of our scripts support streaming input and output! --titus #### Taking grad students to PyCon I am still up at PyCon 2015 in Montreal, and most of my lab is here with me. On Saturday, I told Terry Peppers and some others that PyCon had been one of my (limited) lifelines to (limited) sanity during my early tenure-track years. Whenever I was in danger of buying into too much of the academic bubble, events like PyCon and the various Twitter and blogging interactions I had in the Python world helped remind me how ridiculous some aspects of the academic world are. On Sunday, a colleague at another university (who had not been to PyCon before, I think) was talking to me and said (paraphrased): "I wish I could afford to bring my graduate students. This conference is way more relevant to what they'll probably be doing down the road than any academic conference." On Monday, I gave a talk in the Biology program at McGill, where I was invited by the grad student association. I spent a fair amount of time talking with grad students about career options, and when I mentioned that PyCon was just finishing up, a few of them said they wished they'd known about it. I think they'd have had a good time, especially at the job fair. Moral of the story, for me: keep on bringing my grad students to PyCon. --titus ## April 09, 2015 ### Titus Brown #### PyCon 2015 talk notes for &quot;How to interpret your own genome&quot; Here are talk notes and links for my PyCon 2015 talk. The talk slides are up on SlideShare. ## General background You should definitely check out Mike Lin's great blog posts on "Blogging my genome". I found SNPedia through this wonderful blog post on how to use 23andMe irresponsibly, on Slate Star Codex. My introduction to bcbio came from Brad Chapman's excellent blog post on evaluating and comparing variant detection methods. There are several openly available benchmarking data sets for human genetics/genomics. The Ashkenazim data set I used for my talk is here, and you can see the Personal Genome Project profile for the son, here. The raw data is available here, and you can see the resequencing report for the son, here. The Personal Genome Project is something worth checking out. More and more of human genetics and genomics is "open" -- check out the Variant Call Format (VCF) spec, now on github. ## Pipeline To run the bcbio variant calling pipeline I discuss in the talk, or examine the SNPs in the Ashkenazim trio with Gemini, take a look at my pipeline notes. The Gemini part will let you examine SNPs for the three individuals in the Ashkenazi trio, starting from the VCF files. ## Slide notes Slide 4: this link explains recombination and inheritance REALLY well. This John Hawks' blog post is my source for 300-600 novel mutations per generation. Slide 19: You can read more about the Ashkenazi Jews here. The data sets are available here. Slide 27: Canavan Disease Slides 30 and 31 from Demographic events and evolutionary forces shaping European genetic diversity by Veeramah and Novembre, 2014. Slide 35: the "narcissome" link Slide 36: a paper on lack of concordance amongst variant callers. Slide 37: the gene drive link. ## How to get involved I asked the bcbio and gemini folk if there were any opportunities for Python folk to get involved in their work. Here are some of their thoughts: ### Continuum Analytics #### What's new in Blaze tl; dr: We discuss the latest features and user facing API changes of Blaze. Blaze has undergone quite a bit of development in the last year. There are some exciting features to discuss with some important user facing API changes. ## April 08, 2015 ### NeuralEnsemble #### Elephant 0.1.0 released We are pleased to announce the first release of the Elephant toolbox for the analysis of neurophysiology data. Elephant builds on the Python scientific stack (NumPy, SciPy) to provide a set of well-tested analysis functions for spike train data and time series recordings from electrodes, such as spike train statistics, power spectrum analysis, filtering, cross-correlation and spike-triggered averaging. The toolbox also provides tools to generate artificial spike trains, either from stochastic processes or by controlled perturbations of real spike trains. Elephant is built on the Neo data model, and takes advantage of the broad file-format support provided by the Neo library. A bridge to the Pandas data analysis library is also provided. Elephant is a community-based effort, aimed at providing a common platform to test and distribute code from different laboratories, with the goal of improving the reproducibility of modern neuroscience research. If you are a neuroscience researcher or student using Python for data analysis, please consider joining us, either to contribute your own code or to help with code review and testing. Elephant is the direct successor to NeuroTools and maintains ties to complementary projects such as OpenElectrophy and SpykeViewer. It is also the default tool for electrophysiology data analysis in the Human Brain Project. As a simple example, let's generate some artificial spike train data using a homogeneous Poisson process: from elephant.spike_train_generation import homogeneous_poisson_process from quantities import Hz, s, ms spiketrains = [ homogeneous_poisson_process(rate=10.0*Hz, t_start=0.0*s, t_stop=100.0*s) for i in range(100)] and visualize it in Matplotlib: import matplotlib.pyplot as plt import numpy as np for i, spiketrain in enumerate(spiketrains): t = spiketrain.rescale(ms) plt.plot(t, i * np.ones_like(t), 'k.', markersize=2) plt.axis('tight') plt.xlim(0, 1000) plt.xlabel('Time (ms)', fontsize=16) plt.ylabel('Spike Train Index', fontsize=16) plt.gca().tick_params(axis='both', which='major', labelsize=14) plt.show() Now we calculate the coefficient of variation of the inter-spike interval for each of the 100 spike trains. from elephant.statistics import isi, cv cv_list = [cv(isi(spiketrain)) for spiketrain in spiketrains] As expected for a Poisson process, the values cluster around 1: plt.hist(cv_list) plt.xlabel('CV', fontsize=16) plt.ylabel('count', fontsize=16) plt.gca().tick_params(axis='both', which='major', labelsize=14) plt.show() ## April 06, 2015 ### Titus Brown #### Towards a bioinformatics middle class Note: Turns out Nick Loman is a C programmer. Well, that's what happens when I make assumptions, folks ;). Jared Simpson just posted a great blog entry on nanopolish, an HMM-based consensus caller for Oxford Nanopore data. In it he describes how he moved from a Python prototype to a standalone C++ program. It's a great blog post, but it struck one discordant note for me. In the post, Jared says: I was not satisfied with the Python/C++ hybrid design. I am sensitive to installation issues when releasing software as I have found that installing dependencies is a major source of problems for the user (...). I admire Heng Li's software where one usually just needs to run git clone and make to build the program. Fundamentally, moving from a lightweight Python layer on top of a heavier, optimized C++ library into a standalone binary seems like a step backwards to me. I wrote on Twitter, I worry ... that short-term convenience is lost at expense of composability and flexibility. Thoughts? which spawned an interesting conversation about dependency hell, software engineering done proper, funding, open source process and upstream contributions, and design vs language. I don't have a single coherent message to give, but I wanted to make a few points in a longer format (hence this blog post). First, format hell is directly caused by everyone developing their own programs, which consume and emit semi-standard formats. In the absence of strong format standards (which I don't particularly want), our choice may be between this format hell, and internal manipulation of rich formats and data structures. (Maybe I'm over-optimistic about the latter?) Second, a component ecosystem, based on scripting language wrappers around C/C++, strikes me as a major step forward in terms of flexibility and composability. (That's what we're working towards with some of our khmer development.) A bunch of lightweight components that each do one interesting thing, well, would let us separate specific function from overall processing logic and format parsing. Moreover, it would be testable - a major concern with our current "stack" in bioinformatics, which is only amenable to high level testing. Third, and this is my real concern - C++ is an utterly opaque language to most people. For example, Nick Loman - a pretty sophisticated bioinformatics dude, IMO - is almost certainly completely incapable of doing anything with nanopolish's internals. This is because his training is in biology, not in programming. I'm picking on Nick because he's Jared's partner in nanopolish, but I think this is a generally true statement of many quite capable bioinformaticians. Heck, I'm perfectly capable in C and can scratch my way through C++ programming, but I do my best to avoid packages that have only a C++ interface because of the procedural overhead involved. I disagree strongly with Jared's black & white statement that "this isn't a language problem" -- part of it absolutely is! Scripting languages enable a much more flexible and organic interaction with algorithms than languages like Java and C++, in my extensive personal experience. People can also pick up scripting languages much more easily than they can C++ or Java, because the learning curve is not so steep (although the ultimate distance climbed may be long). This leads into my choice of title - at the end of the day, what do we want? Do we want a strong division between bioinformatics "haves" - those who can grok serious C++ code at a deep level, and interact with the C++ interface when they need to adapt code, vs those who consume text I/O at the command line or via hacked-together pipelines? Or do we want a thicker "middle class", who appreciate the algorithms and can use and reuse them in unexpected ways via a scripting language, but without the investment and time commitment of digging into the underlying library code? I've put my energy squarely on the latter vision, with teaching, training, software development, and publication, so you know where I stand :). Maybe I'm naive in thinking that this approach will build a better, stronger set of approaches to bioinformatics and data-intensive biology than the other approach, but I'm giving it the ol' college try. --titus p.s. I'm hoping to post a powerful demonstration of the component/library approach in a few weeks; I'll revisit the topic then. p.p.s. It should go without saying that Jared and Nick and Heng Li are all great; this is not a personal diatribe, and given the amount of time and energy we put into building khmer as a Python library, I wouldn't recommend this approach in the current funding climate. But I want to (re)start the discussion! ## April 04, 2015 ### Juan Nunez-Iglesias #### jnuneziglesias Over the past few weeks, I’ve been heavily promoting the SciPy conference, a meeting about scientific programming in Python. I’ve been telling everyone who would listen that they should submit a talk abstract and go, because scientific programming is increasingly common in any scientist’s work and SciPy massively improves how you do that. I have also been guiltily ommitting that the speaker and attendee diversity at SciPy is shockingly bad. Last year, for example, 15% of attendees were women, and that was an improvement over the ratio three years ago, when just 3% (!!!) were women. I rationalised continuing to promote this conference because there was talk from past organisers about making efforts to improve. (And indeed, the past three years have been on an upward trajectory.) A couple of days ago, however, the full list of keynote speakers was announced, and lo and behold, it’s three white guys. I have to acknowledge that they are extremely accomplished in the SciPy universe, and, if diversity were not more generally a problem at this conference and in tech in general, I wouldn’t bat an eye. Excellent choice of speakers, really. Looking forward to it. But diversity is a problem. It’s an enormous problem. I’m inclined to call it catastrophic. Let me try to quantify it. Men and women are equally capable scientific programmers. So out of a total pool of 100 potential SciPy attendees/contributors, 50 are women and 50 are men. Now, let’s assume the male side of the community is working at near-optimum capacity, so, 50 of 50 those men are at SciPy. 15% of attendees being women means just 9 of the 50 potential female contributors are making it out to the conference (9/59 ≈ 15%). Or, slice it another way, a whopping (50 – 9) / 50 = 82% of women who could be contributing to SciPy are missing. Now, think of where we would be if we took 82% of male science-Pythonistas and just erased their talks, their discussions, and their code contributions. The SciPy ecosystem would suck. Yet that is exactly how many coders are missing from our community. Now, we can sit here and play female conference speaker bingo until the cows come home to roost, but that is missing the point: these are all only excuses for not doing enough. “Not my fault” is not good enough anymore. It is everyone’s fault who does not make an active and prolonged effort to fix things. The keynote speakers are an excellent place to make a difference, because you can eliminate all sorts of confounders. I have a certain sympathy, for example, for the argument that one should pick the best abstracts/scholarship recipients, rather than focus on race or gender. (Though the process should be truly blind to remove pervasive bias, as studies and the experience of orchestra auditions have repeatedly shown.) For keynotes though, organisers are free to pursue any agenda they like. For example, they can make education a theme, and get Lorena Barba and Greg Wilson in, as they did last year. Until the gender ratio begins to even remotely approach 1:1, diversity as an agenda should be a priority for the organisers. This doesn’t mean invite the same number of women and men to give keynotes. This means keep inviting qualified women until you have at least one confirmed female keynote speaker, and preferably two. Then, and only then, you can look into inviting men. Women have been found to turn down conference invitations more often than men, irrespective of ability or accomplishment. I don’t know why, but I suspect one reason is lack of role models, in the form of previous female speakers. That’s why this keynote roster is so disappointing. There’s tons of accomplished female Pythonistas out there, and there would be even more if we all made a concerted effort to improve things. I don’t want to retread the same territory that Jonathan Eisen (@phylogenomics) has already covered in “Calling attention to meeting with skewed gender ratios, even when it hurts“. In particular, see that article for links to many others with ideas to improve gender ratios. But this is my contribution in the exact same series: love SciPy. See my previous posts for illustration. Even looking back at my recent post, when I looked for a picture that I thought captured the collegial, collaborative feel of the conference, I unintentionally picked one featuring only men. This needs to improve, massively, if I’m going to keep supporting this conference. I really hope the organisers place diversity at the centre of their agenda for every decision going forward. I thank Jonathan Eisen, Andy Ray Terrel, and April Wright for comments on earlier versions of this article. ## April 02, 2015 ### Titus Brown #### OpenCon webcast: &quot;How to get tenure (while being open)&quot; This is a stub blog post for the talk notes for my OpenCon talk on how to get tenure as an open scientist. A few links -- More soon! --titus playing on easy ## April 01, 2015 ### Titus Brown #### Cultural confusions about data - the intertidal zone between two styles of biology A few weeks back, a journalist contacted me about my old blog post comparing physics and biology, and amidst other conversation, I pointed them at my latest blog post on data and said that I thought a lot of (molecular) biologists were "culturally confused about data". The next question was, perhaps obviously, "what do you mean by that?" and I wrote in response: ... for molecular biologists, "data" is what they collect piece by piece with PCR, qPCR, clone sequencing, perturbation experiments, image observations. It is so individual and specialized to a problem that to share it prior to publication makes no sense; no one could understand it all unless it were fit into a narrative as part of a pub, and the only useful product of the data is the publication; access to the data is only useful for verifying that it wasn't manufactured. Sequencing data was one of the first outputs (as opposed to things like reagents, antibodies and QPCR primers) that was useful beyond a particular narrative. I might sequence a gene because I want to knock it down and need to know its leader sequence for that, but then you might care about its exonic structure for evolutionary reasons, and Phil over there might be really interested in its protein domains, while Kim might be looking at an allele of that gene that is only in part of the population. I'm probably overstating that distinction but it helps explain a LOT of what I've seen in terms of cultural differences between my grad/pd labs (straight up bio) and where I think bio is going. I'm sure I'm wrong (certainly incomplete) about lots of this, but it does fit my own personal observations. Other perspectives welcome! I decided to write this up as a blog post because I read Carly Strasser's excellent blog post introducing open science, which emphasizes data, and it made me think about my response above. I think it's interesting to think about how "data" can be interpreted by different fields, and I'd like to stress how important it is that we bridge the gap between these high-level views and day-to-day practice in each subdomain - the culture and language can vary so significantly between even neighboring fields! Oh, and Carly Strasser is now one of the Moore Data Driven Discovery Initiative Program Officers - I'm really happy to see the Moore Foundation confronting these aspects of data head on by hiring someone with Carly's experience and expertise, and I look forward to interacting with her more on these issues! --titus ### Gaël Varoquaux #### Job offer: working on open source data processing in Python We, Parietal team at INRIA, are recruiting software developers to work on open source machine learning and neuroimaging software in Python. In general, we are looking for people who: • have a mathematical mindset, • are curious about data (ie like looking at data and understanding it) • have an affinity for problem-solving tradeoffs • love high-quality code • worry about users • are good scientific Python coders, • enjoy interacting with a community of developers We welcome candidates people without all the skills, but are strongly motivated to acquire them. Prior open-source experience is a big plus. One example of such position with application in Neuroimaging is: http://gael-varoquaux.info/programming/hiring-a-programmer-for-a-brain-imaging-machine-learning-library.html Which was opened a year ago and has now resulted in nilearn: http://nilearn.github.io/ Other positions may be more focused on general machine learning or computing tools such as scikit-learn and joblib, which are reference open-source libraries for data processing in Python. We are a tightly knit team, with a high degree of programming, data analysis and neuroimaging skills. Please contact me and Olivier Grisel if you are interested, ### NeuralEnsemble #### ANN: HoloViews 1.0 data visualization and ImaGen 2.0 pattern generation in Python We are pleased to announce the first public release of HoloViews, a free Python package for scientific and engineering data visualization: http://ioam.github.io/holoviews and version 2.0 of ImaGen, a free Python package for generating two-dimensional patterns useful for vision research and computational modeling: http://ioam.github.io/imagen HoloViews provides composable, sliceable, declarative data structures for building even complex visualizations of any scientific data very easily. With HoloViews, you can see your data as publication-quality figures almost instantly, so that you can focus on the data itself, rather than on laboriously putting together your figures. Even complex multi-subfigure layouts and animations are very easily built using HoloViews. ImaGen provides highly configurable, resolution-independent input patterns, directly visualizable using HoloViews but also available without any plotting package so that they can easily be incorporated directly into your computational modeling or visual stimulus generation code. With ImaGen, any software with a Python interface can immediately support configurable streams of 0D, 1D, or 2D patterns, without any extra coding. HoloViews and ImaGen are very general tools, but they were designed to solve common problems faced by vision scientists and computational modelers. HoloViews makes it very easy to visualize data from vision research, whether it is visual patterns, neural activity patterns, or more abstract measurements or analyses. Essentially, HoloViews provides a set of general, compositional, multidimensional data structures suitable for both discrete and continuous real-world data, and pairs them with separate customizable plotting classes to visualize them without extensive coding. ImaGen 2.0 uses the continuous coordinate systems provided by HoloViews to implement flexible resolution-independent generation of streams of patterns, with parameters controlled by the user and allowing randomness or other arbitrary changes over time. These patterns can be used for visual stimulus generation, testing or training computational models, initializing connectivity in models, or any other application where random or dynamic but precisely controlled streams of patterns are needed. Features: - Freely available under a BSD license - Python 2 and 3 compatible - Minimal external dependencies -- easy to integrate into your workflow - Declarative approach provides powerful compositionality with minimal coding - Include extensive, continuously tested IPython Notebook tutorials - Easily reconfigurable using documented and validated parameters - Animations are supported natively, with no extra work - Supports reproducible research -- simple specification, archived in an IPython Notebook, providing a recipe for regenerating your results - HoloViews is one of three winners of the 2015 UK Open Source Awards To get started, check out ioam.github.io/holoviews and ioam.github.io/imagen! Jean-Luc Stevens Philipp Rudiger Christopher Ball James A. Bednar ## March 31, 2015 ### Continuum Analytics #### Continuum Analytics - April Tech Events Catch us this month at PyCon Montreal and PyData Dallas - the first PyData Conference in Texas - plus all the events below. Email info@continuum.io to schedule a meeting or to let us know about your meetup/event! ### Matthieu Brucher #### Book: Building Machine Learning Systems in Python Almost 18 months ago, I posted a small post on the first version of this book (http://matt.eifelle.com/2013/09/04/book-building-machine-learning-systems-in-python/). At the time, I was really eager to see the second edition of it. And there it is! I had once again the privilege of being a technical reviewer for this book, and I havce to say that the quality didn’t lower one bit, it went even higher. Of course, there is still room for a better book, when all Python module for Machine Learning are even better. I guess that will be for the third edition! To get the book from the publisher: https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python-second-edition On other matters, the blog was quiet for a long time, I’m hoping to get some time to post a few new posts soon, but it is quite hard as I’m currently studying for another master’s degree! ## March 30, 2015 ### Titus Brown #### Some ideas for workshops and unconference models for data-intensive biology Here at the Lab for Data-Intensive Biology (TM) we are constantly trying to explore new ideas for advancing the practice of biological data sciences. Below are some ideas that originated with or were sharpened by conversations with Greg Wilson (Executive Director, Software Carpentry) and Tracy Teal (Project Lead, Data Carpentry) that I am interested in turning into reality, as part of my training efforts in Data Intensive Biology at UC Davis (also see my blog post). ## Quarterly in-depth "unconferences" on developing advanced domain-specific data analyses I spend an increasing amount of time working to teach people how to analyze sequencing data, but in practice we kinda suck at analyzing this data, especially when it's from non-model systems. We need some workshops to advance those efforts, too. So, I am thinking of running quarterly (4x year) week-long unconferences, each on a different topic. The idea is to get together a small group of people (~15-25) actively and openly working on various aspects of a specific type of data analysis, to hang out and collaborate/coordinate their efforts for a week. The plan would be to intersperse a few presentations with lots of hacking and communication time, with the goal of making progress on topics of mutual interest and nucleating collaborative online interactions. The following topics are areas where I can easily imagine a week-long technical workshop making some substantive progress: • vet/ag genome (re)annotation and curation • non-model transcriptome analysis • geomicrobiology/function-focused metagenome analysis • bioinformatics training (e.g. train the trainers) • reproducibility in genomic analysis • computational protocols and benchmarking • advanced statistical approaches to data integration Importantly, these workshops would be inexpensive and largely unfunded - I would ask participants to fund themselves, rather than seeking to write grants to support a bunch of people. If we can locate them in an inexpensive place then the total cost would be in the$1000-2000/person range, which most active research labs could probably support. I would seek funding for scholarships to increase diversity of participants, but beyond that my goal would be to make these workshops so useful that active and funded researchers want to come. (I mean, I wouldn't turn down money that dropped into my lap, but I've had too many workshop proposals rejected as not quite what the PMs wanted to get on that merry-go-round again, unless it's critical to make something super important happen.)

One non-negotiable component for me would be that everything worked on at these meetings would be under open licenses, and already being developed openly. Although I suppose we could have a meeting where people interested in opening their software could get together to do so in a guided fashion... proponents of open science, take note!

Such workshops would not need to be hosted at UC Davis, or by me; I just want to make 'em happen and am happy to co-organize or coordinate, and think I could do ~4 a year myself. There are a lot of people invested in progressing on these issues who already have some money, and so one option for moving forward more generally would be to find those people and co-opt them :).

## A workshop consisting of half-day focused lessons

Last week, I ran a workshop on starting a new project with reproducibility in mind --

Reproducible Computational Analysis - How to start a new project

Description: Computational science projects, from data analysis to modeling, can benefit dramatically from a little up-front investment in automation; starting off with version control and automated building of results will pay off in efficiency, agility, and both transparency and reproducibility of the results. However, most computational researchers have never been exposed to a completely automated analysis pipeline. I will demonstrate the process of initiating a new project, building a few initial scripts, and automating the generation of results, as well as building some graphs. While the topic will be from my own research in bioinformatics, the overall approach should apply to anyone doing data analysis or simulations.

Technology used will include git, IPython Notebook, and 'make'.

This is an interactive seminar intended for computational science researchers with some experience in version control and scripting (for example, if you've taken a Software Carpentry workshop, you will be at a good starting point).

This is an idea that originated with Greg - he nucleated the idea, and then I went ahead and tried it out. More on that workshop later, but... why not do this a lot?

We are thinking about how to do a focused series of these ~3 hour learning opportunities, either all demo or half-demo/half participation, each on a different topic. For example, my lab could do a section on k-mer analyses of large sequencing data sets, or on GitHub Flow, or on software testing, or on whatever; the important thing is that there are tons of Software Carpentry instructors with deep roots in one discipline or another, and it'd be a fun way to learn from each other while teaching to a larger audience.

This is something we might try during the third week of our NGS summer course; if you're a badged SWC instructor and want to demo something related to sequence analysis, drop me a note with a brief proposal.

## Instructor gatherings for lesson development and testing

Tracy Teal, Jason Williams, Mike Smorul, Mary Shelley, Shari Ellis, and Hilmar Lapp just ran a Data Carpentry hackathon focused on lesson development and assessment. Riffing off of that, what about getting instructors together to do lesson development and testing on a regular basis, and then present it in front of a more advanced crowd? This would be an opportunity for people to develop and test lessons for Software Carpentry and Data Carpentry on a tolerant audience, with other instructors around to offer help and advice, and without the challenges of a completely novice audience for the first time through.

This is also something we might try during the third week of our NGS summer course; if you're a badged SWC instructor and want to do something on sequence analysis, please drop me a note and tell me what!

Any other thoughts on things that have worked, or might work, for advancing training and practice in a hands-on manner?

thanks,

--titus

## March 27, 2015

### Gaël Varoquaux

#### Euroscipy 2015: Call for paper

EuroScipy 2015, the annual conference on Python in science will take place in Cambridge, UK on 26-30 August 2015. The conference features two days of tutorials followed by two days of scientific talks & posters and an extra day dedicated to developer sprints. It is the major event in Europe in the field of technical/scientific computing within the Python ecosystem. Scientists, PhD’s, students, data scientists, analysts, and quants from more than 20 countries attended the conference last year.

The topics presented at EuroSciPy are very diverse, with a focus on advanced software engineering and original uses of Python and its scientific libraries, either in theoretical or experimental research, from both academia and the industry.

Submissions for posters, talks & tutorials (beginner and advanced) are welcome on our website at http://www.euroscipy.org/2015/ Sprint proposals should be addressed directly to the organisation at euroscipy-org@python.org

Important dates:

• Apr 30, 2015 Talk and tutorials submission deadline
• May 1, 2015 Registration opens
• May 30, 2015 Final program announced
• Jun 15, 2015 Early-bird registration ends
• Aug 26-27, 2015 Tutorials
• Aug 28-29, 2015 Main conference
• Aug 30, 2015 Sprints

We look forward to an exciting conference and hope to see you in Cambridge

The EuroSciPy 2015 Team - http://ww.euroscipy.org/2015/

## March 26, 2015

### Titus Brown

#### A new preprint on semi-streaming analysis of large data sets

We just posted a new preprint (well, ok, a few weeks back)! The preprint title is "Crossing the streams: a framework for streaming analysis of short DNA sequencing reads", by Qingpeng Zhang, Sherine Awad, and myself. Note that like our other recent papers, this paper is 100% reproducible, with all source code in github. we haven't finished writing up the instructions yet, but happy to share (see AWS.md for notes).

This paper's major point is that you can use the massive redundancy of deep short-read sequencing to analyze data as it comes in, and that this could easily be integrated into existing k-mer based error correctors and variant callers. Conveniently the algorithm doesn't care what you're sequencing - no assumptions are made about uniformity of coverage, so you can apply the same ol' spectral approaches you use for genomes to transcriptomes, metagenomes, and probably single-cell data. Which is, and let's be frank here, super awesome.

The paper also provides two useful tools, both implemented as part of khmer: one is an efficient approach to k-mer-abundance-based error trimming of short reads, and the other is a streaming approach to looking at per-position error profiles of short-read sequencing experiments.

A colleague, Erich Schwarz, suggested that I more strongly make the point that is really behind this work: we're in for more data. Scads and scads of it. Coming it with ways of efficiently dealing with it at an algorithmic level is important. (We didn't strengthen this point in the posted preprint - the feedback came too late -- but we will hopefully get a chance to stick it in in the revisions.)

The really surprising thing for me is that the general semi-streaming approach has virtually no drawbacks - the results are very similar to the full two-pass offline approaches used currently for error correction etc. Without implementing a huge amount of stuff we had to make this argument transitively, but I think it's solid.

For entertainment, take a look at the error profile in Figure 6. That's from a real data set, published in Nature something or other...

And, finally, dear prospective reviewers: the biggest flaws that I see are these:

• we chose to make most of our arguments by analyzing real data, and we didn't spend any time developing theory. This is a choice that our lab frequently makes -- to implement effective methods rather than developing the underlying theory -- but it leaves us open for a certain type of criticism.
• to extend the previous point, the algorithmic performance depends critically on details of the data set. We didn't know how to discuss this and so the discussion is maybe a bit weak. We'd love reviewers to ask pointed questions that we can address in order to shore it up.
• Repeats! How does all this stuff work with repeats!? I did a lot of work simulating repetitive sequences and couldn't find any place where repeats actually caused problems. My intuition now tells me that repeats are not actually a problem for methods that interrogate De Bruijn graphs using entire reads as an index into the graph, but I'd welcome someone telling me I'm wrong and either telling me where to look, or asking concrete questions that illuminate better directions to take it.

--titus

## March 25, 2015

### Matthew Rocklin

#### Partition and Shuffle

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

This post primarily targets developers.

tl;dr We partition out-of-core dataframes efficiently.

## Partition Data

Many efficient parallel algorithms require intelligently partitioned data.

For time-series data we might partition into month-long blocks. For text-indexed data we might have all of the “A”s in one group and all of the “B”s in another. These divisions let us arrange work with foresight.

To extend Pandas operations to larger-than-memory data efficient partition algorithms are critical. This is tricky when data doesn’t fit in memory.

## Partitioning is fundamentally hard

Data locality is the root of all performance
-- A Good Programmer


Partitioning/shuffling is inherently non-local. Every block of input data needs to separate and send bits to every block of output data. If we have a thousand partitions then that’s a million little partition shards to communicate. Ouch.

Consider the following setup

  100GB dataset
/ 100MB partitions
= 1,000 input partitions


To partition we need shuffle data in the input partitions to a similar number of output partitions

  1,000 input partitions
* 1,000 output partitions
= 1,000,000 partition shards


If our communication/storage of those shards has even a millisecond of latency then we run into problems.

  1,000,000 partition shards
x 1ms
= 18 minutes


Previously I stored the partition-shards individually on the filesystem using cPickle. This was a mistake. It was very slow because it treated each of the million shards independently. Now we aggregate shards headed for the same out-block and write out many at a time, bundling overhead. We balance this against memory constraints. This stresses both Python latencies and memory use.

## BColz, now for very small data

Fortunately we have a nice on-disk chunked array container that supports append in Cython. BColz (formerly BLZ, formerly CArray) does this for us. It wasn’t originally designed for this use case but performs admirably.

Briefly, BColz is…

• A binary store (like HDF5)
• With columnar access (useful for tabular computations)
• That stores data in cache-friendly sized blocks
• With a focus on compression
• Written mostly by Francesc Alted (PyTables) and Valentin Haenel

It includes two main objects:

• carray: An on-disk numpy array
• ctable: A named collection of carrays to represent a table/dataframe

## Partitioned Frame

We use carray to make a new data structure pframe with the following operations:

• Append DataFrame to collection, and partition it along the index on known block divisions blockdivs
• Extract DataFrame corresponding to a particular partition

Internally we invent two new data structures:

• cframe: Like ctable this stores column information in a collection of carrays. Unlike ctable this maps perfectly onto the custom block structure used internally by Pandas. For internal use only.
• pframe: A collection of cframes, one for each partition.

Through bcolz.carray, cframe manages efficient incremental storage to disk. PFrame partitions incoming data and feeds it to the appropriate cframe.

## Example

Create test dataset

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [1, 2, 3, 4],
...                        'b': [1., 2., 3., 4.]},
...                       index=[1, 4, 10, 20])

Create pframe like our test dataset, partitioning on divisions 5, 15. Append the single test dataframe.

In [3]: from pframe import pframe
In [4]: pf = pframe(like=df, blockdivs=[5, 15])
In [5]: pf.append(df)

Pull out partitions

In [6]: pf.get_partition(0)
Out[6]:
a  b
1  1  1
4  2  2

In [7]: pf.get_partition(1)
Out[7]:
a  b
10  3  3

In [8]: pf.get_partition(2)
Out[8]:
a  b
20  4  4

Continue to append data…

In [9]: df2 = pd.DataFrame({'a': [10, 20, 30, 40],
...                         'b': [10., 20., 30., 40.]},
...                        index=[1, 4, 10, 20])
In [10]: pf.append(df2)

… and partitions grow accordingly.

In [12]: pf.get_partition(0)
Out[12]:
a   b
1   1   1
4   2   2
1  10  10
4  20  20

We can continue this until our disk fills up. This runs near peak I/O speeds (on my low-power laptop with admittedly poor I/O.)

## Performance

I’ve partitioned the NYCTaxi trip dataset a lot this week and posting my results to the Continuum chat with messages like the following

I think I've got it to work, though it took all night and my hard drive filled up.
Down to six hours and it actually works.
Three hours!
By removing object dtypes we're down to 30 minutes
20!  This is actually usable.
OK, I've got this to six minutes.  Thank goodness for Pandas categoricals.
Five.
Down to about three and a half with multithreading, but only if we stop blosc from segfaulting.


And thats where I am now. It’s been a fun week. Here is a tiny benchmark.

>>> import pandas as pd
>>> import numpy as np
>>> from pframe import pframe

>>> df = pd.DataFrame({'a': np.random.random(1000000),
'b': np.random.poisson(100, size=1000000),
'c': np.random.random(1000000),
'd': np.random.random(1000000).astype('f4')}).set_index('a')

Set up a pframe to match the structure of this DataFrame Partition index into divisions of size 0.1

>>> pf = pframe(like=df,
...             blockdivs=[.1, .2, .3, .4, .5, .6, .7, .8, .9],
...             chunklen=2**15)

Dump the random data into the Partition Frame one hundred times and compute effective bandwidths.

>>> for i in range(100):
...     pf.append(df)

CPU times: user 39.4 s, sys: 3.01 s, total: 42.4 s
Wall time: 40.6 s

>>> pf.nbytes
2800000000

>>> pf.nbytes / 40.6 / 1e6  # MB/s
68.9655172413793

>>> pf.cbytes / 40.6 / 1e6  # Actual compressed bytes on disk
41.5172952955665

We partition and store on disk random-ish data at 68MB/s (cheating with compression). This is on my old small notebook computer with a weak processor and hard drive I/O bandwidth at around 100 MB/s.

## Theoretical Comparison to External Sort

There isn’t much literature to back up my approach. That concerns me. There is a lot of literature however on external sorting and they often site our partitioning problem as a use case. Perhaps we should do an external sort?

I thought I’d quickly give some reasons why I think the current approach is theoretically better than an out-of-core sort; hopefully someone smarter can come by and tell me why I’m wrong.

We don’t need a full sort, we need something far weaker. External sort requires at least two passes over the data while the method above requires one full pass through the data as well as one additional pass through the index column to determine good block divisions. These divisions should be of approximately equal size. The approximate size can be pretty rough. I don’t think we would notice a variation of a factor of five in block sizes. Task scheduling lets us be pretty sloppy with load imbalance as long as we have many tasks.

I haven’t implemented a good external sort though so I’m only able to argue theory here. I’m likely missing important implementation details.

• PFrame code lives in a dask branch at the moment. It depends on a couple of BColz PRs (#163, #164)

## March 24, 2015

### Martin Fitzpatrick

#### Live Demo - Wooey: Web UIs for Python scripts

A new live demo of Wooey is now up and running with a few simple example scripts. Features:

• Web UIs for Python scripts generated automatically from argparse
• An automated background worker to run scripts and collect outputs
• Run Queue to schedule (and prioritise) jobs

Try it out, fork it or leave feedback below.

### Juan Nunez-Iglesias

#### Photo by Ian Rees (from the SciPy 2012 conference website)

SciPy is my favourite conference. My goal with this post is to convince someone to go who hasn’t had that chance yet.

## Why SciPy?

Most scientists go to conferences in their own field: neuroscientists go to the monstrous Society for Neuroscience (SfN); Bioinformaticians go to RECOMB, ISMB, or PSB; and so on.

People go to these to keep up with the latest advances in their field, and often, to do a bit of networking.

SciPy is a different kind of conference. It changes the way you do science. You learn about the latest free and open source software to help you with your work. You learn to write functions and interfaces instead of scripts, and to write tests so you don’t break your code. You also learn to contribute these to bigger projects, maximising the reach and impact of your work (see “sprints”, below).

And you learn these things by doing them, with the people who are the best at this, rather than by reading books and blog posts. (Which maybe I shouldn’t knock, since I’m writing a book about all this and you are reading my blog!)

Attendees to SciPy have scientific software in common, but come from diverse fields, including physics, pure maths, data visualisation, geosciences, climatology, and yes, biology and bioinformatics. Mingling with such a diverse group is a fantastic way to get your creative juices flowing!

The conference lasts a full week and is broken up into three parts: tutorials, main conference, and sprints.

### the tutorials

With a few exceptions, you won’t learn about your own field. But you will learn an enormous amount about tools that will help you be a better scientist. If you are a newbie to Python, you can go to the beginner tutorials track and learn about the fantastic scientific libraries available in Python. If you already use NumPy, SciPy, pandas, and the IPython notebook, you can go to the intermediate or advanced tracks, and learn new things about those. Even as an advanced SciPy user I still get tons of value from the tutorials. (Last year Min RK gave a wild demo of IPython parallel’s capabilities by crawling wikipedia remotely while building up a graph visualisation on his live notebook.) (Fast-forward to the 1h mark to see just the payoff.) Here’s last year’s tutorial schedule for an idea of what to expect.

### the main conference track

You will also hear about the latest advances in the scientific libraries you know and use, and about libraries you didn’t know about but will find useful (such as scikit-bio, yt or epipy). The main conference track features software advances written by presenters from many different fields. Hearing about these from the authors of the software lets you ask much different questions compared to hearing someone say, for example, “we used the Matlab image processing toolbox”. If you ever had a feature request for your favourite library, or you wondered why they do something in a particular way, there’s no better opportunity to get some closure.

The crosstalk between different fields is phenomenal. Hearing how a diverse set of people deal with their software problems really opens your mind to completely different approaches to what you had previously considered.

### the sprints

Finally, there’s two days of coding sprints. Even if you are a complete beginner in software development, do yourself a favour and participate in one of these.

Two and a half years after my first SciPy in 2012, I’m writing a scientific Python book for O’Reilly, and I can 100% trace it to participating in the scikit-image sprint that year. With their guidance, I wrote my first ever GitHub pull request and my first ever unit test. Both were tiny and cute, and I would consider them trivial now, but that seed grew into massive improvements in my code-writing practice and many more contributions to open source projects.

And this is huge: now, instead of giving up when a software package doesn’t do what I need it to do, I just look at the source code and figure out how I can add what I want. Someone else probably wants that functionality, and by putting it into a major software library instead of in my own code, I get it into the hands of many more users. It’s a bit counterintuitive but there is nothing more gratifying than having some random person you’ve never met complain when you break something! This never happens when all your code is in your one little specialised repository containing functionality for your one paper.

## How SciPy

The SciPy calls for tutorials, talks, posters, and its plotting contest are all out. There’s specialised tracks and most of you reading this are probably interested in the computational biology and medicine track. It’s taken me a while to write this post, so there’s just one week left to submit something: the deadline is April 1st Update: the deadline for talks and posters has been extended to April 10th!

Even if you don’t get something in, I encourage you to participate. Everything I said above still applies if you’re not presenting. You might have a bit more trouble convincing your funders to pay for your travels, but if that’s the case I encourage you to apply for financial assistance from the conference.

I’ve written about SciPy’s diversity problem before, so I’m happy to report that this year there’s specific scholarships for women and minorities. (This may have been true last year, I forget.) And, awesomely, Matt Davis has offered to help first-time submitters with writing their proposals.

Give SciPy a try: submit here and/or register here. And feel free to email me or comment below if you have questions!

Update: A colleague pointed out that I should also mention the awesomeness of the conference venue, so here goes: Austin in July is awesome. If you love the heat like I do, well, it doesn’t get any better. If you don’t, don’t worry: the AT&T Conference Center AC is on friggin overdrive the whole time. Plus, there’s some nearby cold springs to swim in. The center itself is an excellent hotel and the conference organises massive discounts for attendees. There’s a couple of great restaurants on-site; and the Mexican and Texas BBQ in the area are incredible — follow some Enthought and Continuum folks around to experience amazing food. Finally, Austin is a great city to bike in: last time I rented a road bike for the whole week from Mellow Johnny’s, and enjoyed quite a few lunchtime and evening rides.