September 02, 2014

Jake Vanderplas

On Frequentism and Fried Chicken

My recent series of posts on Frequentism and Bayesianism have drawn a lot of comments, but recently Frederick J. Ross, a UW colleague whom I have not yet had the pleasure of meeting, penned a particularly strong-worded critique: Bayesian vs frequentist: squabbling among the ignorant. Here I want to briefly explore and respond to the points he makes in the post.

Mr. Ross doesn't mince words. He starts as follows:

Every so often some comparison of Bayesian and frequentist statistics comes to my attention. Today it was on a blog called Pythonic Perambulations. It's the work of amateurs.

He goes on to lodge specific complaints about subtleties I glossed-over in the four posts, all of which seem to miss one salient detail: the posts were an explicit response to my observation that "many scientific researchers never have opportunity to learn the distinctions between Frequentist and Bayesian methods and the different practical approaches that result..." That is, I aimed the discussion not toward someone with a deep background in statistics, but at someone who can't even name the fundamental differences between frequentism and Bayesianism.

Did I gloss over advanced subtleties in this introductory primer? Certainly. As interesting as it may have been for Mr. Ross and other well-read folks had I delved into, say, the deeper questions of assumptions implicit in frequentist constraints vs. Bayesian priors, it would have distracted from the purpose of the posts, and would have lost the very readers for whom the posts were written.

Rethinking the Debate

Thus, we see that his first set of complaints can be chalked-up to a simple misunderstanding of the intended audience: that's an honest mistake, and I won't make more of it. But he goes beyond this, and proposes his own final answer to the centuries-old debate between frequentists and Bayesians. As he writes: "Which one is right? The answer, as usual when faced with a dichotomy, is neither."

This should pique your interest: he's claiming that not only am I, a humble blogger, an ignorant amateur (which may be true), but that luminaries of the science and statistics world — people like Neyman, Pearson, Fisher, Jaynes, Jeffries, Savage, and many others who sometimes ardently addressed this question — are simply ignorant squabblers within the field which they all but created. I doubt I'm alone in finding this sweeping indictment a bit suspect.

But let's skip these charges and dig further: what third route does Mr. Ross propose to trample all this precedent? The answer is decision theory:

Probability, as a mathematical theory, has no need of an interpretation... the real battleground is statistics, and the real purpose is to choose an action based on data. The formulation that everyone uses for this, from machine learning to the foundations of Bayesian statistics, is decision theory.

His argument is that frequentist and Bayesian methods, in a reductionist sense, are both simply means of reaching a decision based on data, and can therefore be viewed as related branches of decision theory. He goes on to define some notation which explains how any statistical procedure can be formulated as a question of progressing from data, via some loss function, to a particular decision. Frequentist and Bayesian approaches are simply manifestations of this unified theory which use particular loss functions, and thus squabbling about them is the pastime of the ignorant.

I'd like to offer an analogy in response to this idea.

Baked or Fried?

One day in the kitchen, two chefs begin arguing about who makes the best chicken. Chef Hugh prefers his chicken fried: the quick action of the hot oil results light, crispy spiced outer breading complementing the tender meat it encloses. Chef Wolfgang, on the other hand, swears by baked chicken, asserting that its gentler process leaves more moisture, and allows more time for complex flavors to seep into the meat. They decide to have a cook-off: Fried vs. Baked, to decide once and for all which method is the best.

They're just beginning their preparations in the test kitchen when Rick, the local Food Theorist, storms through the door. He follows these chefs on Twitter, and has heard about this great Fried vs. Baked debate. Given his clear expertise on the matter, he wants to offer his final say on the question. As Food Theorists are wont to do, he starts lecturing them:

"Truly, I'm not really sure what this whole contest is about. Don't you know that baking and frying are both doing essentially the same thing? Proteins denature as they heat. Water evaporates, sugar caramelizes, and the Maillard Reaction turns carbohydrates and amino acids into a crispy crust. If you could just get into your ignorant heads that any cooking method is simply a manifestation of these simple principles, you'd realize that neither method is better, and we wouldn't need to waste our time on this silly competition."

At this point, Chef Hugh and Chef Wolfgang pause, catch each other's gaze for a moment, and burst into a hearty laughter. They turn around continue the task of actually turning the raw chicken meat into an edible meal, enjoying the craft and camaraderie of cooking even in the midst of their friendly disagreement. Rick slips out the door and heads home to gaze at his navel while eating a dinner of microwaved pizza bagels.

Baked or Fried? Bayesian or Frequentist?

So what's my point here? The fact is that everything our Food Theorist has said is technically correct: from a completely reductionist perspective, cooking meat is nothing more than controlled denaturing of proteins, evaporation of water, and other well-understood chemical processes. But to actually prepare a meal, you can't stop with the theory. You have to figure out how to apply that knowledge in practice, and that requires decisions about whether to use an oven, a deep fryer, or a charcoal grill.

Similarly, everything Mr. Ross said in his blog post is more or less true, but you can't stop there. Applying his decision theory in practice requires making some choices: despite his protests, you actually do have to decide how to map your theory of probability onto reality reflected in data, and that requires some actual philosophical choices about how you treat probability, which lead to fundamentally different questions being answered.

Frequentism vs. Bayesianism, Again

This brings us back to the original question Mr. Ross (not I) posed: Frequentism vs. Bayesianism: which is correct? As I've maintained throughout my posts (and as Mr. Ross seems to have overlooked when reading them): neither is correct. Or both. It really depends on the situation. As I have attempted to make clear, if you're asking questions about long-term limiting frequencies of repeated processes, classical frequentist approaches are probably your best bet. If you're hoping to update your knowledge about the world based on a finite set of data, Bayesian approaches are more appropriate.

While I have argued that Frequentist approaches answer the wrong question in most scientific settings, I have never claimed that frequentism is fundamentally flawed, or that it is "wrong": on the contrary, in that particular post I went to great length to use Monte Carlo simulations to show that the frequentist approach does in fact give the correct answer to the question it asks. Frequentist and Bayesian approaches answer different statistical questions, and that is a fact you must realize in order to use them.

So where does this leave us? Well, Mr. Ross seems to have based his critique largely on misunderstandings: my intended audience was novices rather than experts, and despite his claims otherwise I never attempted to promote frequentism over Bayesianism or vice versa. His protests notwithstanding, I maintain that in practice, frequentism and Bayesianism remain as different as fried and baked chicken: you can huff and puff about unified theoretical frameworks until your face is blue, but at the end of the day you need to choose between the oven and the fryer.

by Jake Vanderplas at September 02, 2014 11:00 PM

Matthieu Brucher

Book review: A Brief Introduction to Continuous Evolutionary Optimization

When I developed my first tool based on genetic algorithms, it was to replace local optimization algorithm (“Simulated Annealing”, as advertised by Numerical Recipes) as a global optimization algorithm. Now a couple of years later, I found a small book on GE that seemed on topic with what I had to implement (I relied at the time on another book that I never reviewed ; perhaps another time). Is it a good brief introduction as the book says?

Content and opinions

First of all, the book is really small, only 100 pages. At more than 50$, that’s quite a high page rate, especially when there are only 70 pages of actual content…

There are three parts in the book, the foundation, advanced optimization and learning. Let’s start with the first one.

The first part also contains the introduction. It mainly describes what evolutionary optimization means, and how to compare it to the other techniques, and a description of the other chapters. So the first interesting/technical chapter is the second one. It explains what a genetic algorithm actually is, and the way the algorithm is described is better that what I found in bigger books. There are also additional variants of algorithms used by GE (different types of recombinations for instance), other usual may be missing (like selection by tournament). As the book is about evolutionary optimization, and not GE, there is also the introduction of other algorithms, namely Particle Swarm Optimization, and Covariance Matrix Adaptation Evolution Strategies. Those were properly defined, better than most online resources as well. The third chapter is about defining the parameters of the algorithms, and this is an interesting chapter as the main topic is to allow this evolutionary algorithm to evolve the parameters during the optimization. This was really interesting to see, and I’m looking forward to implementing this in my tool.

More evolution in part 2, with using evolution to handle constraints (really funny, yet something else to try on my case), and then a variant of the Powell method mixed with evolutionary optimization. It is less useful perhaps, as it seems to be a local optimization algorithm, related to simulated annealing in some way, but with the population approach. Still interesting, but you won’t apply this as often as what you learned in the first part or the chapter before. The last chapter in this part tackles multiple objective optimization. This is really a good approach (a colleague of mine made his PhD on this kind of problems, now I understand his thesis better!) to multiple objective optimization. Usually, you have to combine them in some way, but as we use populations, we get actual borders, and possibly an infinite number of possible solutions, without having to choose which function to favor.

The last chapter is called “Learning”, but I would have called it “Applications”. The applications are model learning, but that’s about it. It is nice though that the book tackles more or less textbook functions in the first parts, and then tackles a problem that makes some actual sense (no, I don’t think the Rosenbrock cost function optimization can compare to solving an actual problem). Someone thought about what was missing in the other books!


In the end, the main issue is the cost of the book, although compared to other scientific book, it is not that expensive. Otherwise, I found it really clear, efficient, covering several subjects
properly. I think I found my new reference for people wanting to start evolutionary computation!

by Matthieu Brucher at September 02, 2014 07:30 AM

September 01, 2014

Titus Brown

Running Software Carpentry instructor training at UC Davis in January, 2015

I am pleased to announce that Dr. Greg Wilson will be giving a two-day Software Carpentry Instructor Training workshop at UC Davis, January 6-7, 2015. This will be an in-person version of the instructor training that Greg runs every quarter; see my blog post about the first such instructor training, here to get some of idea of what it's like. Note that the training is necessary for those who want to become accredited Software Carpentry or Data Carpentry instructors.

If you're interested in further information, go subscribe to this github issue; we should have room for a dozen extra people to attend.

NOTE: This workshop is hosted by my new lab, but I am seeking additional support; drop me a line if you're interested in co-sponsoring the workshop.

On January 8th, I plan to run a third day of activities. The three things I have tentatively planned are,

  1. run people through the GitHub Flow process of contributing to a github repository, as in our hackathon (also see our writeup here);
  2. walk people through our process for generating maintainable command-line tutorials, recipes, and protocols (more on that at ABIC);
  3. have an open discussion about what and how to do training in data intensive biology at Davis.

This third day of activities is going to be more restricted in attendance than the first two days, but if you're into teaching computation and biology and want to see how we're doing things, drop me a note ( and I'll see if we can fit you in.


by C. Titus Brown at September 01, 2014 10:00 PM

August 30, 2014

Titus Brown

Some naive ideas about training efforts at UC Davis

As I mentioned, I am hoping to significantly scale up my training efforts at UC Davis; it's one of the reasons they hired me, it's a big need in biology, and I'm enthusiastic about the whole thing! A key point is that, at least at the beginning, it may replace some or all of my for-credit teaching. (Note that the first four years of Analyzing Next-Generation Sequencing Data counted as outreach, not teaching, at MSU.)

I don't expect to fully spool up before fall 2015, but I wanted to start outlining my thoughts.

The ideas below came in large part from conversations with Tracy Teal, a Software Carpentry instructor who is one of the people driving Data Carpentry, and who also was one of the EDAMAME course instructors.

How much training, how often, and to whom?

I think my initial training efforts will center on Software Carpentry-style workshops, on a variety of (largely bio-specific) topics. These would be two-day in-person workshops, 9-5am, each focused on a specific topic.

I think I can sustainably lead one a month, with perhaps a few months where I organize two in the same week (M/Tu and Th/Fri, perhaps).

These would be on top of at least one NGS course a year, too. I also expect I will participate in various Genome Center training workshops.

The classes would be targeted at grad students, postdocs, and faculty -- same as the current NGS course. I would give attendees from VetMed some priority, followed by attendees with UC Davis affiliations, and then open to anyone. I imagine doing this in a tiered way, so that some outsiders could always come; variety and a mixed audience are good things!

On what topics?

I have a laundry list of ideas, but I'm not sure what to start with or how to make decisions about what to teach when. ...suggestions welcome. (I also can't teach all of these myself, but I want to get the list of ideas down!)

I'd like to preface this list with a few comments: I've been teaching and training in these topics for five years (at least) now, so I'm not naive about how hard (or easy) it is to teach this to computationally inexperienced biologists. It's clear that there's a progression of skills that need to be taught for most of these, as well as a need for careful lesson planning, tutorial design, and pre/post assessment. These workshops would also be but one arrow in the quiver -- I have many other efforts that contribute to my lab's teaching and training.

With that having been said, here's a list of general things I'd like to teach:

  • Shell and UNIX (long running commands, remote commands, file and path management)
  • Scripting and automation (writing scripts, make, etc.)
  • Bioinformatics and algorithms
  • "Big data" statistics
  • Data integration for sequencing data
  • Software engineering (testing, version control, code review, etc.) on the open source model
  • Practical bioinformatics (See topics below)
  • Modeling and simulations
  • Workflows and replication tracking
  • Software Carpentry
  • Data Carpentry

I have many specific topics that I think people know they want to learn:

  • Mapping and variant calling
  • Genome assembly and evaluation (microbial & large genomes both)
  • Transcriptome assembly and evaluation (reference free & reference based)
  • Genome annotation
  • Differential expression analysis
  • ChIP-seq
  • Metagenome assembly
  • Microbial ecology and 16s approaches
  • Functional inference (pathway annotations)
  • Phylogenomics
  • Marker development
  • Genotyping by sequencing
  • Population genomics

And finally, here are two shorter workshop ideas that I find particularly neat: experimental design (from sample prep through validation), and sequencing case studies (success and failure stories). In the former, I would get together a panel of two or three people to talk through the issues involved in doing a particular experiment, with the goal of helping them write a convincing grant For the latter, I would find both success and failure stories and then talk about what other approaches could have rescued the failures, as well as what made the successful stories successful.

To what end? Community building and collaborations.

Once I started focusing in on NGS data at MSU as an assistant professor, I quickly realized that I could spend all my time in collaborations. I learned to say "no" fairly fast :). But all those people still need to do data analysis. What to do? I had no clear answer at MSU, but this was one reason I focused on training.

At Davis, I hope to limit my formal collaborations to research topics, and concentrate on training everybody to deal with their own data; in addition to being the only scalable approach, this is career-building for them. This means not only investing in training, but trying to build a community around the training topics. So I'd like to do regular (weekly? fortnightly?) "help desk" afternoons for the campus, where people can come talk about their issue du jour. Crucially, I would limit this to people that have gone through some amount of training - hopefully both incentivizing people to do the training, and making sure that some minimal level of effort has been applied. The goal would be to move towards a self-sustaining community of people working in bioinformatic data analysis across multiple levels.

Cost and materials.

Since UCD VetMed is generously supporting my salary, I am naively expecting to charge nothing more than a nominal fee -- something that would discourage people from frivolously signing up or canceling. Perhaps lunch money? (This might have to be modified for people from outside of VetMed, or off-campus attendees.)

All materials would continue to be CC0 and openly available, of course. 'cause life's too short to limit the utility of materials.

Other thoughts

I'd love to put together a slush fund so that I can invite out speakers to run workshops on topics that I don't know that well (most of 'em).

How about a workshop focused on teaching people how to teach with the materials we put together? (I would expect most of these workshops to be cloud-based.)


p.s. In addition to Tracy, thanks to Keith Bradnam, Aaron Darling, Matt MacManes and Ethan White, for their comments and critiques on a draft.

by C. Titus Brown at August 30, 2014 10:00 PM

The Critical Assessment of Metagenome Interpretation and why I'm not a fan

If you're into metagenomics, you may have heard of CAMI, the Critical Assessment of Metagenome Interpretation. I've spoken to several people about it in varying amounts of detail, and it seems like the CAMI group is working to generate some new shotgun metagenome data sets and will then encourage tool developers to bang on them. (You can also read a short Methagora blog on CAMI.)

I've been asked by about a dozen people now what I think of it, so I'm blogging about it now rather than answering people individually :).

tl; dr? I'm not a fan.

First, what's this subscription nonsense? (We'll keep you in the loop if you register here.) There are a plethora of ways to keep people in the loop without asking them to subscribe to anything (blogs, Twitter, Web sites, e-mail lists...). Is there a reason to keep the details of the contest secret or hidden behind a subscriber wall? (Answer: no.)

Second, I am told that the contest will be run once, with software that is blind to the content of the metagenome. I understand the idea of blind software evaluation, but this is not a long-term sustainable approach; we need to generate a nice set of orthogonal data sets and then LART people who fine-tune their software against one or another.

Third, it looks like the CAMI folk will make the same mistake as the Assemblathon 2 folk, and not require that the analyses be completely replicable. So in the end there will be a massive expenditure of effort that results in a paper, which will then be a nice static record of how things were back in 2014. Given the pace of tool and technology change, this will have a very short shelf-life (although no doubt tool developers will cite it for years to come, to prove that IDBA was once worse than their own assembler is now). Why not re-run it every 6 months with the latest versions of softwares X, Y, and Z? We have plenty of ways to automate analyses, and there is simply no excuse for not doing so at this point. (ht to Aaron Darling for alerting me to

Fourth, there are several mock metagenome data sets that are already underanalyzed. For example, we're currently working with the Shakya et al. (2013) data set, but I don't think anyone else is (and it's pretty clear most people don't realize what a stinging indictment of 16s this paper is - or, at least, don't care. Discussion for another time.).

So why are we not poking at existing data first? I don't know, but I suspect it speaks to an underlying bias against re-analyzing published data sets, which we really need to get over in this field.

As I said during the Q&A for my BOSC 2014 keynote, I'm not a big fan of how we do benchmarks in bioinformatics; I think this is a fine exemplar.

I probably won't participate in the CAMI benchmark, not just for the above reasons but because I simply don't have the people to do so at the moment., as with Assemblathon 2, they're going to need reviewers, right? ;)

Note that Aaron Darling raised many similar points at ISME, and they were apparently very well received. Maybe CAMI will adapt their approach in response, which would be great!


p.s. Thanks to Alice McHardy and Aaron Darling for their detailed discussions of this with me!

by C. Titus Brown at August 30, 2014 10:00 PM

August 29, 2014

Titus Brown

Some software development plans for Davis

I've been thinking a lot about what I want to do at UC Davis, and I would like to announce the following plans:

  1. I hope to write new and better transcriptome and metagenome assemblers. The current assemblers are hopelessly bad and I think we can probably eke out a 1% improvement, at least, by starting from scratch.
  2. I've been very disappointed with the workflow systems that are out there, so for people that want to run our new assembler, we'll be packaging that with a new workflow management system.
  3. We'll also be developing a new Web site for running analyses of biological data. Galaxy may have market penetration, but I really dislike their default CSS, and I think my lab can probably do a better job if we start from scratch.
  4. Needless to say, I'll need my own physical cloud hardware to run it all. So I'm planning a considerable expansion of our lab's cluster. I think we can probably make a big impact by writing our own virtualization management software, too; the existing systems are written by amateurs like Amazon and OpenStack, after all!

The bad news is that after some serious consideration, I have decided not to invest in a new mapping program; on top of the above it's a bit too much. I think I trust Heng Li's bwa, for now; we may revisit in a few months.

I will be seeking large scale NSF, NIH, USDA, and DOE funding for the above. Let me know if you want to sign on!


by C. Titus Brown at August 29, 2014 10:00 PM

Stefan Pfenninger

IPython notebook extensions to ease day-to-day work

Since I use the IPython notebook for a lot of explorative data analysis, I wanted some extra functionality to make day-to-day notebook work just a little bit more pleasant. I use several of Min RK’s excellent notebook extensions, particularly the floating table of contents and the Gist button.

This post describes some additional extensions I created to do things I need in day-to-day work, as well as how to leverage the customization possibilities of the CodeMirror editor to add vertical rulers to code cells. The notebook extensions described here are available on GitHub.


Sometimes larger analysis tasks take more than a couple of seconds to compute, and I want to alt-tab to do something else while I wait. The IPython-notify extension ensures that you don’t forget what you were originally doing, by displaying a notification once the kernel becomes idle again after a set minimum busy time:

Notification screenshot

When the plugin is activated, a button to request notification permissions is shown in the toolbar. After notification permissions have been granted, this button is replaced by a dropdown menu with five choices: Disabled, 0, 5, 10 and 30.

To activate notifications, select a minimum kernel busy time required to trigger a notification (e.g. if selecting 5, a notification will only be shown if the kernel was busy for more than 5 seconds). The selection is saved in the notebook’s metadata and restored when the notebook is re-opened.

Theme switching

As described in a previous post, depending on ambient lighting and time of day I prefer seeing my code in light text on a dark background, so I created a simple extension to switch between two themes.

The default CSS included only changes code cells (this has the added effect that it’s easier to tell code cells apart from markdown or raw cells):

Dark IPython Notebook

The CSS files can easily be exchanged for a different ones, for example a Base16 theme.

Customizing the CodeMirror editor

IPython notebooks use the CodeMirror editor, which includes a wide range of addons. IPython also allows the user to add additional javascript and CSS via the custom.js and custom.css files in the static/custom directory inside the IPython profile (likely ~/.ipython/profile_default/static/custom on Linux or OS X).

I want to display some rulers to indicate line lengths I shouldn’t exceed to conform with PEP8. Thanks to CodeMirror’s flexible configuration system, this can be achieved by adding the following code to custom.js, making use of the rulers addon:


var clsname = "ipynb_ruler";
var rulers = [{column: 79, className: clsname},
              {column: 99, className: clsname}];

IPython.Cell.options_default.cm_config.rulers = rulers;

By adding the class ipynb_ruler to custom.css, the look of the rulers can be tweaked, for example to use a lighter color:

.ipynb_ruler {
    border-color: #DFDFDF;

Unfortunately, there is a bug in older CodeMirror versions that in some browsers causes extra scrollbars when enabling rulers. For the time being, this can be fixed by placing a fixed rulers.js into ~/.ipython/profile_default/static/components/codemirror/addon/display (adjust as needed if your IPython profile directory is different). I backported a fix into the rulers.js shipped with IPython 2.2, available here.

by Stefan Pfenninger at August 29, 2014 02:31 PM

August 27, 2014

Titus Brown

Estimating (meta)genome size from shotgun data

This is a recipe that provides a time- and memory- efficient way to loosely estimate the likely size of your assembled genome or metagenome from the raw reads alone. It does so by using digital normalization to assess the size of the coverage-saturated de Bruijn assembly graph given the reads provided by you. It does take into account coverage, so you need to specify a desired assembly coverage - we recommend starting with a coverage of 20.

Uses for this recipe include estimating the amount of memory required to achieve an assembly and providing a lower bound for metagenome assembly size and single-copy genome diversity.

This recipe will provide inaccurate estimates on transcriptomes (where splice variants will end up confusing the issue - this looks at single-copy sequence only) or for metagenomes with high levels of strain variation (where the assembler may collapse strain variants that this estimate will split).

Note: at the moment, the khmer script needs to be updated from the master branch of khmer to run this code properly. Once we've cut a new release, we'll remove this note and simply specify the khmer release required.

Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing to a coverage of 150 or higher. If your reads are in reads.fa, you can get an estimate of the single-copy genome size (here known to be 5500 bp) by running

~/dev/khmer/scripts/ -x 1e8 -k 20 -C 20 -R report.txt reads.fa
./ -C 20 -k 20 reads.fa.keep report.txt

This yields the output:

Estimated (meta)genome size is: 8727 bp

This is off by about 50% for reasons that we don't completely understand. Note that you can get more accurate estimates for this data set by increasing C and decreasing k, but 20/20 should work about this well for most data sets. (For an E. coli data set, it returns 6.5 Mbp, which is only about 25% off.)

by C. Titus Brown at August 27, 2014 10:00 PM

Estimate whether your sequencing has saturated your sample to a given coverage

This recipe provides a time-efficient way to determine whether you've saturated your sequencing depth, i.e. how much new information is likely to arrive with your next set of sequencing reads. It does so by using digital normalization to generate a "collector's curve" of information collection.

Uses for this recipe include evaluating whether or not you should do more sequencing.

This approach is more accurate for low coverage than normalize-by-median's reporting, because it collects the redundant reads rather than discarding them.

Note: at the moment, the khmer script needs to be updated from the master branch of khmer to run this code properly. Once we've cut a new release, we'll remove this note and simply specify the khmer release required.

Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing, and you want to know whether or not you've saturated to a coverage of 5 with your sequencing. You can use a variant of digital normalization, saturate-by-median, to run a collector's curve:

~/dev/khmer/sandbox/ -x 1e8 -k 20 -C 5 -R report.txt --report-frequency 10 reads.fa

Then, plot the resulting saturation curve:

./ report.txt saturation.png --xmin 0 --ymin 0 --xmax 1500

The x axis here is the number of reads examined (column 1 in report.txt), while the y axis (column 2) is the number of reads that are below a coverage of 5 in the data set at that point. You can see here that by the time you had sampled 1000 reads, you'd stopped seeing new coverage=5 reads, which suggests that further sequencing is unnecessary.

If you zoom out on the graph, you'll see that the curve keeps on climbing, albeit much more slowly. This is due to the influence of error rate on prediction of "novel" reads, and is something we have to fix.

by C. Titus Brown at August 27, 2014 10:00 PM

William Stein

What is SageMathCloud: let's clear some things up

[PDF version of this blog post]
"You will have to close source and commercialize Sage. It's inevitable." -- Michael Monagan, cofounder of Maple, told me this in 2006.
SageMathCloud (SMC) is a website that I first launched in April 2013, through which you can use Sage and all other open source math software online, edit Latex documents, IPython notebooks, Sage worksheets, track your todo items, and many other types of documents. You can write, compile, and run code in most programming languages, and use a color command line terminal. There is realtime collaboration on everything through shared projects, terminals, etc. Each project comes with a default quota of 5GB disk space and 8GB of RAM.

SMC is fun to use, pretty to look at, frequently backs up your work in many ways, is fault tolerant, encourages collaboration, and provides a web-based way to use standard command-line tools.

The Relationship with the SageMath Software

The goal of the SageMath software project, which I founded in 2005, is to create a viable free open source alternative to Magma, Mathematica, Maple, and Matlab. SMC is not mathematics software -- instead, SMC is best viewed by analogy as a browser-based version of a Linux desktop environment like KDE or Gnome. The vast majority of the code we write for SMC involves text editor issues (problems similar to those confronted by Emacs or Vim), personal information management, support for editing LaTeX documents, terminals, file management, etc. There is almost no mathematics involved at all.

That said, the main software I use is Sage, so of course support for Sage is a primary focus. SMC is a software environment that is being optimized for its users, who are mostly college students and teachers who use Sage (or Python) in conjunction with their courses. A big motivation for the existence of SMC is to make Sage much more accessible, since growth of Sage has stagnated since 2011, with the number one show-stopper obstruction being the difficulty of students installing Sage.

Sage is Failing

Measured by the mission statement, Sage has overall failed. The core goal is to provide similar functionality to Magma (and the other Ma's) across the board, and the Sage development model and community has failed to do this across the board, since after 9 years, based on our current progress, we will never get there. There are numerous core areas of research mathematics that I'm personally familiar with (in arithmetic geometry), where Sage has barely moved in years and Sage does only a few percent of what Magma does. Unless there is a viable plan for the areas to all be systematically addressed in a reasonable timeframe, not just with arithmetic geometry in Magma, but with everything in Mathematica, Maple., etc, we are definitely failing at the main goal I have for the Sage math software project.

I have absolutely no doubt that money combined with good planning and management would make it possible to achieve our mission statement. I've seen this hundreds of times over at a small scale at Sage Days workshops during the last decade. And let's not forget that with very substantial funding, Linux now provides a viable free open source alternative to Microsoft Windows. Just providing Sage developers with travel expenses (and 0 salary) is enough to get a huge amount done, when possible. But all my attempts with foundations and other clients to get any significant funding, at even the level of 1% of the funding that Mathematica gets each year, has failed. For the life of the Sage project, we've never got more than maybe 0.1% of what Mathematica gets in revenue. It's just a fact that the mathematics community provides Mathematica $50+ million a year, enough to fund over 600 fulltime positions, and they won't provide enough to fund one single Sage developer fulltime.

But the Sage mission statement remains, and even if everybody else in the world gives up on it, I HAVE NOT. SMC is my last ditch strategy to provide resources and visibility so we can succeed at this goal and give the world a viable free open source alternative to the Ma's. I wish I were writing interesting mathematical software, but I'm not, because I'm sucking it up and playing the long game.

The Users of SMC

During the last academic year (e.g., April 2014) there were about 20K "monthly active users" (as defined by Google Analytics), 6K weekly active users, and usually around 300 simultaneous connected users. The summer months have been slower, due to less teaching.

Numerically most users are undergraduate students in courses, who are asked to use SMC in conjunction with a course. There's also quite a bit of usage of SMC by people doing research in mathematics, statistics, economics, etc. -- pretty much all computational sciences. Very roughly, people create Sage worksheets, IPython notebooks, and Latex documents in somewhat equal proportions.

What SMC runs on

Technically, SMC is a multi-datacenter web application without specific dependencies on particular cloud provider functionality. In particular, we use the Cassandra database, and custom backend services written in Node.js (about 15,000 lines of backend code). We also use Amazon's Route 53 service for geographically aware DNS. There are two racks containing dedicated computers on opposites sides of campus at University of Washington with 19 total machines, each with about 1TB SSD, 4TB+ HDD, and 96GB RAM. We also have dozens of VM's running at 2 Google data centers to the east.

A substantial fraction of the work in implementing SMC has been in designing and implementing (and reimplementing many times, in response to real usage) a robust replicated backend infrastructure for projects, with regular snapshots and automatic failover across data centers. As I write this, users have created 66677 projects; each project is a self-contained Linux account whose files are replicated across several data centers.

The Source Code of SMC

The underlying source of SMC, both the backend server and frontend client, is mostly written in CoffeeScript. The frontend (which is nearly 20,000 lines of code) is implemented using the "progressive refinement" approach to HTML5/CSS/Javascript web development. We do not use any Javascript single page app frameworks, though we make heavy use of Bootstrap3 and jQuery. All of the library dependencies of SMC, e.g., CodeMirror, Bootstrap, jQuery, etc. for SMC are licensed under very permissive BSD/MIT, etc. libraries. In particular, absolutely nothing in the Javascript software stack is GPL or AGPL licensed. The plan is that any SMC source code that will be open sourced will be released under the BSD license. Some of the SMC source code is not publicly available, and is owned by University of Washington. But other code, e.g., the realtime sync code, is already available.
Some of the functionality of SMC, for example Sage worksheets, communicate with a separate process via a TCP connection. That separate process is in some cases a GPL'd program such as Sage, R, or Octave, so the viral nature of the GPL does not apply to SMC. Also, of course the virtual machines are running the Linux operating system, which is mostly GPL licensed. (There is absolutely no AGPL-licensed code anywhere in the picture.)

Note that since none of the SMC server and client code links (even at an interpreter level) with any GPL'd software, that code can be legally distributed under any license (e.g., from BSD to commercial).
Also we plan to create a fully open source version of the Sage worksheet server part of SMC for inclusion with Sage. This is not our top priority, since there are several absolutely critical tasks that still must be finished first on SMC, e.g., basic course management.

The SMC Business Model

The University of Washington Center for Commercialization (C4C) has been very involved and supportive since the start of the projects. There are no financial investors or separate company; instead, funding comes from UW, some unspent grant funds that were about to expire, and a substantial Google "Academic Education Grant" ($60K). Our first customer is the "US Army Engineer Research and Development Center", which just started a support/license agreement to run their own SMC internally. We don't currently offer a SaaS product for sale yet -- the options for what can be sold by UW are constrained, since UW is a not-for-profit state university. Currently users receive enhancements to their projects (e.g., increased RAM or disk space) in exchange for explaining to me the interesting research or teaching they are doing with SMC.

The longterm plan is to start a separate for-profit company if we build a sufficient customer base. If this company is successful, it would also support fulltime development of Sage (e.g., via teaching buyouts for faculty, support of students, etc.), similar to how Magma (and Mathematica, etc.) development is funded.

In conclusion, in response to Michael Monagan, you are wrong. And you are right.

by William Stein ( at August 27, 2014 07:55 AM

You don't really think that Sage has failed, do you?

I just received an email from a postdoc in Europe, and very longtime contributor to the Sage project.  He's asking for a letter of recommendation, since he has to leave the world of mathematical software development (after a decade of training and experience), so that he can take a job at hedge fund.  He ends his request with the question:

> P.S. You don't _really_ think that Sage has failed, do you?

After almost exactly 10 years of working on the Sage project, I absolutely do think it has failed to accomplish the stated goal of the mission statement: "Create a viable free open source alternative to Magma, Maple, Mathematica and Matlab.".     When it was only a few years into the project, it was really hard to evaluate progress against such a lofty mission statement.  However, after 10 years, it's clear to me that not only have we not got there, we are not going to ever get there before I retire.   And that's definitely a failure.   

Here's a very rough quote I overheard at lunch today at Sage Days 61, from John Voight, who wrote much quaternion algebra code in Magma: "I'm making a list of what is missing from Sage that Magma has for working with quaternion algebras.  However, it's so incredibly daunting, that I don't want to put in everything.  I've been working on Magma's quaternion algebras for over 10 years, as have several other people.  It's truly daunting how much functionality Magma has compared to Sage."

The only possible way Sage will not fail at the stated mission is if I can get several million dollars a year in money to support developers to work fulltime on implementing interesting core mathematical algorithms.  This is something that Magma has had for over 20 years, and that Maple, Matlab, and Mathematica also have.   That I don't have such funding is probably why you are about to take a job at a hedge fund.    If I had the money, I would try to hire a few of the absolute best people (rather than a bunch of amateurs), people like you, Robert Bradshaw, etc. -- we know who is good. (And clearly I mean serious salaries, not grad student wages!)

So yes, I think the current approach to Sage has failed.    I am going to try another approach, namely SageMathCloud.  If it works, maybe the world will get a free open source alternative to Magma, Mathematica, etc.  Otherwise, maybe the world never ever will.      If you care like I do about having such a thing, and you're teaching course, or whatever, maybe try using SageMathCloud.   If enough people use SageMathCloud for college teaching, then maybe a business model will emerge, and Sage will get proper funding.   

by William Stein ( at August 27, 2014 07:52 AM

August 26, 2014

Matthieu Brucher

Announcement: ATKUniversalDelay 1.1.0

I’m happy to announce the release of a mono fixed delay line on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. The three knobs adjust the direct signal (blend), the delayed signal (feedforward) as well as the feedback signal from the delay line injected in the input. The delay can be set from 0 ms to 1 s by steps of 0.1 ms.



The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKUniversalDelay

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at August 26, 2014 07:31 AM

August 25, 2014

Paul Ivanov

pedestrian musings

I walk in monologue 
    through Berkeley's Hills
Feet pressing into sidewalk firmly
I eat the pensive mood 
    solitude brings
And bite into the juiciness of
I write, first time in years,
    free verse impromptu
Taking few dozen steps
    between each pair of lines
I yearn, on tip-toes
    stretching high, to be expressive
A mode of being longtime
I'm walking home - from job
    I'll soon be leaving
To find myself believing once 
That which I do defines 
    me not and feeling
That which I am is
    good. enough. a lot.

by Paul Ivanov at August 25, 2014 07:00 AM

August 24, 2014

Titus Brown

The first six years of faculty life

Inspired by Sarah Bisbing's excellent post on her first year as a faculty member, here are the questions I remember asking myself during my first six years:

Year 0: What science do I want to do?

Year 1: What the hell am I doing all day and why am I always so exhausted?

Year 2: Why don't my grad students know how to do anything yet?

Year 3: What does this data mean?

Year 4: Why are grant and paper reviewers so dumb?

Year 5: Why are grant and paper authors so dumb?

Year 6: Will I get tenure, and if so, do I really want to be doing this for the rest of my life?


by C. Titus Brown at August 24, 2014 10:00 PM

August 23, 2014

Titus Brown

Downsampling shotgun reads to a given average coverage (assembly-free)

The below is a recipe for subsetting a high-coverage data set to a given average coverage. This differs from digital normalization because the relative abundances of reads should be maintained -- what changes is the average coverage across all the reads.

Uses for this recipe include subsampling reads from a super-high coverage data set for the purpose of assembly, as well as more esoteric reasons (see the bottom of the post). This approach won't work on digitally normalized reads, and is primarily intended for genomes and low-complexity metagenomes. For high-complexity metagenomes we recommend partitioning.

Note: at the moment, the khmer scripts and are in the khmer repository under branch feature/collect_reads. Once we've merged it into the master branch and cut a release, we'll remove this note and simply specify the khmer release required.

This recipe uses code from khmer-recipes and dbg-graph-null.

Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing to a coverage of 150 or higher. If your reads are in reads.fa -x 1e8 -k 20 reads.fa
~/dev/khmer/sandbox/ reads.fa reads-cov.dist
./ reads-cov.dist reads-cov.png --xmax=600 --ymax=500

and looks like this:


You see the same peaks at roughly the same places.

Now use to subset the data to a lower average coverage of 50

~/dev/khmer/sandbox/ -x 1e8 -C 50 -k 20 reads.fa -o reads.subset.fa

Here, is walking through the data set and computing a running average of the coverage of the last 1000 reads. Once it hits the specified average coverage of 50 (-C 50) it stops collecting the reads. Take a look at the read coverage spectrum for the subsetted data:

~/dev/khmer/sandbox/ reads.subset.fa reads-subset.dist
./ reads-subset.dist reads-subset.png --xmax=600 --ymax=500

and compare the resulting plot with the one above --


Here you can see that the coverage spectrum has been shifted left and down by the subsampling (which is what you'd expect).

Note that picking the coverage that you want is a bit tricky, because it will be the average across the reads. If you have a highly repetitive genome you may need to go for something higher than your desired single-copy genome coverage, because the repeats will skew your average to the right.


If the peaks look good, you can use the output counting table as an argument to slice-reads-by-coverage (see this post). If you use the original reads, this will then give you _all_ the reads that cluster by coverage with that peak. For example,

~/dev/khmer/sandbox/ reads.fa reads-repeats.fa -m 100 -M 200

will give you all the reads from the repetitive component, which will be much higher coverage in the combined data set; take a look: -x 1e8 -k 20 reads-repeats.fa
~/dev/khmer/sandbox/ reads-repeats.fa reads-repeats.dist
./ reads-repeats.dist reads-repeats.png --xmax=600 --ymax=500

Here the slice specified (-m and -M) is with respect to the read abundances in This allows you to more explore and subset large data sets than you would otherwise be able to, and also avoids some khmer-specific issues with counting k-mers that are higher abundance than 255.

by C. Titus Brown at August 23, 2014 10:00 PM

Extracting shotgun reads based on coverage in the data set (assembly-free)

In recent days, we've gotten several requests, including two or three on the khmer mailing list, for ways to extract shotgun reads based on their coverage with respect to the reference. This is fairly easy if you have an assembled genome, but what if you want to avoid doing an assembly? khmer can do this fairly easily using techniques taken from the digital normalization work, which allows you to estimate read coverage directly from the data.

The below is a recipe for computing coverage spectra and slicing reads out of a data set based on their coverage, with no assembly required. It's the first of what I hope to be many practical and useful recipes for working with shotgun data. Let me know what you think!

Uses for extracting reads by coverage include isolating repeats or pulling out mitochondrial DNA. This approach won't work on digitally normalized reads, and is primarily intended for genomes and low-complexity metagenomes. For high-complexity metagenomes we recommend partitioning.

Note: at the moment, the khmer script slice-reads-by-coverage is in the khmer repository under branch feature/collect_reads. Once we've merged it into the master branch and cut a release, we'll remove this note and simply specify the khmer release required.

This recipe uses code from khmer-recipes and dbg-graph-null.

Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing to a coverage of 150. If your reads are in reads.fa, you can generate a k-mer spectrum from your genome with k=20 -x 1e8 -k 20 reads.fa -s reads.fa reads.dist
./ reads.dist reads-dist.png --ymax=300

and it would look something like this:


For this (simulated) data set, you can see three peaks: one on the far right, which contains the high-abundance k-mers from your repeats; one in the middle, which contains the k-mers from the single-copy genome; and one all the way on the left at ~1, which contains all of the erroneous k-mers.

This is a useful diagnostic tool, but if you wanted to extract one peak or another, you'd have to compute a summary statistic of some sort on the reads. The khmer package includes just such a 'read coverage' estimator. On this data set, the read coverage spectrum can be generated like so::

~/dev/khmer/sandbox/ reads.fa reads-cov.dist
./ reads-cov.dist reads-cov.png --xmax=600 --ymax=500

and looks like this:


You see the same peaks at roughly the same places. While superficially similar to the k-mer spectrum, this is actually more useful in its own right -- because now you can grab the reads and do things with them.

We provide a script in khmer to extract the reads; slice-reads-by-coverage will take either a min coverage, or a max coverage, or both, and extract the reads that fall in the given interval.

First, let's grab the reads between 50 and 200 coverage -- these are the single-copy genome components. We'll put them in reads-genome.fa.

~/dev/khmer/sandbox/ reads.fa reads-genome.fa -m 50 -M 200

Next, grab the reads greater in abundance than 200; these are the repeats. We'll put them in reads-repeats.fa.

~/dev/khmer/sandbox/ reads.fa reads-repeats.fa -m 200

Now let's replot the read coverage spectra, first for the genome: -x 1e8 -k 20 reads-genome.fa
~/dev/khmer/sandbox/ reads-genome.fa reads-genome.dist
./ reads-genome.dist reads-genome.png --xmax=600 --ymax=500

and then for the repeats: -x 1e8 -k 20 reads-repeats.fa
~/dev/khmer/sandbox/ reads-repeats.fa reads-repeats.dist
./ reads-repeats.dist reads-repeats.png --xmax=600 --ymax=500
images/slice/reads-genome.png images/slice/reads-repeats.png

and voila! As you can see we have the reads of high coverage in reads-repeats.fa, and the reads of intermediate coverage in reads-genome.fa.

If you look closely, you might note that some reads seem to fall outside the specified slice categories above -- that's presumably because their coverage was predicated on the coverage of other reads in the whole data set, and now that we've sliced out various reads their coverage has dropped.

by C. Titus Brown at August 23, 2014 10:00 PM

August 22, 2014

Titus Brown

A Review: Neanderthal Man: In Search of Lost Genomes

I just finished reading Svante Paabo's autobiography, Neanderthal Man: In Search of Lost Genomes. The book is perfect -- if you're a biologist of any kind, you'll understand most of it without any trouble, and even physicists can probably get a lot out of the story (heh).

The book describes Svante Paabo's journey towards sequencing the Neanderthal and Denisovan genomes, and the attendant scientific and popular science implications. It's a fantastic portrayal of how science really works, from the perspective of a driven, charismatic, and successful scientist.

Beyond the scientific story, which I had not known much about at all, there were two particularly interesting parts of the book.

First, I was surprised at the candor and simultaneous shallowness with which Dr. Paabo discussed his various relationships and bisexuality. Throughout the book there is occasional mention of men, women, marriage, and relationships, and while I don't get much of a sense of how impactful all of this was on him personally, it is striking how little it seems to have impacted his scientific life. There was one particular bit in the middle that I found very understated in reporting, where he and a couple all move into the same apartment building, and then depart with different spouses. I guess I was surprised at the choice to report the personal relationships while avoiding any depth whatsoever. (Craig Venter still takes the cake with a single paragraph in A Life Decoded where he starts the paragraph with a divorce and ends with an engagement. It's a long paragraph, but still.)

Second, it was somewhat dispiriting to watch Dr. Paabo's transition from an enthusiastic young scientist concerned primarily with getting accurate scientific knowledge out there to one who was very focused not only on scientific correctness but on publicity. There's a great section at the beginning where he talks about publishing the first mtDNA sequence from Neanderthals, and he chooses Cell because it allowed longer articles and a better representation of the science; he also spends a lot of time making sure his mtDNA sequence can be reproducibly generated by another lab. By the end, however, he's publishing in Science and Nature, and agonizing over whether or not he'll be first with the publication. (The latter approach seem much more harmful to good scientific practice to me; both the secrecy & competitiveness and the desire to push out papers in Science and Nature are problematic.)

I don't know how much self-reflection went into the narrative, but it would be interesting to know why his approach towards publication changed so much during his career. Is it because science changed as a whole towards requiring the splashy high impact papers? Or is it because he grew in stature and funding requirements to the point where he needed the splashy high impact publications to justify the funding? Or (as he says towards the end) is it because of the perceived career needs of his junior colleagues? Or all three? Or more?

Anyway, a great read. Highly recommended for the sci-curious.


by C. Titus Brown at August 22, 2014 10:00 PM

Jake Vanderplas

Hacking Academia: Data Science and the University

A reflection on our SciFoo breakout session, where we discussed issues of data science within academia.

Almost a year ago, I wrote a post I called the Big Data Brain Drain, lamenting the ways that academia is neglecting the skills of modern data-intensive research, and in doing so is driving away many of the men and women who are perhaps best equipped to enable progress in these fields. This seemed to strike a chord with a wide range of people, and has led me to some incredible opportunities for conversation and collaboration on the subject. One of those conversations took place at the recent SciFoo conference, and this article is my way of recording some reflections on that conversation.

SciFoo is an annual gathering of several hundred scientists, writers, and thinkers sponsored by Digital Science, Nature, O'Reilly Media & Google. SciFoo brings together an incredibly eclectic group of people: I met philosophers, futurists, alien hunters, quantum physicists, mammoth cloners, magazine editors, science funders, astrophysicists, musicians, mycologists, mesmerists, and many many more: the list could go on and on. The conference is about as unstructured as it can be: the organizers simply provide food, drink, and a venue for conversation, and attendees put together breakout discussions on nearly any imaginable topic. If you ever get the chance to go, my advice is to drop everything else and attend. It was one of the most quirky and intellectually stimulating weekends I've ever spent.

The SciFoo meeting is by invitation only, and given the incredible work of other attendees, I'm still not quite sure how I ended up on the invite list (it was perhaps the worst flare-up of impostor syndrome I've ever had!) I forced myself to get over it, though, and teamed-up with Chris Mentzel, a program director in the Moore Foundation, and led a session: we called it Hacking Academia from Inside and Out. The session was in many ways a conversation around the general topic of my Brain Drain post, though it was clear that many of the folks in attendance had been thinking in these terms long before I penned that particular essay.

The Problem

The problem we discussed is laid out in some detail in my Brain Drain post, but a quick summary is this: scientific research in many disciplines is becoming more and more dependent on the careful analysis of large datasets. This analysis requires a skill-set as broad as it is deep: scientists must be experts not only in their own domain, but in statistics, computing, algorithm building, and software design as well. Many researchers are working hard to attain these skills; the problem is that academia's reward structure is not well-poised to reward the value of this type of work. In short, time spent developing high-quality reusable software tools translates to less time writing and publishing, which under the current system translates to little hope for academic career advancement.

In my Brain Drain post, I observed the rise of data-intensive research and lamented that researchers with the requisite interdisciplinary knowledge are largely unrecognized and unrewarded in academia, even as they are highly-sought-after in the tech industry. I previously labeled this type of person a "new breed of scientist", but since then it's become clear that the working label for this type of person has become (for better or worse) a data scientist.

Defining Data Science

The term "Data Science" generally seems to get a bad rap: it's variously dismissed as misleading, an empty buzzword, or begrudgingly conceded to be flawed, but useful. Perhaps "Data Scientist" can be understood as just a more subdued term for the "sexy statistician" that Hal Varian predicted would become the top career of this decade.

I think the best illustration of data science's definition comes from Drew Conway's Data Science Venn Diagram, which applies the label "Data Science" to the intersection of hacking skills, statistical knowledge, and domain expertise.

The key is that in addition to the normal depth of knowledge in one's own field, there as a rare breadth to the knowledge and skill-set of a data scientist.

In the words of Alex Szalay, these sorts of researchers must be "Pi-shaped" as opposed to the more traditional "T-shaped" researcher. In Szalay's view, a classic PhD program generates T-shaped researchers: scientists with wide-but-shallow general knowledge, but deep skill and expertise in one particular area. The new breed of scientific researchers, the data scientists, must be Pi-shaped: that is, they maintain the same wide breadth, but push deeper both in their own subject area and in the statistical or computational methods that help drive modern research:

Perhaps neither of these labels or descriptions is quite right. Another school of thought on data science is Jim Gray's idea of the "Fourth Paradigm" of scientific discovery: First came the observational insights of empirical science; second were the mathematically-driven insights of theoretical science; third were the simulation-driven insights of computational science. The fourth paradigm involves primarily data-driven insights of modern scientific research. Perhaps just as the scientific method morphed and grew through each of the previous paradigmatic transitions, so should the scientific method across all disciplines be modified again for this new data-driven realm of knowledge.

Regardless of what metaphor, definition, or label you apply to this class of researcher, it is clear that their skill set is highly valuable in both academia and industry: the brain drain that many have observed comes from the unfortunate fact that academia fails to properly reward the valuable skill set of the data scientist.

Our Discussion: Academia and Data Science

With this label in mind, our SciFoo discussion focused largely around the following questions:

  1. Where does Data Science fit within the current structure of the university?
  2. What is it that academic data scientists want from their career? How can academia offer that?
  3. What drivers might shift academia toward recognizing & rewarding data scientists in domain fields?
  4. Recognizing that graduates will go on to work in both academia and industry, how do we best prepare them for success in both worlds?

I'll go through some of the thoughts we discussed below:

1. Where does Data Science fit within the current structure of the university?

The question of data science's place in academia drew a variety of responses and ideas:

The "Fourth Paradigm": data science is simply a label for a new skill-set, and shouldn't be separated from the departments in which it is useful. The thinking here is that data science is simply an umbrella term for an essential skill in modern scientific research. For example, laboratory biologists are dependent on pipetting skills: this doesn't mean that the university should create a new "Department of Applied Pipetting". On the contrary, it simply means that pipetting technique should be part of a laboratory biologist's normal training. Similarly, departments across the university should simply incorporate relevant data science techniques into their normal curriculum.

Data science as a consulting service. Perhaps data science is more like Information Technologies (IT). All modern science labs depend on some sort of computer infrastructure, and most universities long ago realized that it's counter-productive to expect their specialized researchers to effectively manage that infrastructure. Instead, IT organizations were created which provide these services to multiple departments. Data science might be treated the same way: we can't expect every scientist to be fluent in the statistical and computational methods required to work with large datasets, so we might instead out-source these tasks to data science experts.

Data science as an applied computer science department. There has been a trend in the 20th-century of academic subjects splitting into "pure" and "applied" sub-domains. Many Universities have departments of "applied math" and "applied physics", which (loosely speaking) distinguish themselves from the non-applied version by employing the techniques of the field within practical rather than theoretical contexts. Perhaps data science is best viewed as an applied branch of computer science or of statistics which should become its own academic department.

Data science as a new role for libraries. It is no secret that digitization is changing the role of libraries on university campuses. The general public thinks of libraries little more than warehouses for books, but those in the field see printed books as just one particular manifestation of their focus, which has always been data curation. Many library scientists I've talked with recently are excited about the role that new methods and technologies can play in this task of curating and extracting information from their stores of data. From this perspective, Library & Information Science departments may be a natural home for interdisciplinary data science.

Data science as a new interdisciplinary institute. A middle ground to the above approaches may be to organize data science within an interdisciplinary institute; this is a common approach for topics that are inherently multi-disciplinary. Such institutes are already common in academia: for example, the University of Washington is home to the Joint Institute for the Study of Atmosphere and Ocean, which brings together dozens of department, schools, and labs to collaborate on topics related to the climate and the environment. Perhaps such an umbrella institute is the place for data science in the University.

2. What is it that academic data scientists want from their job? How can academia offer that?

Moving from university-level issues to personal-level issues, we brainstormed a list of goals that drive data scientists within academia and industry. While scientists are by no means a homogeneous group, most are driven by some combination of the following concerns:

  • Salary & other financial compensation
  • Stability: the desire to live in one place rather than move every few years
  • Opportunity for Advancement
  • Respect of Peers
  • Opportunity to work on open source software projects
  • Opportunity to travel & attend conferences
  • Flexibility to work on interesting projects
  • Opportunity to publish / freedom from the burden of publishing
  • Opportunity to teach / freedom from the burden of teaching
  • Opportunity to mentor students / freedom from the burden of mentoring students

I've tried to generally order these from "perks of industry" to "perks of academia" from the perspective of a recently-minted PhD, but this rough one-dimensional categorization misses the variety of opportunities in both worlds. For example, some academic research scientists do not spend time teaching or mentoring students, and some tech industry jobs contain the type of flexibility usually associated with academic research.

Those younger participants in our conversation who have most recently been "in the game", so to speak, especially noted problems in academia with the first three points. Compared to an industry data scientist position, an academic post-doc has some distinct disadvantages:

  • Money: the NIH postdoc salary hovers somewhere around $\$$40,000 - $\$$50,000 per year. Industry often starts recent PhDs at several times that salary.
  • Stability: before finding a permanent position, it is common now for young scientists to do several short-term post-doctoral appointments, often in different corners of the country. This means that many academics cannot expect to root themselves in a community until their mid-30s or later. For an industry data scientist, on the other hand, there may be many attractive and permanent job options near home.
  • Advancement: those who focus on software and data in academia, rather than a breakneck pace of publication, have little hope for a tenure-track position under the current system. A researcher who spends significant time building software tools can expect little reward within academia, no matter how influential those tools might be.

Anecdotally, these seem to be the top three issues for talented data scientists in academia. With the tech-industry market for data scientists as hot as it is, skilled PhD candidates have many compelling incentives to leave their field of research.

3. What drivers might shift academia toward recognizing & rewarding data scientists in domain fields?

We discussed several drivers that might make academia a more friendly place for data scientists. Generally, these centered around notions of changing the current broken reward structure to recognize success metrics other than classic publication or citation counts. These drivers may be divided into two different categories: those outside academia who might push for change (e.g. funding agencies, publishers, etc.) and those inside academia who might implement the change themselves (e.g. university leadership and department leadership). We discussed the following ideas:

From the outside:

  • Publishers might provide a means of publishing code as a primary research project. Introducing a DOI for software with easy means of updating contributor lists over time, for example, might lead to increased citation and recognition of alternative research activities.
  • Publishers might push for reproducibility as a requirement for publication. Reproducibility of data-intensive research requires the well-engineered code that data scientists can produce, and would increase the value placed on these skills.
  • Funding agencies might provide funding specifically for interdisciplinary data scientists within specific domain fields.
  • Funding agencies might encourage interdisciplinary collaboration through specific grant requirements.

From the inside:

  • University leadership might set aside funding for new data science departments or for interdisciplinary institutes focused on data science.
  • University leadership might create joint positions: e.g. paying half of a professor or researcher's salary so that he or she can focus that time on tasks such as creating, maintaining, or teaching about important software tools.
  • Department leadership might take the lead to adapt their curriculum to include data science topics, and specifically hire faculty who emphasize these areas in their work.
  • Department leadership might adjust their hiring practice to recognize alternative metrics that go beyond the H-index: for example, recognizing the importance of an open source software tool that is well-used, but may not generate a classic citation record.

4. Recognizing that graduates will go on to work in both academia and industry, how do we best prepare them for success in both worlds?

This question is the flip-side of the Brain Drain theme: the number of PhDs granted each year far exceeds the number of academic positions available, so it is simply impossible for every graduate to remain in academia.

This is, for some reason, a somewhat taboo subject in academia: I've talked to many who at the end of their PhD program were leaning toward leaving academia, and dreaded having "the talk" with their thesis advisor. But academic departments should take seriously the job prospects for their graduates, and that involves making sure they have marketable skills upon graduation.

Fortunately, these marketable skills overlap highly with the skills that can lead to success in modern data-intensive scientific research: the ability to design and write good software, to create tools that others can reuse, to collaboratively work on large software projects, to effectively extract insight from large datasets, etc. Unfortunately, development of this skill set takes time and energy at the level of both graduate coursework and research mentorship. Departments should push to make sure that their students are learning these skills, and are rewarded for exercising them: this will have the happy side-effect of increasing demand within academia for the scientists who nurture these skills.

Hacking Academia: How is this problem being addressed?

At the beginning of the SciFoo conference, each of the ~250 people present were invited to introduce themselves with a one line description of who they are and what they spend their time thinking about. The title of this post, Hacking Academia, is a phrase that I borrowed from Chris Mentzel's one-line description of his own focus.

This is hacking not in the Hollywood sense of using computers for malicious purposes, but hacking in the sense of working within a rigid system to accomplish something it is not necessarily designed to do. This type of "institutional hacking" is the best description of what we're after: finding shortcuts that will mold the world of academic scientific research into a more effective version of itself.

Complaining about the state of academia is a favorite past-time of academics, but it is far rarer to see actual solutions to these problems. One of the best pieces of the SciFoo discussion was just this: hearing about the steps that various individuals, foundations, and institutions are taking to address the concerns outlined above. I'll mention a few examples here:

UW, Berkeley, NYU: The Moore-Sloan Initiative

As I mentioned, my co-conspirator in leading this session was Chris, whom I know through his involvement with the recent Moore-Sloan Data Science Initiative. For years within his role in the Moore Foundation, Chris has been thinking about these issues from the perspective of a funder of scientific research. His efforts in this area have recently led to this $\$$38 million initiative, which is built around a five-year grant to three institutions: University of Washington, University of California Berkeley, and New York University. One of the primary and explicit goals of the grant is to jump-start new career paths for data scientists in scientific domain fields, and each of the three universities has a team who is approaching the work in their own way.

In January, I was hired by UW's eScience Institute to help lead the UW portion of this effort. We are in the middle of building a data science studio space that will be a central hub for multi-disciplinary data-intensive research on campus, and are currently in the process of hiring our first round of interdisciplinary postdocs, data scientists, and research scientists. These positions are designed especially to attract those skilled "Pi-shaped" researchers who may fall through the cracks in classic academic tracks, and we place particular value on alternative metrics such as open source contributions and efforts toward reproducibility. Berkeley and NYU are undertaking similar efforts on their own campuses with the Berkeley Institute for Data Science and NYU's Center for Data Science.

The Moore Foundation: Data Driven Discovery

The Moore Foundation is not stopping with this Data Science Initiative. They will soon be announcing their Data Driven Discovery grant winners: 14 individuals who will split a total of $\$$21 million over five years, along with up to $\$$9 million in additional grants to scale-up specific data-driven software and methods. The Moore foundation seems intent on using their endowment to effect real change in this area, and I am excited to see the results (I can assure you that I'm not just saying this because they currently pay my salary!)

NSF: the IGERT program

The NSF's long-standing Interactive Graduate Education and Research Traineeship (IGERT) program provides funding for interdisciplinary training for PhDs and postdocs, and recent grants in particular have had a data science focus. UW was recently awarded an IGERT grant for an interdisciplinary data-focused PhD program, and the first group of these students will be starting this fall. An important piece of this was that home departments agreed to allow these students to forego some of their normal degree requirements to make room for joint courses on data science skills and techniques. IGERT remains an active NSF program, and has similar interdisciplinary grants to schools and departments around the United States.

UW: Provost's Data Science Initiative

At UW, the university-wide leadership is also thinking along these lines with some concrete actions of their own: the provost has set aside funding for half-positions to encourage hiring of interdisciplinary faculty. The next Astronomy professor hired by UW, for example, might spend half of his or her time working and teaching through the eScience institute on general computational and data science methods.

Publishing Code: the Journal of Statistical Software

Though this isn't a recent initiative, the Journal of Statistical Software (JSS) is an example of a non-profit publisher which is having a positive impact in the area of statistical software, by giving scientists a forum to publish the software they write and cite the software they use. Perhaps in part because of the extreme usefulness of well-written, reusable software, JSS is very highly-ranked (see, for example, the SCImago rankings). More journals like this, which place explicit value on reproducible computation and well-written software tools, could be a huge benefit to the academic data scientist. (full disclosure: I'm on the editorial board of JSS).

Harvard University: Initiative in Innovative Computing & Institute for Advanced Computational Science

Alyssa Goodman of Harvard was part of our discussion, and mentioned that nearly a decade ago Harvard foresaw and began addressing the value of interdisciplinary data-intensive science and research. They created a short-lived Initiative in Innovative Computing (IIC), which existed from 2005-2009, until the global financial crisis led its funding to be cut. At its peak, the IIC was supported to the tune of around $4 million per year and was home to roughly 40 staff, most working jointly between the IIC and other departments. After the IIC funding dissipated, it seems that most of this momentum (and many of the IIC staff) moved to the Harvard's Institute for Advanced Computational Science (IACS), started by Tim Kaxiras and the Harvard School of Engineering in 2010. Though IACS has traditionally focused more on simulation and computation, it has recently begun to branch out to visualization and data science as well. Alyssa seems optimistic that momentum in this area is again building, and mentioned also Harvard's Institute for Quantitative Social Science, the Seamless Astronomy Program, and the Library as key players.

Michigan State: new Applied Computation Department

Another active participant in the discussion was Steve Hsu, Vice President for Research at Michigan State University. He shared some exciting news about a new development at Michigan State: the creation of a new department, tentatively called Applied Computation and Mathematics. According to Steve, MSU will soon be hiring around twenty interdisciplinary tenure-track faculty, who will have one foot in their home department and one foot in this new department. MSU already has many impressive researchers working in data-intensive areas across the university: with this effort I'm excited to see the dynamic interdisciplinary community they will build.


The SciFoo discussion was excellent, as was the weekend as a whole: I would like to take this chance to thank O'Reilly Media, Digital Science, Nature Publishing Group, and Google for making it all possible.

Nearly a year after my Brain Drain brain dump, I am thrilled to see that there are so many good people thinking about and working on these problems, and so many real solutions both in process and on the horizon. While the lure of a well-funded tech industry will doubtless still attract a steady stream of scientists away from their academic research, I'm heartened to see such focused effort toward carving-out niches within academia for those who have so much to contribute.

by Jake Vanderplas at August 22, 2014 03:00 PM

August 21, 2014

Manoj Kumar


I was postponing the last post for the last of my Pull Requests to get merged. Now since it got merged, I do not have any reason to procrastinate. This is the work that I have done across summer, with a short description of each,

(Just in case you were wondering why the “another” in the title, )

1. Improved memory mangement in the coordinate descent code.
Status: merged
Pull Request:
Changing the backend from multiprocessing to threading by removing the GIL, and replacing the function calls with pure cblas. A huge improvement 3x – 4x in terms of memory was seen without compromising much on speed.

2. Randomised coordinate descent
Status: merged
Pull Request:
Updating a feature randomnly with replacement instead of doing an update across all features can make descent converge quickly.

3. Logistic Regression CV
Status: merged
Pull Request:
Fitting a cross validation path across a grid of Cs, with new solvers based on newton_cg and lbfgs. For high dimensional data, the warm start makes these solvers converge faster.

4. Multinomial Logistic Regression
Status: merged
Pull Request:
Minimising the cross-entropy loss instead of doing a OvA across all classes. This results in better probability estimates of the predicted classes.

5. Strong Rules for coordinate descent
Status: Work in Progress
Pull Request:
Rules which help skip over non-active features. I am working on this and it should be open for review in a few days.

Apart from these I have worked on a good number of minor bug fixes and enhancements, including exposing the n_iter parameter across all estimates, fixing incomplete download of newsgroup datasets, and soft coding the max_iter param in liblinear.

I would like to thank my mentor Alex who is the best mentor one can possibly have, (I’m not just saying this because of hope that he will pass me :P), Jaidev, Olivier, Vlad, Arnaud, Andreas, Joel, Lars, and the entire scikit-learn community for helping me to complete an important project to an extent of satisfaction. (It is amazing how people manage to contribute so much, inspite of having other full time jobs). I will be contributing to scikit-learn full-time till December at least as part of my internship.
EDIT: And of course Gael (how did I forget), the awesome project manager who is always full of enthusiasm and encouragement.

As they say one journey ends for the other to begin. The show must go on.

by Manoj Kumar at August 21, 2014 11:31 PM

August 20, 2014

Titus Brown

The August Carnival of Evolution

Every month, Bjørn Østman finds another sucker^W^W^W organizes a Carnival of Evolution blog post, that does a roundup of blogs on evolution from a previous month. This month, I'm hosting it -- it's a bit late, due to some teaching duties, so apologies!

Trigger warning: This blog post contains discussions of evolution, which may cause anxiety in those who don't want to be exposed to ideas with which they are pretty sure they disagree.


My favorite blog post from July was the Marc Srour's post on Cone snail venoms and their awesomeness. Marc reviews a paper, Dutertre et al. (2014), that discusses how one cone snail venom duct manufactures different kinds of defensive and offensive venoms, presumably in response to the different needs of defense and predation.

Jane Hu's post on how the largest known flying dinosaur avoided crashing reviews Han et al., 2014, which describes how the long feathers on a raptor helped stabilize it during flight.

I found this post on why there are no ring species by Jerry Coyne to be interesting from two perspectives. First, I'd never read about the ring species concept before; and second, I thought it was interesting to devote so much discussion to why no actual examples of this concept seemed to exist :).

Bjørn's post on death of the fittest is a nice example of how simple bottom-up simulations can help you understand evolutionary concepts. I'm mildly disappointed that Bjørn didn't make the MATLAB code available as a link in the blog, though - what's with that?

Craig Benkman's blog on the outsized impact of a small mammal explores his PNAS paper, Talluto and Benkman (2014) on selection pressures from seed predation and fire. The short version is that pine squirrels prefer to harvest hard woody pine cones that also help repopulate the forest quickly after fires. In what I think is the best line in the blog post, "when squirrel densities exceed one and a half squirrels per hectare", the anti-hard-woody-pine-cone selection pressure from squirrels eating them dominates over the pro-repopulate-after-fire selection pressure.

Turning to humans, Bradly Alicea has a nice discussion of how dual process models that take into account both genetic fitness and cultural adaptation could be a better way to understand human biological variation.

Veering to something much smaller, Viking wannabe Jeff Morris wrote a nice blog post on microbial ecosystems and the Black Queen Hypothesis, talking about how the Black Queen Hypothesis can foster certain kinds of apparent "cooperation".

Next, returning to Jerry Coyne and Why Evolution is True, check out this great blog post on Poelstra et al. 2014, looking at the genomic and transcriptomic underpinnings of two closely related crows. Despite very little genetic variation, these crows maintain distinct appearance and territories. tl; dr? The closer we look at the concept of "species", the harder it is to draw clear lines. Also, the primary point of difference between the two ...subspecies? seems to be located at one very small portion of their chromosome 18. How odd!

Wrapping up my Carnival, Samuel Perez blogs about stamen size in some populations of Arabidopsis thaliana. There is a mystery here: what is causing populations at low altitudes to lose their short stamens?

And that's it for this month! From a personal perspective, it was nice to compare and contrast so many different topics and styles of blog posts -- it's clear there are many ways to blog, and this is a fine selection of blogs! Thanks for roping me in, Bjørn!


The next host: the September edition will be at Sam Hardman's blog Ecologica (

by C. Titus Brown at August 20, 2014 10:00 PM

In which I declare my intentions to move to UC Davis

This past weekend, I accepted an offer to join UC Davis as an Associate Professor of Genetics in the Department of Population Health and Reproduction, in the School of Veterinary Medicine. The appointment is still pending tenure review, but I expect to join Davis whether or not they give me tenure (sshh! don't tell them!)

I am very sad to be leaving many good friends and colleagues in Michigan, but the personal and professional opportunities available at UC Davis are simply outstanding; we decided as a family that Davis was our future.

I'm sure you must all have lots of questions. So here's a handy list of answers!

  1. Wait, tenure review? Didn't you have tenure at MSU?!


  2. VetMed... wait, what?


  3. Really? You're going to leave it there?

    Turns out veterinary animals have genomes, too! And PHR and VetMed are extremely broad in their research programs -- they have great people in ecology, evolution, microbiology, and many other fields. UC Davis overall is truly excellent, but I am really enthusiastic about joining VetMed specifically. There are several other recent faculty hires that I'm thrilled about, and the existing faculty are just outstanding; I expect Davis VetMed to offer a wonderful and fertile ground for the growth of my research program.

  4. OK, seriously, why did they even interview you, much less hire you?

    Well, I agree that my fit for the position description as posted is not quite perfect. So I asked the same question! Among other things, several members of the hiring committee said that they really liked my education efforts. Without a strong research program, they would probably not have looked seriously at my application; but, once they did, they said that they really liked the mix of research and education and outreach.

    Just as a reminder, I've made all of my application materials available here.

  5. What talk did you give? Your soil metagenomics talk?! How'd that go?

    My talk is posted on Slideshare so you can see for yourself -- it was almost entirely about the work that my student Dr. Likit Preeyanon did on Marek's Disease resistance in chicken.

    Yeah, I work on that stuff, too :).

  6. Did you apply for any other jobs?

    Yep. I applied for about six academic positions, including positions in Big Data, mol bio/ecology/evolution/bioinformatics, and microbiology. Got one interview, and one job offer. shrug

  7. Are you going to continue doing ... whatever it is you do?

    Yep. And more!

  8. Did Jonathan Eisen have anything to do with this?

    Jonathan was one of my references, but AFAIK he was uninvolved in the decision past that. Needless to say, however, the fact that UC Davis is supportive of his social media and open access efforts was a strong positive at Davis (although MSU is no slouch there either).

    We do hope his open access policies extend to his backyard pool.

  9. When are you going!?

    TBD. Sometime in early 2015, I think.

  10. Will you be hiring?

    Yeppers. More on that down the road.

  11. Do you even like granola?

    I do! Note that I went to Reed and Caltech for undergrad and grad school respectively, so I think Davis is a good average between the two -- and not just geographically :).

  12. How did you get hired without a passel of Cell/Nature/Science papers?


    I have a few high profile papers, most notably a bunch o' PNAS papers. But my publication record was never at any point brought up by anyone, so I don't know what they thought of it.

  13. It was your klout score, wasn't it! That's why they hired you!

    Honestly, as far as I can tell they were largely unaware of my social media and open science interests. The search committee seemed interested in it over dinner, though.

  14. What else makes you excited about Davis?

    The new Big Data initiative at Davis.

    The Davis Genome Center.

    The opportunities for interactions with faculty from the CS, Ag, Microbiology, and Developmental Biology parts of campus.

    The proximity to the Bay Area, the JGI, UC Berkeley and BIDS, Stanford, Silicon Valley, and of course the Perlstein Lab.

    The granola.

    The proximity to Big Sur, a.k.a. "the most beautiful area in the world."

    The weather.

  15. Do you have any plans to scale up (or back) your education and training efforts?

    Funny you should ask! Yes, I do plan to scale up my training efforts significantly! More on that once I figure out where my office will be, how to get paid, and where all the campus coffee shops are.

  16. Do you plan to change the name of your lab?

    Well, right now it's called the Lab of Genomics, Evolution, and Development (GED), which has become increasingly inaccurate :). I'm think of naming it the Lab for Data Intensive Biology (DIB). Other suggestions welcome; given my inability to focus, the Lab of Life, the Universe, and Everything might work just as well...

  17. How do you feel about losing your short e-mail address? It doesn't get much shorter than ''!

    Hopefully I can get That will more than make up for it.

  18. The City of Davis just received a new armored vehicle. Coincidence?

    No comment.


by C. Titus Brown at August 20, 2014 10:00 PM


GSoC Open Source Brain: Cortical Connections

Cortical Connections

Cortical Connections

In the same vein that the post before this one we will show here how to construct the connections between the cortical layers. In order to do so we will construct a function that works in general for any arbitrary connectivity, we describe in the following its structure. First, as in the thalamo-cortical connectivity, we have again the same structure of a function that loops over the target population extracting the relevant parameters that characterize these neurons. Furthermore we have another function that loops over the source population creating the corresponding tuples for the connection list. Is in this last function where the particular connectivity rule is implemented.

In the particular case of the Troyer model the connectivity between the cortical cells is determined by the correlation between the receptive fields of the neurons, the receptive fields here being Gabor functions. In more detail the neurons whose receptive fields are more correlated will be the ones more likely to have excitatory connections between them. On the other hand the ones whose receptive fields are less correlated will be more likely to receive inhibitory connections. In this post we show two schemes that accomplish this connectivity. The first one uses the fact the parameters of the receptive field to calculate a connectivity and the second one uses the receptive fields directly to calculate the correlations. We present the determining functions in the stated order down here.

Now we present the function that creates the connectivity for a given neuron par. The circular distance between the orientation and phases are calculated as a proxy to estimate how similar the receptive fields of the neurons are. After that, the distance between them is weighted and normalized with a normal function in order to obtain a value that we can interpret as a probability value. Finally in order to calculate the connectivity we sample n_pick times with the given probability value to see hwo strong a particular connection should be.

def cortical_to_cortical_connection(target_neuron_index, connections, source_population, n_pick, g, delay, source_orientations,
source_phases, orientation_sigma, phase_sigma, target_neuron_orientation,
target_neuron_phase, target_type):
Creates the connections from the source population to the target neuron

for source_neuron in source_population:
# Extract index, orientation and phase of the target
source_neuron_index = source_population.id_to_index(source_neuron)
source_neuron_orientation = source_orientations[source_neuron_index]
source_neuron_phase = source_phases[source_neuron_index]

# Now calculate phase and orientation distances
or_distance = circular_dist(target_neuron_orientation, source_neuron_orientation, 180)

if target_type:
phase_distance = circular_dist(target_neuron_phase, source_neuron_phase, 360)
phase_distance = 180 - circular_dist(target_neuron_phase, source_neuron_phase, 360)

# Now calculate the gaussian function
or_gauss = normal_function(or_distance, mean=0, sigma=orientation_sigma)
phase_gauss = normal_function(phase_distance, mean=0, sigma=phase_sigma)

# Now normalize by guassian in zero
or_gauss = or_gauss / normal_function(0, mean=0, sigma=orientation_sigma)
phase_gauss = phase_gauss / normal_function(0, mean=0, sigma=phase_sigma)

# Probability is the product
probability = or_gauss * phase_gauss
probability = np.sum(np.random.rand(n_pick) probability) # Samples
synaptic_weight = (g / n_pick) * probability

if synaptic_weight > 0:
connections.append((source_neuron_index, target_neuron_index, synaptic_weight, delay))

return connections

Note that the overall strength is weighted by the conductivity value g that is passed as an argument. Furthermore a delay that is also passed as an argument is added to the list as the last element of the tuple.

Secondly we present the full correlation scheme. In this scheme we utilize the kernels directly to calculate the spatial correlation between them. In particular after we have flattened our kernels Z to have a series instead of a matrix we use the function perasonr from scipy.stats to calculate the correlation. Again as in the case above we use this probability to sample n_pick times and then calculate the relative connectivity strength with this.

def cortical_to_cortical_connection_corr(target_neuron_index, connections, source_population, n_pick, g, delay,
source_orientations, source_phases, target_neuron_orientation, target_neuron_phase,
Z1, lx, dx, ly, dy, sigma, gamma, w, target_type):
Creates the connections from the source population to the target neuron

for source_neuron in source_population:
# Extract index, orientation and phase of the target
x_source, y_source = source_neuron.position[0:2]
source_neuron_index = source_population.id_to_index(source_neuron)
source_neuron_orientation = source_orientations[source_neuron_index]
source_neuron_phase = source_phases[source_neuron_index]

Z2 = gabor_kernel(lx, dx, ly, dy, sigma, gamma, source_neuron_phase, w, source_neuron_orientation,
x_source, y_source)

if target_type:
probability = pearsonr(Z1.flat, Z2.flat)[0]
probability = (-1) * pearsonr(Z1.flat, Z2.flat)[0]

probability = np.sum(np.random.rand(n_pick) probability) # Samples
synaptic_weight = (g / n_pick) * probability

if synaptic_weight > 0:
connections.append((source_neuron_index, target_neuron_index, synaptic_weight, delay))

return connections

Note that the overall strength is weighted by the conductivity value g that is passed as an argument. Furthermore a delay that is also passed as an argument is added to the list as the last element of the tuple.

We now show how a plot that illustrates how the probabilities change when the parameters that determined the gabor function are changed for each scheme.

In the figure above we have int he upper part how the probability for the first scheme a neuron with phase 0 and orientation 0 change as we vary the phase (left) and orientation (right). In the two graphs bellow we have the same for the second scheme we presented

by H ( at August 20, 2014 02:48 AM

GSoC Open Source Brain: Thalamo-Cortical Connections

Thalamo-cortical connections

Thalamo-cortical connections

In this post I will show how to build arbitrary custom connections in PyNN. We will illustrate the general technique in the particular case of the Troyer model. In the Troyer model the connections from the LGN to the cortex are determined with a gabor-profile therefore I am going to describe the required functions to achieve such an aim.

In the PyNN documentation we find that one of the ways of implementing arbitrary connectivity patterns is to use the FromListConnector utility. In this format we have to construct a list of tuples with a tuple for each connection. In each tuple we need to include the index of the source neuron (the neuron from which the synapse originates), the index of the target neuron (the neuron into which the synapse terminates), the weight and the delay. For example (0, 1, 5, 0.1) would indicate that we have a connection from the neuron 0 to the neuron 1 with a synaptic weight of 5 and a delay of 0.1.

In the light of the explanation above we need to construct a function that is able to construct a list with the appropriate weights given a target and a source populations. In order to start moving towards this goal we will first write a function that connects a given neuron in the target population to all the neurons in the source population. We first present the function here bellow and we will explain it later:

def lgn_to_cortical_connection(cortical_neuron_index, connections, lgn_neurons, n_pick, g, delay, polarity, sigma,
gamma, phi, w, theta, x_cortical, y_cortical):
Creates connections from the LGN to the cortex with a Gabor profile.

This function adds all the connections from the LGN to the cortical cell with index = cortical_neuron_index. It
requires as parameters the cortical_neruon_index, the current list of connections, the lgn population and also
the parameters of the Gabor function.

cortical_neuron_index : the neuron in the cortex -target- that we are going to connect to
connections: the list with the connections to which we will append the new connnections
lgn_neurons: the source population
n_pick: How many times we will sample per neuron
g: how strong is the connection per neuron
delay: the time it takes for the action potential to arrive to the target neuron from the source neuron
polarity: Whether we are connection from on cells or off cells
sigma: Controls the decay of the exponential term
gamma: x:y proportionality factor, elongates the pattern
phi: Phase of the overall pattern
w: Frequency of the pattern
theta: Rotates the whole pattern by the angle theta
x_cortical, y_cortical : The spatial coordinate of the cortical neuron


for lgn_neuron in lgn_neurons:
# Extract position
x, y = lgn_neuron.position[0:2]
# Calculate the gabbor probability
probability = polarity * gabor_probability(x, y, sigma, gamma, phi, w, theta, x_cortical, y_cortical)
probability = np.sum(np.random.rand(n_pick) probability) # Samples

synaptic_weight = (g / n_pick) * probability
lgn_neuron_index = lgn_neurons.id_to_index(lgn_neuron)

# The format of the connector list should be pre_neuron, post_neuron, w, tau_delay
if synaptic_weight > 0:
connections.append((lgn_neuron_index, cortical_neuron_index, synaptic_weight, delay))

The first thing to note from the function above are its arguments. It contains the source population and the particular target neuron that we want to connect to. It also contains all the connectivity and gabor-function related parameters. In the body of the function we have one loop over the whole source population that decides whether we add a connection from a particular cell or not. In order to decide if we add a connection we have to determine the probability from the gabor function. Once we have this we sample n_pick times and add a weighted synaptic weight accordingly to the list for each neuron.

In the function above we have the values of the gabor function passed as arguments. However, the values of each gabor function depend on the nature of the cell of the target population. In the light of this we will construct another function that loops over the target population and extracts the appropriate gabor values for each function in this population. We again present the function and then explain it:

def create_lgn_to_cortical(lgn_population, cortical_population, polarity, n_pick, g, delay, sigma, gamma, phases,
w, orientations):
Creates the connection from the lgn population to the cortical population with a gabor profile. It also extracts
the corresponding gabor parameters that are needed in order to determine the connectivity.

print 'Creating connection from ' + lgn_population.label + ' to ' + cortical_population.label

# Initialize connections
connections = []

for cortical_neuron in cortical_population:
# Set the parameters
x_cortical, y_cortical = cortical_neuron.position[0:2]
cortical_neuron_index = cortical_population.id_to_index(cortical_neuron)
theta = orientations[cortical_neuron_index]
phi = phases[cortical_neuron_index]

# Create the connections from lgn to cortical_neuron
#lgn_to_cortical_connection(cortical_neuron_index, connections, lgn_population, n_pick, g, polarity, sigma,
#gamma, phi, w, theta, x_cortical, y_cortical)

lgn_to_cortical_connection(cortical_neuron_index, connections, lgn_population, n_pick, g, delay, polarity, sigma,
gamma, phi, w, theta, 0, 0)

return connections

This function requires as arguments the source and target populations as well as the necessary parameters that characterize each cell connectivity: orientation and phase. In the body of the function we have a loop over the cortical population that extracts the relevant parameters -position, orientation and phase- and then calls the function that we already describe previously in order to create the connectivity from the source population to the cell in place.

So now we have the necessary functions to construct a list. Now, we can use FromListConnector to transform the list into a connector. And the use this to define a Projection. We define both the excitatory and inhibitory connections. We abstract this complete set into the following function:

def create_thalamocortical_connection(source, target, polarity, n_pick, g, delay, sigma, gamma, w, phases, orientations, simulator):
Creates a connection from a layer in the thalamus to a layer in the cortex through the mechanism of Gabor sampling

# Produce a list with the connections
connections_list = create_lgn_to_cortical(source, target, polarity, n_pick, g, delay, sigma, gamma, phases, w, orientations)

# Transform it into a connector
connector = simulator.FromListConnector(connections_list, column_names=["weight", "delay"])

# Create the excitatory and inhibitory projections
simulator.Projection(source, target, connector, receptor_type='excitatory')
simulator.Projection(source, target, connector, receptor_type='inhibitory')

With this we can create in general connections from one target population to the other. We can even change change the gabor function for whatever we want if we want to experiment with other connectivity patterns. Finally we present down here an example of a sampling from a Gabor function with the aglorithm we just constructed:

So in the image we show in the left the sampling from the ideal Gabor function in the right.

by H ( at August 20, 2014 02:43 AM

August 19, 2014

Titus Brown

The fifth workshop on Analyzing Next Generation Sequencing Data

The fifth annual Analyzing Next Generation Sequencing Data workshop just finished - #ngs2014. As usual the schedule and all of the materials are openly available.

tl; dr? Good stuff.

We've been running this thing since 2010, and we now have almost 120 alumni (5 classes of roughly 24 students each). The students come from all over the world (although I think we're missing attendees from Africa and Antarctica), and from many different types of institutions, including top tier research universities, biotech, and non-profit research centers. The "students" vary in rank from graduate students and staff to full professors; the age range goes from low 20s to a fair bit older than me.

The class invariably starts with an introduction to cloud computing, followed by a number of step-by-step tutorials through command line BLAST, quality analysis of Illumina data, mapping, and assembly. We cover basic analysis of your data, statistical considerations in sample design and data analysis, automation, version control, R, Python, IPython Notebook, mRNAseq analysis, and a variety of other topics. The students take full advantage of the opportunity to ask an immense variety of questions, ranging from algorithmic and technical to fundamental scientific questions.

It's truly awesome, and I am unreasonably proud of this workshop!

This year the previous workshop faculty and instructors (Ian Dworkin, Istvan Albert, Chris Chandler, Adina Howe) were joined by a newer cohort, including Meg Staton, Matt MacManes, Daniel Standage, Aaron Darling, and Martin Schilling. The younger group of scientists (Chris, Adina, Meg, Matt, Daniel, Aaron, and Martin) make me feel old: they're young, enthusiastic, all about open science, and totally into reproducibility and Software Carpentry and good bioinformatics. Many of them are now junior faculty, which really gives me hope for the future! Of particular note, Adina Howe took the course in 2010 (the first time we offered it) and is now taking a position as a big data biology junior faculty member -- how cool is that!?

Our TAs this year did a fantastic job. Amanda Charbonneau was the overall "cruise director", and Aswathy Sebastian, Will Pitchers, Elijah Lowe, and Qingpeng Zhang served as TAs.

How did the course go this year?

Each year we do an end-of-course discussion where I try to stay quiet as the students dissect all of the bad decisions I made in organizing the course. This year, I took a page from Greg Wilson's handbook and made the students offer up one good thing and one bad thing -- every student had to provide at least one of each, and they had to be non-overlapping. We didn't entirely succeed at getting completely non-overlapping feedback, but the lists are still interesting and informative:


I think my favorite is "Good: covered a lot of material; bad: covered A LOT of material!" although "I am worried that we now have a false sense of hope" comes in as a close second. Meg Staton should feel proud that one of the comments boiled down to "more Meg", although it came out as "more of Meg's flowcharts." And of course there's the always popular opinion that "if only you'd given us more to read up front, we'd have come better prepared", which in my experience is an incredibly over-optimistic lie, if well intentioned :).

Assessing the workshop

This is now the third year we've run assessments on the workshop. Our expert assessment company, StemEd LLC, hasn't yet finished the assessment report for 2014, but you can read the 2012 and 2013 evaluations here and here. These aren't complete assessments -- we are still working on processing the long-form answers -- and they are somewhat superficial, but overall paint a very positive picture of the workshop. This is in line with what we hear from the students both informally throughout, as well as more formally at the end-of-workshop discussions.

One thing that surprised me this year (and I'm mentioning it because I wouldn't have noticed if someone hadn't said it for their "one good thing") was that people commented very positively on the diversity of students, who came from all over the world (we covered all but two South American countries!), had many different research backgrounds and interests, and were at many different career stages. While we are in fact tempted to roll dice to choose the students (this year, we accepted 24 of 170 applicants), we actually do spend some time trying to balance the class -- and it seemed to work well this year!

Next years' plans

We are planning to run the workshop against next year, at about the same time and in the same place, but there are some potential complications. (More about those soon.)

The biggest change that I think I'm going to put into place next year is that the first ~5 days will be paced much more carefully. Some people come in with a lot of self-confidence and some solid expertise, while others come in with little of either; the latter group is really sensitive to being overwhelmed, while the former group is usually eager to drink from the firehose (sometimes until they see just how high we can turn up the pressure, hah). I plan to address this by making the first 5 days all about gentle-but-thorough introductions to UNIX, mapping, assembly, and scripting. In my experience, even the people who come in with some knowledge get a lot out of the more thorough introductions. Then in the second week we'll go crazy with a range of subjects.

As part of this change, I may restrict the first week lecturers to trained Software Carpentry instructors. This would include Adina Howe, Meg Staton, and Martin Schilling, as well as myself (although I'd like to offer to pay for other instructors to do the in-person training if they were interested). From what I observed, people who haven't gone through Greg Wilson's training bootcamp are really lacking in an ability to "read" the classroom - for one, they don't pay enough attention to stickies, and for another, they are often too enthusiastic to get to the cool stuff. These instructors are incredibly valuable after people have learned to tread water, but can too easily drown students in the first week. I got a lot of feedback this year that they needed to be introduced more carefully. (There are some people that are just ill-suited to instructing non-experts, but I tend not to invite them -- I'm thinking of the first lecture at another workshop, which (literally) started with "OK, now after compiling and installing my software package, fire up vi and edit the config file to reflect your local system settings. Then run the program on the first demo file XXXX.")

Something else I need to make sure of is that I (or someone) remains heavily involved in the course throughout. This year I was distracted by several different things, including three (!) thesis defenses on main campus that took place during the course, and I didn't bring the intensity. Next year, more intensity (or perhaps a new workshop director ;).

So that's my report for this year.


by C. Titus Brown at August 19, 2014 10:00 PM

Matthieu Brucher

Announcement: ATKLimiter 1.0.0, ATKCompressor 1.1.0, ATKExpander 1.1.0

I’m happy to announce the release of one new limiter plugin based on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. I also updated the compressor and the expander with improved UI controls. The compressor also has now a dry/wet knob, allowing to use it for parallel compression.







The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

The files can be downloaded on SourceForge, as well as the source code.

Direct link for ATKLimiter
Direct link for ATKCompressor
Direct link for ATKExpander

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at August 19, 2014 07:13 AM

August 18, 2014

Hamzeh Alsalhi

Google Summer of Code 2014 Final Summary

Now at the end of this GSoC I have contributed four pull requests that have been merged into the code base. There is one planed pull request that has not been started and another pull request nearing its final stages. The list below gives details of each pull request and what was done or needs to be done in the future.

This GSoC has been an excellent experience. I wan't to thank the members of the scikit-learn community, most of all Vlad, Gael, Joel, Oliver, and my mentor Arnaud, for their guidance and input which improved the quality of my projects immeasurably.

Sparse Input for Ensemble Methods

PR #3161 - Sparse Input for AdaBoost
StatusCompleted and Merged
Summary of the work done: The ensemble/weighted_boosting class was edited to avoid densifying the input data and to simply pass along sparse data to the base classifiers to allow them to proceed with training and prediction on sparse data. Tests were written to validate correctness of the AdaBoost classifier and AdaBoost regressor when using sparse data by making sure training and prediction on sparse and dense formats of the data gave identical results, as well verifying the data remained in sparse format when the base classifier supported it. Go to the AdaBoost blog post to see the results of sparse input with AdaBoost visualized.

PR - Sparse input Gradient Boosted Regression Trees (GBRT)
StatusTo be started
Summary of the work to be done: Very similar to sparse input support for AdaBoost, the classifier will need modification to support passing sparse data to its base classifiers and similar tests will be written to ensure correctness of the implementation. The usefulness of this functionality depends on the sparse support for decision trees which is a pending mature pull request here PR #3173.

Sparse Output Support

PR #3203 - Sparse Label Binarizer
StatusCompleted and Merged
Summary of the work done: The label binarizing function in scikit-learns label code was modified to support conversion from sparse formats and helper functions to this function from the utils module were modified to be able to detect the representation type of the target data when it is in sparse format. Read about the workings of the label binarizer.

PR #3276 - Sparse Output One vs. Rest
StatusCompleted and Merged
Summary of the work done: The fit and predict functions for one vs. rest classifiers modified to detect sparse target data and handle it without densifying the entire matrix at once, instead the fit function iterates over densified columns of the target data and fits an individual classifier for each column and the predict uses binarizaion on the results from each classifier individually before combining the results into a sparse representation. A test was written to ensure that classifier accuracy was within a suitable range when using sparse target data.

PR #3438 - Sparse Output Dummy Classifier
StatusCompleted and Merged
Summary of the work done: The fit and predict functions were adjusted to accept the sparse format target data. To reproduce the same behavior of prediction on dense target data first a sparse class distribution function was written to get the classes of each column in the sparse matrix, second a random sampling function was created to provide a sparse matrix of randomly drawn values from a user specified distribution. Read the blog post to see detailed results of the sparse output dummy pull request.

PR #3350 - Sparse Output KNN Classifier
StatusNearing Completion
Summary of the work done: In the predict function of the classifier the dense target data is indexed one column at a time. The main improvement made here is to leave the target data in sparse format and only convert a column to a dense array when it is necessary. This results in a lower peak memory consumption, the improvement is proportional to the sparsity and overall size of the target matrix.

Future Directions 

It is my goal for the Fall semester to support the changes I have made to the scikit-learn code base the best I can. I also hope to see myself finalize the remaining two pull requests.

by (Hamzeh) at August 18, 2014 01:13 AM

August 17, 2014

Vighnesh Birodkar


This years GSoC coding period has nearly come to an end. This post aims to briefly summarize everything that happened during the last three months. My task was to implement Region Adjacency Graph based segmentation algorithms for scikit-image. This post provides a good explanation about them. Below I will list out my major contributions.


Region Adjacency Graphs

Fixing the API for RAGs was very important, since it was directly going to affect everything else that followed. After a long discussion and some benchmarks we finally decided to have NetworkX as a dependency. This helped a lot, since I had a lot of graph algorithms already implemented for me. The file implements the RAG class and the RAG construction methods. I also implemented threshold_cut, a function which segments images by simply thresholding edge weights. To know more, you can visit, RAG Introduction.

Normalized Cut

The function cut_normazlied, implements the Normalized Cut algorithm for RAGs. You can visit Normalized Cut on RAGs to know more. See the videos at the end to get a quick idea of how NCut works. Also see, A closer look at NCut, where I have benchmarked the function and indicated bottlenecks.

Drawing Regions Adjacency Graphs

In my posts, I had been using a small piece of code I had written to display RAGs. This Pull Request implements the same functionality for scikit-image. This would be immensely useful for anyone who is experimenting with RAGs. For a more detailed explanation, check out Drawing RAGs.

Hierarchical Merging of Region Adjacency Graphs

This Pull Request implements a simple form of Hierarchical Merging. For more details, see Hierarchical Merging of Region Adjacency Graphs. This post also contains videos at the end, do check them out. This can also be easily extended to a boundary map based approach, which I plan to do post-GSoC


Final Comments

The most important thing for me is that I am a better Python programmer as compared to what I was before GSoC began this year. I was able to see how some graph based segmentation methods work at their most basic level. Although GSoC has come to an end, I don’t think my contributions to scikit-image have. Contributing to it has been a tremendous learning experience and plan to continue doing so. I have been been fascinated with Image Processing since me and my friends wrote an unholy piece of Matlab code about 3 years ago to achieve this. And as far as I can see its a fascination I will have for the rest of my life.

Finally, I would like to thank my mentors Juan, Johannes Schönberger and Guillaume Gay. I would also like to thank Stefan for reviewing my Pull Requests.



by Vighnesh Birodkar at August 17, 2014 07:15 PM


Region Adjacency Graphs model regions in an image as nodes of a graph with edges between adjacent regions. Superpixel methods tend to over segment images, ie, divide into more regions than necessary. Performing a Normalized Cut and Thresholding Edge Weights are two ways of extracting a better segmentation out of this. What if we could combine two small regions into a bigger one ? If we keep combining small similar regions into bigger ones, we will end up with bigger regions which are significantly different from its adjacent ones. Hierarchical Merging explores this possibility. The current working code can be found at this Pull Request

Code Example

The merge_hierarchical function performs hierarchical merging on a RAG. It picks up the smallest weighing edge and combines the regions connected by it. The new region is adjacent to all previous neighbors of the two combined regions. The weights are updated accordingly. It continues doing so till the minimum edge weight in the graph in more than the supplied thresh value. The function takes a RAG as input where smaller edge weight imply similar regions. Therefore, we use the rag_mean_color function with the default "distance" mode for RAG construction. Here is a minimal code snippet.

from skimage import graph, data, io, segmentation, color

img =
labels = segmentation.slic(img, compactness=30, n_segments=400)
g = graph.rag_mean_color(img, labels)
labels2 = graph.merge_hierarchical(labels, g, 40)
g2 = graph.rag_mean_color(img, labels2)

out = color.label2rgb(labels2, img, kind='avg')
out = segmentation.mark_boundaries(out, labels2, (0, 0, 0))

I arrived at the threshold 40 after some trial and error. Here is the output.


The drawback here is that the thresh argument can vary significantly depending on image to image.

Comparison with Normalized Cut

Loosely speaking the normalized cut follows a top-down approach where as the hierarchical merging follow a bottom-up approach. Normalized Cut starts with the graph as a whole and breaks it down into smaller parts. On the other hand hierarchical merging, starts with individual regions and merges them into bigger ones till a criteria is reached. The Normalized Cut however, is much more robust and requires little tuning of its parameters as images change. Hierarchical merging is a lot faster, even though most of its computation logic is written in Python.

Effect of change in threshold

Setting a very low threshold, will not merge any regions and will give us back the original image. A very large threshold on the other hand would merge all regions and give return the image as just one big blob. The effect is illustrated below.











Hierarchical Merging in Action

With this modification the following code can output the effect of all the intermediate segmentation during each iteration.

from skimage import graph, data, io, segmentation, color
import time
from matplotlib import pyplot as plt

img =
labels = segmentation.slic(img, compactness=30, n_segments=400)
g = graph.rag_mean_color(img, labels)
labels2 = graph.merge_hierarchical(labels, g, 60)

c = 0

out = color.label2rgb(graph.graph_merge.seg_list[-10], img, kind='avg')
for label in graph.graph_merge.seg_list:
    out = color.label2rgb(label, img, kind='avg')
    out = segmentation.mark_boundaries(out, label, (0, 0, 0))
    io.imsave('/home/vighnesh/Desktop/agg/' + str(c) + '.png', out)
    c += 1

I then used avconv -f image2 -r 3 -i %d.png -r 20 car.mp4 to output a video. Below are a few examples.

In each of these videos, at every frame, a boundary dissapears. This means that the two regions separated by that boundary are merged. The frame rate is 5 FPS, so more than one region might be merged at a time.

Coffee Image


Car Image


Baseball Image


by Vighnesh Birodkar at August 17, 2014 05:52 PM

Issam Laradji

(GSoC 2014) Final Summary (Neural Networks)

GSoC 2014 has been an extraordinary experience. Not only did it encourage me to develop much needed open-source implementation of neural network algorithms, but also exposed me to a great, diverse community. I also learned useful practices for maintaining clean, quality code and writing accessible documentation. This prepared me to work well, and efficiently under pressure, since quality work had to be produced in a short period of time.

In terms of the requirements, the three algorithms mentioned in the proposal - (1) Multi-layer perceptron, (2) Multi-layer perceptron with pre-training, and (3) Extreme Learning Machines - are completed (see below for a comprehansive description). In terms of specific requirements, there has been a lot of changes in order to accommodate positive suggestions, especially for MLP, and ELM. While a part of MLP was completed prior to the start of GSoC, the code went through a complete renovation, which made it faster, more readable, more scalable, and more robust. In fact, most of the work involved was optimizing the speed of execution, improving readability - this includes proper naming and convenient infrastructure of the code base, and writing a comprehensive documentation. The algorithms are explained in more detail below.

This wouldn't have been possible without the profound, sincere assistance of my mentor Olivier Grisel, and the scikit-learn team - including, Arnaud Joly, Gael Varoquaux, Kyle Kastner, Jnothman, Lars Buitinck, and many more. I sincerely thank the PSA team for emphasizing on summarizing my work as blog posts here and I do greatly appreciate Google's significant support it offered, which was instrumental in the successful completion of this project.

(1) Multi-layer perceptron (MLP) (link: #3204)
Figure 1: One hidden layer MLP

This  implements the classic backpropagation algorithm supporting one or more hidden layers (see Figure 1). Depending on the problem type (classification or regression), backpropagation optimizes an objective function whose main purpose is to have the predicted output as close to the target as possible, though subject to some constraints like regularization.

The MLP supports L2 regularization which controls the degree to which it is overfitting. Increased regularization constrains the trained weights to be of smaller value which makes the decision function more linear.

We also added a renowned activation function known as rectified linear unit (ReLU) function, which not only is faster, but allows training more than one hidden layer efficiently, at least more than hyperbolic tan and logistic [1].

Unit testing was made thorough as 99.17% of the statements were covered. A gradient unit test helped make sure the activation functions - hyperbolic tan, logistic, and ReLU - work as expected.

After the mid-term, much of the code was renovated. Many methods were combined to simplify the code and improve readability. Performance was improved by removing redundant calls and  taking advantage of pre-allocation of matrices - including, values of activation layers, gradients, and weights. Many private variables were removed, making pickling less prone to error and less dense.

MLP might benefit from a scheme known as pre-training which is explained in section 2.

(2) Multi-layer perceptron with pre-training (link: #3281 )

                                           Figure 2: Pre-training scheme using restricted boltzmann machines.

Prior to running backpropagation, an unsupervised learner can provide the MLP with initial weights that might be better than randomized weights. The parameters optimized by the unsupervised learner - after training on the dataset - can be assigned to the MLP weight parameters as starting points.

The motivation is that these initial weights are meant to allow backpropagation to converge in a better local optima than otherwise [2].

Figure 2 illustrates the scheme of using pre-training with multi-layer perceptron. For each set of weights between two layers, a restricted boltzmann machine (RBMs) trains on the input data of the previous layer and the final parameters are assigned to these set of weights in the large multi-layer perceptron.

An example was set to compare the performance of multi-layer perceptron (MLP) with and without pre-training using RBMs [3]. MLP without pre-training had its parameters initialized using scaled, random distribution. For pre-training, an RBM trains on the digits dataset and the resultant parameters are given to MLP as initial coefficient and intercept parameters. Below are the testing scores against the digits dataset [4],

  Testing accuracy of mlp without pretraining: 0.967 
  Testing accuracy of mlp with pretraining: 0.978

However, it is not always the case that pretraining improves performance. In some occasions, especially when dealing with large training sets, it could even decrease the score.

(3) Extreme Learning Machines (link: #3306)

                                                                          Figure 3: Neural network for ELM

The main focus after the mid-term evaluations was on developing extreme learning machines (ELMs). First we implemented the standard algorithm of ELMs that optimize an objective function using least-square solutions.

An ELM has a similar network as a one hidden layer MLP, except the output layer has no bias (see Figure 3). ELM, basically, trains a network through these 3 steps,

  • it applies a random projection to the input space, onto a possibly higher dimensional space;
  • the result passes through an element-wise non-linear activation function, typically a sigmoid such as the tanh, and logistic functions; and
  • last, it trains a linear one vs. rest classifier or a multi-output ridge regression model.

The algorithm trains a single-hidden layer feedforward network by computing the hidden layer values using randomized parameters, then solving  for the output weights using least-square solutions. For prediction, after computing the forward pass, the continuous output values pass through a gate function converting them to integers that represent classes. The function representing  ELM is given as, $y=\beta\cdot f(W^T \cdot X + b ) $

where matrices $X$ and $y$ represent the input samples and target
values, respectively; matrices $W$ and $b$ are randomly generated based on a uniform distribution; matrix $\beta$ contains unknown variables; and $f(\cdot)$ is the non-linear, component-wise activation function.
ELM solves for $\beta$ using the ridge regression implementation, given as, $(H^T H + (1 / C) * I)^{-1} H^T y$ where $H = f(W^TX+b)$, $C$ is a regularization term which controls the linearity of the decision function, and $I$ is the identity matrix.
We demonstrated the effects of tuning two main hyperparameters, 
  • weight_scale, which controls the variance of the random projection weights, the higher the value the more the less the regularization and therefore more overfitting. 
  • C, which controls the regularization strength of the output linear model, which regularizes the hidden-to-output weights in the same way as weight_scale regularizes the input-to-hidden weights.

and another main hyperparameter,
  • n_hidden, which controls the number of hidden layer nodes.

Below are 3 plots that illustrate the effect of these parameters on score,

Figure 4: Effect of weight_scale on decision function.

                                                                        Figure 5: Effect of C on decision function.

                              Figure 6: Effect of weight_scale and C on the scores against the Digits dataset.

Figures 4 and 5 show how increasing the regularization terms C would lead to a more non-linear decision function.

Figure 6 shows a colour map representing scores returned by grid-search illustrating the fact that a balance between C and weight_scale is important to have a higher score. C=1.0 and weight_scale=10  achieved the highest score as indicated by the darkest shade of the relevant blue square.

We re-used ridge regression [5] implementation for solving the least-square solution as it optimizes training speed for different data types. Next, we implemented the sequential algorithm of the ELM. It allows ELM to train on the dataset in batches, while, interestingly, the end result is exactly the same as though the whole dataset is put into memory. However, decreasing the size of the batches, can potentially increase training time. Below is a benchmark showing the training time in seconds of training ELMs with different batch sizes on a 10000 image MNIST dataset.

  batch_size        50 hidden neurons            500 hidden neurons
   None                0.32s                                 2.21s  
   10                     0.71s                                13.62s  
   100                   0.33s                                3.30s  
   1000                 0.32s                                2.44s

batch_size=None means that the whole dataset is put into memory. As shown in the benchmark, Training on larger batch sizes improves speed but could cause memory error if the batch size is too large. Nevertheless, using batches the algorithm supports on-line learning and therefore it can update its solutions as the data arrives in chunks.

Also, support for kernel was added to ELM, which was later removed for reasons that will soon appear. ELM originally transforms the input data into hidden activations depending on the number of hidden neurons. Similarly, kernels, like in SVM, transform the input data into another dimensional space where hidden neurons play no role. Empirically, kernels were found to be slow, yet lead to no accuracy improvement over the standard ELM. For that reason and to avoid feature creep, we removed kernel support.

However, we added support of the ReLU activation function, hyperbolic tan, and logistic. They were put in an external file so that they can be shared between different modules in scikit-learn .

Further, we updated another file [6] that is responsible for assigning class weights, useful for several algorithms that support weighted classification. We added method that computes the weights  corresponding to each sample as a vector to allow ridge-regression to run weighted least-square solutions in the ELM.

We also improved testing coverage. ELM has a coverage of 100% of the code, making it reliable. Testings were made to make sure, that weighted ELM does improve results in instances of imbalanced datasets; that higher number of hidden neurons does improve the training score; and that whether the algorithm runs using batch-based or not should produce the same end result.

To conclude, this experience was special and useful in that it brought me closer to the scikit-learn community and other open-source communities. It also encouraged me to satisfy my long ambition of implementing useful algorithms and writing accessible documentation for any user who wish to delve into the world of neural networks.

I sincerely look forward to continue working with the scikit-learn team for the years to come and I sincerely look forward to participating in GSoC 2015, either as a mentor or as a student.


[1] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. "Rectifier nonlinearities improve neural network acoustic models." ICML Workshop on Deep Learning for Audio, Speech, and Language Processing. 2013.

[2] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507.




[6] References

1) Multi-layer perceptron:

2) Multi-layer perceptron with pre-training:

3) Extreme learning machines:

by (Issam Laradji) at August 17, 2014 10:13 AM

Maheshakya Wijewardena

Performance comparison among LSH Forest, ANNOY and FLANN

Finally, it is time to compare performance of Locality Sensitive Hashing Forest(approximate nearest neighbor search implementation in scikit-learn), Spotify Annoy and FLANN.


Synthetic datasets of different sizes (varying n_samples and n_features) are used for this evalutation. For each data set, following measures were calculated.

  1. Index building time of each ANN (Approximate Nearest Neighbor) implementation.
  2. Accuracy of nearest neighbors queries with their query times.

Python code used for this evaluation can be found in this Gist. Parameters of LSHForest (n_estimators=10 and n_candidates=50) are kept fixed during this experiment. Accuracies can be raised by tuning these parameters.


For each dataset, two graphs have been plotted according to the measures expressed in the above section. n_samples=1000, n_features=100 n_samples=1000, n_features=500 n_samples=6000, n_features=3000 n_samples=10000, n_features=100 n_samples=10000, n_features=500 n_samples=10000, n_features=1000 n_samples=10000, n_features=6000 n_samples=50000, n_features=5000 n_samples=100000, n_features=1000 It is evident that index building times of LSH Forest and FLANN are almost incomparable to that of Annoy for almost all the datasets. Moreover, for larger datasets, LSH Forest outperforms Annoy at large margins with respect to accuracy and query speed. Observations from these graphs prove that LSH Forest is competitive with FLANN for large datasets.

August 17, 2014 12:00 AM

A demonstration of the usage of Locality Sensitive Hashing Forest in approximate nearest neighbor search

This is a demonstration to explain how to use the approximate nearest neighbor search implementation using locality sensitive hashing in scikit-learn and to illustrate the behavior of the nearest neighbor queries as the parameters vary. This implementation has an API which is essentially as same as the NearestNeighbors module as approximate nearest neighbor search is used to speed up the queries at the cost of accuracy when the database is very large.

Before beginning the demonstration, background has to be set. First, the required modules are loaded and a synthetic dataset is created for testing.

import time
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import LSHForest
from sklearn.neighbors import NearestNeighbors

# Initialize size of the database, iterations and required neighbors.
n_samples = 10000
n_features = 100
n_iter = 30
n_neighbors = 100
rng = np.random.RandomState(42)

# Generate sample data
X, _ = make_blobs(n_samples=n_samples, n_features=n_features,
                  centers=10, cluster_std=5, random_state=0)

There are two main parameters which affect queries in the LSH Forest implementation.

  1. n_estimators : Number of trees in the LSH Forest.
  2. n_candidates : Number of candidates chosen from each tree for distance calculation.

In the first experiment, average accuracies are measured as the value of n_estimators vary. n_candidates is kept fixed. slearn.neighbors.NearestNeighbors used to obtain the true neighbors so that the returned approximate neighbors can be compared against.

# Set `n_estimators` values
n_estimators_values = np.linspace(1, 30, 5).astype(
accuracies_trees = np.zeros(n_estimators_values.shape[0], dtype=float)

# Calculate average accuracy for each value of `n_estimators`
for i, n_estimators in enumerate(n_estimators_values):
    lshf = LSHForest(n_candidates=500, n_estimators=n_estimators,
    nbrs = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')
    for j in range(n_iter):
        query = X[rng.randint(0, n_samples)]
        neighbors_approx = lshf.kneighbors(query, return_distance=False)
        neighbors_exact = nbrs.kneighbors(query, return_distance=False)

        intersection = np.intersect1d(neighbors_approx,
        ratio = intersection/float(n_neighbors)
        accuracies_trees[i] += ratio

    accuracies_trees[i] = accuracies_trees[i]/float(n_iter)

Similarly, average accuracy vs n_candidates is also measured.

# Set `n_candidate` values
n_candidates_values = np.linspace(10, 500, 5).astype(
accuracies_c = np.zeros(n_candidates_values.shape[0], dtype=float)

# Calculate average accuracy for each value of `n_candidates`
for i, n_candidates in enumerate(n_candidates_values):
    lshf = LSHForest(n_candidates=n_candidates, n_neighbors=n_neighbors)
    nbrs = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')
    # Fit the Nearest neighbor models
    for j in range(n_iter):
        query = X[rng.randint(0, n_samples)]
        # Get neighbors
        neighbors_approx = lshf.kneighbors(query, return_distance=False)
        neighbors_exact = nbrs.kneighbors(query, return_distance=False)

        intersection = np.intersect1d(neighbors_approx,
        ratio = intersection/float(n_neighbors)
        accuracies_c[i] += ratio

    accuracies_c[i] = accuracies_c[i]/float(n_iter)

You can get a clear view of the behavior of queries from the following plots. accuracies_c_l

The next experiment demonstrates the behavior of queries for different database sizes (n_samples).

# Initialize the range of `n_samples`
n_samples_values = [10, 100, 1000, 10000, 100000]
average_times = []
# Calculate the average query time
for n_samples in n_samples_values:
    X, labels_true = make_blobs(n_samples=n_samples, n_features=10,
                                centers=10, cluster_std=5,
    # Initialize LSHForest for queries of a single neighbor
    lshf = LSHForest(n_candidates=1000, n_neighbors=1)

    average_time = 0

    for i in range(n_iter):
        query = X[rng.randint(0, n_samples)]
        t0 = time.time()
        approx_neighbors = lshf.kneighbors(query,
        T = time.time() - t0
        average_time = average_time + T

    average_time = average_time/float(n_iter)

n_samples space is defined as [10, 100, 1000, 10000, 100000]. Query time for a single neighbor is measure for these different values of n_samples. query_time_vs_n_samples

August 17, 2014 12:00 AM


GSoC Open Source Brain: Arbitrary Spike-trains in PyNN

Arbitrary Spikes in PyNN

Arbitrary Spike-trains in PyNN

In this example we are going to create a population of cells with arbitrary spike trains. We will load the spike train from a file where they are stored as a list of arrays with the times at which they occurred. In order to so we are going to use the SpikeSourceArray class model of PyNN

First we start importing PyNN in with nest and all the other required libraries. Furthermore we start the simulators and given some general parameters. We assume that we already have produced spikes for the cells with 0.50 contrast:

import numpy as np
import matplotlib.pyplot as plt
import cPickle
import pyNN.nest as simulator

contrast = 0.50
Nside_lgn = 30
Ncell_lgn = Nside_lgn * Nside_lgn
N_lgn_layers = 4
t = 1000 # ms

simulator.setup(timestep=0.1, min_delay=0.1, max_delay=5.0)

So we are going to suppose that we have our data stored in './data'. The spike-trains are lists as long as the cell population that contain for each element an array with the times at which the spikes occurred for that particular neuron. In order to load them we will use the following code

directory = './data/'
format = '.cpickle'

spikes_on = []
spikes_off = []

for layer in xrange(N_lgn_layers):

# Layer 1
layer = '_layer' + str(layer)

polarity = '_on'
contrast_mark = str(contrast)
mark = '_spike_train'
spikes_filename = directory + contrast_mark + mark + polarity + layer + format
f2 = open(spikes_filename, 'rb')

polarity = '_off'
contrast_mark = str(contrast)
mark = '_spike_train'
spikes_filename = directory + contrast_mark + mark + polarity + layer + format
f2 = open(spikes_filename, 'rb')

Now this is the crucial part. If we want to utilize the SpikeSourceArray model for a cell in PyNN we can define a function that pass the spike-train for each cell in the population. In order to so we use the following code:

def spike_times(simulator, layer, spikes_file):
return [simulator.Sequence(x) for x in spikes_file[layer]]

Note that we have to change every spike-train array to a sequence before using it as a spike-train. After defining this function we can create the LGN models:

# Cells models for the LGN spikes (SpikeSourceArray)
lgn_spikes_on_models = []
lgn_spikes_off_models = []

for layer in xrange(N_lgn_layers):
model = simulator.SpikeSourceArray(spike_times=spike_times(simulator, layer, spikes_on))
model = simulator.SpikeSourceArray(spike_times=spike_times(simulator, layer, spikes_off))

Now that we have the corresponding model for the cells we can create the populations in the usual way:

# LGN Popluations

lgn_on_populations = []
lgn_off_populations = []

for layer in xrange(N_lgn_layers):
population = simulator.Population(Ncell_lgn, lgn_spikes_on_models[layer], label='LGN_on_layer_' + str(layer))
population = simulator.Population(Ncell_lgn, lgn_spikes_off_models[layer], label='LGN_off_layer_' + str(layer))

In order to analyze the spike-trains patterns for each population we need to declare a recorder for each population:

layer = 0 # We declare here the layer of our interest

population_on = lgn_on_populations[layer]
population_off = lgn_off_populations[layer]


Note here that we can chose the layer of our interest by modifying the value of the layer variable. Finally we run the model with the usual instructions and extract the spikes:

# Run model
############################# # Run the simulations for t ms

# Extract the data
data_on = population_on.get_data() # Creates a Neo Block
data_off = population_off.get_data()

segment_on = data_on.segments[0] # Takes the first segment
segment_off = data_off.segments[0]

In order to visualize the spikes we use the following function:

# Plot spike trains
def plot_spiketrains(segment):
Plots the spikes of all the cells in the given segments
for spiketrain in segment.spiketrains:
y = np.ones_like(spiketrain) * spiketrain.annotations['source_id']
plt.plot(spiketrain, y, '*b')
plt.ylabel('Neuron number')

Here the spiketrain variable contains the spike-train for each cell, that is, an array with the times at which the action potentials happened for each cell. In order to tell them apart we assigned them the value of the cell id. Finally we can plot the spikes of the on and off cells with the following code:

plt.subplot(2, 1, 1)
plt.title('On cells ')

plt.subplot(2, 1, 2)
plt.title('Off cells ')

We now show the plot produced by the code above. Note that the on and off cells are off-phase by 180.

by H ( at August 17, 2014 12:17 AM

GSoC Open Source Brain: Firing Rate Induced by a Sinus Grating

Firing Rate Induced by a Sinus Grating

Firing Rate induced by a Sinus Grating

Now that we know how to do convolutions with our center-surround kernel we can chose any other kind of stimulus to carry this out. In the neuoscience of vision it is very common to use a sinus grating in a wide array of experimental setings so we are going to use it now. In short, in this post we are going to see the see what signal does a center-surround kernel produces when is convolved with a sinus grating.

Center-Surround Kernel

In order to do the convolution we are going to define the kernel in the usual way using a function that we have utilized from our work before:

# First we define the size and resolution of the space in which the convolution is going to happen
dx = 0.05
dy = 0.05
lx = 6.0 # In degrees
ly = 6.0 # In degrees

# Now we define the temporal parameters of the kernel
dt_kernel = 5.0 # ms
kernel_duration = 150 # ms
kernel_size = int(kernel_duration / dt_kernel)

# Now the center surround parameters
factor = 1 # Controls the overall size of the center-surround pattern
sigma_center = 0.25 * factor # Corresponds to 15'
sigma_surround = 1 * factor # Corresponds to 1 degree

# Finally we create the kernel
kernel_on = create_kernel(dx, lx, dy, ly, sigma_surround, sigma_center, dt_kernel, kernel_size)

Sinus Grating

Now we are going to construct our sinus grating. But first, we need to think on how long our stimulus is going to last which is a function of how long the we want to simulate the convolution and of the resolutions of the stimulus and the simulation:

## Now we define the temporal l parameters of the sinus grating
dt_stimuli = 5.0 # ms

# We also need to add how long do we want to convolve
dt = 1.0 # Simulation resolution
T_simulation = 1 * 10 ** 3.0 # ms
T_simulation += int(kernel_size * dt_kernel) # Add the size of the kernel
Nt_simulation = int(T_simulation / dt) # Number of simulation points
N_stimuli = int(T_simulation / dt_stimuli) # Number of stimuli points

Finally we now present the parameters that determine the sinus grating. First the spatial frequency (K), followed by the spatial phase (Phi) and orientation (Theta). Furthermore we have also a parameter for the amplitude and the temporal frequency:

# And now the spatial parameters of the sinus grating
K = 0.8 # Cycles per degree
Phi = 0 # Spatial phase
Theta = 0 # Orientation
A = 1 # Amplitude
# Temporal frequency of sine grating
w = 3 # Hz

Now with all the spatial parameters in our possession we can call the function that produces the sine grating, we define it as the following function that we present below:

stimuli = sine_grating(dx, lx, dy, ly, A, K, Phi, Theta, dt_stimuli, N_stimuli, w)

def sine_grating(dx, Lx, dy, Ly, A, K, Phi, Theta, dt_stimuli, N_stimuli, w):
Returns a sine grating stimuli
Nx = int(Lx / dx)
Ny = int(Ly / dy)

# Transform to appropriate units
K = K * 2 * np.pi # Transforms K to cycles per degree
w = w / 1000.0 # Transforms w to kHz

x = np.arange(-Lx/2, Lx/2, dx)
y = np.arange(-Ly/2, Ly/2, dy)
X, Y = np.meshgrid(x, y)
Z = A * np.cos(K * X *cos(Theta) + K * Y * sin(Theta) - Phi)
t = np.arange(0, N_stimuli * dt_stimuli, dt_stimuli)
f_t = np.cos(w * 2 * np.pi * t )

stimuli = np.zeros((N_stimuli, Nx, Ny))

for k, time_component in enumerate(f_t):
stimuli[k, ...] = Z * time_component

return stimuli


Now that we have the stimulus and the kernel we can do the convolution, in order to do that we use again our functions and indexes that we use in the last post:

## Now we can do the convolution

# First we define the necessary indexes to the convolution
signal_indexes, delay_indexes, stimuli_indexes = create_standar_indexes(dt, dt_kernel, dt_stimuli, kernel_size, Nt_simulation)
working_indexes, kernel_times = create_extra_indexes(kernel_size, Nt_simulation)

# Now we calculate the signal
signal = np.zeros(Nt_simulation)

for index in signal_indexes:
signal[index] = convolution(index, kernel_times, delay_indexes, stimuli_indexes, kernel_on, stimuli)

We can visualize signal with the following code:

#Plot the signal
t = np.arange(kernel_size*dt_kernel, T_simulation, dt)
plt.plot(t, signal[signal_indexes])

We can see that the signal is also a sinus with a frequency that is consistent with the one from the sinus grating.

by H ( at August 17, 2014 12:17 AM

August 16, 2014

Vighnesh Birodkar


Variation with number of regions

In this post I explained how the Normalized Cut works and demonstrated some examples of it. This post aims to take a closer look at the code. I ran the following code to monitor the time taken by NCut with respect to initial number of regions.

from __future__ import print_function
from skimage import graph, data, io, segmentation, color
import time
from matplotlib import pyplot as plt

image =
segment_list = range(50, 801, 50)

for nseg in segment_list:
    labels = segmentation.slic(image, compactness=30, n_segments=nseg)
    rag = graph.rag_mean_color(image, labels, mode='similarity')
    T = time.time()
    new_labels = graph.ncut(labels, rag)
    time_taken = time.time() - T
    out = color.label2rgb(new_labels, image, kind='avg')
    io.imsave('/home/vighnesh/Desktop/ncut/' + str(nseg) + '.png', out)
    print(nseg, time_taken)


Here is the output sequentially.
thresh=50 thresh=100 thresh=150 thresh=200 thresh=250 thresh=300 thresh=350 thresh=400 thresh=450 thresh=500 thresh=550 thresh=600 thresh=650 thresh=700 thresh=750 thresh=700

By a little guess-work, I figured that the curve approximately varies as x^2.2. For 800 nodes, the time taken is around 35 seconds.

Line Profile

I used line profiler to examine the time taken by each line of code in threshold_normalized. Here are the results.

   218                                           @profile
   219                                           def _ncut_relabel(rag, thresh, num_cuts):
   220                                               """Perform Normalized Graph cut on the Region Adjacency Graph.
   222                                               Recursively partition the graph into 2, until further subdivision
   223                                               yields a cut greather than `thresh` or such a cut cannot be computed.
   224                                               For such a subgraph, indices to labels of all its nodes map to a single
   225                                               unique value.
   227                                               Parameters
   228                                               ----------
   229                                               labels : ndarray
   230                                                   The array of labels.
   231                                               rag : RAG
   232                                                   The region adjacency graph.
   233                                               thresh : float
   234                                                   The threshold. A subgraph won't be further subdivided if the
   235                                                   value of the N-cut exceeds `thresh`.
   236                                               num_cuts : int
   237                                                   The number or N-cuts to perform before determining the optimal one.
   238                                               map_array : array
   239                                                   The array which maps old labels to new ones. This is modified inside
   240                                                   the function.
   241                                               """
   242        59       218937   3710.8      3.2      d, w = _ncut.DW_matrices(rag)
   243        59          151      2.6      0.0      m = w.shape[0]
   245        59           61      1.0      0.0      if m > 2:
   246        44         3905     88.8      0.1          d2 = d.copy()
   247                                                   # Since d is diagonal, we can directly operate on its data
   248                                                   # the inverse of the square root
   249        44          471     10.7      0.0 = np.reciprocal(np.sqrt(,,
   251                                                   # Refer Shi & Malik 2001, Equation 7, Page 891
   252        44        26997    613.6      0.4          vals, vectors = linalg.eigsh(d2 * (d - w) * d2, which='SM',
   253        44      6577542 149489.6     94.9                                       k=min(100, m - 2))
   255                                                   # Pick second smallest eigenvector.
   256                                                   # Refer Shi & Malik 2001, Section 3.2.3, Page 893
   257        44          618     14.0      0.0          vals, vectors = np.real(vals), np.real(vectors)
   258        44          833     18.9      0.0          index2 = _ncut_cy.argmin2(vals)
   259        44         2408     54.7      0.0          ev = _ncut.normalize(vectors[:, index2])
   261        44        22737    516.8      0.3          cut_mask, mcut = get_min_ncut(ev, d, w, num_cuts)
   262        44           78      1.8      0.0          if (mcut < thresh):
   263                                                       # Sub divide and perform N-cut again
   264                                                       # Refer Shi & Malik 2001, Section 3.2.5, Page 893
   265        29        78228   2697.5      1.1              sub1, sub2 = partition_by_cut(cut_mask, rag)
   267        29          175      6.0      0.0              _ncut_relabel(sub1, thresh, num_cuts)
   268        29           92      3.2      0.0              _ncut_relabel(sub2, thresh, num_cuts)
   269        29           32      1.1      0.0              return
   271                                               # The N-cut wasn't small enough, or could not be computed.
   272                                               # The remaining graph is a region.
   273                                               # Assign `ncut label` by picking any label from the existing nodes, since
   274                                               # `labels` are unique, `new_label` is also unique.
   275        30          685     22.8      0.0      _label_all(rag, 'ncut label')

As you can see above 95% of the time is taken by the call to eigsh.

To take a closer look at it, I plotted time while ensuring only one iteration. This commit here takes care of it. Also, I changed the eigsh call to look for the largest eigenvectors instead of the smallest ones, with this commit here. Here are the results.


A single eignenvalue computation is bounded by O(n^1.5) as mentioned in the original paper. The recursive NCuts are pushing the time required towards more than O(n^2).eigsh solves the eigenvalue problem for a symmetric hermitian matrix. It in turn relies on a library called ARPack. As documented here ARPack isn’t very good at finding the smallest eigenvectors. If the value supplied as the argument k is too small, we get the ArpackNoConvergence Exception. As seen from the above plot, finding the largest eigenvectors is much more efficient using the eigsh function.

Since the problem is specific to ARPack, using other libraries might lead to faster computation. slepc4py is one such BSD licensed library. The possibility of optionally importing slec4py should be certainly explored in the near future.

Also, we can optionally ask the user for a function to solve the eigenvalue problem, so that he can use a matrix library of his choice if he/she so desires.

Final Thoughts

Although the current Normalized Cut implementation takes more than quadratic time, the preceding over segmentation method does most of the heavy lifting. With something like SLIC, we can be sure of the number of nodes irrespective of the input image size. Although, a better eigenvalue finding technique for smallest eigenvectors would immensely improve its performance.

by Vighnesh Birodkar at August 16, 2014 03:12 PM

Jan Hendrik Metzen

Compare Classifier Predictions using Reliability Diagrams

This notebook generates reliability diagrams for some classifiers on an artificial data set. Reliability diagrams allow checking if the predicted probabilities of a binary classifier are well calibrated. For perfectly calibrated predictions, the curve in a reliability diagram should be as close as possible to the diagonal/identity. This would correspond to a situation in which among \(N\) instances for which a classifier predicts probability \(p\) for class \(A\), the ratio of instances which actually belong to class \(A\) is approx. \(p\) (for any \(p\) and sufficiently large \(N\)).

This notebook reproduces some of the results from the paper Predicting Good Probabilities with Supervised Learning.

In [1]:
%load_ext watermark
%watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy,scikit-learn
Jan Hendrik Metzen 16/08/2014 

CPython 2.7.7
IPython 2.1.0

numpy 1.8.1
scikit-learn 0.14.1

compiler   : GCC 4.1.2 20080704 (Red Hat 4.1.2-54)
system     : Linux
release    : 3.13.0-29-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit

In [2]:
import numpy as np

from sklearn import datasets

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.isotonic import IsotonicRegression
In [3]:
import matplotlib.pyplot as plt
%matplotlib inline

Function for reliability curve computation

In [4]:
def reliability_curve(y_true, y_score, bins=10, normalize=False):
    """Compute reliability curve

    Reliability curves allow checking if the predicted probabilities of a
    binary classifier are well calibrated. This function returns two arrays
    which encode a mapping from predicted probability to empirical probability.
    For this, the predicted probabilities are partitioned into equally sized
    bins and the mean predicted probability and the mean empirical probabilties
    in the bins are computed. For perfectly calibrated predictions, both
    quantities whould be approximately equal (for sufficiently many test

    Note: this implementation is restricted to binary classification.


    y_true : array, shape = [n_samples]
        True binary labels (0 or 1).

    y_score : array, shape = [n_samples]
        Target scores, can either be probability estimates of the positive
        class or confidence values. If normalize is False, y_score must be in
        the interval [0, 1]

    bins : int, optional, default=10
        The number of bins into which the y_scores are partitioned.
        Note: n_samples should be considerably larger than bins such that
              there is sufficient data in each bin to get a reliable estimate
              of the reliability

    normalize : bool, optional, default=False
        Whether y_score needs to be normalized into the bin [0, 1]. If True,
        the smallest value in y_score is mapped onto 0 and the largest one
        onto 1.

    y_score_bin_mean : array, shape = [bins]
        The mean predicted y_score in the respective bins.

    empirical_prob_pos : array, shape = [bins]
        The empirical probability (frequency) of the positive class (+1) in the
        respective bins.

    .. [1] `Predicting Good Probabilities with Supervised Learning

    if normalize:  # Normalize scores into bin [0, 1]
        y_score = (y_score - y_score.min()) / (y_score.max() - y_score.min())

    bin_width = 1.0 / bins
    bin_centers = np.linspace(0, 1.0 - bin_width, bins) + bin_width / 2

    y_score_bin_mean = np.empty(bins)
    empirical_prob_pos = np.empty(bins)
    for i, threshold in enumerate(bin_centers):
        # determine all samples where y_score falls into the i-th bin
        bin_idx = np.logical_and(threshold - bin_width / 2 < y_score,
                                 y_score <= threshold + bin_width / 2)
        # Store mean y_score and mean empirical probability of positive class
        y_score_bin_mean[i] = y_score[bin_idx].mean()
        empirical_prob_pos[i] = y_true[bin_idx].mean()
    return y_score_bin_mean, empirical_prob_pos

Training data

Generate a toy dataset on which different classifiers are compared. Among the 20 features, only 2 are actually informative. 2 further features are redundant, i.e., linear combinations of the 2 informative features. The remaining features are just noise.

In [5]:
X, y = datasets.make_classification(n_samples=100000, n_features=20,
                                    n_informative=2, n_redundant=2)
In [6]:
bins = 25
train_samples = 100  # Samples used for training the models
calibration_samples =  400  # Additional samples for claibration using Isotonic Regression

X_train = X[:train_samples]
X_calibration = X[train_samples:train_samples+calibration_samples]
X_test = X[train_samples+calibration_samples:]
y_train = y[:train_samples]
y_calibration = y[train_samples:train_samples+calibration_samples]
y_test = y[train_samples+calibration_samples:]

Compute reliability curves for different classifiers

Compute reliability curves for different classifiers: * Logistic Regression * Naive Bayes * Random Forest * Support-Vector Classification (scores) * Support-Vector Classification + Isotonoc Calibration

In [7]:
classifiers = {"Logistic regression": LogisticRegression(),
               "Naive Bayes": GaussianNB(),
               "Random Forest": RandomForestClassifier(n_estimators=100),
               "SVC": SVC(kernel='linear', C=1.0),
               "SVC + IR": SVC(kernel='linear', C=1.0)}
In []:
reliability_scores = {}
y_score = {}
for method, clf in classifiers.items():, y_train)
    if method == "SVC + IR":  # Calibrate SVC scores using isotonic regression.
        n_plus = (y_calibration == 1.0).sum()  # number of positive examples
        n_minus = (y_calibration == 0.0).sum()  # number of negative examples
        # Determine target values for isotonic calibration. See 
        # "Predicting Good Probabilities with Supervised Learning"
        # for details.
        y_target = np.where(y_calibration == 0.0, 
                            1.0 / (n_minus + 2.0),
                            (n_plus + 1.0) / (n_plus + 2.0))
        # Perform actual calibration using isotonic calibration
        svm_score = clf.decision_function(X_calibration)[:, 0]
        ir = IsotonicRegression(out_of_bounds='clip').fit(svm_score, y_target)
        y_score[method] = ir.transform(clf.decision_function(X_test)[:, 0])
        reliability_scores[method] = \
            reliability_curve(y_test, y_score[method], bins=bins, normalize=False)
    elif method == "SVC":
        # Use SVC scores (predict_proba returns already calibrated probabilities)
        y_score[method] = clf.decision_function(X_test)[:, 0]
        reliability_scores[method] = \
            reliability_curve(y_test, y_score[method], bins=bins, normalize=True)
        y_score[method] = clf.predict_proba(X_test)[:, 1]
        reliability_scores[method] = \
            reliability_curve(y_test, y_score[method], bins=bins, normalize=False)

Plot reliability diagram

In [9]:
plt.figure(0, figsize=(8, 8))
plt.subplot2grid((3, 1), (0, 0), rowspan=2)
plt.plot([0.0, 1.0], [0.0, 1.0], 'k', label="Perfect")
for method, (y_score_bin_mean, empirical_prob_pos) in reliability_scores.items():
    scores_not_nan = np.logical_not(np.isnan(empirical_prob_pos))
             empirical_prob_pos[scores_not_nan], label=method)
plt.ylabel("Empirical probability")

plt.subplot2grid((3, 1), (2, 0))
for method, y_score_ in y_score.items():
    y_score_ = (y_score_ - y_score_.min()) / (y_score_.max() - y_score_.min())
    plt.hist(y_score_, range=(0, 1), bins=bins, label=method,
             histtype="step", lw=2)
plt.xlabel("Predicted Probability")
plt.legend(loc='upper center', ncol=2)
<matplotlib.legend.Legend at 0x7f8e75ee1890>

The following observations can be made:

  • Logistic regression returns well-calibrated probabilities close to the "perfect" line
  • Naive Bayes tends to push probabilties to 0 or 1 (note the counts in the histograms). This is mainly because it makes the assumption that features are conditionally independent given the class, which is not the case in this dataset which contains 2 redundant features.
  • Random Forest shows the opposite behavior: The histograms show peaks at approx. 0.1 and 0.9 probability, while probabilities close to 0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil and Caruana: "Methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values. Because predictions are restricted to the interval [0,1], errors caused by variance tend to be one-sided near zero and one. For example, if a model should predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict zero. If we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away from 0. We observe this effect most strongly with random forests because the base-level trees trained with random forests have relatively high variance due to feature subseting." As a result, the calibration curve shows a characteristic sigmoid shape, indicating that the classifier could trust its "intuition" more and return probabilties closer to 0 or 1 typically. A post-processing such as Platt-calibration, which fits a sigmoid to the probabilities on a separate calibration dataset, would typically help if the calibration curve is sigmoid.
  • The scores of a Support Vector Classification (SVC), which are linearly related to the distance of the sample from the hyperplane, show a similar but even stronger effect as the Random Forest. This is not too surprising as the scores are in no sense probabilties and must not be interpreted as such as the curve shows.
  • One alternative to Platt-calibration is Isotonic Regression. While Platt-calibration fits a sigmoid, Isotonic Regression fits an arbitrary increasing (isotonic) function. Thus, it has a weaker inducttive bias and can be applied more broadly (also in situations where the calibration curve is not sigmoid). The downside is that it typically requires more calibration data because its inductive bias is weaker. This can also be seen in the SVC + IR curve: While the sigmoid shape of the pure SVC scores is removed and the calibration curve does not show a clear bias, it is quite noisy, indicating much variance. Thus, the used calibration dataset (even thoug 4 times larger than the training data) is too small in this case.

For a further discussion and more extensive experiments, please refer to Niculescu-Mizil and Caruana.

This post was written as an IPython notebook. You can download this notebook.

by Jan Hendrik Metzen at August 16, 2014 02:08 PM

August 15, 2014

Janani Padmanabhan

The deadlines ringing do remind that 3 months have flown past in a jiffy. Time to bid sayonara! Good times seem to roll by too soon, always; sad thing! Nevertheless it has been a great experience under a brilliant mentoring. The post has come up much later the previous one. Things have taken shape since then.

1. Ellipsoidal Harmonics:
 Ellipsoidal harmonic functions, the first kind; also known as Lames functions have been implemented in Cython and had been integrated with ufuncs.
 Ellipsoidal Harmonic function of the second kind and the calculation normalization constant for Lames function were implemented as .pyx files due to the involvement of global variables. The calculation of normalization constant was implemented in 2 different ways, using integration and using recurrence. Though recurrence seemed to be the more basic and faster way of implementation, the numerical stability wasn't that good; so we adopted integration.
The process involved many new things to me, like the integration of awesome LAPACK library, the speed and awesomeness of Cython etc!
The pull request is here:
2. Hypergeometric functions:
The present implementation of hypergeometric functions is buggy. For real values of x, C implementation of the function from Cephes library has few errors while the FORTRAN implementation for complex values of x suffers with errors for much wider domain. There has been an attempt to re-implement the function in Cython and make it less ridden with errors. Though the shortage of time denied a bug-free implementation a few bugs have been removed successfully.
The implementation so far has been posted here

I would yet again stress on the fact that the flipping of bits this summer has been of great fun and well as a great skill and knowledge booster. Never was my summer so productive!

Signing off with loads of great memories and experiences

by (janani padmanabhan) at August 15, 2014 06:56 PM

Vighnesh Birodkar


A lot of Image Processing algorithms are based on intuition from visual cues. Region Adjacency Graphs would also benefit if they were somehow drawn back on the images they represent. If we are able to see the nodes, edges, and the edges weights, we can fine tune our parameters and algorithms to suit our needs. I had written a small hack in this blog post to help better visualize the results. Later, Juan suggested I port if for scikit-image. It will indeed be a very helpful tool for anyone who wants to explore RAGs in scikit-image.

Getting Started

You will need to pull for this Pull Request to be able to execute the code below. I’ll start by defining a custom show_image function to aid displaying in IPython notebooks.

from skimage import graph, data, io, segmentation, color
from matplotlib import pyplot as plt
from skimage.measure import regionprops
import numpy as np
from matplotlib import colors

def show_image(img):
    width = img.shape[1] / 50.0
    height = img.shape[0] * width/img.shape[1]
    f = plt.figure(figsize=(width, height))

We will start by loading a demo image just containing 3 bold colors to help us see how the draw_rag function works.

image = io.imread('/home/vighnesh/Desktop/images/colors.png')


We will now use the SLIC algorithm to give us an over-segmentation, on which we will build our RAG.

labels = segmentation.slic(image, compactness=30, n_segments=400)

Here’s what the over-segmentation looks like.

border_image = segmentation.mark_boundaries(image, labels, (0, 0, 0))


Drawing the RAGs

We can now form out RAG and see how it looks.

rag = graph.rag_mean_color(image, labels)
out = graph.draw_rag(labels, rag, border_image)


In the above image, nodes are shown in yellow whereas edges are shown in green. Each region is represented by its centroid. As Juan pointed out, many edges will be difficult to see because of low contrast between them and the image, as seen above. To counter this we support the desaturate option. When set to True the image is converted to grayscale before displaying. Hence all the image pixels are a shade of gray, while the edges and nodes stand out.

out = graph.draw_rag(labels, rag, border_image, desaturate=True)


Although the above image does very well to show us individual regions and their adjacency relationships, it does nothing to show us the magnitude of edges. To give us more information about the magnitude of edges, we have the colormap option. It colors edges between the first and the second color depending on their weight.

blue_red = colors.ListedColormap(['blue', 'red'])
out = graph.draw_rag(labels, rag, border_image, desaturate=True,


As you can see, the edges between similar regions are blue, whereas edges between dissimilar regions are red. draw_rag also accepts a thresh option. All edges above thresh are not considered for drawing.

out = graph.draw_rag(labels, rag, border_image, desaturate=True,
                     colormap=blue_red, thresh=10)


Another clever trick is to supply a blank image, this way, we can see the RAG unobstructed.

cyan_red = colors.ListedColormap(['cyan', 'red'])
out = graph.draw_rag(labels, rag, np.zeros_like(image), desaturate=True,


Ahhh, magnificent.

Here is a small piece of code which produces a typical desaturated color-distance RAG.

image =
labels = segmentation.slic(image, compactness=30, n_segments=400)
rag = graph.rag_mean_color(image, labels)
cmap = colors.ListedColormap(['blue', 'red'])
out = graph.draw_rag(labels, rag, image, border_color=(0,0,0), desaturate=True,


If you notice the above image, you will find some edges crossing over each other. This is because, some regions are convex. Hence their centroid lies outside their boundary and edges emanating from it can cross other edges.


I will go over some examples of RAG drawings, since most of it is similar, I won’t repeat the code here. The Ncut technique, wherever used, was with its default parameters.

Color distance RAG of Coffee on black background


Color distance RAG of Coffee after applying NCut


Notice how the centroid of the white rim of the cup is placed at its centre. It is the one adjacent to the centroid of the gray region of the upper part of the spoon, connected to it via a blue edge. Notice how this edge crosses others.

Color distance RAG of Lena


A futuristic car and its color distance RAG after NCut



Coins Image and their color distance RAG after NCut


Further Improvements

  • A point that was brought up in the PR as well is that thick lines would immensely enhance the visual
    appeal of the output. As and when they are implemented, rag_draw should be modified to support drawing
    thick edges.
  • As centroids don’t always lie in within an objects boundary, we can represent regions by a point other than their centroid, something which always lies within the boundary. This would allow for better visualization of the actual RAG from its drawing.

by Vighnesh Birodkar at August 15, 2014 06:14 PM

August 12, 2014

Matthieu Brucher

Announcement: Audio TK 0.5.0

Audio Toolkit is now almost ready for its first stable release. Its content will now move toward more advanced DSP algorithms (zero delay filters, amplifiers).

Download link: ATK 0.5.0


* Renamed slope attribute to ratio for Gain Compressor and Expander Filters
* Renamed the Chamberlin filter
* Added a StereoUniversalFixedDelayLineFilter that can make mix two channels together with different delay for each channel with Python wrappers
* Added a GainLimiterFilter (maximum ratio) with Python wrappers
* Added a MaxFilter with Python wrappers
* Added a DryWetFilter with Python wrappers

* Bug fixes

* Added a PipelineGlobalSinkFilter with Python wrapper
* Changed the MiddleSideFilter scale (no more dividing by 2 in the code)
* Additional tools additions (cos generator, offset+volume filter)
* Added a second order all pass filter with Python wrappers

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at August 12, 2014 07:46 AM

August 11, 2014

Richard Tsai

GSoC2014: Recent progress

Hi! It has been several weeks since I talked about my work last time. In the past serveral weeks I mainly worked on the optimization of cluster.hierarchy.

The SLINK Algorithm

The most important optimization is the SLINK alogrithm1 for single linkage. The naive hierarchical agglomerative clustering (HAC) algorithm has a \(O(n ^ 3)\) time complexity, while SLINK is \(O(n ^ 2)\) and very easy to implement (even easier than the naive algorithm).

The Pointer Representation

SLINK is very good in performance but requires a special linkage representation – the pointer representation, which is different from what cluster.hierarchy is using. The pointer representation can be described as follows.

  • A cluster is represented by the member with the largest index
  • \(\Pi[i] (i \in [0, n - 1])\) is the first cluster that cluster \(i\) joins, \(\Lambda[i] (i \in [0, n - 1]\) is the distance between cluster \(i\) and cluster \(\Pi[i]\) when they join

For example, the pointer representation of the following dendrogram is

  • \(\Pi[i] = \{6, 3, 3, 5, 9, 9, 8, 9, 9, 9\}\)
  • \(\Lambda[i] = \{1.394, 0.419, 0.831, 1.561, 3.123, 10.967, 1.633, 1.198, 4.990, \infty\}\)


The implementation of SLINK is very simple. There’s a pesudo-code in the orignal paper. The following Cython code need two pre-defined function condensed_index, which calculate the index of element (i, j) in a square condensed matrix, and from_pointer_representation, which convert the pointer representation to what you need.

def slink(double[:] dists, double[:, :] Z, int n):
    cdef int i, j
    cdef double[:] M = np.ndarray(n, dtype=np.double)
    cdef double[:] Lambda = np.ndarray(n, dtype=np.double)
    cdef int[:] Pi = np.ndarray(n, dtype=np.int32)

    Pi[0] = 0
    Lambda[0] = NPY_INFINITYF
    for i in range(1, n):
        Pi[i] = i
        Lambda[i] = NPY_INFINITYF

        for j in range(i):
            M[j] = dists[condensed_index(n, i, j)]

        for j in range(i):
            if Lambda[j] >= M[j]:
                M[Pi[j]] = min(M[Pi[j]], Lambda[j])
                Lambda[j] = M[j]
                Pi[j] = i
                M[Pi[j]] = min(M[Pi[j]], M[j])

        for j in range(i):
            if Lambda[j] >= Lambda[Pi[j]]:
                Pi[j] = i

    from_pointer_representation(Z, Lambda, Pi, n)


On a N = 2000 dataset, the improvement is significant.

In [20]: %timeit _hierarchy.slink(dists, Z, N)
10 loops, best of 3: 29.7 ms per loop

In [21]: %timeit _hierarchy.linkage(dists, Z, N, 0)
1 loops, best of 3: 1.87 s per loop

Other Attempts

I’ve also tried some other optimizations, some of which succeed while the others failed.

I used binary search in cluster_maxclust_monocrit and there was a bit improvement (though it is not a time-consuming function in most cases).

Before (N = 2000):

In [14]: %timeit hierarchy.fcluster(Z, 10, 'maxclust')
10 loops, best of 3: 35.6 ms per loop

After (N = 2000):

In [11]: %timeit hierarchy.fcluster(Z, 10, 'maxclust')
100 loops, best of 3: 5.86 ms per loop

Besides, I tried an algorithm similar to SLINK but for complete linkage – the CLINK algorithm2. However, it seems that CLINK is not the complete linkage that we used today. It did not always result in the best linkage in some cases on my implementation. Perhaps I have misunderstood some things in that paper.

At the suggestion of Charles, I tried an optimized HAC algorithm using priority queue. It has \(O(n^2 \log n)\) time complexity. However, it didn’t work well as expected. It was slower even when N = 3000. The algorithm needs to delete a non-root node in a priority queue, so when a binary heap is used, it needs to keep track of the index of every node and results in the increase of the time constant. Some other priority queue algorithms might perform better but I haven’t tried.

  1. Sibson, R. (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1), 30-34. 

  2. Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20(4), 364-366. 

by Richard at August 11, 2014 12:36 PM

August 10, 2014

Juan Nunez-Iglesias

min/max vs time

The python-bioformats library lets you seamlessly read microscopy images into numpy arrays from pretty much any file format.

I recently explained how to use Fiji’s Jython interpreter to open BioFormats images, do some processing, and save the result to a standard format such as TIFF. Since then, the
CellProfiler team has released an incredible tool, the python-bioformats library. It uses the Java Native Interface (JNI) to access the Java BioFormats library and give you back a numpy array within Python. In other words, it’s magic.

Some of the stuff I was doing in the Jython interpreter was not going to fly for a 400GB image file produced by Leica (namely, setting the flag setOpenAllSeries(True)). This file contained 3D images of multiple zebrafish embryos, obtained every 30 minutes for three days. I needed to process each image sequentially.

The first problem was that even reading the metadata from the file resulted in a Java out-of-memory error! With the help of Lee Kamentsky, one of the creators of python-bioformats, I figured out that Java allocates a maximum memory footprint of just 256MB. With the raw metadata string occupying 27MB, this was not enough to contain the full structure of the parsed metadata tree. The solution was simply to set a much larger maximum memory allocation to the JVM:

import javabridge as jv, bioformats as bf
jv.start_vm(class_path=bf.JARS, max_heap_size='8G')

Once that was done, it was straightforward enough to get an XML string of metadata to get information about all the images contained in the file:

md = bf.get_omexml_metadata('Long time Gfap 260314.lif')

This was my first experience with XML, and I found the official schema extremely intimidating. Thankfully, as usual, Python makes common things easy, and parsing XML is a very common thing. Using the xml package from the Python standard library, it is easy enough to get the information I want from the XML string:

from xml.etree import ElementTree as ETree
mdroot = ETree.fromstring(md)

You might have gathered that the ElementTree object contains a rooted tree with all the information about our file. Each node has a tag, attributes, and children. The root contains information about each series:

>>> print(mdroot[300].tag, mdroot[300].attrib)
('{}Image', {'ID': 'Image:121', 'Name': '41h to 47.5 hpSCI/Pos021_S001'})

And each series contains information about its acquisition, physical measurements, and pixel measurements:

>>> for a in mdroot[300]:
...     print((a.tag, a.attrib))
('{}AcquiredDate', {}),
('{}InstrumentRef', {'ID': 'Instrument:121'}),
('{}ObjectiveSettings', {'RefractiveIndex': '1.33', 'ID': 'Objective:121:0'}),
('{}Pixels', {'SizeT': '14', 'DimensionOrder': 'XYCZT', 'PhysicalSizeY': '0.445197265625', 'PhysicalSizeX': '0.445197265625', 'PhysicalSizeZ': '1.9912714979001302', 'SizeX': '1024', 'SizeY': '1024', 'SizeZ': '108', 'SizeC': '2', 'Type': 'uint8', 'ID': 'Pixels:121'})

I only need a fraction of this metadata, so I wrote a function, parse_xml_metadata, to parse out the image names, their size in pixels, and their physical resolution.

Armed with this knowledge, it is then straightforward to preallocate a numpy array for each image and read the image from disk:

from matplotlib import pyplot as plt, cm
from lesion.lifio import parse_xml_metadata
import numpy as np
from __future__ import division

filename = 'Long time Gfap 260314.lif'
rdr = bf.ImageReader(filename, perform_init=True)
names, sizes, resolutions = parse_xml_metadata(md)
idx = 50 # arbitrary series for demonstration
size = sizes[idx]
nt, nz = size[:2]
image5d = np.empty(size, np.uint8)
for t in range(nt):
    for z in range(nz):
        image5d[t, z] =, t=t, series=idx, rescale=False)
plt.imshow(image5d[nt//2, nz//2, :, :, 0], cmap=cm.gray)

2D slice from 5D volume

Boom. Using Python BioFormats, I’ve read in a small(ish) part of a quasi-terabyte image file into a numpy array, ready for further processing.

Note: the dimension order here is time, z, y, x, channel, or TZYXC, which I think is the most efficient way to read these files in. My wrapper allows arbitrary dimension order, so it’ll be good to use it to figure out the fastest way to iterate through the volume.

In my case, I’m looking to extract statistics using scikit-image’s profile_line function, and plot their evolution over time. Here’s the min/max intensity profile along the embryo for a sample stack:

min/max vs time

I still need to clean up the code, in particular to detect bad images (no prizes for guessing which timepoint was bad here), but my point for now is that, thanks to Python BioFormats, doing your entire bioimage analysis in Python just got a heck of a lot easier.

by Juan Nunez-Iglesias at August 10, 2014 07:43 AM

August 05, 2014

Matthieu Brucher

Book review: Global Optimization Methods In Geophysical Inversion

When I worked on the common reflection surface stack, one of our biggest issues was selecting the proper optimization algorithms. There are so many for global problems! The book tries to browse through several classical algorithms.

Content and opinions

First things first, I read the first edition, I couldn’t get the latest version, as the authors switched publishers, and I only had access to the old one (French ranting: why are scientific papers/books/… so expensive in digital formats???). I still could check the first chapter, see how it was improved, and based on the former content and the new content, I think I can also give an opinion on the second edition.

The book main format is to have all mathematical explanations in different chapters, and then int he last chapters, the methods are used to solve different geophysical problems. The authors tried to use as much geoscientific vocabulary as possible. It’s the only real geoscientific material in the first chapters, as it is mainly presentation of methods that could be found in any book.

So the first chapter presents all the general statistical tools you need. Although the rest of the book uses all the tools, the central point of the book is random number generation, as all the next methods use it, one way or another. My biggest concern is that the authors think that random numbers can be generated from a simple modulo distribution. And when you are doing statistics on final numbers, that sounds really odd. In the 90′s, this could be explained, there were not that many good random number generators, Mersenne Twister appeared a few years after the first edition, so OK. But how come this is still the case in the second edition? We can now generate random number with an almost-cryptographic quality (see Random 123), with no additional complexity, and some people are still advocating for out-dated practice? This mainly means one thing: content was added to the book, but nothing was updated. So when reading the book, you need to keep this in mind.

Next are tackled direct, linear and iterative linear inverse methods. Some of those methods are specific to the geophysical inverse problem: contrary to CT, you can’t turn your object that you want to “image”, you only get one aspect, one view. So this means that you need to use specific algorithms. But it still stems to the same usual cost functions, so even though the algorithms themselves can be specific, there is a general aspect to the inverse problem. A last part in the chapter is about the probabilistic formulation and the usual maximum likelihood and maximum a posterior.

The third chapter starts the issue of solving a problem on a specific space, with several local minimas, and where the previous methods would fail. After the simple grid search (always useful when you want to have a broad picture of the problem) come the Monte Carlo methods. They are usually very costly, but they are a simple statistical tool to sample a problem after a grid search.

Then Simulated Annealing is introduced. It is logical to have it after MC methods, as they can see as a better way of sampling the search space for the best solution. I appreciated the fact that several different ways of doing SA are explained, with their issues and their qualities. After that chapter come Genetic Algorithms. There are different ways to introduce them, and I have another book review on the subject that I will publish soon. The explanation here is really basic, but the main points are there. You could do better, but you could do worse.

There is an additional chapter in the second edition on additional evolutionary algorithms. I don’t think that in 6 pages you can tackle everything, but these algorithms are a little bit more trendy than SA and the simple GA, so it is always welcome.

The last but not least chapter presents several applications of SA and GA. They don’t tackle real fields scale tests, only one group of trace, but at least you can see something, and it is also reproducible if you want to. There are (too?) many images, seismograms with everything that happens, explanations on the number of iterations… I’m still missing the total run time for the examples, as it is an important aspect when doing field-scale experiments, as well as some additional robustness tests. There is one additional part in the second edition about joint inversion.

The book finishes with uncertainties on the solutions. All methods draw some kind of statistical models, so it is possible to have uncertainties thanks to the thousands drawn models. I think one warning is missing, as samplers have a starting period before they are stable, and I think it is missing in this section.


The book tries to bridge a rather large gap, between new global optimization methods (even now, some methods can still be considered new) and the industrial oil and gas industry. I always saw it as being late on new algorithms in the geophysics department (reservoir tends to be less late, perhaps because of the far lesser scale of their data), so it is interesting to see a book, with all its missed opportunities, trying to bridge that gap.

by Matthieu Brucher at August 05, 2014 07:49 AM

Maheshakya Wijewardena

Improvements for LSH Forest implementation and its applications

GSoC 2014 is coming to an end. But LSH Forest implementation requires a little more work to be completed. Following are the list of tasks to be done. They will be completed during the next two weeks.

1. Improving LSH Forest implementation

I have got a lot of feedback from scikit-learn community about my implementation of LSH Forest. Many of them are about the possible optimizations. Making those optimizations happen will cause a significant improvement in the performance.

2. Applying LSH Forest in DBSCAN

The idea of this is to speed up the clustering method using approximate neighbor search, rather than spending much time on exhaustive exact nearest neighbor search. DBSCAN requires the neighbors of which the distance from a data point is less than a given radius. We use the term radius neighbors for this. As LSH Forest is implemented to adhere with the scikit-learn API, we have a radius_neighbors function in LSH Forest(NearestNeighbors in scikit-learn has radius_neighbors function). Therefore, LSH Forest can be directly applied in place of exact nearest neighbor search.

After this application, it will be benchmarked to analyze the performance. Approximate neighbors are more useful when the size of the database (often the number of features) is very large. So the benchmark will be based on following facts. What are the sample sizes and number of features, where approximate neighbor search reasonably outperforms the exact neighbor search.

3. Documentation and wrapping up work

After completing the implementation and benchmarking, it will be documented with the scikit-learn documentation standards.

August 05, 2014 12:00 AM

July 31, 2014

William Stein

SageMathCloud -- history and status

2005: I made first release the SageMath software project, with the goal to create a viable open source free alternative to Mathematica, Magma, Maple, Matlab.

2006: First web-based notebook interface for using Sage, called "sagenb". It was a cutting edge "AJAX" application at the time, though aimed at a small number of users.

2007-2009: Much work on sagenb. But it's still not scalable. Doesn't matter since we don't have that many users.

2011-: Sage becomes "self sustaining" from my point of view -- I have more time to work on other things, since the community has really stepped up.

2012: I'm inspired by the Simons Foundation's (and especially Jim Simon's) "cluelessness" about open source software to create a new online scalable web application to (1) make it easier for people to get access to Sage, especially on Windows, and (2) generate a more longterm sustainable revenue stream to support Sage development. (I was invited to a day-long meeting in NYC at Simon's headquarters.)

2012-2013: Spent much of 2012 and early 2013 researching options, building prototypes, some time talking with Craig Citro and Robert Bradshaw (both at Google), and launched SageMathCloud in April 2013. SMC got some high-profile use, e.g., by UCLA's 400+ student calculus course.

2014: Much development over the last 1.5 years. Usage has also grown. There is some growth information here. I also have useful google analytics data from the whole time, which shows around 4000 unique users per week, with an average session duration of 97 minutes (see attached). Number of users has actually dropped off during the summer, since there is much less teaching going on.

SMC itself is written mostly in CoffeeScript using Node.js on the backend. There's a small amount of Python as well.

It's a highly distributed multi-data center application. The database is Cassandra. The backend server processes are mostly Node.js processes, and also nginx+haproxy+stunnel.

A copy of user data is stored in every data center, and is snapshotted every few minutes, both via :
  • ZFS -- for rolling snapshots that vanish after a month -- and via
  • bup -- for snapshots that remain forever, and are consistent across data centers.
These snapshots are critical for making it possible to trust collaborators on projects to not (accidentally) destroy your work. It is not possible for users to delete the bup snapshots, by design.
Here's what it does: realtime collaborative editing of Latex docs, IPython notebooks, Sage worksheets; use the command line terminal; have several people collaborate on a project (=a Linux account).
The main applications seem to be:
  • teaching courses with a programming or math software components -- where you want all your students to be able to use something, e.g., IPython, Julia, etc, and don't want to have to deal with trying to get them to install said software themselves. Also, you want to easily be able to share files with students, see their work in realtime, etc. It's a much, much easier for people to get going that with naked VM's they have to configure -- and also I provide cross-data center replication.
  • collaborative research mathematics -- all co-authors of a paper work together in an SMC project, both writing the paper there and doing computations.
Active development work right now:
  • course management for homework (etc)
  • administration functionality (mainly motivated by self-hosting and better moderation)
  • easy history slider to see all pasts states of a document
  • switching from bootstrap2 to bootstrap3.

by William Stein ( at July 31, 2014 11:17 PM

July 30, 2014

Continuum Analytics

Advanced Features of Conda Part 1

Conda is the package manager that comes with Continuum’s Anaconda distribution. Conda makes it easy to install and manage all your packages and environments on Windows, Mac OS X, and Linux.

If you are new to conda, I recommend starting at the documentation, and watching my presentation from SciPy 2014.

In this post, I will assume that you are already familiar with conda and its basic usage, both for installing and building packages. I will look at a few advanced features or tips that may not be well known, even among advanced users of conda. These features will help you with feature discovery, customization, and allow you to manage your packages and environments in more advanced ways.

by Aaron Meurer at July 30, 2014 04:44 PM

July 29, 2014

Vighnesh Birodkar


In my last post I demonstrated how removing edges with high weights can leave us with a set of disconnected graphs, each of which represents a region in the image. The main drawback however was that the user had to supply a threshold. This value varied significantly depending on the context of the image. For a fully automated approach, we need an algorithm that can remove edges automatically.

The first thing that I can think of which does something useful in the above mention situation is the Minimum Cut Algorithm. It divides a graph into two parts, A and B such that the weight of the edges going from nodes in Set A to the nodes in Set B is minimum.

For the Minimum Cut algorithm to work, we need to define the weights of our Region Adjacency Graph (RAG) in such a way that similar regions have more weight. This way, removing lesser edges would leave us with the similar regions.

Getting Started

For all the examples below to work, you will need to pull from this Pull Request. The tests fail due to outdated NumPy and SciPy versions on Travis. I have also submitted a Pull Request to fix that. Just like the last post, I have a show_img function.

from skimage import graph, data, io, segmentation, color
from matplotlib import pyplot as plt
from skimage.measure import regionprops
from skimage import draw
import numpy as np

def show_img(img):

    width = img.shape[1]/75.0
    height = img.shape[0]*width/img.shape[1]
    f = plt.figure(figsize=(width, height))

I have modified the display_edges function for this demo. It draws nodes in yellow. Edges with low edge weights are greener and edges with high edge weight are more red.

def display_edges(image, g):
    """Draw edges of a RAG on its image

    Returns a modified image with the edges drawn. Edges with high weight are
    drawn in red and edges with a low weight are drawn in green. Nodes are drawn
    in yellow.

    image : ndarray
        The image to be drawn on.
    g : RAG
        The Region Adjacency Graph.
    threshold : float
        Only edges in `g` below `threshold` are drawn.

    out: ndarray
        Image with the edges drawn.

    image = image.copy()
    max_weight = max([d['weight'] for x, y, d in g.edges_iter(data=True)])
    min_weight = min([d['weight'] for x, y, d in g.edges_iter(data=True)])

    for edge in g.edges_iter():
        n1, n2 = edge

        r1, c1 = map(int, rag.node[n1]['centroid'])
        r2, c2 = map(int, rag.node[n2]['centroid'])

        green = 0,1,0
        red = 1,0,0

        line  = draw.line(r1, c1, r2, c2)
        circle =,c1,2)
        norm_weight = ( g[n1][n2]['weight'] - min_weight ) / ( max_weight - min_weight )

        image[line] = norm_weight*red + (1 - norm_weight)*green
        image[circle] = 1,1,0

    return image

To see demonstrate the display_edges function, I will load an image, which just has two regions of black and white.

demo_image = io.imread('bw.png')


Let’s compute the pre-segmenetation using the SLIC method. In addition to that, we will also use regionprops to give us the centroid of each region to aid the display_edges function.

labels = segmentation.slic(demo_image, compactness=30, n_segments=100)
labels = labels + 1  # So that no labelled region is 0 and ignored by regionprops
regions = regionprops(labels)

We will use label2rgb to replace each region with its average color. Since the image is so monotonous, the difference is hardly noticeable.

label_rgb = color.label2rgb(labels, demo_image, kind='avg')


We can use mark_boundaries to display region boundaries.

label_rgb = segmentation.mark_boundaries(label_rgb, labels, (0, 1, 1))


As mentioned earlier we need to construct a graph with similar regions having more weights between them. For this we supply the "similarity" option to rag_mean_color.

rag = graph.rag_mean_color(demo_image, labels, mode="similarity")

for region in regions:
    rag.node[region['label']]['centroid'] = region['centroid']

label_rgb = display_edges(label_rgb, rag)


If you notice above the black and white regions have red edges between them, i.e. they are very similar. However the edges between the black and white regions are green, indicating they are less similar.

Problems with the min cut

Consider the following graph


The minimum cut approach would partition the graph as {A, B, C, D} and {E}. It has a tendency to separate out small isolated regions of the graph. This is undesirable for image segmentation as this would separate out small, relatively disconnected regions of the image. A more reasonable partition would be {A, C} and {B, D, E}. To counter this aspect of the minimum cut, we used the Normalized Cut.

The Normalized Cut

It is defined as follows
Let V be the set of all nodes and w(u,v) for u, v \in V be the edge weight between u and v

NCut(A,B) = \frac{cut(A,B)}{Assoc(A,V)} + \frac{cut(A,B)}{Assoc(B,V)}
cut(A,B) = \sum_{a \in A ,b \in B}{w(a,b)}

Assoc(X,V) = cut(X,V) = \sum_{x \in X ,v \in V}{w(x,v)}

With the above equation, NCut won’t be low is any of A or B is not well-connected with the rest of the graph. Consider the same graph as the last one.


We can see that minimizing the NCut gives us the expected partition, that is, {A, C} and {B, D, E}.

Normalized Cuts for Image Segmentation

The idea of using Normalized Cut for segmenting images was first suggested by Jianbo Shi and Jitendra Malik in their paper Normalized Cuts and Image Segmentation. Instead of pixels, we are considering RAGs as nodes.

The problem of finding NCut is NP-Complete. Appendix A of the paper has a proof for it. It is made tractable by an approximation explained in Section 2.1 of the paper. The function _ncut_relabel is responsible for actually carrying out the NCut. It divides the graph into two parts, such that the NCut is minimized. Then for each of the two parts, it recursively carries out the same procedure until the NCut is unstable, i.e. it evaluates to a value greater than the specified threshold. Here is a small snippet to illustrate.

img =

labels1 = segmentation.slic(img, compactness=30, n_segments=400)
out1 = color.label2rgb(labels1, img, kind='avg')

g = graph.rag_mean_color(img, labels1, mode='similarity')
labels2 = graph.cut_normalized(labels1, g)
out2 = color.label2rgb(labels2, img, kind='avg')



NCut in Action

To observe how the NCut works, I wrote a small hack. This shows us the regions as divides by the method at every stage of recursion. The code relies on a modification in the original code, which can be seen here.

from skimage import graph, data, io, segmentation, color
from matplotlib import pyplot as plt
import os

#img =
os.system('rm *.png')
img =
#img = color.gray2rgb(img)

labels1 = segmentation.slic(img, compactness=30, n_segments=400)
out1 = color.label2rgb(labels1, img, kind='avg')

g = graph.rag_mean_color(img, labels1, mode='similarity')
labels2 = graph.cut_normalized(labels1, g)

offset = 1000
count = 1
tmp_labels = labels1.copy()
for g1,g2 in graph.graph_cut.sub_graph_list:
    for n,d in g1.nodes_iter(data=True):
        for l in d['labels']:
            tmp_labels[labels1 == l] = offset
    offset += 1
    for n,d in g2.nodes_iter(data=True):
        for l in d['labels']:
            tmp_labels[labels1 == l] = offset
    offset += 1        
    tmp_img = color.label2rgb(tmp_labels, img, kind='avg')
    io.imsave(str(count) + '.png',tmp_img)
    count += 1

The two components at each stage are stored in the form of tuples in sub_graph_list. Let’s say, the Graph was divided into A and B initially, and later A was divided into A1 and A2. The first iteration of the loop will label A and B. The second iteration will label A1, A2 and B, and so on. I used the PNGs saved and converted them into a video with avconv using the command avconv -f image2 -r 1 -i %d.png -r 20 demo.webm. GIFs would result in a loss of color, so I made webm videos. Below are a few images and their respective successive NCuts. Use Full Screen for better viewing.

Note that although there is a user supplied threshold, it does not have to vary significantly. For all the demos below, the default value is used.

Colors Image


During each iteration, one region (area of the image with the same color) is split into two. A region is represented by its average color. Here’s what happens in the video

  • The image is divided into red, and the rest of the regions (gray at this point)
  • The grey is divided into a dark pink region (pink, maroon and yellow) and a
    dark green ( cyan, green and blue region ).
  • The dark green region is divided into light blue ( cyan and blue ) and the
    green region.
  • The light blue region is divided into cyan and blue
  • The dark pink region is divided into yellow and a darker pink (pink and marron
  • The darker pink region is divided into pink and maroon regions.

Coins Image



Camera Image



Coffee Image



Fruits Image

apples group fruit vegetable isolated on white


Baby Image ( Scaled )



Car Image


by Vighnesh Birodkar at July 29, 2014 08:48 PM

July 28, 2014

Manoj Kumar


Hi, It has been a long time since I had posted something on my blog. I had the opportunity to participate in the scikit-learn sprint recently, with the majority of the core-developers. The experience was awesome, but most of the time I had no idea what people were talking about, and I realised I have to learn a lot. I read somewhere that if you need to keep improving in life, you need to make sure the worst person in the job, and if that is meant to be true, I’m well on the right track.

Anyhow on a more positive note, recently one of my biggest pull requests got merged, ( ) and we shall have a quick look at the background, what it can do and what it cannot.

1. What is Logistic Regression?
A Logistic Regression is a regression model that uses the logistic sigmoid function to predict classification. The basic idea is to predict the feature vector \omega sucht that it fits the Logistic_log function, \frac{1}{1 + e^{-w'*X}} . A quick look at the graph (taken from wikipedia), when y is one, we need our estimator to predict w'X to be infinity and vice versa.


Now if we want to fit labels [-1, 1] the sigmoid function becomes \frac{1}{1 + e^{-y*w'*X}}. The logistic loss function is given by, log(1 + e^{-y*w*x}. Intuitively this seems correct because when y is 1, we need our estimator to predict w*x to be infinity, to suffer zero loss. Similarly when y is -1, we need out estimator to predict w*x to be -1. Our basic focus is to optimize for loss.

2. How can this be done?
This can be done either using block coordinate descent methods, like lightning does, or use the inbuilt solvers that scipy provides like newton_cg and lbfgs. For the newton-cg solver, we need the hessian, or more simply the double derivative matrix of the loss and for the lbfgs solver we need the gradient vector. If you are too lazy to do the math (like me?), you can obtain the values from here, Hessian

3. Doesn’t scikit-learn have a Logistic Regression already?
Oh well it does, but it is dependent on an external library called Liblinear. There are two major problems with this.
a] Warm start, (one cannot warm start, with liblinear since it does not have a coefficient parameter), unless we patch the shipped liblinear code.
b] Penalization of intercept. Penalization is done so that the estimator does not overfit the data, however the intercept is independent of the data (which can be considered analogous to a column of ones), and so it does not make much sense to penalize it.

4. Things that I learnt
Apart from adding a warm start, (there seems to be a sufficient gain in large datasets), and not penalizing the intercept,
a] refit paramter – generally after cross-validating, we take the average of the scores obtained across all folds, and the final fit is done according to the hyperparameter (in this case C) that corresponds to the perfect score. However Gael suggested that one could take the best hyperparameter across every fold (in terms of score) and average these coefficients and hyperparameters. This would prevent the final refit.
b] Parallel OvA – For each label, we perform a OvA, that is to convert y into 1 for the label in question, and into -1’s for the other labels. There is a Parallel loop across al loops and folds, and this is to supposed to make it faster.
c] Class weight support: The easiest way to do it is to convert to per sample weight and multiply it to the loss for each sample. But we had faced a small problem with the following three conditions together, class weight dict, solver liblinear and a multiclass problem, since liblinear does not support sample weights.

5. Problems that I faced.
a] The fit intercept = True case is found out to be considerably slower than the fit_intercept=False case. Gaels hunch was that it was because the intercept varies differently as compared to the data, We tried different things, such as preconditioning the intercept, i.e dividing the initial coefficient with the square root of the diagonal of the Hessian, but it did not work and it took one and a half days of time.

b] Using liblinear as an optimiser or a solver for the OvA case.

i] If we use liblinear as a solver, it means supplying the multi-label problem directly to liblinear.train. This would affect the parallelism and we are not sure if liblinear internally works the same way as we think we do. So after a hectic day of refactoring code, we finally decided (sigh) using liblinear as an optimiser is better (i.e we convert the labels to 1 and -1). For more details about Gaels comment, you can have a look at this

Phew, this was a long post and I’m not sure if I typed everything as I wanted to. This is what I plan to accomplish in the coming month
1. Finish work on Larsmans PR
2. Look at glmnet for further improvements in the cd_fast code.
3. ElasticNet regularisation from Lightning.

by Manoj Kumar at July 28, 2014 11:36 PM

Titus Brown

The Moore DDD Investigator competition: my presentation

Here are my talk notes for the Data Driven Discovery grant competition ("cage match" round). Talk slides are on slideshare You can see my full proposal here as well.

Hello, my name is Titus Brown, and I'm at Michigan State University where I run a biology group whose motto is "better science through superior software". I'm going to tell you about my vision for building infrastructure to support data-intensive biology.

Our research is focused on accelerating sequence-based biology with algorithmic prefilters that help scale downstream sequence analysis. The basic idea is to take the large amounts of data coming from sequencers and squeeze out the information in the data for downstream software to use. In most cases we make things 10 or 100x easier, in some cases we've been able to make analyses possible where they weren't doable before;

In pursuit of this goal, we've built three super awesome computer science advances: we built a low-memory approach for counting sequence elements, we created a new data structure for low-memory graph storage, and developed a streaming lossy compression algorithm that puts much of sequence analysis on a online and streaming basis. Collectively, these are applicable to a wide range of basic sequence analysis problems, including error removal, species sorting, and genome assembly.

We've implemented all three of these approaches in an open source analysis package, khmer, that we developed and maintain using good software engineering approaches. We have primarily focused on using this to drive our own research, but since you can do analyses with it can't be done any other way, we've had some pretty good adoption by others. It's a bit hard to tell how many people are using it because of the many ways people can download it, but we believe it to be in the 1000s and we know we have dozens of citations.

The most important part of our research is this: we have enabled some excellent biology! For a few examples, we've assembled the largest soil metagenome ever with Jim Tiedje and Janet Jansson, we've helped look at deep sea samples of bone eating worms with Shana Goffredi, we're about to publish the largest de novo mRNASeq analysis ever done, and we're enabling evo devo research at the phylogenetic extremes of the Molgula sea squirts. This was really what we set out to do at the beginning but the volume and velocity of data coming from sequencers turned out to be the blocking problem.

Coming from a bit of a physics background, when I started working in bioinformatics 6 years ago, I was surprised at our inability to replicate others' results. One of our explicit strategies now is to try to level up the field by doing high quality, novel, open science. For example, our lamprey analysis is now entirely automated, taking three weeks to go from 3 billion lamprey mRNASeq reads to an assembled, annotated transcriptome that we can interactively analyze in an IPython Notebook, which we will publish with the paper. Camille, who is working on this, is a combination software engineer and scientist, and this has turned out to be a really productive combination.

We've also found that 1000s of people want to do the kinds of things we're doing, but most don't have the expertise or access to computational infrastructure. So, we're also working on open protocols for analyzing sequence data in the cloud - going from raw mRNASeq data to finished analysis for about $100. These protocols are open, versioned, forkable, and highly replicable, and we've got about 20 different groups using them right now.

So that's what I work on now. But looking forward I see a really big problem looming over molecular biology. Soon we will have whatever 'omic data set we want from, to whatever resolution we want, limited only by sampling. But we basically don't have any good way of analyzing these data sets -- most groups don't have the capacity or capability to analyze them themselves, we can't store these data sets in one place and -- perhaps the biggest part of the catastrophe -- people aren't making these data available until after publication, which means that I expect many of them to vanish. We need to incentivise pre-publication sharing by making it useful to share your data. We can do individual analyses now, but we're missing the part that links these analyses to other data sets more broadly.

My proposal, therefore, is to build a distributed graph database system that will allow people to interconnect with open, walled-garden, and private data sets. The idea is that researchers will spin up their own server in the cloud, upload their raw or analyzed data, and have a query interface that lets them explore the data. They'll also have access to other public servers, and be able to opt-in to exploring pre-published data; this opt-in will be in the form of a walled-garden approach where researchers who use results from analyzing other unpublished data sets will be given citation information to those data sets. I hope and expect that this will start to incentivise people to open their data sets up a bit, but to be honest if all it does is make it so that people can analyze their own data in isolation it will already be a major win.

None of this is really a new idea. We published a paper exploring some of these ideas in 2009, but have been unable to find funding. In fact, I think this is the most traditionally unfundable project I have ever proposed, so I hope Moore feels properly honored. In any case, the main point here is that graph queries are a wonderful abstraction that lets you set up an architecture for flexibly querying annotations and data when certain precomputed results already exist. The pygr project showed me the power of this when implemented in a distributed way and it's still the best approach I've ever seen implemented in bioinformatics.

The idea would be to enable basic queries like this across multiple servers, so that we can begin to support the queries necessary for automated data mining and cross-validation.

My larger vision is very buzzwordy. I want to enable frictionless sharing, driven by immediate utility. I want to enable permissionless innovation, so that data mining folk can try out new approaches without first finding a collaborator with an interesting data set, or doing a lot of prep work. By building open, federated infrastructure, and avoiding centralized infrastructure, I am planning for poverty: everything we build will be sustainable and maintainable, so when my funding goes away others can pick it up. And my focus will be on solving people's current problems, which in biology are immense, while remaining agile in terms of what problems I tackle next.

The thing is, everybody needs this. I work across many funding agencies, and many fields, and there is nothing like this currently in existence. I'm even more sure of this because I posted my Moore proposal and requested feedback and discussed it with a number of people on Twitter. NGS has enabled research on non-model organisms but its promise is undermet due to lack of cyberinfrastructure, basically.

How would I start? I would hire two domain postdocs who are tackling challenging data analysis tasks, and support them with my existing lab; this would involve cross-training the postdocs in data intensive methodologies. For example, one pilot project is to work on the data from the DeepDOM cruise, where they did multi-omic sampling across about 20 points in the atlantic, and are trying to connect the dots between microbial activity and dissolved organic matter, with metagenomic and metabolomic data.

Integrated with my research, I would continue and expand my current efforts in training. I already run a number of workshops and generate quite a bit of popular bioinformatics training material; I would continue and expand that effort as part of my research. One thing that I particularly like about this approach is that it's deeply self-interested: I can find out what problems everyone has, and will be having soon, by working with them in workshops.

by C. Titus Brown at July 28, 2014 10:00 PM

Issam Laradji

(GSoC 2014) Progress report for 07/27/14

Great progress! I submitted implementations of multi-layer perceptron (mlp-link) (mlp-pretraining-link) and extreme learning machines (elm-link) and their documentations as well. Yet many improvements could be made through revisions and my mentors' momentous support that they always provided throughout the summer.

Besides many corrections, a lot have been added since the last post - here is an overview,

1)  Documentation

I  wrote, with the help of my mentor, a documentation (link) on extreme learning machines (ELM), which briefly describes ELM's scheme and the main hyperparameters it possesses.  It also explains tips on why adjusting those parameters are important since noisy, small datasets need a different approach than large, clean datasets. Further, a brief tutorial was given to help users set up and train ELM objects. Finally, a mathematical overview was given describing the function developed from training an ELM and the kind of algorithm it uses to solve the unknown coefficients in that function.

I believe the document can be made more necessarily comprehensive by adding details that describe other ELM parameters such as recursive least-square learning, and details that describe how different kernels affect the decision function. I plan to address these fixes before next week.

2) Example

I added an example illustrating the effect of weight_scale and C, two hyperparameters in ELM.

C is a regularization term that constrains the coefficients of the hidden-to-output weights
weight_scale scales the range of values that input-to-hidden weights can take.

Their effects are illustrated in Figure 1 and 2, where the value of the chosen parameter is given above the corresponding decision function.

Figure 1: Effect of varying the regularization term C on variance.

Figure 2: Effect of varying the regularization term weight_scale on variance.

As shown, increasing the value of weight_scale or C makes for a more nonlinear decision function as you may notice the plots corresponding to higher values are of more curvy structure.

I am currently running ELM on the Covertype dataset (link). The results, however, aren't yet promising as ELM achieved a poor performance of 17% error-rate with as many as 1000 hidden neurons. The training error is still high, which means higher number of hidden neurons will likely reduce the error-rate. But even with 2000 hidden neurons, the error-rate was only reduced to 16,8%. The reason is,  Covertype has 54 features, so a much larger representation (produced by the hidden neurons) of the dataset is not adding any significant information. Therefore, I will explore other parameters using grid search in the hopes to significantly reduce that error-rate.

by (Issam Laradji) at July 28, 2014 05:57 AM

July 27, 2014

Hamzeh Alsalhi

Sparse Output Dummy Classifier

The Scikit-learn dummy classifier is a simple way to get naive predictions based only on the target data of your dataset. It has four strategies of operation.

  • constant - always predict a value manually specified by the use
  • uniform - label each example with a label chosen uniformly at random from the target data given
  • stratified  - label the examples with the class distribution seen in the training data
  • most-frequent - always predict the mode of the target data
The dummy classifier has built in support for multilabel-multioutput data. I have made a pull request #3438 this week that has introduced support for sparsely formatted output data. This is useful because memory consumption can be vastly improved when the data is highly sparse. Below a benchmark these changes with two memory consumption results graphed for each of the four strategies, once in with sparsely formatted target data and once with densely formatted data as the control.

Benchmark and Dataset

I used the Eurlex eurovoc dataset available here in libsvm format for use with the following script.  The benchmark script will let you recreate the results in this post easily. When run with the python module memory_profiler it measures the total memory consumed when doing an initialization of a dummy classifier, along with a fit and predict on the Eurlex data.

The dataset used has approximately 17,000 samples and 4000 classes for the training target data, and the test data is similar. They both have sparsity of 0.001.

Results Visualized

Constant Results: Dense 1250 MiB, Sparse 300 MiB

The constants used in the fit have a level of sparsity similar to the data because they were chosen as an arbitrary row from the target data.

Uniform Results: Dense 1350 MiB, Sparse 1200 MiB

Stratified Results: Dense 2300 MiB, Sparse 1350 MiB

Most-Frequent Results: Dense 1300 MiB, Sparse 300 MiB


We can see that in all cases expect for Uniform we get significant memory improvements by supporting sparse matrices. The sparse matrix implementation for uniform is not useful because of the dense nature of the output even when the input shows high levels of sparsity. It is possible this case will be revised to warn the user or even throw an error.

Remaining Work

There is work to be done on this pull request to make the predict function faster in the stratified and uniform cases when using sparse matrices. Although the uniform cases is not important in itself the underlying code for generating sparse random matrices is used in the stratified case. Any improvements to uniform will come for free is the stratified case speed is improved.

Another upcoming focus is to return to the sparse output knn pull request and make some improvements. There will be code written in the sparse output dummy pull request for gathering a class distribution from a sparse target matrix that can be abstracted to a utility function and will be reusable in the knn pull request.

by (Hamzeh) at July 27, 2014 02:04 AM

July 24, 2014

Gaël Varoquaux

The 2014 international scikit-learn sprint

A week ago, the 2014 edition of the scikit-learn sprint was held in Paris. This was the third time that we held an internation sprint and it was hugely productive, and great fun, as always.

Great people and great venues

We had a mix of core contributors and newcomers, which is a great combination, as it enables us to be productive, but also to foster the new generation of core developers. Were present:

  • Laurent Direr
  • Michael Eickenberg
  • Loic Esteve
  • Alexandre Gramfort
  • Olivier Grisel
  • Arnaud Joly
  • Kyle Kastner
  • Manoj Kumar
  • Balazs Kegl
  • Nicolas Le Roux
  • Andreas Mueller
  • Vlad Niculae
  • Fabian Pedregosa
  • Amir Sani
  • Danny Sullivan
  • Gabriel Synnaeve
  • Roland Thiolliere
  • Gael Varoquaux

As the sprint extended through a French bank holiday and the week end, we were hosted in a variety of venues:

  • La paillasse, a Paris bio-hacker space
  • INRIA, the French computer-science national research, and the place where I work :)
  • Criteo, a French company doing word-wide add-banner placement. The venue there was absolutely gorgeous, with a beautiful terrace on the roofs of Paris.
    And they even had a social event with free drinks one evening.
  • Tinyclues, a French startup mining e-commerce data.

I must say that we were treated like kings during the whole stay; each host welcoming us as well they could. Thank you to all of our hosts!

Sponsored by the Paris-Saclay Center for Data Science

Beyond our hosts, we need to thank the Paris-Saclay Center for Data Science. The CDS gave us funding that covered some of the lunches, acomodations, and travel expenses to bring in our
contributors from abroad.

Achievements during the sprint

The first day of the sprint was dedicated to polishing the 0.15 release, which was finally released on the morning of the second day, after 10 months of development.

A large part of the efforts of the sprint were dedicated to improving the coding base, rather than directly adding new features. Some files were reorganized. The input validation code was cleaned up (opening the way for better support of pandas structures in scikit-learn). We hunted dead code, deprecation warnings, numerical instabilities and tests randomly failing. We made the test suite faster, and refactored our common tests that scan all the model.

Some work of our GSOC student, Manoj Kumar, was merged, making some linear models faster.

Our online documentation was improved with the API documentation pointing to examples and source code.

Still work in progress:

  • Faster stochastic gradient descent (with AdaGrad, ASGD, and one day SAG)
  • Calibration of probabilities for models that do not have a ‘predict_proba’ method
  • Warm restart in random forests to add more estimators to an existing ensemble.
  • Infomax ICA algorithm.

by gael at July 24, 2014 10:59 PM

Maheshakya Wijewardena

Testing LSH Forest

There are two types of tests to perform in order to ensure the correct functionality the LSH Forest.

  1. Tests for individual functions of the LSHForest class.
  2. Tests for accuracy variation with parameters of implementation.

scikit-learn provides a nice set testing tools for this task. It is elaborated in the utilities for developers section. I have used following assertions which were imported from sklearn.utils.testing. Note that numpy as imported as np.

  1. assert_array_equal - Compares each element in an array.
  2. assert_equal - Compares two values.
  3. assert_raises - Checks whether the given type of error is raised.

Testing individual functions

Testing fit function

Requirement of this test is to ensure that the fit function does not work without the necessary arguments provision and it produces correct attributes in the class object(in the sense of dimensions of arrays).

Suppose we initialize a LSHForest as lshf = LSHForest()

If the estimator is not fitted with a proper data, it will produce a value error and it is testes as follows:

X = np.random.rand(samples, dim)
    lshf = LSHForest(n_trees=n_trees)

We define the sample size and the dimension of the dataset as samples and dim respectively and the number of trees in the LSH forest as n_trees.

# Test whether a value error is raised when X=None
    assert_raises(ValueError,, None)

Then after fitting the estimator, following assertions should hold true.

# _input_array = X
    assert_array_equal(X, lshf._input_array)
    # A hash function g(p) for each tree
    assert_equal(n_trees, lshf.hash_functions_.shape[0])
    # Hash length = 32
    assert_equal(32, lshf.hash_functions_.shape[1])
    # Number of trees in the forest
    assert_equal(n_trees, lshf._trees.shape[0])
    # Each tree has entries for every data point
    assert_equal(samples, lshf._trees.shape[1])
    # Original indices after sorting the hashes
    assert_equal(n_trees, lshf._original_indices.shape[0])
    # Each set of original indices in a tree has entries for every data point
    assert_equal(samples, lshf._original_indices.shape[1])

All the other tests for functions also contain the tests for valid arguments, therefore I am not going to describe them in those sections.

Testing kneighbors function

kneighbors tests are based on the number of neighbors returned, neighbors for a single data point and multiple data points.

We define the required number of neighbors as n_neighbors and crate a LSHForest.

n_neighbors = np.random.randint(0, samples)
    point = X[np.random.randint(0, samples)]
    neighbors = lshf.kneighbors(point, n_neighbors=n_neighbors,
    # Desired number of neighbors should be returned.
    assert_equal(neighbors.shape[1], n_neighbors)

For multiple data points, we define number of points as n_points:

points = X[np.random.randint(0, samples, n_points)]
    neighbors = lshf.kneighbors(points, n_neighbors=1,
    assert_equal(neighbors.shape[0], n_points)

The above tests ensures that the maximum hash length is also exhausted because the data points in the same data set are used in queries. But to ensure that hash lengths less than the maximum hash length also get involved, there is another test.

# Test random point(not in the data set)
    point = np.random.randn(dim)
    lshf.kneighbors(point, n_neighbors=1,

Testing distances of the kneighbors function

Returned distances should be in sorted order, therefore it is tested as follows: Suppose distances is the returned distances from the kneighbors function when the return_distances parameter is set to True.

assert_array_equal(distances[0], np.sort(distances[0]))

Testing insert function

Testing insert is somewhat similar to testing fit because what we have to ensure are dimensions and sample sizes. Following assertions should hold true after fitting the LSHFores.

# Insert wrong dimension
    assert_raises(ValueError, lshf.insert,
    # Insert 2D array
    assert_raises(ValueError, lshf.insert,
                  np.random.randn(dim, 2))


    # size of _input_array = samples + 1 after insertion
    assert_equal(lshf._input_array.shape[0], samples+1)
    # size of _original_indices[1] = samples + 1
    assert_equal(lshf._original_indices.shape[1], samples+1)
    # size of _trees[1] = samples + 1
    assert_equal(lshf._trees.shape[1], samples+1)

Testing accuracy variation with parameters

The accuracy of the results obtained from the queries depends on two major parameters.

  1. c value - c.
  2. number of trees - n_trees.

Separate tests have been written to ensure that the accuracy variation is correct with these parameter variation.

Testing accuracy against c variation

Accuracy should be in an increasing order as the value of c is raised. Here, the average accuracy is calculated for different c values.

c_values = np.array([10, 50, 250])

Then the calculated accuracy values of each c value is stored in an array. Following assertion should hold true in order to make sure that, higher the c value, higher the accuracy.

# Sorted accuracies should be equal to original accuracies
    assert_array_equal(accuracies, np.sort(accuracies))

Testing accuracy against n_trees variation

This is as almost same as the above test. First, n_trees are stored in the ascending order.

n_trees = np.array([1, 10, 100])

After calculating average accuracies, the assertion is performed.

# Sorted accuracies should be equal to original accuracies
    assert_array_equal(accuracies, np.sort(accuracies))

What is left?

In addition to the above tests, precision should also be tested against c values and n_trees. But it has already been tested in the prototyping stage and those tests consume a reasonably large amount of time which makes them unable to be performed in the scikit-learn test suit. Therefore, a separate benchmark will be done on this topic.

About Nose

As provided in the guidelines in scikit-learn contributors' page, Nose testing framework has been used to automate the testing process.

July 24, 2014 12:00 AM

July 23, 2014

Continuum Analytics

Bokeh 0.5.1 Released!

We are excited to announce the release of version 0.5.1 of Bokeh, an interactive web plotting library for Python!

This release includes many bug fixes and improvements over our last recent 0.5 release:

  • Hover activated by default
  • Boxplot in bokeh.charts
  • Better messages when you forget to start the bokeh-server
  • Fixed some packaging bugs
  • Fixed NBviewer rendering
  • Fixed some Unicodeencodeerror

by Damián Avila at July 23, 2014 01:30 PM

July 22, 2014


Webinar: Work Better, Smarter, and Faster in Python with Enthought Training on Demand

  Join Us For a Webinar We’ll demonstrate how Enthought Training on Demand can help both new Python users and experienced Python developers be better, smarter, and faster at the scientific and analytic computing tasks that directly impact their daily productivity and drive results. Space is limited! Click a webinar session link below to reserve […]

by info at July 22, 2014 11:40 PM

Matthieu Brucher

ATKCompressor and ATKExpander Implementation with Audio ToolKit

I’d like to talk a little bit about the way a compressor and an expander can be written with the Audio Toolkit. Even if both effects use far less filters than the SD1 emulation, they still implement more than just one filter in the pipeline, contrary to a fixed delay line (another audio plugin that I released with the compressor and the expander).

Block diagram

So first the block diagram of a compressor (it is the same for an expander):

Compressor pipeline

Compressor pipeline

The pipeline is quite obvious: start by detecting the power of the signal (a simple square with an AR(1) and a memory of 1ms for these plugins), create from this power an amplitude function through the GainCompressionFilter, and pass it through an attack/release filter (modifying the perception of the signal’s amplitude). Then the amplitude gain is multiplied by the actual gain, and that’s it!


Let’s see how they behave. First this is the formula I used to compute the gain for a compressor:

Power to amplitude formula for a compressor

Power to amplitude formula for a compressor

The idea was to use a C1 function for the signal. First everything is relative to the threshold, so the power divided by the threshold is an absolute curve. Now that this is settled, the power is converted to dB (as all curves are usually in that domain) and then an approximation of the absolute() function is used, with its parameter, the softness. If the power in dB is added to this function and then divided by 0, the resulting function will be of value 0 up to a relative power of 0dB, which means a constant amplitude gain of 0. For values higher than 0dB, the gain can be reduced by a factor depending on the slope. The final amplitude gain can be achieved by converting back the result to amplitudes.

Here are a few curves for the compressor:

Compressor gain curves

Compressor gain curves

And the same can be done for the expander:
Expander gain curves

Expander gain curves

The gain filters work on power values, from threshold to slope, and only returns a gain. The script takes an amplitude as input, squares it to produce the power values, processes them and then applies the gain value to the original amplitude to make the traditional graph.


The profile for the basic compressor is quite obvious:

Compressor profile before optimization

Compressor profile before optimization

After tabulating the gain reduction values in a table (the table has to be properly set up for a compressor and an expander), the compressor can be blazingly fast:
Compressor profile after optimization

Compressor profile after optimization


Creating a compressor when you have all elements is easy. Now it is time to create a stereo compressor/expander from these elements!

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at July 22, 2014 07:34 AM

July 19, 2014

Titus Brown

Citing our software and algorithms - a trial

Our lab is part of the ongoing online conversation about how to properly credit software and algorithms; as is my inclination, we're Just Trying Stuff (TM) to see what works. Here's an update on our latest efforts!

A while back (with release 1.0 of khmer) we added a CITATION file to the khmer repository and distribution. This was in response to Robin Wilson's call for CITATION files, and dovetails with some of the efforts of the R and Debian communities.

In the short term, we don't expect many people to pay attention to these kinds of infrastructure efforts, and for much of our work we actually have publications on the algorithms involved. More to the point, our software isn't just software -- it's the instantiation of novel data structures and algorithms, or at least novel applications of data structures. The people who did the research are not necessarily the same people as the developers and maintainers of our software implementation, and we'd like to reward both appropriately with citations.

Additionally, for things like tenure and promotion and grants, often it is the case that only peer reviewed articles count. In this case, having citations accrue to those articles is a good idea!

So, rather than directly citing our tarballs or repositories (see F1000 Research and Mozilla Science Lab's efforts) we have modified our scripts to output the appropriate citation information. For example, if you run '', you get this output

|| This is the script '' in khmer.
|| You are running khmer version 1.1-9-g237d5ad
|| You are also using screed version 0.7
|| If you use this script in a publication, please cite EACH of the following:
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
|| Please see the CITATION file for details.

The first citation is the software description, and the second is the normalization algorithm.

Likewise, if you run '', you will see:

|| If you use this script in a publication, please cite EACH of the following:
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * J Pell et al., PNAS, 2014 (PMID 22847406)
|| Please see the CITATION file for details.

which is our De Bruijn graph paper.

Interestingly, GNU Parallel also provides citation information:

When using programs that use GNU Parallel to process data for publication please cite:

 O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
 ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.

which is pretty cool!

Note also that Michael Crusoe, who works with me on khmer (side note: find Michael some completely over-the-top title - "the khmer software maestro"?), has been working with the Galaxy folk to build citation infrastructure into the Galaxy Tool Shed. Mo' infrastructure mo' bettah.

What's next?

Now that we're starting to provide unambiguous and explicit citation information, it's up to individual actors to cite our software appropriately. That's something we can help with (by mentioning it in e.g. reviews) but I'm not sure how much more we can do in the khmer project specifically. (Suggestions welcome here!)

My biggest unanswered concern in this space is now something completely different: it's providing (and getting) credit for the CS research. For example, there are several implementations of the digital normalization idea -- in silico normalization (in Trinity) and also BBnorm (see here and here). Those are implementations of the underlying idea of normalization, and I (perhaps selfishly) think that in most cases people using the BBnorm or Trinity code should be citing our digital normalization preprint.

This concern emerges from the fact that good algorithm development is largely different from good software development. Many bioinformaticians provide basic proof of concept for an algorithm or a data structure, but do not invest much time and energy in software engineering and community engagement. While this is a great service -- we often do need new algorithms and data structures -- we also need good implementations of data structures and algorithms. Academia tends to reward the DS&A and not the implementation folk, but I think we need to do both, and need to do both separately. Shifting to a system where only the implementers get credit doesn't seem like a great improvement to me ;).

So my thought here is that any tool that uses a research algorithm or data structure developed by others should output citation information for that other work. This follows the advice given by Sarah Callaghan to "cite what you use".

A specific example we're planning: someone is porting some abandoned thesisware to khmer. The citation information will specify both khmer (for the software implementation) and the methods publication (already published) for basic validation information.

I'm not sure where to draw the line, though -- there are clearly cases where the data structures and algorithms have been developed further and our work no longer deserves to be cited in the software, and other cases where the DS&A work may be irrelevant. People using Minia, for example, should not need to cite Pell et al., 2012, because the Minia folk extended our work significantly in their paper and implementation. And, for many instances of k-mer counting, our low-memory k-mer counting work (soon to be published in Zhang et al., 2014) is not necessary -- so if they're using khmer because of its convenient Python API, but not depending in any particular way on low-memory k-mer counting, they probably shouldn't bother citing Zhang et al. Or maybe they should, because of accuracy concerns addressed in that paper?

I'd be interested in your thoughts and experiences here. (Note, I haven't sent anything to the Trinity or BBnorm folk because (a) I haven't thought things through, (b) that'd probably be too aggressive anyway, and (c) we should figure out what the community norms should be first...)


p.s. Thanks to Michael Crusoe and Varsha Khodiyar for reading a preview of this blog post and giving me feedback!

by C. Titus Brown at July 19, 2014 10:00 PM

Preprints and double publication - when is some exposure too much?

Note to all: this is satire... As Marcia McNutt says below, please see Science Magazine's Contributors FAQ for more detailed information.

Recently I had some conversations with Science Magazine about preprints, and when they're counted as double publication (see: Ingelfinger Rule). Now, Science has an enlightened preprint policy:

...we do allow posting of research papers on not-for-profit preprint servers such as Please contact the editors with questions regarding allowable postings.

but details are not provided. Now, on Facebook and elsewhere I've seen people be confused about whether (for example) posting a preprint and then discussing it counts as "distribution" -- here, Science muddies the waters by saying,

Distribution on the Internet may be considered prior publication and may compromise the originality of the paper as a submission to Science, although...

(followed by the previous quote).

So, spurred by some recent questions along this vein that a friend asked on Facebook, I followed up with Science. I asked the editors a broad set of questions about when preprints could be publicized on blogs, via Twitter, or on social media sites such as Facebook. Here is their response, which I think provides a valuable and specific set of guidelines for us.

Dear Dr. Brown,

thank you for your detailed questions. In our role as guardians of the veracity and impact of scientific literature, we have long been concerned about how to minimize pre-publication dissemination of scientific information, and we are happy to communicate our deliberations and decisions for you.

We do allow posting to preprint servers, as explicitly noted in our policy. This is to preserve scholarly communication while making sure that the public are not exposed to research conclusions unvalidated by peer review. However, as the line between journalism and Internet opining continues to blur, we are concerned that non-scientists may occasionally be exposed to incorrect research conclusions through discussion of these preprints on the World Wide Web and social media.

We have therefore developed the following guidelines:

First, any researcher (any member of the team) may discuss their preprint in a blog post, as long as their blog is not read by more than 1,000 unique visitors in any given 30-day period.

Second, researchers may convey these preprints to others via e-mail or Facebook. We have designated an upper limit of dissemination to 25 researchers (via e-mail) or 150 "friends" (via Facebook).

Third, a blog post and/or a link to the preprint may be publicized via Twitter or other social media Web sites, as long as it is not "reblogged" (via retweet) to more than 5,000 total individuals or acknowledged (e.g. "favorited") by more than 150.

We believe that these numbers provide a fair and equitable balance between protecting the scientific literature from undue dissemination of research results, and allowing scientists to discuss their work with friends. We hope you agree.

Please note that we have contracted with the National Security Agency as part of their new Freedom and Liberty Outreach Program to make sure these dissemination limits are observed. We continue to reserve the right to reject any paper with or without consideration for any reason, including when researchers violate these limits.


[ redacted ]

They've said they welcome feedback at, so please go ahead and send them your thoughts.


by C. Titus Brown at July 19, 2014 10:00 PM

July 18, 2014

Juan Nunez-Iglesias


I just got back home from the SciPy 2014 conference in Austin, TX. Here are my thoughts after this year’s conference.

About SciPy in general

Since my first SciPy in 2012, I’ve decided to prioritise programming conferences over scientific ones, because the value I get for my time is just that much higher. At a scientific conference, I certainly become aware of way-cool stuff going on in other research labs in my area. Once I get home, however, I go back to whatever I was doing. In contrast, at programming conferences, I become aware of new tools and practices that change the way I do science. In his keynote this year, Greg Wilson said of Software Carpentry, “We save researchers a day a week for the rest of their careers.” I feel the same way about SciPy in general.

In the 2012 sprints, I learned about GitHub Pull Requests and code review, the lingua franca of open source development today. I can’t express how useful that’s been. I also started my ongoing collaboration with the scikit-image project, which has enabled my research to reach far more users than I ever could have achieved on my own.

No scientific conference I’ve been to has had such an impact on my science output, nor can I imagine one doing so.

This year’s highlights

This year was no different. Without further ado, here are my top hits from this year’s conference:

  • Michael Droettboom talked about his continuous benchmarking project, Airspeed Velocity. It is hilariously named and incredibly useful. asv checks out code from your Git repo at regular intervals and runs benchmarks (which you define), and plots your code’s performance over time. It’s an incredible guard against feature creep slowing down your code base.
  • IPython recently unveiled their modal version 2.0 interface, sending vimmers worldwide rejoicing. But a few key bindings are just wrong from a vim perspective. Most egregiously, i, which should enter edit mode, interrupts the kernel when pressed twice! That’s just evil. Paul Ivanov goes all in with vim keybindings with his hilarious and life-changing IPython vimception talk. The title is more appropriate than I realised. Must-watch.
  • Damián Avila revealed (heh) his live IPython presentations with Reveal.js, forever changing how Python programmers present their work. I was actually aware of this before the conference and used it in my own talk, but you should definitely watch his talk and get the extension from his repo.
  • Min RK gave an excellent tutorial on IPython parallel (repo, videos 1, 2, 3). It’s remarkable what the IPython team have achieved thanks to their decoupling of the interactive shell and the computation “kernel”. (I still hate that word.)
  • Brian Granger and Jon Frederic gave an excellent tutorial on IPython interactive widgets (notebook here). They provide a simple and fast way to interactively explore your data. I’ve already started using these on my own problems.
  • Aaron Meurer gave the best introduction to the Python packaging problem that I’ve ever seen, and how Continuum’s conda project solves it. I think we still need an in-depth tutorial on how package distributors should use conda, but for users, conda is just awesome, and this talk explains why.
  • Matt Rocklin has a gift for crystal clear speaking, despite his incredible speed, and it was on full display in his (and Mark Wiebe’s) talk on Blaze, Continuum’s new array abstraction library. I’m not sure I’ll be using Blaze immediately but it’s certainly on my radar now!
  • Lightning talks are always fun: days 1, 2, 3. Watch out for Fernando Pérez’s announcement of Project Jupyter, the evolution of the IPython notebook, and for Damon McDougall’s riveting history of waffles. (You think I’m joking.)

Apologies if I’ve missed anyone: with three tracks, an added one with the World Cup matches ;) , and my own talk preparations, “overwhelming” does not begin to describe the conference! I will second Aaron Meurer’s assertion that there were no bad talks. Which brings us to…

On my to-watch

Jake Vanderplas recently wrote a series of blog posts (last one here, with links to earlier posts) comparing frequentist and Bayesian approaches to statistical analysis, in which he makes a compelling argument that we should all relegate frequentism to the dustbin of history. As such, I intend to go over Chris Fonnesbeck’s tutorial (2, 3) and talk about Bayesian analysis in Python using PyMC.

David Sanders also did a Julia tutorial (part 2) that was scheduled at the same time as my own scikit-image tutorial, but I’m interested to see how the Julia ecosystem is progressing, so that should be a good place to start. (Although I’m still bitter that they went with 1-based indexing!)

The reproducible science tutorial (part 2) generated quite a bit of buzz so I would be interested to go over that one as well.

For those interested in computing education or in geoscience, the conference had dedicated tracks for each of those, so you are bound to find something you like (not least, Lorena Barba’s and Greg Wilson’s keynotes). Have a look at the full listing of videos here. These might be easier to navigate by looking at the conference schedule.

The SciPy experience

I want to close this post with a few reflections on the conference itself.

SciPy is broken up into three “stages”: two days of tutorials, three days of talks, and two days of sprints. Above, I covered the tutorials and talks, but to me, the sprints are what truly distinguish SciPy. The spirit of collaboration is unparalleled, and an astonishing amount of value is generated in those two days, either in the shape of code, or in introducing newcomers to new projects and new ways to collaborate in programming.

My biggest regret of the conference was not giving a lightning talk urging people to come to the sprints. I repeatedly asked people whether they were coming to the sprints, and almost invariably the answer was that they didn’t feel they were good enough to contribute. To reiterate my previous statements: (1) when I attended my first sprint in 2012, I had never done a pull request; (2) sprints are an excellent way to introduce newcomers to projects and to the pull request development model. All the buzz around the sprints was how welcoming all of the teams were, but I think there is a massive number of missed opportunities because this is not made obvious to attendees before the sprints.

Lastly, a few notes on diversity. During the conference, April Wright, a student in evolutionary biology at UT Austin, wrote a heartbreaking blog post about how excluded she felt from a conference where only 15% of attendees were women. That particular incident was joyfully resolved, with plenty of SciPyers reaching out to April and inviting her along to sprints and other events. But it highlighted just how poorly we are doing in terms of diversity. Andy Terrel, one of the conference organisers, pointed out that 15% is much better than 2012’s three (women, not percent!), but (a) that is still extremely low, and (b) I was horrified to read this because I was there in 2012… And I did not notice that anything was wrong. How can it be, in 2012, that it can seem normal to be at a professional conference and have effectively zero women around? It doesn’t matter what one says about the background percentage of women in our industry and so on… Maybe SciPy is doing all it can about diversity. (Though I doubt it.) The point is that a scene like that should feel like one of those deserted cityscapes in post-apocalyptic movies. As long as it doesn’t, as long as SciPy feels normal, we will continue to have diversity problems. I hope my fellow SciPyers look at these numbers, feel appalled as I have, and try to improve.

… And on cue, while I was writing this post, Andy Terrel wrote a great post of his own about this very topic:

I still consider SciPy a fantastic conference. Jonathan Eisen (@phylogenomics), whom I admire, would undoubtedly boycott it because of the problems outlined above, but I am heartened that the organising committee is taking this as a serious problem and trying hard fix it. I hope next time it is even better.

by Juan Nunez-Iglesias at July 18, 2014 01:48 AM

July 17, 2014

Titus Brown

Charting the future of data science at the NIH -- topics to discuss?

In September, I will be visiting the NIH to "chart the next 5 years of data science at the NIH." This meeting will use an open space approach, and we were asked to provide some suggested topics. Here are five topics that I suggested, and one that Jeramia Ory suggested (the last one) in response to my posting these on Twitter.

Additions? Thoughts? Suggestions?

  1. Education and training in data science: from undergrad through independence
  2. Fostering the careers of data scientists in biomedical research
  3. Building sustainable cyberinfrastructure to support data intensive research in biomedical sciences
  4. Supporting and incentivizing a transition to more open science
  5. 5 years out: now that we have all the data, what do we do next?
  6. How do we support/train curators or a culture of curation? (Good curation is a bottleneck to insight.)


by C. Titus Brown at July 17, 2014 10:00 PM

Andy Terrel

Diversity at SciPy2014

SciPy2014 logo

SciPy2014 is over, but there is so much more to do. I expect to be reviving my blog with a set of prose on my experience. Today I want to talk about a topic that was dear to my heart when I started this blog, diversity in the scientific python ecosystem. I'm focusing on gender diversity, but we also need to be aware of racial and other diversities, as well. I hope helping one will lead to helping others.

For the record, I am writing this post from my personal perspective. It is not representative of any company or community that has graciously chosen me as a representative.

SciPy, we have a diversity problem and it's not anyone else's job to fix it. We've been trying to fix this problem, but this has uncovered other problems with our methodologies. It is my sincere hope that this amazingly brave post by April Wright and follow up will help shape the future of our community. Thank you, April, your courageous anecdote has been hugely impactful ... if the community listens.

Working on diversity at SciPy

First let me outline some of the efforts that have been taken in the past three years that I have been involved with organizing the SciPy Conference. Even getting these seemingly trivial changes has been a challenge I never expected. I've had community members tell me I'm wasting my time, others tell me not enough is being done, and the larger part of the community a bit uninterested in the topic. Folks, SciPy is a community conference, it is on us to work hard to make it the community we want and I don't want to be in an exclusive white male club any longer.

First, in 2012, Matt Davis noted via twitter that more men had walked on the moon than women attending SciPy. How dare he say such slander! ... wait let's count how many women are here ... I only see 3 ... très ... ¡Qué hostias!

The first thought on how to change was to find more women to be on our organizing committee and keynotes. Becoming chair in 2013, I asked every woman I knew in the community to help out. We increased to 8 out of 58 in our organizers. Jonathan and I asked a half dozen women to keynote, but were roundly rejected. Not only do we have a problem with diverse groups coming to the conference, they have to be heavily persuaded to be a part of the organization at all. I could send one or two emails to community men I knew and get a avid contributor. This is the hill we must climb.

In 2013, we also had a social funded by NumFOCUS and the PSF. The result was about 20 women of a variety of races coming together to talk about the issues. The message the organizers took home was cast a wider net. There is amazing quality from women in scientific computing, but they will not be looking at conferences that are primarily white men. We are also a very small growing conference (150 in 2012 to 460 in 2014) so people outside our clique aren't going to come for prestige. Also the overwhelming opinion was, a social is nice, but less alcohol more food.

This year, 2014, we did similar things as last. Increase women on the organizing committee (19 of 84), added a diversity committee, asked my uber-prof friend to give one of the best SciPy keynotes ever and host an event for diversity. We also worked with NumFOCUS to fund scholarships explicitly for women in technology through a generous donation from JP Morgan. This funding helped 5 more women attend the conference, in addition to the 15 other students (which included 3 women) supported to attend. We estimated 15% of attendees were women, not great but better than that original 1.5% in 2012.

The conference also had several discussions on the topic during the conference.

Additionally, I opened up the mailing list for the conference organization with the hope of encouraging more participation from the community. The efforts to take the conference as open as possible is one other effort to build the community we want. Being open isn’t sufficient to be diverse, it’s a necessary but not even close to sufficient condition.

Needed Changes

First, April hit the nail on the head. Who cares if you do all the things above if your conference is not welcoming to everyone. Below, I outline a few specific actions the SciPy2015 community can take.

  • Welcome to SciPy track in all event categories
  • Better social events to get to know each other
  • Cast wider nets via smaller user groups
  • Teach the community how to run events better
  • Have a welcome committee to help people know who we are.

I've discussed this with many many community members and SciPy2015 will have a "Welcome to SciPy" arc to help folks know who we are, what tools we use, and why. The comment that it is a great conference if you already know all the things to succeed with SciPy has been repeated often.

Additionally, we have discussed quite a bit about having social events that are not alcohol or food related. There have always been smaller events not organized by the conference but we need to encourage this activity more broadly. This year if you watched twitter, you would have found folks swimming, running, and climbing, but having this in a program would be a big help.

As Katy pointed out we need to advertise our activities more. Some folks are chatting about starting meetups around the SciPy community. A small way to make sure folks are seeing local faces more often.

Sheila Miguez pointed out the incredible in-person event handbook from Shauna G. of Open Hatch. I think taking up the principles in this handbook is really needed. We have not made welcoming, goal setting, and clarifying structures a priority at events.

In the past, I have held events for other orgs where we made sure that a small set of folks were explicitly tasked with shaking hands and talking to people who were alone. It seems foolish to some, but making folks break from the clique and talk to others only helps build the community.

We Need You

Finally, SciPy, we need you. Today all these events happen because of a few concerned volunteers. We need more people helping with our events, advocating for the minority, and building the community.

Community building doesn't fit nicely on a resume nor does it always feel good, but it is critical to everyone. You will be depressed after you read a blog showing how far our community has to go. You will have colleagues tell you that you are wasting your time. You will miss deadlines that affect your funding or day job. Without community we will not sustain.

When I sat down with folks to encourage them to join our executive committee, I typically made the argument that if we don't build the community we want, no one else will. There are a lot of challenges we need to face, e.g. promoting code as a real scholarly work and promoting an open sharing culture among coders in scientific computing. We cannot do this without a diverse functional community.

Please go sign up for the mailing list and start making this a better place.


Thank you to: April Wright, Kelsey Jordahl, Anthony Scopatz, Matthew Turk, Dan Katz, Sheila Miguez, and Kyle Mandli for reading an early version and edits to this document.

by Andy R. Terrel at July 17, 2014 07:00 AM

July 16, 2014

Luis Pedro Coelho


Tonight, I’m doing a Python for Image Analysis tutorial in Heidelberg. See the meetup page for information. Materials are at:


by luispedro at July 16, 2014 12:13 PM