February 06, 2016

Filipe Saraiva

Cantor migrating to Phabricator: which tools our contributors must to use

Projects and software developed by KDE community are going to migrate for a new tool to manage our code, commits, reviews, tasks, and more. This tool is Phabricator and you can visit the instance for KDE projects in this address.

Since November 2015 we are migrating Cantor to Phabricator. After our first successful review code some days ago, I decided to write a post about which tools our contributors must to use while the migration process is not finished.

Project

Phabricator has an app to project management where we can to put some useful information and make coordination of tasks. The page for Cantor project is online and configured.

Other interesting feature is the possibility to join in a project or watch the activities of a project. If you have a KDE Identity, login in KDE Phabricator and follow us!

Workboard

KDE provides an application to manage tasks using a kanban board, the KDE TODO. Despite it is a nice tool, we never used that.

Projects app in Phabricator has an application to this same objective, Workboard. We are using it currently to track tasks of SoK student Fernando Telles. I intent to use it to manage the development of Cantor for each new release.

Tasks, bugs, wishes

The Phabricator app named Maniphest is the tool to create and track bugs, tasks and wishes (feature requests).

But in KDE we have a heavily customized Bugzilla, so for the moment there is not a decision about how to migrate our bugs reports tool.

Therefore, KDE Bugzilla is our bugs reports tool yet. However, I invite the contributors to use Maniphest to submit wishes of new features. We never used Bugzilla for this last objective, so there is no problem if we begin to use the new tool for it.

Repository

Like the most of KDE Projects, Cantor has their source code managed by git. Phabricator has an application named Diffusion to navigate and see a lot of data about a source code repository.

This application is configured for Cantor and it is available in this link.

Code review

The Phabricator app to code review is called Differential and it is available to Cantor as well.

However, there is not a decision about the migration and the shutdown of the current code review tool used by KDE, Reviewboard. Therefore our contributors can to use one or other tool (please, not both together!), but I strongly recommend to use Differential.

Wiki

Yes, Phabricator has an own application to wiki pages, named Phriction. Currently Cantor has a wiki page just in Userbase. We are not using wiki pages at the moment, so we will decide if Phriction will be our tool for wikis just at some point in the future.

Communication

Ok, Phabricator also has a tool for communication, Conpherence. However, Cantor contributors can continue to use our current communication tools provide by KDE Edu, the #kde-edu IRC channel at Freenode network and the KDE Edu mail list.

Despite I have some criticism about Phabricator (for instance, I don’t like the Application -> Project architecture; I prefer Project -> Application), it is a very nice tool for projects management and it has a lot of applications for specific tasks. In this text I listed several of them, but there are many others to be explored and evaluated.

I hope this post can help Cantor contributors about which tool must to be utilized for some task of the project. Maybe the text can to present some news to future Phabricator users and help KDE developers in the path of the migration.

The impact of Phabricator in KDE community is something to be analyzed in the future. This tool has a lot of applications and it can change the way how the KDE subprojects are organized. Let’s see what the future will say for us.

by Filipe Saraiva at February 06, 2016 07:28 PM

February 05, 2016

Mark Fenner

Testing a Cholesky Implementation

In working through the Cholesky, QR, and SVD methods for solving the normal regression equations, I got curious about the underlying implementations of each. I did QR and Cholesky in grad school (a great Numerical Analysis sequence). I never got around to SVD (I think we did some variation of tridiagonalization). Anyway, Cholesky is so […]

by Mark Fenner at February 05, 2016 04:33 PM

Into the Weeds III – Hacking LAPACK Calls To Save Memory

In the last post, we tried to use lower-level LAPACK/LAPACKE calls to remove some of the copying overhead of our operations. Well, it turns out that it didn’t work out so well (see the table at the end of the post). The fundamental reason why is that LAPACKE took the easy way out, when they […]

by Mark Fenner at February 05, 2016 04:30 PM

Into the Weeds II – LAPACK, LAPACKE, and Two Paths without Copies

In the last post, we tried to use lower-level LAPACK/LAPACKE calls to remove some of the copying overhead of our operations. Well, it turns out that it didn’t work out so well (see the table at the end of the post). The fundamental reason why is that LAPACKE took the easy way out, when they […]

by Mark Fenner at February 05, 2016 04:30 PM

Into the Weeds I – LAPACK, LAPACKE, and Three+ Paths

In the last post, we looked at three ways of solving the least-squares regression problem (Cholesky, QR, and SVD decompositions) using LAPACK routines in a round-about fashion. The different decompositions (i.e., factorizations) were performed by (1) calling a NumPy np.linalg routine that called (2) an LAPACK driver routine for a "bigger" problem like linear equation […]

by Mark Fenner at February 05, 2016 04:28 PM

Three Paths to Least Squares Linear Regression

In the prior installment, we talked about solving the basic problem of linear regression using a standard least-squares approach. Along the way, I showed a few methods that gave the same results to the least squares problem. Today, I want to look at three methods for solving linear regression: solving the full normal equations via […]

by Mark Fenner at February 05, 2016 04:28 PM

Linear Regression Basic Solution

Before we diverged into an investigation of representation/storage of Python variables \(\mathtt{X}\) and \(\mathtt{Y}\) (math variables \(\bf{X}\) and \(\bf{Y}\)), we had the following equation: \[\min RSS(\beta)=(\bf{y}-\bf{X}\beta)^T(\bf{y}-\bf{X}\beta)\] The question is, how do we solve it? Well, if you recall your calculus, you should remember (or I’ll remind you) that we can optimize (find the minima and […]

by Mark Fenner at February 05, 2016 04:28 PM

LinearRegression And Notation

The precision of mathematical notation is a two-edged sword. On one side, it should clearly and concisely represent formal statements of mathematics. On the other side, sloppy notation, abbreviations, and time/space-saving shortcuts (mainly in the writing of mathematics) can lead to ambiguity — which is a time sink for the reader! Even acceptable mathematical notation […]

by Mark Fenner at February 05, 2016 04:27 PM

Game of Life in NumPy

In a previous post, we explored generating the grid-neighborhoods of an element in a NumPy array. The punch line of that work was grid_nD, shown below. In [1]: import numpy as np import matplotlib.pyplot as plt import time from IPython import display %matplotlib inline from numpy.lib.stride_tricks import as_strided In [2]: def grid_nD(arr): assert all(_len>2 for _len in […]

by Mark Fenner at February 05, 2016 04:25 PM

Game of Life in NumPy (Preliminaries)

You might be familiar with Conway’s Game of Life. It is a simple game based on the idea of modeling a dynamic system progressing over time. We progress from discrete time points \(t_i \rightarrow t_i+1\) and apply very simple rules at each time point. The game progresses on a square grid of discrete points. Each […]

by Mark Fenner at February 05, 2016 04:24 PM

Visualizing Probabilities in a Venn Diagram

I came across a little problem that lends itself nicely to visualization. And, it’s the kind of visualization that many of us have seen before: Venn diagrams. In [103]: import matplotlib.pyplot as plt %matplotlib inline # create some circles circle1 = plt.Circle((-.5,0), 1, color='r', alpha=.2) circle2 = plt.Circle(( .5,0), 1, color='b', alpha=.2) # create a figure […]

by Mark Fenner at February 05, 2016 04:19 PM

February 04, 2016

Continuum Analytics news

Anaconda 2.5 Release - now with MKL Optimizations

Posted Friday, February 5, 2016

We are happy to announce that Anaconda 2.5 has been released, which includes the Intel Math Kernel Library (MKL) optimizations (version 11.3.1) for improved performance. While we have supported MKL-enabled versions of numpyscipy, scikit-learn and numexpr for several years as our commercial "MKL Optimizations" product, we are now making these packages available for free for everyone, and they are included by default in the standard Anaconda installers. This means that you do not have to do anything extra to see the performance benefits of MKL when using these packages! All you need to do is download and install Anaconda 2.5 (or higher), and you're ready to go. If you already have Anaconda installed, update to Anaconda 2.5 by using conda:

conda update conda
conda install anaconda=2.5

The list of changes, fixes and updates can be found in the changelog.

Anaconda now also includes a small utility package called mkl-service which provides a Python interface to some useful MKL functions declared in mkl_service.h, such as setting the number of threads to use. For example:

>>> import mkl
>>> mkl.get_max_threads()
2
>>> mkl.set_num_threads(1)
>>> mkl.get_max_threads()
1

The full list of available interface functions is documented on github.

Finally, in case you do not need or want MKL, it is possible to opt out of installing MKL. We provide this option on Linux and OS X, because MKL is a large package (roughly 100MB), and for many tasks it is not necessary. The alternatives to MKL are OpenBLAS (for Linux), and the native Accelerate Framework (for OS X). To use the non-MKL versions of packages on Linux or OS X, first install Miniconda, and then execute:

conda install nomkl

This effectively adds the nomkl feature, which makes conda prefer non-MKL versions of packages in all cases. For example, executing:

conda install scipy

will install the non-MKL version of scipy, and also the non-MKL version of numpy. The installation of non-MKL dependencies happens automatically for any package, even if the requested package itself does not depend on MKL (pandas, for example).

If you have already downloaded and installed the full Anaconda, you can (again on Linux and OS X only) remove MKL using the following commands:

conda install nomkl numpy scipy scikit-learn numexpr
conda remove mkl mkl-service

Go download Anaconda today!

by swebster at February 04, 2016 08:30 PM

February 02, 2016

Matthieu Brucher

Book review: Slide Rules: Design, Build, and Archive Presentations in the Engineering and Technical Fields

I have trouble with slides. I hate them. I’ve followed a training of presentation to make better ones, and with more or less no slides anymore. I liked that training very much, but it’s difficult to apply to scientific presentations. As such, I’ve decided to read this book who is about scientific presentations (published by IEEE-Wiley) and to see how other people apprehend slides.

Content and opinions

There are five rules according to the book. Rule 0 is actually acknowledge the bad state of current presentations. When I see some presentation in my current company, I want to tear my eyes out of their sockets or rip my ears of my head. That bad. But we can do better.

Rule 1 is “Revisit Presentation Assumptions”. I like this one, it’s true. Presenters say that people expect slides to be the way they are. That’s untrue. Listeners don’t want to fall asleep during presentation, they want a story, they want to be fed with what they need. That’s the first rule.

Rule 2 is “Write Sentence Headers”. This one intrigued me, I wasn’t expecting that one. But when confronted with the result, I have to say that it seems more promising than my former titles. It’s all about telling the proper story, and getting rid of text in the slide.

Rule 3 is “Use Targeted Visuals”. That was actually my man rule before. Or more exactly, don’t use any visuals, unless you need one. It actually starts with not exposing the complete slide (when you have to use a bullet point list, typically), or selecting relevant numbers on a graph, a proper photo… It also means that you may have to modify your company slide template… But actually, I’m not sure my CEO use them anyway, because he knows they are made by bad internal communication people.

Rule 4 is “Archive Details for Future Use”. I usually let this one out, but now I realize it is an important rule. Your bosses or colleagues will reuse the slides. So you need to make them self-contained with your knowledge inside. Not necessarily on the slide that you present, but somewhere.

Rule 5 is “Keep Looking Forward”. Yes, we need to accept the fact that today’s presentation won’t be tomorrow’s. They change everyday and it is good to ride the trend and not be too late. Which is why I read the book in the first place, to keep on changing my ways.

Conclusion

I liked the strong position the book takes one hanging presentations in firms, the compromise it offers but it is always to enhance the situation the reader is faced with. Definite mandatory read.

by Matt at February 02, 2016 08:37 AM

February 01, 2016

Continuum Analytics news

Anaconda for R users: SparkR and rBokeh

Posted Monday, February 1, 2016

In this post, we present two projects for the R programming language that are powered by Anaconda. We will explore how rBokeh allows you to create beautiful interactive visualizations and how easy it is to scale your predictive models with SparkR through Anaconda’s cluster management capabilities.

Bokeh and rBokeh

Bokeh is an interactive visualization framework that targets modern web browsers for presentation. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards and data applications, without having to learn web technologies, such as JavaScript. Bokeh currently provides interfaces in Python, R, Lua and Scala. rBokeh is the R library that allows you write interactive visualizations in R.

Spark and SparkR

Spark is a popular open source processing framework for large scale in-memory distributed computations. SparkR is an R package that provides a frontend API to use Apache Spark from R.

Getting started with R in Anaconda

Conda, R Essentials and the R channel on Anaconda Cloud.

The easiest way to get started with R in Anaconda is installing R Essentials, a bundle of over 80 of the most used R packages for data science, including dplyr, shiny, ggplot2, tidyr, caret and nnet. R Essentials also includes Jupyter Notebooks and the IRKernel. To learn about conda and Jupyter, visit our previous blog post, "Jupyter and conda for R."

Conda is Anaconda's package, dependency and environment manager. Conda works across platforms (Windows, Linux, OS X) and across languages (R, Python, Scala...). R users can also use conda and benefit from its capabilities: create and install R packages and manage portable sandboxes of environments that might have different packages or versions of packages. An R channel is available on Anaconda Cloud with over 200+ R packages.

To install R Essentials run:

$ conda install -c r r-essentials

To learn more about the benefits of conda, how to create R conda packages, and manage projects with Python and R dependencies, visit our previous blogpost “Conda for data science."

rBokeh: Interactive Data Visualizations in R

rBokeh is included in R Essentials, but it can also be separately installed from the R channel:

$ conda install -c r r-rbokeh

Once you have rBokeh installed, you can start an R console by typing `r` in your terminal:

$(r-essentials):~/$ r
 

Import the rBokeh library and start creating your interactive visualizations. The following example draws a scatterplot of the iris dataset with different glyphs and marker colors depending on the Species class, and the hover tool indicates their values as you mouse over the data points:

> library(rbokeh)
> p - figure() %>%
  ly_points(Sepal.Length, Sepal.Width, data = iris,
    color = Species, glyph = Species,
    hover = list(Sepal.Length, Sepal.Width))
> p

rBokeh plots include a toolbox with the following functionality: panning, box zooming, resizing, wheel zooming, resetting, saving and tooltip hovering.

rBokeh and Shiny

Besides the interactivity that is offered through the toolbox, rBokeh also integrates nicely with Shiny, allowing you to create visualizations that can be animated.

Here’s an example of a simple Shiny App using rBokeh that generates a new hexbin plot from randomly sampling two normal distributions (x and y).

library("shiny")
library("rbokeh")
library("htmlwidgets")

ui - fluidPage(
  rbokehOutput("rbokeh")
)

server - function(input, output, session) {
  output$rbokeh - renderRbokeh({
    invalidateLater(1500, session)
    figure(plot_width = 400, plot_height = 800) %>% ly_hexbin(rnorm(10000), rnorm(10000))
    })
}

shinyApp(ui, server)

For more information and examples on rBokeh, visit the rBokeh documentation.

Using Anaconda for cluster management

The simplicity that conda brings to package and environment management can be extended to your cluster through Anaconda’s capabilities for cluster management. Anaconda for cluster management is freely available for unlicensed, unsupported use with up to 4 cloud-based or bare-metal cluster nodes.

You can install the cluster management library from the anaconda-cluster channel on Anaconda Cloud. You must have an Anaconda Cloud account and be logged in via anaconda login.

conda install anaconda-client
anaconda login
conda install anaconda-cluster -c anaconda-cluster

For detailed installation instructions, refer to the Installation section in the documentation.

Setting up your cloud cluster

In this example, we will create and provision a 4-node cloud-based cluster on Amazon EC2. After installing the Anaconda cluster management library, run the following command:

$ acluster

This will create the ~/.acluster directory, which contains all of the configuration information. Edit the the ~/.acluster/providers.yaml file and add your AWS credentials and key file information.

aws_east:
  cloud_provider: ec2
  keyname: my-private-key
  location: us-east-1
  private_key: ~/.ssh/my-private-key.pem
  secret_id: AKIAXXXXXX
  secret_key: XXXXXXXXXX

Next, create a profile in the ~/.acluster/profiles.d/ directory that defines the cluster and includes the spark-standalone and notebook plugins.

name: aws_sparkr
node_id: ami-d05e75b8
node_type: m3.large
num_nodes: 4
plugins:
  - spark-standalone
  - notebook
provider: aws_east
user: ubuntu

You can now launch and provision your cluster with a single command:


$ acluster create spark_cluster --profile aws_sparkr

Notebooks and R Essentials on your cluster

After the cluster is created and provisioned with conda, Spark, and a Jupyter Notebook server, we can install R Essentials on all of the cluster nodes with:

$ acluster conda install r-essentials -c r

You can open the Jupyter Notebook server that is running on the head node of your cluster with:

$ acluster open notebook

The default notebook password is acluster, which can be customized in your profile. You can now open an R notebook that is running on the head node of your cloud-based cluster:

For example, here’s a simple notebook that runs on a single node with R Essentials:

The full example notebook can be viewed and downloaded from Anaconda Cloud.

Running SparkR on your cluster


You can open the Spark UI in your browser with the following command:

$ acluster open spark-standalone

 

 

Now, let’s create a notebook that uses SparkR to distribute a predictive model across your 4-node cluster. Start a new R notebook and execute the following lines:

Sys.setenv(SPARK_HOME='/opt/anaconda/share/spark')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))

library(SparkR)
sc - sparkR.init("spark://{{ URL }}:7077", sparkEnvir = list(spark.r.command='/opt/anaconda/bin/Rscript'))

Replace {{ URL }} in the above command with the URL displayed in the Spark UI. In my case, the URL is spark://ip-172-31-56-255:7077.


The following notebook uses three Spark workers across the cluster to fit a predictive model.

You can view the running Spark applications in the Spark UI:

and you can verify that the three Spark workers are running the application:

You can view and download the SparkR notebook from Anaconda Cloud

Once you’ve finished, you can easily destroy the cluster with:

$ acluster destroy spark_cluster

Conclusion

We have presented two useful projects for doing Data Science in R with Anaconda: rBokeh and SparkR. Learn more about these projects in their respective documentation pages: rBokeh and SparkR. I also recommend downloading the Anaconda for cluster management cheat sheet to help you setup, manage, and provision your clusters.

Thanks to Ryan Hafen, rBokeh developer; and all of the Spark and SparkR developers.

For more information about scaling up Python and R in your enterprise, or Anaconda's cluster management features, contact sales@continuum.io.

 

 

by swebster at February 01, 2016 05:18 PM

January 31, 2016

Wei Xue

Abcd

Abcd was originally published by 4D at 4D's space on January 31, 2016.

by 4D (xuewei4d [@] gmail.com) at January 31, 2016 09:45 PM

January 29, 2016

Martin Fitzpatrick

Create Simple GUI Applications with Python and Qt

Buy the book

Create Simple GUI Applications will show you how to use Python and Qt to do just that.

This ebook was written to accompany the video tutorial course on Udemy, adding background and detail to the lectures, with more examples and reference documentation. However, it stands perfectly well alone as a solid introduction to programming Python GUI applications using the PyQt framework. The first chapter covers all that you need to quickly start building functional applications.

This is my first ebook, and first time publishing on the Leanpub platform - which is designed for in progress publishing. The book will continue to be extended with more material as the course expands, and these updates will remain free to everyone who has previously bought the book.

A web version of the book is available to read online for free.

Hope you find the book useful, and let me know what you think!

by Martin Fitzpatrick at January 29, 2016 07:25 AM

January 26, 2016

Titus Brown

An advanced git lesson for scientists, broadcast

Today at 9:15am PT, Raniere Silva will be giving a lesson on advanced git usage, including bisect, branches, pull requests, and rebasing.

This lesson will be broadcast via YouTube as a Hangout on Air - if you're interested in watching it, we will post the link on the workshop page closer to 9:15am.

--titus

by C. Titus Brown at January 26, 2016 11:00 PM

Continuum Analytics news

Continuum Analytics Expands Executive Leadership Team Amid Rapid Adoption of Anaconda in 2015

Posted Tuesday, February 2, 2016

AUSTIN, TX - February 2, 2016 - Continuum Analytics, developer of Anaconda, the leading modern open source analytics platform powered by Python, today announced it has added two new leaders. Jon Shepherd steps into the leadership team as Senior Vice President of Sales to grow enterprise sales on the heels of more than 2.25M downloads of Anaconda in 2015. Matt Roberts joins as Vice President of Product Engineering to establish world­class software development processes. Shepherd and Roberts join a rapidly growing team that added 90 new hires in 2015.

“As the driving force behind Anaconda, Continuum is at a pivotal time where the velocity of community and industry adoption positions us not only as the d​e facto ​enterprise distribution of Python but also as the foundation of the open data science movement. Jon and Matt’s arrival signals our continued success in building out a world­class leadership team,” explained Travis Oliphant, CEO and co­founder of Continuum Analytics. “Jon’s deep experience in enterprise sales of data and analytics platforms will accelerate our growth and deepen our customer relationships, while Matt brings expertise in leading world­class software development teams.”

Shepherd and Roberts join Continuum Analytics’ growing team of visionaries and thought leaders in the open source community.

“Anaconda is the leader in open data science. I’m looking forward to leading the company’s growth in the enterprise market,” said Shepherd. “The full stack analytics capability of Anaconda and its powerful package management, along with Continuum’s training, support and expert consulting, deliver complete solutions in a single platform.”

Shepherd brings o​ver 20 years of experience to Continuum Analytics, having led high performance sales teams responsible for sales of innovative enterprise technologies including DataRobot, Netezza and RapidMiner. As an entrepreneurial sales executive, Shepherd was most recently in sales leadership at DataRobot and also held executive sales positions at RapidMiner, IBM, Netezza, among other software startups. S​hepherd has a B.A. from Vanderbilt in Economics/Psychology.

Roberts comes to Continuum Analytics from Socialware, where he was vice president of product development and was responsible for transforming a small team into a high performing, collaborative product­ development organization that offered critical value to enterprise customers. Prior to Socialware, Roberts held engineering and product development roles at CA Technologies/Hyperformix, Motive (Alcatel­Lucent), Meetrix and Concero. Roberts holds a B.S. from Virginia Tech in Management Science.

About Continuum Analytics:

Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, architects, and business analysts – to connect the dots in their data and accelerate the time­to­value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture­backed by General Catalyst and BuildGroup.

To learn more about Continuum Analytics, visit w​w​w.continuum.io.​ 

Media Contact:

Treble

Aaron DeLucia

(512) 960-8222

continuum@treblepr.com

 

by swebster at January 26, 2016 08:08 PM

Pierre de Buyl

Code as a paper

Publishing scientific software is a nontrivial problem. I present here, and ask feedback, on the idea to upload the software to arXiv with a presentation paper and use it as an evolving reference.

https://blog.datacite.org/software-citation-workflows/

Many people have already thought about this issue, there are several old and new solutions:

  • Paper: Publishing a software paper in a computing oriented journal. In physics, a well-known venue is Computer Physics Communications.
  • Web page: Setup a web page for your software and link to the source (a tarball, a version control system, etc).
  • Crazy (in a good sense): Launching a new journal: http://gael-varoquaux.info/programming/a-journal-promoting-high-quality-research-code-dream-and-reality.html
  • Zenodo/GitHub: Obtain a DOI for your GitHub-hosted code via Zenodo. https://guides.github.com/activities/citable-code/

Here are my criteria:

  • Open-access (by principle).
  • Low or inexistant author charges (pragmatic condition).
  • Provides a single citable reference point for the software and its updates. Be explicit on the version, though.
  • Persistent.
  • Allows more than "just" the code. That is, I would like to document the first release by an extended note, similar to what would be found in a traditional software paper.

Let us review the propositions above:

  • Paper: Only allows one version. Also, is either closed-source of expensive.
  • Web page: Hard to cite. Likely not persistent enough.
  • Crazy: Experience shows that it is not realistic, at least now.
  • Zenodo/GitHub: One DOI per version. Also, the DOI points to an exact reproduction of the source repository. It is thus rather inflexible.

The arXiv identifier is, I believe, stable and citable. One can attach a tarball of code with the paper and it is possible to update a paper with updated versions.

by Pierre de Buyl at January 26, 2016 10:00 AM

January 19, 2016

Titus Brown

Partitioning with khmer is not my recommended approach any more

I wrote the below in response to someone who e-mailed me about trying out our partitioning approach for metagenome assembly.


yes, the original partitioning approach worked only on low coverage data sets. The main reason is that highly connected regions (repeats, from biology; and some kinds of sequencing errors) are incredibly hard to traverse.

Our 2014 paper used a modified protocol, which included both digital normalization and a "knot removal" approach. This should work, but...

...since then, we've continued to refine the approach, but it's not something we're actively working on. A big part of the reason is that there are some nifty new metagenome assemblers that resolve most of the problems we set out to solve -- MEGAHIT is the one that we recommend.

If you're interested in genome separation/binning/strain extraction on raw reads, the Cleary et al. 2015 paper is where I would look. I wrote a News and Views piece on it.


The leading edge of technology in metagenomics keeps on moving. Exciting times ;).

--titus

by C. Titus Brown at January 19, 2016 11:00 PM

Matthieu Brucher

Book review: Googled: The End of the World As We Know It

Google is intriguing and there are plenty of books on the company. Who did it grow to such a beast? What is their true purpose? This is one of these books, published 5 years ago. Is it still relevant today?

Content and opinions

Google started as a search engine and ended up in more or less all domains of our life. It’s a pattern that is underlined in the book. wherever the engineers thought they could improved something in our lives, they created something. That’s the idea of the first part, messing up with the magic. Google earns money with ads, automated ads, and ads is the bread and butter of all publishing companies. And they did mess up with the magic (in a good way, I think) as they undermined the traditional way of doing things.

The second part is the story of Google beginnings, from the garage to the place where they could earn money. Seeing what they went through reminded me that they didn’t have a business model right from the start, they just wanted to solve an issue that we all had on the Web.

The third part starts more or less when Google could manage to do whatever it pleases, like buying Youtube. At the time, it was a big thunder to see Google absorb it, as the site didn’t have any business model (I still struggle to understand how it makes money now). It also stirred lots of controversies, but it was just the tip of the iceberg, the book describes all the other struggles Google faced.

The last part is about the future of the media companies (still struggling, Netflix is there with Amazon Prime, still pushing them to change their way of working) and Google as well. I think Google went further from their original corpus of thoughts with Android for instance (keeping lots of things closed when they started as open source). It is morphing further to a more traditional company in my opinion. And that’s for the worst…

Conclusion

Google is not a company like another. Its directors are one of a kind, and this book shows it quite well. Whether they will stick to their ideas is another matter.

by Matt at January 19, 2016 08:18 AM

January 18, 2016

Continuum Analytics news

Using Bokeh at NIST

Posted Monday, January 18, 2016
.bokeh table { border: none; } .bokeh td { padding: 0; }

Disclaimer

Certain trade names or company products are mentioned in the text to specify adequately the procedure and equipment used. In no case does such identification imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment is the best available for the purpose.


Using Bokeh at NIST

Over the last couple years, researchers in the Fire Research Division at the National Institute of Standards and Technology (NIST) have used Python/Anaconda tools on a wide range of projects. Only recently have we started using Bokeh. Two projects that use Bokeh are described in this post: the first uses Bokeh to create an interactive map of NIST fire studies and the second uses Bokeh to plot thermal data from a firefighter's immediate environment in real time.


Interactive Map of NIST Fire Studies

Fire research at the National Institute of Standards and Technology (NIST) dates back to the early 1900's following the need for a standardized fire hydrant coupling after the Great Baltimore Fire. Fast-forward to the 1980s and for the last 30 years, NIST researchers have been performing experiments and running computer simulations following fire incidents to advance the field of fire safety.

NIST researchers wrote detailed reports of their work following these fires, but an intuitive, centralized source of this material did not exist. Therefore, we wanted to build an interface to fix this. We felt a map that showed where the fires occured with interactivity to link to the reports would be an interesting project.

import pandas as pd
from bokeh.models.glyphs import Circle
from bokeh.plotting import show, output_notebook,figure
from bokeh.models import (
    GMapPlot, GMapOptions, Range1d, ColumnDataSource, LinearAxis,
    PanTool, WheelZoomTool,HoverTool, TapTool, OpenURL)
output_notebook()

First, let's load and take a look at the data file of fires.

data = pd.read_csv('study_data.csv')
data.head()
Study Latitude Longitude Color Type Date Report
0 Simulation of an Attic Fire in a Wood Frame Re... 41.802088 -87.681981 Navy LODD/LODI 2-Nov-12 http://dx.doi.org/10.6028/NIST.TN.1838
1 Simulation of a Fire in a Hillside Residential... 37.739505 -122.439223 Navy LODD/LODI 2-Jun-11 http://dx.doi.org/10.6028/NIST.TN.1856
2 Simulation of a Residential Wind Driven Baseme... 38.965891 -76.917754 Navy LODD/LODI 24-Feb-12 http://dx.doi.org/10.6028/NIST.TN.1870
3 Simulation of the Dynamics of a Wind-Driven Fi... 29.679490 -95.282329 Navy LODD/LODI 12-Apr-09 http://www.nist.gov/customcf/get_pdf.cfm?pub_i...
4 Simulation of the Dynamics of a Fire in a Two-... 40.404689 -91.382416 Red LODD/LODI & Civilian Loss 22-Dec-99 http://www.nist.gov/customcf/get_pdf.cfm?pub_i...

The addresses of the fires were geocoded using GoogleV3 geocoder to generate the lattitude and longitude for each fire. The studies were then coded into one of four categories: LODD/LODI (a fire with a firefighter Line of Duty Death or Line of Duty Injury), Civilian Loss, WUI (wildland urban interface fire), or LODD/LODI and Civilian Loss. The data file also includes the NIST hosting link of the report so that we can link to it through first class Bokeh features.

The next step is to load in a simple map.

The latitude, longitude, and zoom are set to view all of the United States and Puerto Rico.

x_range = Range1d()
y_range = Range1d()

map_options = GMapOptions(lat=39., lng=-98, zoom=3)

plot = GMapPlot(
    x_range=x_range, y_range=y_range,
    map_options=map_options,
    title = "NIST Fire Studies", plot_width=875, plot_height=500
)
plot.map_options.map_type="terrain"

Adding Interaction

We wanted the study name, type, and date of the fire to be available upon hover. Second, we wanted the glyphs to be clickable to direct users to the full study reports (the main purpose of this exercise).

source = ColumnDataSource({'lat':data['Latitude'],'lon':data['Longitude'],'studys': data['Study'],
                           'report': data['Report'],'fill':data['Color'],'type':data['Type'],'date':data['Date']})
circle = Circle(x="lon",y="lat",size=15,fill_color="fill")
plot.add_glyph(source, circle)

pan = PanTool()
wheel_zoom = WheelZoomTool()
hover = HoverTool()
hover.tooltips = [('Study Title','@studys'),('Date','@date'),('Type','@type')]
tap = TapTool()
url = "@report"
TapTool.callback = OpenURL(url=url)

plot.add_tools(pan,wheel_zoom,hover,tap)
show(plot)

This is clearly a simple example, but that is the point. The final product is an interative map that clearly highlights where fires have been studied by NIST researchers, with minimal coding overhead.


Real-Time Data Plotting of a Firefighter's Thermal Exposure

Bokeh was also utilized by our group at NIST to improve an existing project [1] focused on characterizing a firefighter's thermal exposure in fire scenarios.

To overcome challenges associated with measuring a firefighter's continuously changing environment, a portable measurement and data acquisition system was designed to be used with firefighter personal protective equipment (PPE). The system needed to:

  • measure the temperature and heat flux (heat rate per area) of a firefighter's immediate environment

  • store and circulate cooling water around the heat flux sensor

  • withstand the typical movements and actions of firefighters and the elevated thermal conditions encountered in a fire environment

  • accompany a firefighter as he/she moves throughout the environment to conduct various tasks

Description of Portable Measurement and Data Acquisition System

The portable system can be divided into two main parts: the helmet and the pack.

Helmet Portion

The helmet portion of the system consists of a type K thermocouple and a Schmidt-Boelter heat flux gauge (SBG) to measure temperature and heat flux, respectively. Both sensors are mounted to the aluminum shield on the front of a firefighter helmet.

Two sections of 1/8” diameter rubber tubing connected to the SBG are used to transport the cooling water around the gauge. The rubber tubing and sensor wires travel out the backside of the helmet and to a hydration backpack.

Pack Portion

The data logger rests in the front pocket of the pack, the water reservoir used to store the cooling water for the SBG is located in the rear pocket of the pack, and the water pump used to circulate the cooling water is located in a side pocket of the pack.

The low-profile hydration pack is worn under a PPE coat on the chest of the firefighter to avoid interference with the Self-Contained Breathing Apparatus worn on the firefighter’s back. Because the hydration pack is worn underneath the PPE coat, its components are protected from the intense thermal conditions encountered in fire environments.

New Data Logger with WiFi Support

Recently, we decided to replace the original data logger with a microcontroller that supports a Linux distribution and WiFi. Using Python with the microcontroller data logger expands the capabilities of the portable measurement and data acquisition system and allows the logged data to be plotted in real time. The workflows to collect, store, and plot the sensor data using the original data logger and using the new microcontroller data logger are presented and discussed below.

The workflow improved significantly by using the microcontroller data logger. The steps to collect, save, and plot the data occur immediately after one another and have been automated using Python. The microcontroller's built-in WiFi support is used to transmit the sensor data to a host computer in real time. Collected data are now saved in two separate locations (the logger's microSD card and the host computer) during the logging process. Additionally, the host computer utilizes the Bokeh server to plot data in real time.

Plotting Helmet Sensor Data in Real Time

The procedure outlined below assumes:

  • the thermocouple and heat flux sensors from the portable system have been connected to the microcontroller
  • the microcontroller has been configured to connect to the same WiFi network as the local computer and to run a Python script to receive and save the sensor data upon booting up
  • the host computer has been configured with the packages, tools, and files necessary to run the message broker RabbitMQ and Bokeh server and to execute two Python scripts: one to read and save data from the broker and one to plot the data

Prepare Host Computer to Receive Data

After launching the RabbitMQ server on the host computer, the Python script to read and save data from the message broker to the host computer is executed with the following arguements:

$ python broker_to_host.py <network_ip_address> <path/to/log_file.csv>

Turn on Microcontroller

Upon booting up, the microcontroller connects to the appropriate WiFi network and runs a Python script to read sensor data, save the data on the logger's microSD card (path /mnt/sda1/ ), and send the data to the message broker:

$ python /mnt/sda1/logger_to_broker.py <network_ip_address> <logger_name> /mnt/sda1/log_file.csv

Log & Send Data to Broker

The script executed by the logger converts the voltages output by the helmet sensors to their corresponding measurements, saves the data to a microSD card, and publishes the data to the RabbitMQ message broker. The Python client Pika is used to communicate with the message broker.

First, using the argparse module, the RabbitMQ message broker's ip address (WiFi ip address), name of the microcontroller, and log file on microSD card to save data to are read from the command used to run the script:

parser = argparse.ArgumentParser()
parser.add_argument('broker', help='Network or IP address of message broker')
parser.add_argument('logger_id', help='Logger name or ID')
parser.add_argument('log_file', help='Location of log file')
args = parser.parse_args()

Next, a connection is established with the message broker and an exchange to transmit data to the broker queue is created:

connection = pika.BlockingConnection(pika.ConnectionParameters(host=args.broker))
channel = connection.channel()
channel.exchange_declare(exchange='logs', type='fanout')

After establishing a connection to the broker, sensor voltages are read and converted to their corresponding measurements every second and the data are sent to the broker and saved on the logger's microSD card:

while True:
    # Read voltages from ADC channels
    T_voltage = read_voltage(1)
    HF_voltage = read_voltage(2)

    # Calculate temperature, HF
    T = calculate_T(T_voltage)
    HF = calculate_HF(HF_voltage)

    # Construct message and publish to broker
    message = (time.ctime()+',%s,%d,%0.1f,%0.2f') % (args.logger_id, total_time, T, HF)
    channel.basic_publish(exchange='logs', routing_key='', body=message)

    # Save data to microSD card
    with open(args.log_file, 'a+') as text_file:
        text_file.write(message+'\n')

    total_time = total_time + 1        
    time.sleep(1)

read_voltage() is a user defined function specific to the microcontroller that reads the voltage from a defined channel (here, the thermocouple is connected to channel 1 and the SBG is connected to channel 2). The sensor voltages are converted to their corresponding measurements through a user defined function specific to the sensor type and certain calibration constants specified by the manufacturer. For example, the calibration constants specified by the manufacturer for the SBG correspond to the slope (m) and y-intercept (b) of a linear fit across data from calibration tests. The function calculate_HF() uses the constants and basic equation of a line to convert voltage to heat flux:

def calculate_HF(voltage):
    HF = (voltage)*m + b
    HF = round(float(HF), 2)
    return HF

After the voltages are converted, a message is constructed in the format <time stamp>, <x-axis time value for plot>, <temperature>, <heat flux> and is published to the exchange.

Read & Save Data from Broker

broker_to_host.py was executed earlier to prepare the host computer to read data from the RabbitMQ message broker. Simliar to the script used to send data to the broker, broker_to_host.py begins by using the argparse module to read the arguments defined in the execution command:

parser = argparse.ArgumentParser()
parser.add_argument('broker', help='network or IP address of message broker')
parser.add_argument('log_file', help='Location of log file')
args = parser.parse_args()

Then, Pika is used to establish a connection with the message broker and data are read from the broker and saved to log_file.csv by calling the user defined function callback:

connection = pika.BlockingConnection(pika.ConnectionParameters(host=args.broker))
channel = connection.channel()
channel.exchange_declare(exchange='logs', type='fanout')
queue_name = result.method.queue
channel.queue_bind(exchange='logs', queue=queue_name)
channel.basic_consume(callback, queue=queue_name, no_ack=True)
channel.start_consuming()

Generate Bokeh Plots from Saved Data

The Bokeh server is used to publish and update plots of the sensor data in real time. The command bokeh-server is used to run the Bokeh server on the host computer. Next, a Python script is executed to create, display, and update the plots as new data are received. The script creates plots in a manner similar to that outlined below.

import pandas as pd
from bokeh.plotting import figure, vplot, output_notebook, show
from bokeh.models import (HoverTool, LinearAxis, Range1d, GlyphRenderer)
output_notebook()
# Bokeh options for interaction
TOOLS="pan,box_zoom,reset,resize,save,hover"

# Read in example log file
log_data = pd.read_csv('log_file.csv', index_col=0)
time_x = log_data.iloc[ : , 1]
T_data = log_data.iloc[ : , 2]
HF_data = log_data.iloc[ : , 3]

# Format plots and display data
p1 = figure(title='FF Helmet - Ambient Temp', x_axis_label = 'Time (s)', y_axis_label = 'Temperature (°C)', 
            tools=TOOLS, plot_width=750, plot_height=500)
p1.line(time_x, T_data, color='red', line_width = 3, legend='Amb T')

p2 = figure(title='FF Helmet - Heat Flux', x_axis_label = 'Time (s)', y_axis_label = 'Heat Flux (kW/m²)', 
            tools=TOOLS, plot_width=750, plot_height=500)
p2.line(time_x, HF_data, color='blue', line_width = 3, line_dash = 'dashed', legend='Heat Flux')

p = vplot(p1, p2)
show(p)

Data from an example file were plotted here. The code used to plot the helmet sensor data in real time incorporates a while loop to read the log file and update the data on the Bokeh server (and thus the plots) every second:

# Assign data source to use when rendering glyphs to update data
renderer = p1.select(dict(type=GlyphRenderer))
ds1 = renderer[0].data_source

renderer = p2.select(dict(type=GlyphRenderer))
ds2 = renderer[0].data_source

while True:
    # Update plots with data
    ds1.data["x"] = time_x
    ds1.data["y"] = T_data
    ds1._dirty = True
    cursession().store_objects(ds1)

    ds2.data["x"] = time_x
    ds2.data["y"] = HF_data
    ds2._dirty = True
    cursession().store_objects(ds2)

    time.sleep(1)

    # read log file
    new_data = pd.read_csv('../Data/log_file.csv', index_col=0)
    time_x = new_data.iloc[ : , 1]
    T_data = new_data.iloc[ : , 2]
    HF_data = new_data.iloc[ : , 3]

A video (2X speed) showing the real-time data plotting in action is presented below.

import io
import base64
from IPython.display import HTML

video = io.open('media/realtime_plotting.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''
                
             '''.format(encoded.decode('ascii')))

By implementing basic Python code using Bokeh and other tools, data from a firefighter's immediate environment in fire scenarios are now able to be transmitted and plotted in real-time.

References

  1. Willi J, Horn G, Madrzykowski D (2015) Characterizing a Firefighter's Immediate Thermal Environment in Live-Fire Training Scenarios. Fire Technology. Link

by Max Gamurar at January 18, 2016 06:03 AM

January 15, 2016

William Stein

Thinking of using SageMathCloud in a college course?

SageMathCloud course subscriptions

"We are  college instructors of the calculus sequence and ODE’s.  If the college were to purchase one of the upgrades for us as we use Sage with our students, who gets the benefits of the upgrade?  Is is the individual students that are in an instructor’s Sage classroom or is it the  collaborators on an instructor’s project?"

If you were to purchase just the $7/month plan and apply the upgrades to *one* single project, then all collaborators on that one project would benefit from those upgrades while using that project.

If you were to purchase a course plan for say $399/semester, then you could apply the upgrades (network access and members only hosting) to 70 projects that you might create for a course.   When you create a course by clicking +New, then "Manage a Course", then add students, each student has their own project created automatically.  All instructors (anybody who is a collaborator on the project where you clicked "Manage a course") is also added to the student's project. In course settings you can easily apply the upgrades you purchase to all projects in the course.  

Also I'm currently working on a new feature where instructors may choose to require all students in their course to pay for the upgrade themselves.  There's a one time $9/course fee paid by the student and that's it.  At some colleges (in some places) this is ideal, and at other places it's not an option at all.   I anticipate releasing this very soon.





Getting started with SageMathCloud courses


You can fully use the SMC course functionality without paying anything in order to get familiar with it and test it out.  The main benefit of paying is that you get network access and all projects get moved to members only servers, which are much more robust; also, we greatly prioritize support for paying customers.   

This blog post is an overview of using SMC courses:

  http://www.beezers.org/blog/bb/2015/09/grading-in-sagemathcloud/

This has some screenshots and the second half is about courses:

  http://blog.ouseful.info/2015/11/24/course-management-and-collaborative-jupyter-notebooks-via-sagemathcloud/

Here are some video tutorials made by an instructor that used SMC with a large class in Iceland recently:

  https://www.youtube.com/watch?v=dgTi11ZS3fQ
  https://www.youtube.com/watch?v=nkSdOVE2W0A
  https://www.youtube.com/watch?v=0qrhZQ4rjjg

Note that the above videos show the basics of courses, then talk specifically about automated grading of Jupyter notebooks.  That might not be at all what you want to do -- many math courses use Sage worksheets, and probably don't automate the grading yet.

Regarding using Sage itself for teaching your courses, check out the free pdf book to "Sage for Undergraduates" here, which the American Mathematical Society just published (there is also a very nice print version for about $23):

   http://www.gregorybard.com/SAGE.html

by William Stein (noreply@blogger.com) at January 15, 2016 09:14 AM

Continuum Analytics news

Numba 0.23 Has Been Released!

Posted Friday, January 15, 2016
div.bokeh{ min-height: 345px; } .bokeh table { border: none; } .bokeh td { padding: 0; }

With this latest release of Numba, we're excited to be able to deliver several frequently requested features:

  • JIT classes
  • np.dot support in nopython mode
  • a multithreaded CPU target for @guvectorize.

Experimental Support for JIT-friendly Classes

Numba derives its performance from the combination of two important ingredients:

  1. Data structures that can be accessed directly, bypassing the Python interpreter (such as the NumPy ndarray).
  2. Just-in-time generation of machine code by the LLVM compiler library .

While much of our focus is on the compiler-side of things, we also want to promote machine-friendly data structures in Python. We look forward to someday being able to pass Pandas DataFrames, xray DataArrays, and DyND arrays to Numba-compiled functions. (More on that below...)

However, sometimes you just want to use some simple objects as your data structure, along with compiled methods that operate on the attributes. For those cases, we've created a new decorator that can be applied to a Python class: @jitclass.

Here's an example of how this works:

# conda create -n numba_023_test python=3.4 numba scipy bokeh jupyter
import numpy as np
from numba import jit, jitclass, int64, float64, guvectorize
from bokeh.plotting import figure, output_notebook, show
output_notebook()

 

@jitclass([    
    ('xmin', float64),
    ('xmax', float64),
    ('nbins', int64),
    ('xstep', float64),
    ('xcenter', float64[:]),
    ('bins', int64[:]),
    ('moments', float64[:])
])
class Hist1D(object):
    '''A 1D histogram that can be updated, and computes mean and stddev incrementally'''
    def __init__(self, xmin, xmax, nbins):
        self.xmin = xmin
        self.xmax = xmax
        self.nbins = nbins
        self.xstep = (xmax - xmin) / nbins
        self.xcenter = (np.arange(nbins) + 0.5) * self.xstep - self.xmin
        self.bins = np.zeros(self.nbins, dtype=np.int64)
        self.moments = np.zeros(3, dtype=np.float64)
    
    def fill_many(self, values):
        for value in values:
            bin_index = np.int64((value - self.xmin) / self.xstep)
            if 0 <= bin_index < len(self.bins):
                self.bins[bin_index] += 1
                self.moments[0] += 1
                self.moments[1] += value
                self.moments[2] += value**2
    
    @property
    def count(self):
        return np.int64(self.moments[0])
    
    @property
    def mean(self):
        return self.moments[1] / self.moments[0]
    
    @property
    def stddev(self):
        return np.sqrt(self.moments[2] / self.moments[0] - self.mean**2)
        

 

This example shows all the basic features that @jitclass currently supports. The attributes and Numba types are described in a specification that is passed to the @jitclass decorator.

h = Hist1D(-4, 4, 25)
h.fill_many(np.random.normal(size=5000))

fig = figure(plot_width=600, plot_height=300)
fig.line(h.xcenter, h.bins)
show(fig)

print('Count: %f, Mean: %f, StdDev: %f' % (h.count, h.mean, h.stddev))
Count: 4999.000000, Mean: -0.013222, StdDev: 0.997465

The great thing about JIT classes is that they can also be passed to any nopython mode functions:

@jit(nopython=True)
def add_uniform_noise(hist, noise_fraction):
    '''Add uniformly distributed noise to a histogram.
    The final histogram will have the specified fraction of noise samples.
    '''
    n = np.int64(hist.count / (1 - noise_fraction))
    samples = np.empty(n, dtype=np.float64)
    for i in range(n):
        samples[i] = np.random.uniform(hist.xmin, hist.xmax)
    hist.fill_many(samples)
    

Let's try it out:

h2 = Hist1D(-4, 4, 25)
h2.fill_many(np.random.normal(size=5000))

add_uniform_noise(h2, noise_fraction=0.3)

fig = figure(plot_width=600, plot_height=300, y_range=(0, h2.bins.max() * 1.1))
fig.line(h2.xcenter, h2.bins)
_ = show(fig)

One important caveat about JIT classes is that access to attributes from Python will be significantly slower than a normal Python object:

class PythonObject(object):
  def __init__(self):
      self.count = 1

python_obj = PythonObject()

%timeit python_obj.count + 2
%timeit h2.nbins + 2
10000000 loops, best of 3: 103 ns per loop
The slowest run took 5270.13 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 8.28 µs per loop

This performance difference is because JIT classes store their attribute data in a non-Python object form, so when the attribute value is requested from the Python interpreter, Numba has to wrap the value in a new Python object to return it. This is similar to the tradeoffs associated with accessing NumPy array elements from Python. In moderation, attribute access is fine, but if you need to do it frequently, then you should consider compiling that code with Numba as well.

Our support for JIT classes is very limited today. Only nopython mode is supported, and our interface for handling recursive types (types that contain instances of themselves) is still in flux. We will be working to remove the rough spots, and expand the functionality to cover more use cases over the next few releases.

Initial support for np.dot in nopython mode (requires SciPy 0.16 or later)

This simple-sounding request turned out to be much more complicated because we wanted to make sure that Numba would use the same high performance BLAS library (MKL, OpenBLAS, etc.) likely being used by NumPy and SciPy. We discovered that SciPy exports C-callable BLAS functions for Cython which Numba will also take advantage of. Those SciPy functions in turn will call whatever BLAS implementation SciPy was configured with.

The feature itself is fairly straightforward. We support the np.dot() function in nopython mode for contiguous arrays when doing:

  • 1D vector × 1D vector dot product
  • 2D matrix × 1D vector multiplication
  • 2D matrix × 2D vector multiplication

Future releases will implement the broadcast rules for higher dimensional dot products, and also support non-contiguous arrays by copying to temporary storage before calling the BLAS library. (Note that np.dot inside nopython mode will be no faster than outside nopython mode, since both will use the same optimized BLAS library to do the heavy lifting.)

Multithreaded CPU @guvectorize

We love Universal Functions ("ufuncs") and Generalized Universal Functions ("gufuncs")! They are an underappreciated abstraction for expressing array-oriented functions that are intutive to write and easy for a compiler to parallelize. In fact, many people may not realize that Numba supports four different targets for both ufuncs and gufuncs:

  • target=cpu: Single-threaded, CPU execution
  • target=parallel: Multi-threaded, CPU execution
  • target=cuda: Execution on NVIDIA GPUs that support CUDA (most of them)
  • target=hsa: Execution on AMD APUs that support HSA (Kaveri and Carrizo)

With the Numba 0.22.1 release, we open sourced the parallel, cuda, and hsa targets that had been in our numbapro package (see Deprecating NumbaPro: The New State of Accelerate in Anaconda for more details).

However, there was one missing implementation: @guvectorize(target=parallel). During the Numba 0.23 release cycle we filled in that gap, so now you can take advantage of gufuncs on your multicore processors:

@guvectorize([(float64[:], float64[:])], '(n)->()')
def l2norm_cpu(vec, result):
    result[0] = (vec**2).sum()
    
@guvectorize([(float64[:], float64[:])], '(n)->()', target='parallel')
def l2norm_parallel(vec, result):
    result[0] = (vec**2).sum()
    

On my quad-core MacBook Pro laptop:

n = 100000
dims = 10
random_vectors = np.random.uniform(size=n*dims).reshape((n, dims))
%timeit l2norm_cpu(random_vectors)
%timeit l2norm_parallel(random_vectors)
10 loops, best of 3: 48 ms per loop
10 loops, best of 3: 19.5 ms per loop

For more details on gufuncs, check out our @guvectorize documentation:

What's Next?

In the next release cycle, we'll be refining and improving the features described above, upgrading Numba to LLVM 3.7, and documenting an official API for 3rd parties to extend Numba. Our hope is that this will allow other libraries to add new types and function implementations to the Numba compiler without needing to modify the Numba codebase itself. This opens up a lot of possibilities (nopython support for accessing Pandas DataFrames, anyone?) and we look forward to seeing what you can build with Numba.

As always, you can install the latest Numba with conda:

conda install numba # Don't forget to install scipy for np.dot support!

or find our release at PyPI.

If you have questions or suggestions, we welcome feedback in our GitHub repository and on our mailing list.

(You can download this blog post in notebook form from https://notebooks.anaconda.org/seibert/numba-0-23-release)

 

by Max Gamurar at January 15, 2016 06:17 AM

January 14, 2016

Continuum Analytics news

Anaconda Cloud: Introducing Labels

Posted Wednesday, January 20, 2016

Today we’re introducing a new concept in Anaconda Cloud called labels. You may have known them previously as “channels,” but Continuum’s conda package management tool also has 'channels' that have a different meaning, thus we've decided to rename Anaconda Cloud 'channels' and launch labels to avoid confusion going forward.

Labels allow you to organize your package files on Anaconda Cloud into categories. They are similar to folders, however, unlike folders, you can apply more than one label to a single file.

Why are labels useful?

We at Continuum are using labels internally to move our files through development, testing, and production lifecycle.

How do I find out more?

The documentation on labels, which can be found here, provides examples of how to create labels and use them in both the Anaconda Cloud web interface as well as the anaconda-client command-line tool.

What if I need more help?

If you have any questions about the new functionality, you can contact us at https://anaconda.org/contact , or report an issue at https://anaconda.org/contact/report

by swebster at January 14, 2016 10:45 PM

January 13, 2016

Titus Brown

Why software development practice matters, Containerization version

A while back, Kai Blin (via Nick Loman) asked Michael Barton:

If we containerize all these things won't it just encourage worse software development practices; right now developers still need to consider someone other than themselves installing the software.

and Michael Barton's response, transcribed, was:

"It's a good point. Ultimately, though, if I can get a container, and it works, and I know it will work, do you care how well it was developed? I think if a software developer uses unit tests, feature tests, and all those things, it will be easier for themselves to develop the software, but if I can get a container and even if it was, say, developed with terrible software practices, as long as it works for me and nucleotide.es will test it and make sure it works, then do you care?"

The answer is unambiguously yes, of course. Here's today's anecdote:

We've spent a few months working with a fairly widely used software package in bioinformatics. We found a really obvious but rather minor bug in it a month or two ago, and in tracking down the location of the bug, we spent a lot of time looking at the source code.

Because we looked at it, we no longer trust the package.

The source code is long, complicated, poorly modularized, and (in the place we found the bug) is largely incomprehensible. It comes with no test suite. In order to try to fix the bug, we would have to first create a bunch of our own tests of the larger code base. We are therefore declining to fix the bug (although it's obvious enough that we will probably contact the developers).

Did I mention that it's widely used?


So, in response to Michael:

You can containerize bad code all you want, and now you'll have nice, easy-to-install code that is producing wrong answers.

Yay! Wrong answers more quickly!!!


We need to recognize that most code is broken, and that bad code is *much* more likely to be wrong. If you're not using three of the most obvious lines of defenses against broken code (small functions, version control, and automated tests) then your code is probably wrong. Full stop.


Back to the software with the bug:

I'm honestly not sure what to do next. Our options are --

  1. File a bug report, stop using the package, and be on our merry way.
  2. Dive into the code base, build a test suite, and put the time into verifying someone else's code so that we and others can use it reliably.
  3. Dive into the code base, find more bugs, and then blog about how unreliable the software is - the "name them and shame them" approach. (Doing this without finding more actual bugs would be unfair IMO; at the moment it's just an intuition, albeit it a strong one.)

I suppose I could contact the developers and have an frank and robust discussion with them about how they need to test their software, but I don't know them very well and think it would lead into a big time suck.

But I also am concerned about taking the time away from other things to work on this. The time tradeoffs are considerable - I have spouse & children, meetings with postdocs/ grad students/collaborators, training workshops, and exercise that occupy essentially all my time. Should I spend what little time I have to think and write and code, on this?

A fourth option that I am leaning towards is "testing-lite" - test the key bits that we care about, and then do hack-y or low-performance reimplementations of the bits that we really care about, and then compare results.

A fifth approach (that would probably take too much time) would be to combine 2 and 3 above with refactoring of the code base itself, followed by re- or co-publication. This is probably the right approach.


More generally, what do we do about this kind of thing as a community? What do people think about 1-5 above? Are there approaches that I'm missing here?

--titus

p.s. I'd appreciate it if those who know what software I'm talking about could avoid outing it; thanks.

by C. Titus Brown at January 13, 2016 11:00 PM

January 12, 2016

Pierre de Buyl

The threefry random number generator

Pseudo-random number generation has many applications in computing. The use of random number in parallel is becoming necessary and a recent solution is the use of so-called block cipher algorithms. In this post, I review the mechanism being the Threefry random number generator that is based on the the Threefish block cipher. The article below is actually a Jupyter notebook and can be downloaded on my GitHub account: see the notebook | download the notebook (a right-click may help). The code is 3-clause BSD and content CC-BY.

As a teasing, Threefry is crush resistant, easy to implement, suitable for parallel applications and a working pure Python code is presented! I was motivated to understand this topic because the nano-dimer program by Peter Colberg uses it. Requirements for a Pseudorandom Number Generator (PRNG) include a good performance and a low memory usage (critical for GPUs). A good overview of these considerations is given in Salmon et al doi:10.1145/2063384.2063405.

Threefish

The Threefish block cipher is a part of the encryption candidate Skein. A block cipher is a reversible transformation of a sequence of bytes that depends on a user chosen key. Threefish (see the Wikipedia page and the official specification) is built of several simple components:

  • a mix function that operates on words
  • a permutation function $\pi$ that operates on several consecutive words
  • a key schedule, that defines how the information in the key is applied within the algorithm

Here, a word is defined as a sequence of 64 bits. In practice, it is stored as an unsigned 64-bit integer. All sums are computed modulo $2^{64}-1$.

The full algorithm, going from a series of "plain text" words $p_i$ to the corresponding "cipher" $c_i$ consists in applying the following steps for $N_r$ rounds:

  1. one round in four, add the key to the current value of the data
  2. apply the mix function to two consecutive words at a time
  3. permute the words according to $\pi$

After the last round, a final addition of the key is applied and the full data is XOR-ed with the initial value $p$. In Threefish, the key schedule is modified by a tweak, an extra set of two words. The specification provides:

  • a set of rotation constants for the mix function
  • the constant C240 that is part of the key schedule
  • the permutation function $\pi$
  • the number of rounds.

Threefish is specified for blocks of 256, 512 and 1024 bits. Below is a plain implementation of Threefish-256 that operates on 4 words blocks, in Python (the code is intended for Python 3). The test reproduces the content of the example found on the official NIST CD release of Skein. Intermediate steps are obtained by adding the argument "debug=True" and follow the step-by-step content of the file KAT_MCT/skein_golden_kat_internals.txt.

The code reproduces section 3.3 of the skein specification, with identical or close notation for all symbols, and is hopefully self-explaining.

In [62]:
NW = 4
C240 = 0x1BD11BDAA9FC1A22
N_ROUNDS=72
MASK = 2**64-1
pi = (0, 3, 2, 1)
R_256 = ((14, 16), (52, 57), (23, 40), (5, 37), (25, 33), (46, 12), (58, 22), (32, 32))

def rotl_64(x, d):
    return ((x << d) | (x >> (64-d))) & MASK

def mix(x0, x1, R):
    y0 = (x0+x1) & MASK
    y1 = rotl_64(x1, R) ^ y0
    return y0, y1

def key_schedule(K, TW, s):
    return (K[(s)%(NW+1)] & MASK,
              (K[(s+1)%(NW+1)] + TW[s%3]) & MASK,
              (K[(s+2)%(NW+1)] + TW[(s+1)%3]) & MASK,
              (K[(s+3)%(NW+1)] + s) & MASK)

def threefish(p, K, TW, debug=False):
    K = (K[0], K[1], K[2], K[3], C240^K[0]^K[1]^K[2]^K[3])
    TW = (TW[0], TW[1], TW[0]^TW[1])
    v = list(p)
    for r in range(N_ROUNDS):
        e = [0]*NW
        if r%4 == 0:
            ksi = key_schedule(K, TW, r//4)
            for i in range(NW):
                e[i] = (v[i] + ksi[i]) & MASK
            if debug: print('key injection   ', list(map(hex, e)))
        else:
            e = v
        f = [0]*NW
        f[0], f[1] = mix(e[0], e[1], R_256[r%8][0])
        f[2], f[3] = mix(e[2], e[3], R_256[r%8][1])
        if (r%2 == 0) and debug: print('end of round %03i' % (r+1), list(map(hex, f)))
        for i in range(NW):
            v[i] = f[pi[i]]
        if (r%2 == 1) and debug: print('end of round %03i' % (r+1), list(map(hex, v)))

    ksi = key_schedule(K, TW, N_ROUNDS//4)
    v = [((x + k)^pp) & MASK for x, k, pp in zip(v, ksi, p)]
    if debug: print(list(map(hex, v)))

    return v
In [64]:
# Test run with parameters from the NIST CD, file KAT_MCT/skein_golden_kat_internals.txt

K = (0x0,)*NW
TW = (0x0,0x0)

c = threefish((0x0,)*NW, K, TW)
print([hex(x) for x in c])

K = (0x1716151413121110, 0x1F1E1D1C1B1A1918, 0x2726252423222120, 0x2F2E2D2C2B2A2928)
TW = (0x0706050403020100, 0x0F0E0D0C0B0A0908)

c = threefish((0xF8F9FAFBFCFDFEFF, 0xF0F1F2F3F4F5F6F7, 0xE8E9EAEBECEDEEEF, 0xE0E1E2E3E4E5E6E7), K, TW)
print([hex(x) for x in c])
[&apos0x94eeea8b1f2ada84&apos, &apos0xadf103313eae6670&apos, &apos0x952419a1f4b16d53&apos, &apos0xd83f13e63c9f6b11&apos]
[&apos0x277610f5036c2e1f&apos, &apos0x25fb2add1267773e&apos, &apos0x9e1d67b3e4b06872&apos, &apos0x3f76bc7651b39682&apos]

Block ciphers, random numbers and Threefry

The motivation for this post (and the corresponding code) is to generate random numbers. The use of cryptographic protocols as keyed counter-based pseudorandom number generators is presented in Salmon et al "Parallel Random Numbers: As Easy as 1, 2, 3", doi:10.1145/2063384.2063405, Proceedings of SC'11. The use of the the Skein algorithm (based on Threefish) to generate random numbers is already mentioned in its specification, but the paper by Salmon et al goes further by specializing the algorithm (and others as well) for use in a PRNG context. I present here their specialization of Threefish: Threefry. Threefry is based on the Threefish block cipher, with three shortcuts. The number of rounds is reduced, the tweak that influence the key schedule is removed and the final XOR-ing of the output with the input data is removed. This latter shortcut is not explicit in the manuscript, as far as I understand. I concentrate here on Threefry-2x64, that is using two 64-bit words, with 20 rounds.

Whereas the prototypical PRNG steps from integer value to integer value using (more or less) complex iteration schemes, a counter based PRNG gives a random value directly for any point in the sequence at a fixed computational cost. The purpose of a cryptographic algorithm is to make it hard to guess the input from the output. Changing the input value from $i$ to $i+1$ will output an uncorrelated value.

The algorithm depends on a key of two words. One of them can be used as the seed and the other to identify one stream (i.e. a compute unit in a parallel computing environment). The position in the random stream is given as the "plain text" word.

Practically, it means that instead of keeping a state for the PRNG it is possible to request random_number(counter, key) where random_number is a pure function, counter is simply the iteration number that is requested and key is used to identify the stream (including a stream "id" and a seed).

The authors of the paper provide an implementation at http://www.thesalmons.org/john/random123/ or http://www.deshawresearch.com/resources_random123.html

As I wanted to understand the algorithm and have a working Threefish above, here is a Python version of Threefry-2x64.

In [34]:
NW_THREEFRY = 2
C240 = 0x1BD11BDAA9FC1A22
N_ROUNDS_THREEFRY = 20
MASK = 2**64-1
R_THREEFRY = (16, 42, 12, 31, 16, 32, 24, 21)

def key_schedule_threefry(K, s):
    return (K[(s)%(NW_THREEFRY+1)] & MASK,
              (K[(s+1)%(NW_THREEFRY+1)] + s) & MASK)

def threefry(p, K, debug=False):
    K = (K[0], K[1], C240^K[0]^K[1])
    v = list(p)
    for r in range(N_ROUNDS_THREEFRY):
        if r%4 == 0:
            ksi = key_schedule_threefry(K, r//4)
            e = [ (v[0]+ksi[0]) & MASK, (v[1]+ksi[1]) & MASK ]
            if debug: print('key injection   ', list(map(hex, e)))
        else:
            e = v
        v[0], v[1] = mix(e[0], e[1], R_THREEFRY[r%8])
        if debug: print('end of round %03i' % (r+1), list(map(hex, v)))

    ksi = key_schedule_threefry(K, N_ROUNDS_THREEFRY//4)
    v = [(x + k) & MASK for x, k in zip(v, ksi)]
    if debug: print(list(map(hex, v)))

    return v

Test output from the Random123 distribution, in the file Random123-1.08/examples/kat_vectors.

In [57]:
print(list(map(hex, threefry((0x0,0x0), (0x0, 0x0)))))
ff = 0xffffffffffffffff
print(list(map(hex, threefry((ff,ff), (ff, ff)))))
print(list(map(hex, threefry((0x243f6a8885a308d3, 0x13198a2e03707344), (0xa4093822299f31d0, 0x082efa98ec4e6c89)))))
[&apos0xc2b6e3a8c2c69865&apos, &apos0x6f81ed42f350084d&apos]
[&apos0xe02cb7c4d95d277a&apos, &apos0xd06633d0893b8b68&apos]
[&apos0x263c7d30bb0f0af1&apos, &apos0x56be8361d3311526&apos]

Using Threefry for random numbers

Below, illustrate the use of Threefry as one would use a typical PRNG. A seed is chosen, from which the key K is built. Then a loop uses the threefry function to generate two 64-bit words. The data is collected and show in a histogram and a next-point correlation plot. Also, the random numbers are used to estimate $\pi$ via the Monte Carlo method.

In [87]:
SEED = 0x1234
K = (0x0, SEED)
MULT = 1/2**64
data = []
total_count = 10000
pi_count = 0
for i in range(total_count):
    c0, c1 = threefry((i, 0x0), K)
    c0 *= MULT
    c1 *= MULT
    data.append(c0)
    if c0**2+c1**2<1:
        pi_count += 1

print("Estimate of pi", 4*pi_count/total_count)
Estimate of pi 3.1228
In [88]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
In [89]:
data = np.array(data)
plt.hist(data);
In [90]:
plt.scatter(data[0::2], data[1::2], s=3); plt.xlim(0,1); plt.ylim(0,1);

Final comments

By dissecting the algorithm and presenting a pure Python version, I hope I made it easier to understand Threefry for other who are not familiar with cryptography. The code presented here is purely illustrative in the sense that it presents no performance benefit and a C version will likely be added to my own random module. The stateless nature of Threefry makes it suited for OpenMP Shared-Memory Paralellism where one does not even need a thread-private state. Only the thread ID needs to be given to the PRNG. If a states ends up to be needed, only the sequence number needs to be tracked.

In [ ]:
 

by Pierre de Buyl at January 12, 2016 10:00 AM

January 10, 2016

Titus Brown

Docker workshop at BIDS - post-workshop report

We just finished the second day of a workshop on Docker at the Berkeley Institute for Data Science. I was invited to organize the workshop after some Berkeley folk couldn't make our Davis workshop in November, and so I trundled on down for two days to give it a try. About 15 people showed up, from all walks of research and engineering. (The materials are freely available under a CC0 license.) I did the intro stuff, and Luiz Irber and Carl Boettiger both taught sections on their more advanced uses of Docker.

tl;dr? I think the workshop went well, in the sense that the attendees all got some hands-on experience with Docker and engaged its potential benefits (as well as some of its limitations).

A special thanks to BIDS for hosting me & putting me up overnight!


We delivered the workshop in a Software Carpentry style, with lots of command-line interaction and plenty of time for questions. People were welcome to follow along with what I was doing on the screen, or just watch; in either case, we tried to provide fairly thorough notes so that people could refer back to them in the future.

The workshop was attended by a variety of people, including software engineering folk, infrastructure folk, supercomputing folk, librarians, and grad students / postdocs. Some people had a fair bit of experience with Docker, while others had only vaguely heard of it; despite this spread in experience I am told everyone got something out of it. The experienced folk got to see how we used it, while the people new to Docker got a somewhat thorough overview of the technology and its potential applications.

The workshop focused on scientific applications of Docker, but we had to get through what Docker was first! Day one was mostly about the basics - Docker containers and images, docker-machine, building docker images, data volumes, directory mirroring, and exposing ports. Day two went a bit further afield and talked about mybinder, docker-compose and docker-swarm, and some potential workflows for data-intensive applications.

A few random thoughts --

  • I still don't really know how to teach Docker, because it's a mix of super obvious (for the sysadmin and CS-types) and super detailed (to, you know, actually use it). I think the hands-on approach has to beat more conceptual or theoretical approaches here :). The concept is clear enough for CS people, but the details of the implementation (and how it all actually works) are really at the heart of putting Docker to use and figuring out where it fits.

    This time, I started with local docker and then introduced docker-machine. I might introduce docker-machine first next time, to make it clear that local volume mapping is a super-special feature that you can't rely on, and to introduce the point that even on your Mac/Windows machine, the port mapping is going to be confusing!

  • Docker doesn't run on HPCs or other multi-tenant systems very well, which everyone recognizes as a problem. There was some discussion of Shifter, which is a Docker-like technology intended to address this deficit.

    (I'm happy to sign on to support a grant proposal if someone can credibly claim to be able to deal with the problems of Docker running as root. I can also broker introductions to people who might pay for it :).

  • Docker still isn't great in terms of data management. Data volumes are poor vessels for communicating data. Carl Boettiger showed us an approach based on Flocker that "required only a little configuration." I should probably look into this...

  • We don't have a clear use case for docker-compose in science!

  • The ease with which you can get an RStudio Server or a Jupyter notebook running inside of Docker is just astonishing.

  • Everyone was pretty impressed with how easy and obvious mybinder.org is, and seemed to think that it would fit into an only slight evolved data science workflow. I really want to explore at the seams between long-running data-intensive tasks, and the all-data-fits-in-github short-run analysis supported by mybinder...

    I put together two demos for mybinder, one of a Software Carpentry data set and one showing how to do some significant software installs.

    It's a super cool service!

    (There's two other similar services/packages I want to look at - JupyterHub and Everware. Pointers to intro tutorials welcome!)

  • I'm really interested in suggesting Docker as a way to deal with installation issues in workshops - if only as a backup. The two things standing in the way of that are (a) older laptops and (b) my inexperience with running Docker on Windows.

    Some mixtures of jupyterhub/mybinder/everware might be an alternative solution. But then network becomes an issue. Heck, at a recent workshop on training, we actually brought up the idea of carting around a pile of Raspberry Pis to run local compute clusters for training purposes...

--titus

by C. Titus Brown at January 10, 2016 11:00 PM

January 08, 2016

William Stein

Mathematics Graduate School: preparation for non-academic employment

This is about my personal experience as a mathematics professor whose students all have non-academic jobs that they love. This is in preparation for a panel at the Joint Mathematics Meetings in Seattle.

My students and industry
My graduated Ph.D. students:
  • 3 at Google
  • 1 at Facebook
  • 1 at CCR
My graduating student (Hao Chen):
  • Applying for many postdocs
  • But just did summer internship at Microsoft Research with Kristin. (I’ve had four students do summer internships with Kristin)
All my students:
  • Have done a lot of Software development, maybe having little to do with math, e.g., “developing the Cython compiler”, “transition the entire Sage project to git”, etc.
  • Did a thesis squarely in number theory, with significant theoretical content.
  • Guilt (or guilty pleasure?) spending time on some programming tasks instead of doing what they are “supposed” to do as math grad students.

Me: academia and industry

  • Math Ph.D. from Berkeley in 2000; many students of my advisor (Lenstra) went to work at CCR after graduating…
  • Academia: I’m a tenured math professor (since 2005) – number theory.
  • Industry: I founded a Delaware C Corp (SageMath, Inc.) one year ago to “commercialize Sage” due to VERY intense frustration trying to get grant funding for Sage development. Things have got so bad, with so many painful stupid missed opportunities over so many years, that I’ve given up on academia as a place to build Sage.
Reality check: Academia values basic research, not products. Industry builds concrete valuable products. Not understanding this is a recipe for pain (at least it has been for me).

Advice for students from students

Robert Miller (Google)

My student Robert Miller’s post on Facebook yesterday: “I LOVE MY JOB”. Why: “Today I gave the first talk in a seminar I organized to discuss this result: ‘Graph Isomorphism in Quasipolynomial Time’. Dozens of people showed up, it was awesome!”
Background: When he was my number theory student, working on elliptic curves, he gave a talk about graph theory in Sage at a Sage Days (at IPAM). His interest there was mainly in helping an undergrad (Emily Kirkman) with a Sage dev project I hired her to work on. David Harvey asked: “what’s so hard about implementing graph isomorphism”, and Robert wanted to find out, so he spent months doing a full implementation of Brendan McKay’s algorithm (the only other one). This had absolutely nothing to do with his Ph.D. thesis work on the Birch and Swinnerton-Dyer conjecture, but I was very supportive.

Craig Citro (Google)

Craig Citro did a Ph.D. in number theory (with Hida), but also worked on Sage aLOT as a grad student and postdoc. He’s done a lot of hiring at Google. He says: “My main piece of advice to potential google applicants is ‘start writing as much code as you can, right now.’ Find out whether you’d actually enjoyworking for a company like Google, where a large chunk of your job may be coding in front of a screen. I’ve had several friends from math discover (the hard way) that they don’t really enjoy full-time programming (any more than they enjoy full-time teaching?).”
“Start throwing things on github now. Potential interviewers are going to check out your github profile; having some cool stuff at the top is great, but seeing a regular stream of commits is also a useful signal.”

Robert Bradshaw (Google)

“A lot of mathematicians are good at (and enjoy) programming. Many of them aren’t (and don’t). Find out. Being involved in Sage is significantly more than just having taken a suite of programming courses or hacking personal scripts on your own: code reviews, managing bugs, testing, large-scale design, working with others’ code, seeing projects through to completion, and collaborating with others, local and remote, on large, technical projects are all important. It demonstrates your passion.”

Rado Kirov (Google)

Robert Bradshaw said it before me, but I have to repeat. Large scale software development requires exposure to a lot of tooling and process beyond just writing code - version control, code reviews, bug tracking, code maintenance, release process, coordinating with collaborators. Contributing to an active open-source project with a large number of contributors like Sage, is a great way to experience all that and see if you would like to make it your profession. A lot of mathematicians write clever code for their research, but if less than 10 people see it and use it, it is not a realistic representation of what working as a software engineer feels like. 

The software industry is in large demand of developers and hiring straight from academia is very common. Before I got hired by Google, the only software development experience on my resume was the Sage graph editor. Along with solid understanding of algorithms and data structures that was enough to get in."

David Moulton (Google)

“Google hires mathematicians now as quantitative analysts = data engineers. Google is very flexible for a tech company about the backgrounds of its employees. We have a long-standing reading group on category theory, and we’re about to start one on Babai’s recent quasi- polynomial-time algorithm for graph isomorphism. And we have a math discussion group with lots of interesting math on it.”

My advice for math professors

Obviously, encourage your students to get involved in open source projects like Sage, even if it appears to be a waste of time or distraction from their thesis work (this will likely feel very counterintuitive you’ll hate it).
At Univ of Washington, a few years ago I taught a graduate-level course on Sage development. The department then refused to run it again as a grad course, which was frankly very frustrating to me. This is exactly the wrong thing to do if you want to increase the options of your Ph.D. students for industry jobs. Maybe quit trying to train our students to be only math professors, and instead give them a much wider range of options.

by William Stein (noreply@blogger.com) at January 08, 2016 10:25 PM

January 03, 2016

Titus Brown

Sustaining the development of research-focused software

Recently I was asked by someone at a funding organization about the term "hardening software"; I wrote a blog post asking others what they thought, and this got a number of great comments (as well as spurring Dan Katz to write a blog post of his own). I'd already written an initial response to my friend at the funder (which I'll publish separately), but the comments I got from the blogosphere made me think more broadly about how researchers mentally model software development.

I see two pretty divergent ways of thinking about software development in my research community. The first I would call "consumer product", and the second, "community project". (I'm excluding infrastructure and workflow projects from this discussion, because they don't provide domain specific research functionality.)

In "consumer product" software, a company or research group builds some software and releases it for the wider world to use, often through a publication or a preprint. The goal of the developers is to allow other people to use the software. They may be quite passionate about this, providing good documentation and test data sets along with help forums. However, there will generally be little to no developer documentation available: the goal is to recruit users, not developers.

In "community project" software, a company or research group builds some software, and (as above) releases it to the wider world to use. Here, the producers usually want to let people use the software directly, but there is an additional emphasis on the software being something malleable - something that others can work on and with, and mold or adapt themselves to fit their needs or the needs of a broader community.

With both kinds of projects, the software can be more or less mature, and can be targeted at novice or expert users, and can have a broad or a narrow technical vision, etc etc. The key question is whether the software is the product, or the software project is the product.

It's not hard to find consumer product-type software in the bioinformatics world. Most assemblers and mappers are essentially opaque blobs of software that users interact with via the command line, and bugs, issues, and feature requests are communicated to a core set of developers, who then deal (or not) with them.

It's quite a bit harder to find full community projects in bioinformatics; while I'm sure I'll think of many the moment I hit "post", I don't have a shining exemplar in mind, although velvet and samtools come close. This lack may be due to a lack of developers, but is perhaps also because of the way academic funding works. I'll talk more about this below.

Sustainability and software needs - why do we care, and what do we want?

Software is fast becoming a sine qua non in research, and researchers and funding agencies have been trying with increasing urgency to figure out how to support sustainable software development. There's a whole NSF program devoted to sustainable research software, and I think it's fair to say that one of the main subsurface goals of the BD2K program is to tackle the question of software (PSA: data isn't much use without software). The USDA and bio bits of DOE are also running smack into the problem that the research programs they fund are utterly dependent on software, yet investments in software are often secondary and rarely yield software that is usable past the end of the grant. It turns out that software development for research is a rather intransigent beast.

Fundamentally, stable research software is a way of accelerating research. I believe there to be a lot of interest in & need for the domain-specific software set - software that is more specific than (say) Linux, Python and R, or RStudio and IPython/Project Jupyter, or the various major R and Python packages for data analysis. These have large user bases and are generically quite useful, so while supporting them is still a challenge, their support options are more varied than domain software. Recently, the Moore, Sloan, and Helmsley Foundations have stepped up here with focused investments. But what about software closer to the domain?

It's been very hard to meet the tremendous need for sustained development of domain-specific software - for example, in bioinformatics, I understand that very few mappers or assemblers or parsing libraries are actively supported by funding agencies for more than one round. When they are, it's almost always for new methods development, not for supporting and sustaining the software; these goals are somewhat at odds, to say the least. (See The three porridge bowls of sustainable scientific software development for more on this conundrum.) The software that is supported directly usually is a "best of breed" software package that wins because it's first-to-market, and/or because of existing user bases, and/or because it's useful for biomedically-relevant missions; moreover, it's often produced by a few groups at well-known institutions. While these tools are often quite good in and of themselves, I don't think this approach serves tool and algorithm diversity. We need an ecosystem of tools, not just one tool in each domain that is "owned" by one or two actors.

So, to rephrase the question more specifically: how can we build and sustain an ecosystem of research-focused software to support and accelerate inquiry with flexible domain-specific computational tools?

The lesson we can take from the open source world is that, in the absence of a business model, only "community project" software is sustainable in the long term.

When I write that, it seems almost tautological - of course if a project doesn't have a business model, the project won't be sustainable without a larger community! But there is essentially no business model behind most research software. (The most common approach seems to be assume that anything important will continue to be funded - a dangerous assumption in these days of 10% funding lines; the next most common approach is to build software to be at the core of a research program, and then get funding for the surrounding research program. Both of these approaches have obvious failure modes.) And so I think that we should be focusing on community project software.

In support of this thesis, I would point out that all of the more general packages I mentioned above (Linux, Python, R, RStudio, IPython/Project Jupyter, and things like scikit-learn) are full open source community projects, with a robust contribution process, a wide body of regular and part-time developers, and open and ongoing conversations about future directions. All of these packages have demonstrated long-term sustainability. I can't robustly conclude that this is a causal relationship, but I would strongly bet that it is.

Shoving software out into the community is no guarantee, but it's easier now than it used to be.

Lest people think that slapping a contributor process on software is enough to make it "community supported", let me just say that I know it's much more complicated than that. In addition to all the skills you had to have to write the software in the first place (the technical and algorithmic skills, the scientific background, and at least some software engineering knowledge) you also need to engage with the community online, solicit feedback and improvements, manage contributions and bug reports, and otherwise be responsive.

The transition to a successful community project seems challenging to me. From tracking this in the open source world (mostly in retrospect), you need some combination of luck, skill, and robust community-facing efforts. In the scientific world, I'm not sure how to guide funding towards projects where this transition is likely to be successful. My first thought would be to go with the NSF structure of SSE/SSI - fund new projects to get them started off, then provide transition grants to see if they can engage with a larger community, and then fund them as part of the community. One problem I see is that the time scale is so long - it's 5-10 years to take a project from nothing to something - and maintaining funding across that period is fraught with challenges.

One piece of good news is that it's become a lot easier to manage the technical infrastructure. GitHub or BitBucket can host your code, make it easy for people to work off the latest version, and provide tools to manage patch contributions and code review; ReadTheDocs and github and other sites can host your documentation; Travis-CI and other continuous integration platforms can run your tests, there are lots of places to host mailing lists, Twitter and blogs can help you gather your community 'round, and teleconferencing can help you coordinate developers.

The bad news is that the new skills that are required are community building and social interaction skills - something that many faculty are unprepared for, and probably won't have time for, and so they need to outsource them (just like most software engineering, or experimental work). This leads to a problem: community projects are bigger.

Community software projects are bigger, and therefore harder to fund

From the funding perspective, another problem is that community software projects are just... bigger, and hence more expensive. You need to pay attention not just to the research aspect of things, but also the community and the software quality. This probably means that you need at least two full time efforts for even a small project - a community liaison/release manager/testing manager, and technical lead to drive the project's software engineering forward. (This is in addition to whatever science you're doing, too ;).

From the applying-to-funder perspective, this makes life a bit of a nightmare. You have to successfully navigate continued software development, the building out of a community, and scientific progress, and then justify this all in an "ask" to a funding body, with a pretty low chance of making it. Add in the fact that very few programs will fund software development directly, and the difficulty rises.

I don't think anyone in industry is going to be surprised by the notion that you have to deliver value (here, "innovation") while building robust software. The contrast here is really that researchers often toss "grad ware" out the door as if it's good, robust, usable software, and we often have no funding plan beyond "publish it and then apply for a bunch of grants." This isn't sustainable, we shouldn't view it as sustainable, and the only surprise is that anyone ever thought it was sustainable. But the corollary may be that we need to figure out how to engage with a larger community, and support them, around developing individual research software packages, rather than trying to productize each and everyone one.

But if that engagement costs more, it's going to be difficult for funding agencies to support.

Focusing on community software projects solves a lot of problems

When people are asking for grant support to continue developing sofar, it's frequently difficult to figure out if the software is actually useful. But if there's a broad, robust community associated with it, it's probably useful. (I don't really know how to measure the size and quality of the community, note.)

We probably need fewer community projects overall. Maybe grad students could extend existing software, instead of writing a whole new package; I gather this is what happens in the VTK world.

Community projects will inevitably have less tolerance for crap software, at least on average. (I haven't thought through this thoroughly, but (a) no one wants to join your software project if they can't get it to run, and (b) your project will die if every release has more new bugs in it than were fixed.) So the poor quality, lousy adherence to community standards, and miserable packaging that afflicts many current bioinformatics projects would probably just go away in a model where only community-backed projects were given continued funding.

Community projects also have a built-in succession mechanism - when a principle investigator loses interest in a particular project, or retires, or lapses in funding, there is a decent chance that other PIs will be able to pick up the project and move forward with it.

Communities also channel software development in specific ways that are more democratic. If the project leads aren't responsible for implementing everything but have to rely on the community to implement and support features, then they are less likely to add useless features.

A more fundamental point is that I often think that we don't really know how to set our goals in research software development. Something that open source approaches excel at is channeling the needs of its users into functioning software. Explicitly acknowledging the community's role in deciding the future of a software project means that the project is committed to meeting the needs of its users, which I think probably fits with many goals of funders.

What does this all mean?

In sum, I think one way - perhaps the best way - to sustainably develop research software is to build a community around its development

One way for funders to do this might be to provide support for software making the transition from a small project to a community project.

While funding might still be needed to maintain the core of the project (I think a full-time developer and a full-time community liaison are minimal requirements) this funding would be leveraged better by supporting a full-on community of developers, rather than just supporting a small team product

A corollary might be that software grants should be reviewed equally on their community engagement plan, not just on their innovation in methods.

Formalizing the notion that some research software isn't meant for further use is probably good. In particular, a community-based approach can provide some guidance for research software that IS meant for further use (see Please destroy this software after publication. kthxbye. for my thoughts on the rest of research software -- the "No success without quality" section of Gael Varoquaux's post is an excellent follow-up ;).

One interesting direction (suggested by Tracy Teal) is to think about ways that granting programs could separate funding for the community/maintenance parts of a project from the research parts. For example, communities could be awarded grants to hire community developers or liaisons who are somewhat self-directed. Alternatively, institutes like the Software Sustainability Institute could provide personnel through a granting program.

More generally, project structure starts to matter a lot more once you explicitly have the community involved. In retrospect, a lot of my internal tension around future directions in khmer comes from a divergence in perspective between what the community needs and what the research needs; having separate funding and decision making could have helped here.

Some problems with this perspective

I do worry a lot that we don't have the robust community of developers in biology to support the necessary software development. If all you have is users, who in the community develops the software??

Open source is hardly a panacea, and open source processes aren't bulletproof. Lots of open source software is really bad. I do think that the ones that attract a community are likely to be less bad, though ;).

Balancing innovation and stability is super hard. We need to think a lot harder about this. Matthew Turk pointed out that process may overtake innovation as projects become more stable; is this good, or bad? How can this be managed with a small community?

Open source isn't exactly noted for a diversity of perspectives (or participation). Academia has its problems here, too.

Community coalescence may be more strongly related to star power and social media savviness than technical excellence.

Scientists (at least in biology) don't get rewarded for community involvement and community development. On the flip side, maybe getting grants explicitly for doing that would provide a mechanism of reward.

Researchers still need to figure out how to do good software engineering, which is hard. Community projects can help here, too, by providing standards and guidelines and on-boarding documentation.

A concluding sentence or two

This might all be obvious to everyone already. If so, apologies :)

--titus


Thanks to Tracy Teal, Matthew Turk, and Ethan White for their helpful comments on a draft of this blog post!

by C. Titus Brown at January 03, 2016 11:00 PM

December 29, 2015

Matthew Rocklin

Disk Bandwidth

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Disk read and write bandwidths depend strongly on block size.

Disk read/write bandwidths on commodity hardware vary between 10 MB/s (or slower) to 500 MB/s (or faster on fancy hardware). This variance can be characterized by the following rules:

  1. Reading/writing large blocks of data is faster than reading/writing small blocks of data
  2. Reading is faster than writing
  3. Solid state drives are faster than spinning disk especially for many small reads/writes

In this notebook we experiment with the dependence of disk bandwidth on file size. The result of this experiment is the following image, which depicts the read and write bandwidths of a commercial laptop SSD as we vary block size:

Analysis

We see that this particular hard drive wants to read/write data in chunksizes of 1-100 MB. If we can arrange our data so that we consistently pull off larger blocks of data at a time then we can read through data quite quickly at 500 MB/s. We can churn through a 30 GB dataset in one minute. Sophisticated file formats take advantage of this by storing similar data consecutively. For example column stores store all data within a single column in single large blocks.

Difficulties when measuring disk I/O

Your file system is sophisticated. It will buffer both reads and writes in RAM as you think you’re writing to disk. In particular, this guards your disk somewhat against the “many small writes” regime of terrible performance. This is great, your file system does a fantastic job (bringing write numbers up from 0.1 MB/s to 20 MB/s or so) but it makes it a bit tricky to benchmark properly. In the experiment above we fsync each file after write to flush write buffers and explicitly clear all buffers before entering the read section.

Anecdotally I also learned that my operating system caps write speeds at 30 MB/s when operating off of battery power. This anecdote demonstrates how particular your hard drive may be when controlled by a file system. It is worth remembering that your hard drive is a physical machine and not just a convenient abstraction.

December 29, 2015 12:00 AM

Data Bandwidth

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We list and combine common bandwidths relevant in data science

Understanding data bandwidths helps us to identify bottlenecks and write efficient code. Both hardware and software can be characterized by how quickly they churn through data. We present a rough list of relevant data bandwidths and discuss how to use this list when optimizing a data pipeline.

Name Bandwidth MB/s
Memory copy 3000
Basic filtering in C/NumPy/Pandas 3000
Fast decompression 1000
SSD Large Sequential Read 500
Interprocess communication (IPC) 300
msgpack deserialization 125
Gigabit Ethernet 100
Pandas read_csv 100
JSON Deserialization 50
Slow decompression (e.g. gzip/bz2) 50
SSD Small Random Read 20
Wireless network 1

Disclaimer: all numbers in this post are rule-of-thumb and vary by situation

Understanding these scales can help you to identify how to speed up your program. For example, there is no need to use a faster network or disk if you store your data as JSON.

Combining bandwidths

Complex data pipelines involve many stages. The rule to combine bandwidths is to add up the inverses of the bandwidths, then take the inverse again:

This is the same principle behind adding conductances in serial within electrical circuits. One quickly learns to optimize the slowest link in the chain first.

Example

When we read data from disk (500 MB/s) and then deserialize it from JSON (50 MB/s) our full bandwidth is 45 MB/s:

If we invest in a faster hard drive system that has 2GB of read bandwidth then we get only marginal performance improvement:

However if we invest in a faster serialization technology, like msgpack (125 MB/s), then we double our effective bandwidth.

This example demonstrates that we should focus on the weakest bandwidth first. Cheap changes like switching from JSON to msgpack can be more effective than expensive changes, like purchasing expensive hardware for fast storage.

Overlapping Bandwidths

We can overlap certain classes of bandwidths. In particular we can often overlap communication bandwidths with computation bandwidths. In our disk+JSON example above we can probably hide the disk reading time completely. The same would go for network applications if we handle sockets correctly.

Parallel Bandwidths

We can parallelize some computational bandwidths. For example we can parallelize JSON deserialization by our number of cores to quadruple the effective bandwidth 50 MB/s * 4 = 200 MB/s. Typically communication bandwidths are not parallelizable per core.

December 29, 2015 12:00 AM

December 26, 2015

Juan Nunez-Iglesias

jnuneziglesias

All too often, I encounter published papers in which the code is “available upon request”, or “available in the supplementary materials” (as a zip file). This is not just poor form. It also hurts your software’s future. (And, in my opinion, when results depend on software, it is inexcusable.)

Given the numerous options for posting code online, there’s just no excuse to give code in a less-than-convenient format, upon publication. When you publish, put your code on Github or Bitbucket.

In this piece, I’ll go even further: put your code there from the beginning. Put your code there as soon as you finish reading this article. Here’s why:

No, you won’t get scooped

Reading code is hard. Ask any experienced programmer: most have trouble reading code they themselves wrote a few months ago, let alone someone else’s code. It’s extremely unlikely that someone will browse your code looking for a scoop. That time is better spent doing research.

It’s never going to be ready

Another thing I hear is that they want to post their code, but they want to clean it up first, and remove all the “embarrassing” bits. Unfortunately, science doesn’t reward time spent “cleaning up” your code, at least not yet. So the sad reality is that you probably will never actually get to the point where you are happy to post your code online.

But here’s the secret: everybody is in that boat with you. That’s why this document exists. I recommend you read it in full, but this segment is particularly important:

When it comes time to empirically evaluate new research with respect to someone else’s prior work, even a rickety implementation of their work can save grad-student-months, or even grad-student-years, of time.

Matt Might himself is as thorough and high-profile as you get in computer science, and yet, he has this to say about code clean-up:

I kept telling myself that I’d clean it all up and release it some day.

I have to be honest with myself: this clean-up is never going to happen.

Your code might not meet your standards, but, believe it or not, your code will help others, and the sooner it’s out there, the sooner they can be helped.

You will gain collaborators and citations

If anyone is going to be rifling through your code, they will probably end up asking for your help. This happens with even the best projects: have a look at the activity on the mailing lists for scikit-learn or NumPy, two of the best-maintained open-source projects out there.

When you have to go back and explain how a piece of code worked, that’s when you will actually take the time and clean it up. In the process, the person you help will be more likely to contribute to your project, either in code or in bug reports, improvement suggestions, or even citations.

In the case of my own gala project, I guess that about half of the citations it received happened because of its open-source code and open mailing list.

Your coding ability will automagically improve

I first heard this one from Albert Cardona. They say sunlight is the best disinfectant, and this is certainly true of code. Just the very idea that anyone can easily read their code will make most people more careful when programming. Over time, this care will become second nature, and you will develop a taste for nice, easy-to-read code.

In short, the alleged downsides of code-sharing are, at best, longshots, while there are many tangible upsides. Put your code out there. (And use a liberal open-source license!)


by Juan Nunez-Iglesias at December 26, 2015 11:17 PM

December 23, 2015

Titus Brown

What do you think about the term &amp; practice of &quot;hardening software&quot;?

I just received an e-mail from someone in the funding world who thinks a lot about software, and they were interested in any thoughts I might have on the term "software hardening", and its practice. To quote,

This is about making research software more robust, more easily usable and possibly scalable.

This language has been seen in the wild - e.g. see Extended Development, Hardening and Dissemination of Technologies in Biomedical Computing, Informatics and Big Data Science (R01) -- although it's not defined specifically. Taking language from the announcement, I think they mean some subset of this:

First, contemporary software must be easy to modify and extend, and must be fully documented. Users who experience problems with software should be able to correct the problem with minimal effort and a mechanism must exist for incorporating these corrections into the software. As the needs of a community of users change, the software that supports their research efforts must be adaptable as well. The ability of software to be repaired and to evolve is particularly important because the scientific discovery process is open-ended and ever-changing.

I have my own thoughts about hardening that I will share in a few days, but I thought I'd see what other people thought first! Pointers to other such language would also be welcome...

--titus

p.s. Wikipedia sez: "hardening is usually the process of securing a system by reducing its surface of vulnerability."

by C. Titus Brown at December 23, 2015 11:00 PM

December 20, 2015

Titus Brown

Our training plans for Q1 &amp; Q2, 2016

When I decided to move to UC Davis, I was also seizing the opportunity to dramatically expand my training efforts. At Davis, it was made clear to me that I could substitute skills training -- in whatever guise I chose -- for my for-credit teaching; in consequence, I'd asked for money for a training coordinator in my startup offer, to give me space to organize workshops while figuring out an organizational & recharge model.

I started at UCD in January 2015, and moved in June, and I hired a training coordinator, Jessica Mizzi (@jessicamizzi), in August. So far she and/or I have run about 15 events in 2015. These events have mostly been 1- or 2-day workshops, on top of the two-week summer workshop that we did at MSU; I've done about half the teaching, with volunteers and others doing the rest. I also taught my first Data Carpentry workshop at Caltech in November.

The events have been a bit of a mixed bag for me. While the content has been well received (people seem happy on assessments, and most of the workshops are oversubscribed), one- and two-day workshops take up a lot of time (not to mention the two week workshop :). While the time to do this is more or less OK from my point of view, it turns out that during the term grad students, postdocs, staff, and faculty have a tough time taking even a full day off of their regular job. We also didn't really account for the exhaustion factor - teaching for two days in a row pretty much takes me out of commission for another day. Finally, because we were new on campus, we didn't get much of a chance to plan ahead - finding rooms was sometimes a scramble. Our main priority was just figuring out what might work as a general strategy.

So the whole thing was pretty helter-skelter and not that well thought out, and some things didn't work.

But! We've adjusted and are going to try new things for Q1 and Q2, 2016!

First: we really liked the three half-day workshops that we piloted in December, and we're going to run 27 of them from January to June. Roughly half will be "local-only" workshops like the Shell workshop, and the other half will be "local+remote" workshops, like Emily Dolson's d3.js introduction. All of the workshops will run from 9:15am-12:15pm Pacific time. We think this will be a much more workable time frame and length for many people; our Python and Shell half-day workshops were heavily oversubscribed. Unlike most *Carpentry workshops, we aren't flying in instructors, so we aren't constrained by travel cost per workshop in the same way, either.

You can see the list of upcoming workshops here - it being the holidays, we haven't done too much yet, but we have made github issues for the upcoming remote ones so you can subscribe to those.

Second: in an attempt to build out the local community for data intensive biology and computational research "done right", we will be running "Data Therapy" sessions at the UC Davis Data Science Initiative, every Wed from 3-5pm, starting on Jan 13 (and going through the end of June). Exactly how these will work is still a bit up for grabs, but we would like them to be "co-working" hours at which we provide coffee or tea along with friendly expertise. We're hoping to avoid "help desk" problems by inviting only those who have taken one of our workshops, and we'll see how that goes. I'm thinking about starting with little 15-minute research vignettes on things like Amazon Web Services, Docker, ImpactStory, etc. etc., and then seeing where that goes.

And, apart from that, I'm publicly promising to limit my training events so that I can get on with my research :). Well, in addition to the week-long Intermediate workshop in February, that two-day Docker workshop at Berkeley in January, a week of Data Carpentry in March, and a lesson development workshop in June. But other than that I will stand firm. Probably.

--titus

by C. Titus Brown at December 20, 2015 11:00 PM

Jan Hendrik Metzen

Naive Bees Classifier for the The Metis Challenge

My Naive Bees Classifier for the The Metis Challenge

This is a documentation of my submission to the Naive Bees classification challenge, where I ended up on the second place (username frisbee). The challenge of the competition was to classify whether a bee is a honey bee (Apis) or a bumble bee (Bombus). According to the organizers, being able to identify bee species from images is a task that ultimately would allow researchers to more quickly and effectively collect field data. Pollinating bees have critical roles in both ecology and agriculture, and diseases like colony collapse disorder threaten these species. You can learn more about the challenge under http://www.drivendata.org/competitions/8/

My approach can be summarized as follows:

  • Train a deep convolutional neural network (DCNN) as classifier, which is based on the broadly used VGG16 architecture. This decision was based on an initial empirical comparison of different standard architectures (Inception, AlexNet, VGG16, VGG19), among which VGG16 performed best.
  • Initialize the DCNN with weights that have been obtained using pretraining on ImageNet. This decision was based on the fact that the competition's dataset is relatively small (for starting from scratch) and ImageNet has a considerable conceptual overlap with the competition (among others a class bee)
  • Fine-tune the DCNN to the Naive Bees classification challenge. Small learning rates and dropout +l2 regularization avoided overfitting of the large net to the small dataset.
  • Online augmentation of training dataset based on zooming, rotating, translating, flipping, and stretching the training images randomly. This further reduced overfitting and increased the invariance of the model to certain transformations.
  • Use test-time augmentation for generation of predictions, based on generating 64 variations of each image based on zooming, rotating, and flipping. This reduced variance of the predictions and improved the relative ordering (AUC). Note that 8 variations would have been nearly as good and reduce computation time for prediction by a factor of 8.
  • Average predictions of an ensemble (size: 4) of DCNNs that have been trained as described above but with different random seed. Further reduce variance of predictions.
In [1]:
# We expect that there is a "persistency" directory containing all the required data 
# and models with the following structure:
# * SubmissionFormat.csv
# * train_labels.csv
# * vgg16_[0,1,2,3].pkl  (optional, if pretrained models should be used for prediction)
# * images/train/... (train images)
# * images/train/... (test images)
# * modelzoo/vgg16.pkl (Get from https://s3.amazonaws.com/lasagne/recipes/pretrained/imagenet/vgg16.pkl)
# The path of the directory might be adapted:
persistency_dir = "/home/jmetzen/Temp/data/"

Let us first get some imports out of the way. You will need the following dependencies:

  • numpy
  • scipy
  • pandas
  • scikit-learn
  • theano
  • lasagne (using the revision 34fb8387e from github)
  • matplotlib (optional, only for plotting)
  • cuDNN (optional, otherwise the slower non-cuDNN convolutions and max-pooling from lasagne can be used)

This notebook has been executed sucessfully using the following versions (older and more recent version might also work):

In [2]:
# Note: if the IPython extension "watermark" is not installed, the following lines can be removed safely
%load_ext watermark
%watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy,scipy,pandas,scikit-learn,theano,lasagne,matplotlib
Jan Hendrik Metzen 15/12/2015 

CPython 2.7.10
IPython 4.0.0

numpy 1.10.1
scipy 0.16.0
pandas 0.17.0
scikit-learn 0.17
theano 0.7.0
lasagne 0.2.dev1
matplotlib 1.4.3

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 3.13.0-37-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 12
interpreter: 64bit
In [3]:
import os
import cPickle
import random

import numpy as np
from scipy.ndimage import imread
from scipy.misc import imresize
from scipy.special import logit, expit
import pandas as pd

from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import train_test_split

import theano
import theano.tensor as T

import lasagne
from lasagne.layers import InputLayer, DenseLayer, DropoutLayer, NonlinearityLayer
# If you don't have cuDNN installed, you may use:
# from lasagne.layers import MaxPool2DLayer as PoolLayer,  Conv2DLayer as ConvLayer
from lasagne.layers.dnn import MaxPool2DDNNLayer as PoolLayer,  Conv2DDNNLayer as ConvLayer
from lasagne.utils import floatX
from lasagne.nonlinearities import softmax
from lasagne.regularization import regularize_network_params, l2

%matplotlib inline
import matplotlib.pyplot as plt

# Try to make everything as reproducible as possible
np.random.seed(0)
random.seed(0)
Couldn&apost import dot_parser, loading of dot files will not be possible.
Using gpu device 0: GeForce GTX 970 (CNMeM is disabled)

Now some parameters that need tuning:

In [4]:
# This batchsize is mainly due to GPU-memory restrictions (only 4GB RAM)
batch_size = 28  
# Learning rate was not heavily tuned; only such that the training does not diverge
learning_rate = 0.00005  
lr_decay_rate = 0.7
# The number of augmentations in test-time augmentation
# n_augmentations = 8 should give nearly as good results and speed up prediction by a factor of 8
n_augmentations = 64 
# The size of the ensemble. Larger values should increase performance due to reducing the variance in 
# predictions by averaging. This, however, comes at the cost of linearly increased training and prediction time
n_repetitions = 4 
# The number of training epochs
n_epochs = 35
# Weight decay strength (i.e. l2 penalty)
l2_regularization = 1e-3
# The ratio of the training data that is held out as validation data. Note that this data is 
# different for every repetition in ensembling
validation_size = 0.075
# Directory where all data (training, test data, learned models etc.) is loaded from and stored to
persistency_dir = "/home/jmetzen/Temp/data/"

Data

Now let us load the data:

In [5]:
# load the labels using pandas
labels = pd.read_csv("%strain_labels.csv" % persistency_dir,
                     index_col=0)

submission_format = pd.read_csv("%sSubmissionFormat.csv" % persistency_dir,
                                index_col=0)

sf = pd.read_csv("%sSubmissionFormat.csv" % persistency_dir, index_col=0)

print "Number of training examples is: ", labels.shape[0]
print "Number of testing examples is: ", sf.shape[0]
print "Predictions should be type:", labels.dtypes[0]
Number of training examples is:  3969
Number of testing examples is:  992
Predictions should be type: float64
In [6]:
data_X = np.empty((labels.index.shape[0], 3, 224, 224), dtype=np.uint8)
data_X_test = np.empty((sf.index.shape[0], 3, 224, 224), dtype=np.uint8)

def load(img_id, phase):
    image = imread(('%s' + os.sep + 'images' + os.sep + "%s" + os.sep + '%s.jpg') 
                   % (persistency_dir, phase, img_id))
    image = imresize(image, (224, 224))
    image = image[:, :, :3]  # chop off alpha
    image = image[:, :, ::-1]  # convert to BGR
    return np.swapaxes(np.swapaxes(image, 1, 2), 0, 1)  # change to channel-first

for i, img_id in enumerate(labels.index):
    data_X[i] = load(img_id, "train")

for i, img_id in enumerate(sf.index):
    data_X_test[i] = load(img_id, "test")
    
data_y = np.array(labels, dtype=np.uint8)[:, 0]

Let us illustrate sixteen training images:

In [7]:
def plot_image(img):
    plt.imshow(np.swapaxes(np.swapaxes(img, 0, 1), 1, 2)[:, :, ::-1])
    plt.xticks([])
    plt.yticks([])
    
plt.figure(figsize=(8, 8))
for i in range(16):
    plt.subplot(4, 4, i+1)
    plot_image(data_X[i+100])
    plt.title(["Bombus", "Apis"][data_y[i+100]])
plt.tight_layout()

We can see that there is considerable variation in the images and that it is difficult to tell the two classes apart.

Model

Now, we create our deep neural network using the VGG16 architecture, and we initialize it with weights trained on ImageNet

In [8]:
def create_model():
    # Create model using the VGG_16 architecture
    # See https://raw.githubusercontent.com/Lasagne/Recipes/master/modelzoo/vgg16.py
    
    # VGG-16, 16-layer model from the paper:
    # "Very Deep Convolutional Networks for Large-Scale Image Recognition"
    # Original source: https://gist.github.com/ksimonyan/211839e770f7b538e2d8
    # License: non-commercial use only
    # If the restriction to non-commercial use poses a problem, you can retrain
    # the net on ImageNet without having to load pretrained weights
    pad = 1
    net = {}
    net['input'] = InputLayer((None, 3, 224, 224))
    net['conv1_1'] = ConvLayer(net['input'], 64, 3, pad=pad)
    net['conv1_2'] = ConvLayer(net['conv1_1'], 64, 3, pad=pad)
    net['pool1'] = PoolLayer(net['conv1_2'], 2)
    net['conv2_1'] = ConvLayer(net['pool1'], 128, 3, pad=pad)
    net['conv2_2'] = ConvLayer(net['conv2_1'], 128, 3, pad=pad)
    net['pool2'] = PoolLayer(net['conv2_2'], 2)
    net['conv3_1'] = ConvLayer(net['pool2'], 256, 3, pad=pad)
    net['conv3_2'] = ConvLayer(net['conv3_1'], 256, 3, pad=pad)
    net['conv3_3'] = ConvLayer(net['conv3_2'], 256, 3, pad=pad)
    net['pool3'] = PoolLayer(net['conv3_3'], 2)
    net['conv4_1'] = ConvLayer(net['pool3'], 512, 3, pad=pad)
    net['conv4_2'] = ConvLayer(net['conv4_1'], 512, 3, pad=pad)
    net['conv4_3'] = ConvLayer(net['conv4_2'], 512, 3, pad=pad)
    net['pool4'] = PoolLayer(net['conv4_3'], 2)
    net['conv5_1'] = ConvLayer(net['pool4'], 512, 3, pad=pad)
    net['conv5_2'] = ConvLayer(net['conv5_1'], 512, 3, pad=pad)
    net['conv5_3'] = ConvLayer(net['conv5_2'], 512, 3, pad=pad)
    net['pool5'] = PoolLayer(net['conv5_3'], 2)
    net['fc6'] = DenseLayer(net['pool5'], num_units=4096)
    net['drop6'] = DropoutLayer(net['fc6'], p=0.5)
    net['fc7'] = DenseLayer(net['drop6'], num_units=4096)
    net['drop7'] = DropoutLayer(net['fc7'], p=0.5)
    net['fc8'] = DenseLayer(net['drop7'], num_units=1000)
    # The fc9 layer is additional to the original VGG16 architecture and 
    # allows combining the pre-softmax outputs for the 1000 ImageNet classes
    # (fc8) into the two class probabilities for this problem
    net['fc9'] = DenseLayer(net['fc8'], num_units=2, nonlinearity=None)
    net['prob'] = NonlinearityLayer(net['fc9'], softmax)

    # Load weights for VGG16 that have been pretrained on ImageNet. Those 
    # can be downloaded from  https://s3.amazonaws.com/lasagne/recipes/pretrained/imagenet/vgg16.pkl
    vgg_16 = cPickle.load(open('%smodelzoo/vgg16.pkl' % persistency_dir))
    lasagne.layers.set_all_param_values(net['fc8'], vgg_16['param values'])
    
    return net, np.array(vgg_16['mean value'], dtype=np.int16)
In [9]:
def get_train_eval(net):
    # Define the parameters to be adapted during training
    # Note that we fine-tune all layers except for 'conv1_1' and 'conv1_2'
    # The main reason for not adapting those layers is that this would have
    # required to reduce the batch_size further (for my 4GB GPU) and did not
    # result in a further improvement. If you want to fine-tune all layers,
    # you may use: model_params = lasagne.layers.get_all_params(net['prob'], trainable=True)
    model_params = [net['conv2_1'].W, net['conv2_1'].b,
                    net['conv2_2'].W, net['conv2_2'].b,
                    net['conv3_1'].W, net['conv3_1'].b,
                    net['conv3_2'].W, net['conv3_2'].b,
                    net['conv3_3'].W, net['conv3_3'].b,
                    net['conv4_1'].W, net['conv4_1'].b,
                    net['conv4_2'].W, net['conv4_2'].b,
                    net['conv4_3'].W, net['conv4_3'].b,
                    net['conv5_1'].W, net['conv5_1'].b,
                    net['conv5_2'].W, net['conv5_2'].b,
                    net['conv5_3'].W, net['conv5_3'].b,
                    net['fc6'].W, net['fc6'].b, 
                    net['fc7'].W, net['fc7'].b, 
                    net['fc8'].W, net['fc8'].b, 
                    net['fc9'].W, net['fc9'].b]
    feature_layer = net['fc8']
    output_layer = net['prob']

    X = T.tensor4()
    y = T.ivector()
    sh_lr = theano.shared(lasagne.utils.floatX(learning_rate))

    # training output (deterministic=False for dropout activated)
    output_train = lasagne.layers.get_output(output_layer, X, deterministic=False)
    # Our cost (loss) is the categorical crossentropy loss with weight decay (l2 penalty)
    cost = T.mean(T.nnet.categorical_crossentropy(output_train, y)) \
       + l2_regularization * regularize_network_params(output_layer, l2)
    # evaluation output (deterministic = False for dropout deactivated)
    output_eval, feature_eval = \
        lasagne.layers.get_output([output_layer, feature_layer], X, deterministic=True)
    # Use the ADAM optimizer for adapting network weights such that the loss is minimized 
    updates = lasagne.updates.adam(cost, model_params, learning_rate=sh_lr)

    # Define theano functions which can be used for training and evaluation of the
    # network
    train_fct = theano.function([X, y], [cost, output_train], updates=updates)
    eval_fct = theano.function([X], [output_eval, feature_eval])
    
    return train_fct, eval_fct, sh_lr

Augmentation

We use data augmentation both in training and in validation/testing. The code for augmentation is heavily based on benanne's solution to national data science bowl 2015, in particular this file. Our code can be found here.

In [10]:
from augment import perturb

# Parameters of augmentation used in testing (aug_params_test) have been
# tuned extensively on validation data and should be close to optimal for
# the data.

aug_params_test = {
    'zoom_range': (1, 1.5),  # We allow zooming-in with up-to a factor of 1.5
    'rotation_range': [0, 90, 180, 270], # we allow rotation by 0, 90, 180, and 270 degrees
    'shear_range': (0, 0), 
    'translation_range': (0, 0), 
    'do_flip': True,  # We allow flipping of data
    'allow_stretch': 1.0,
}

# Augmentation for validation is very similar to test augmentation;
# potentially aug_params_test would also be better suited for validation
# but there was a lack of time to rerun the training.

aug_params_valid = {
    'zoom_range': (1, 1.6), 
    'rotation_range': [0, 90, 180, 270],
    'shear_range': (0, 0), 
    'translation_range': (0, 0),
    'do_flip': True,
    'allow_stretch': 1 / 1.3,
}

# Augmentation for training was not heavily tuned (as it would be 
# computationally very expensive). In general, stronger variations
# in training than in testing seemed to be reasonable such that the
# CNN has to learn a greater range of invariances

aug_params_train = {
    'zoom_range': (1, 1.6), # We allow zooming-in with up-to a factor of 1.6
    'rotation_range': [0, 90, 180, 270], # we allow rotation by 0, 90, 180, and 270 degrees
    'shear_range': (0, 0),
    'translation_range': (-25, 25), # We allow translating the zoomed-in image by up to 25 pixels in each dimension
    'do_flip': True, # We allow flipping of data
    'allow_stretch': 1 / 1.3,  # We also allow stretching the image
}

Let us illustrate how (train) augmentation modifies a single training image (the upper left image shows the unmodified image):

In [11]:
plt.figure(figsize=(8, 8))
for i in range(16):
    plt.subplot(4, 4, i+1)
    if i == 0:
        plot_image(data_X[0])
    else:
        plot_image(perturb(data_X[0], aug_params_train))
plt.tight_layout()