## Overview

Barry Mazur and I spent over a decade writing a popular math book "Prime Numbers and the Riemann Hypothesis", which will be published by Cambridge Univeristy Press in 2016.  The book involves a large number of illustrations created using SageMath, and was mostly written using the LaTeX editor in SageMathCloud.

This post is meant to provide a glimpse into the writing process and also content of the book.

Intended Audience: Research mathematicians! Though there is no mathematics at all in this post.

The book is here: http://wstein.org/rh/
Download a copy before we have to remove it from the web!

Goal: The goal of our book is simply to explain what the Riemann Hypothesis is really about. It is a book about mathematics by two mathematicians. The mathematics is front and center; we barely touch on people, history, or culture, since there are already numerous books that address the non-mathematical aspects of RH.  Our target audience is math-loving high school students, retired electrical engineers, and you.

## Clay Mathematics Institute Lectures: 2005

The book started in May 2005 when the Clay Math Institute asked Barry Mazur to give a large lecture to a popular audience at MIT and he chose to talk about RH, with me helping with preparations. His talk was entitled "Are there still unsolved problems about the numbers 1, 2, 3, 4, ... ?"

See http://www.claymath.org/library/public_lectures/mazur_riemann_hypothesis.pdf

Barry Mazur receiving a prize:

Barry's talk went well, and we decided to try to expand on it in the form of a book. We had a long summer working session in a vacation house near an Atlantic beach, in which we greatly refined our presentation. (I remember that I also finally switched from Linux to OS X on my laptop when Ubuntu made a huge mistake pushing out a standard update that hosed X11 for everybody in the world.)

## Classical Fourier Transform

Going beyond the original Clay Lecture, I kept pushing Barry to see if he could describe RH as much as possible in terms of the classical Fourier transform applied to a function that could be derived via a very simple process from the prime counting function pi(x). Of course, he could. This led to more questions than it answered, and interesting numerical observations that are more precise than analytic number theorists typically consider.

Our approach to writing the book was to try to reverse engineer how Riemann might have been inspired to come up with RH in the first place, given how Fourier analysis of periodic functions was in the air. This led us to some surprisingly subtle mathematical questions, some of which we plan to investigate in research papers. They also indirectly play a role in Simon Spicer's recent UW Ph.D. thesis. (The expert analytic number theorist Andrew Granville helped us out of many confusing thickets.)

In order to use Fourier series we naturally have to rely heavily on Dirac/Schwartz distributions.

## SIMUW

University of Washington has a great program called SIMUW: "Summer Institute for Mathematics at Univ of Washington.'' It's for high school; admission is free and based on student merit, not rich parents, thanks to an anonymous wealthy donor!  I taught a SIMUW course one summer from the RH book.  I spent one very intense week on the RH book, and another on the Birch and Swinnerton-Dyer conjecture.

The first part of our book worked well for high school students. For example, we interactively worked with prime races, multiplicative parity, prime counting, etc., using Sage interacts. The students could also prove facts in number theory. They also looked at misleading data and tried to come up with conjectures. In algebraic number theory, usually the first few examples are a pretty good indication of what is true. In analytic number theory, in contrast, looking at the first few million examples is usually deeply misleading.

## Reader feedback: "I dare you to find a typo!"

In early 2015, we posted drafts on Google+ daring anybody to find typos. We got massive feedback. I couldn't believe the typos people found. One person would find a subtle issue with half of a bibliography reference in German, and somebody else would find a different subtle mistake in the same reference. Best of all, highly critical and careful non-mathematicians read straight through the book and found a large number of typos and minor issues that were just plain confusing to them, but could be easily clarified.

Now the book is hopefully not riddled with errors. Thanks entirely to the amazingly generous feedback of these readers, when you flip to a random page of our book (go ahead and try), you are now unlikely to see a typo or, what's worse, some corrupted mathematics, e.g., a formula with an undefined symbol.

## Designing the cover

Barry and Gretchen Mazur, Will Hearst, and I designed a cover that combined the main elements of the book: title, Riemann, zeta:

Then designers at CUP made our rough design more attractive according their tastes. As non-mathematician designers, they made it look prettier by messing with the Riemann Zeta function...

## Publishing with Cambridge University Press

Over years, we talked with people from AMS, Springer-Verlag and Princeton Univ Press about publishing our book. I met CUP editor Kaitlin Leach at the Joint Mathematics Meetings in Baltimore, since the Cambridge University Press (CUP) booth was directly opposite the SageMath booth, which I was running. We decided, due to their enthusiasm, which lasted more than for the few minutes while talking to them (!), past good experience, and general frustration with other publishers, to publish with CUP.

### What is was like for us working with CUP

The actual process with CUP has had its ups and downs, and the production process has been frustrating at times, being in some ways not quite professional enough and in other ways extremely professional. Traditional book publication is currently in a state of rapid change. Working with CUP has been unlike my experiences with other publishers.

For example, CUP was extremely diligent putting huge effort into tracking down permissions for every one of the images in our book. And they weren't satisfy with a statement on Wikipedia that "this image is public domain", if the link didn't work. They tracked down alternatives for all images for which they could get permissions (or in some cases have us partly pay for them). This is in sharp contrast to my experience with Springer-Verlag, which spent about one second on images, just making sure I signed a statement that all possible copyright infringement was my fault (not their's).

The CUP copyediting and typesetting appeared to all be outsourced to India, organized by people who seemed far more comfortable with Word than LaTeX. Communication with people that were being contracted out about our book's copyediting was surprisingly difficult, a problem that I haven't experienced before with Springer and AMS. That said, everything seems to have worked out fine so far.

On the other hand, our marketing contact at CUP mysteriously vanished for a long time; evidently, they had left to another job, and CUP was recruiting somebody else to take over. However, now there are new people and they seem extremely passionate!

## The Future

I'm particularly excited to see if we can produce an electronic (Kindle) version of the book later in 2016, and eventually a fully interactive complete for-pay SageMathCloud version of the book, which could be a foundation for something much broader with publishers, which addresses the shortcoming of the Kindle format for interactive computational books. Things like electronic versions of books are the sort of things that AMS is frustratingly slow to get their heads around...

## Conclusions

1. Publishing a high quality book is a long and involved process.
2. Working with CUP has been frustrating at times; however, they have recruited a very strong team this year that addresses most issues.
3. I hope mathematicians will put more effort into making mathematics accessible to non-mathematicians.
4. Hopefully, this talk will give provide a more glimpse into the book writing process and encourage others (and also suggest things to think about when choosing a publisher and before signing a book contract!)

## November 17, 2015

### Matthieu Brucher

#### Announcement: ATKStereoLimiter 1.0.0

I’m happy to announce the release of a stereo limiter based on the Audio Toolkit. They are available on Windows and OS X (min. 10.8) in different formats.

This stereo plugin limits both channels by getting the max of the instantaneous power of them and applying a limiting gain function on them. There is no oversampling inside the plugin, so the output signal can overshoot.

ATKStereoLimiter

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

## November 16, 2015

### Continuum Analytics news

#### Drawing a Brain with Bokeh

Posted Wednesday, November 16, 2016
.bokeh table{ border:none; } .bokeh td { padding:0; }

This post by medical student and data scientist Gabi Maeztu originally appeared in his blog, Merqur.io and is reprinted with his permission.

I’m interning at Mint Labs for my final year dissertation. My main goal is to find associations between medical data and connectomes (map of neural connections in the brain). After lots of processing on the MRI Images, I end up having a square matrix with the estimated connectivity between the different brain areas.

A square matrix full of numbers is not the easiest thing to interpret for a human, so I decided that I needed a tool to be able to iterate over the brains in the different studies and quickly “understand” them. The classic approach that they suggested me was to do a correlation matrix with a heatmap. I find them difficult to understand so I wanted to copy some of the chord diagrams I saw before in some articles and posters. Most of the chord graphs visualizations I found on google were done with d3.js; which is not easy to integrate on a Python workflow.

Matplotlib, Seaborn or Bokeh are the libraries I’m used to work with, so I took it as a personal challenge and I started the adventure of creating a chord graph with one of those libraries. I like interactivity a lot, so I finally end up using Bokeh.

### Creating a circle of dots

The first idea I came up with was to assign each area a dot in a circle an then draw the connections between the areas. I made a function to create dinamically the x and y of each point in a circle given a n number of areas; as I might use different approaches sometimes.

from math import pi, cos, sin

part = 2 * pi / points
final_points = []

for point in range(points):
angle = part * point;
newX = centerX + radius * cos(angle)
newY = centerY + radius * sin(angle)
final_points.append((newX, newY))

return final_points


### Creating the lines

Now that I had each area as a point in the graph, I needed to add the lines from one area to the other using the x,y array I just got. In order to remove the connections that were very weak, I added a hardness parameter to the function that will ignore those area connections connected by less than 20 estimated connections.

import numpy as np
# Emulate a connectome
connectome = np.random.randint(0, 1000, size=(85,85))

def connectomme_lines(connectome, hardness=20):
number_of_areas = connectome.shape[0]
positions = draw_circle_points(number_of_areas)

final_positions = []
area_name = []

for index, area1 in enumerate(connectome):
commonXY = positions[index]
for index2, area2 in enumerate(area1):
newXY = positions[index2]
if area2 > hardness:
final_positions.append([[commonXY[0], newXY[0]], [commonXY[1], newXY[1]]])

return final_positions

lines = np.array(connectomme_lines(connectome)).T



The lines array was generated and it needed to be transposed to be easier to work with it.

### Let’s plot this !

Time to make Bokeh work. As I was going to plot lot of lines, curved lines (bezier lines) seemed a better option than straight lines. I gave a fixed curvature to all of them so I didn’t have to worry about that.

from bokeh.plotting import figure, output_notebook, show
from bokeh.models import Range1d

p = figure(title="Connectomme", toolbar_location="below")

p.x_range = Range1d(-1.1, 1.1)
p.y_range = Range1d(-1.1, 1.1)

p.bezier(x0=lines[0][0],
y0=lines[0][1],
cx0=lines[0][0]/2,
cy0=lines[0][1]/2,
cx1=lines[1][0]/2,
cy1=lines[1][1]/2,
x1=lines[1][0],
y1=lines[1][1])

show(p)



A preview:

So we had a bunch of lines in the form of a chord diagram, but nothing easy to understand. I thought that giving each line the color of the area it was originated from could be a great idea. Other thing that the graph was not considering at the moment was the number of connections between the areas; zero to thousands depending in the case. Standardization of the number of connections and plotting them as the difference in the width of a connection seemed like a good approach to quantify the strength of the connection.

### Counting the number of connections

To modify the width of the lines, we just need add one parameter to our bezier object.

def width_of_lines(c, hardness=20):
f = c.flatten()
flat = f[(f > hardness)]
return flat/flat.sum()



As the standardized values were too low for bokeh, I broadcasted a multiplication to be able to have some visible lines.

connections_standardized = width_of_lines(connectome) * 150


In order to create the glyphs information only once and reuse it in different plots, ColumnDataSource let’s you define the data in a variable and then use it when you create a glyph in a plot.

Now the plot, the glyphs and the data source were three different things defined independently.

from bokeh.models import ColumnDataSource
# The Data
beziers = ColumnDataSource({
'x0' : lines[0][0],
'y0' : lines[0][1],
'x1' : lines[1][0],
'y1' : lines[1][1],
'cx0' : lines[0][0]/1.5,
'cy0' : lines[0][1]/1.5,
'cx1' : lines[1][0]/1.5,
'cy1' : lines[1][1]/1.5
})

dots = ColumnDataSource(
data=dict(
x=positions[0],
y=positions[1]
)
)


# The Plot
p2 = figure(title="Connectomme", toolbar_location="below")
p2.x_range = Range1d(-1.1, 1.1)
p2.y_range = Range1d(-1.1, 1.1)


# The Glyphs
p2.bezier('x0', 'y0', 'x1', 'y1', 'cx0', 'cy0', 'cx1', 'cy1',
source=beziers,
line_cap='round',

p2.circle('x', 'y', size=8, fill_color="#6D6A75", line_color=None, source=dots)
show(p2)



When adding the curve glyphs to the plot line_width set the width of each line.

### Coloring the brain

Coloring the brain wasn’t something new for me, but in this case assigning so many areas a unique color required some research in color theory.

Creating random color values without too dark or too bright colors is not an easy task. There are different ways of defining a color but the most usual one is using RGB, that stands for red, green, blue. If you use RGB to create random color, the usual output is to have some awful colors that variate a lot in their brightness. I found out that using the HSV method (hue, saturation, value), you can set two of the parameters fixed and generate random colors in a similar color space just by modifying the hue.

from colorsys import hsv_to_rgb

def gen_color(h):
# Source: http://www.goldennumber.net/color/
golden_ratio = (1 + 5 ** 0.5) / 2
h += golden_ratio
h %= 1
return '#%02x%02x%02x' % tuple(int(a*100) for a in hsv_to_rgb(h, 0.55, 2.3))

area_colors = [gen_color(area/connectome.shape[0]) for area in range(connectome.shape[0])]



I needed now to iterate over the matrix to assign each line the correct color depending in the area it was originated from.

def lines_colors(connectome, area_colors, hardness=20):
colors = []
for index, area1 in enumerate(connectome):
area_hsv = area_colors[index]
for area2 in area1:
if area2 > hardness:
colors.append(area_hsv)
return colors

area_colors_lines = lines_colors(connectome, area_colors)



Now let’s plot the same chord diagram with colors

p3 = figure(title="Connectomme", toolbar_location="below")

p3.x_range = Range1d(-1.1, 1.1)
p3.y_range = Range1d(-1.1, 1.1)

p3.bezier('x0', 'y0', 'x1', 'y1', 'cx0', 'cy0', 'cx1', 'cy1',
source=beziers,
line_cap='round',
line_width=connections_standardized,
line_color=area_colors_lines
)

p3.circle('x', 'y', size=8, fill_color=area_colors, line_color=None, source=dots)


### Adding interactivity to the plot

Having such a cool plot needed a way to interact with it. Bokeh offers some tools like HoverTool for that. The hover tool displays informational tooltips whenever the cursor is directly over a glyph, in this case I used the circles to show the area it was representing.

import pandas as pd
area_names.columns = ["num", "names"]
area_names["names"] = area_names["names"].map(str.strip)


The dot’s data was modified again, as we needed to include the data to be shown when the mouse is hover the point.

from bokeh.models.tools import HoverTool
# Area's dots
dots_interactive = ColumnDataSource(data=dict(x=positions[0],
y=positions[1],
desc=list(area_names["names"]))
)

# Hover tools
hover_dots = HoverTool(
tooltips=[("Area", " @desc")],
point_policy='snap_to_data'
)


We plot again.

p4 = figure(title="Connectomme", toolbar_location="below")

p4.x_range = Range1d(-1.1, 1.1)
p4.y_range = Range1d(-1.1, 1.1)

p4.bezier('x0', 'y0', 'x1', 'y1', 'cx0', 'cy0', 'cx1', 'cy1',
source=beziers,
line_cap='round',
line_width=connections_standardized,
line_color=area_colors_lines
)

p4.circle('x', 'y', size=8, fill_color=area_colors, line_color=None, source=dots_interactive)


We include the new tool and Voilá !

p4.add_tools(hover_dots)
show(p4)


NOTE: The order of the labels in this specific case is not the correct one. I’ll correct it soon.

And this is so far what I did but this still work on progress and I’ll show more soon. Thanks for passing by !

## November 15, 2015

### Titus Brown

#### What should I teach about Jupyter?

What are the most concretely useful, interesting, awesome or neat things about Project Jupyter?

Over here at the Lab for Data Intensive Biology, we're putting on a workshop on Project Jupyter notebooks (the notebook system formerly known as IPython Notebook). This will be a two-day hands-on workshop, Carpentry-style, with the goal of introducing students to the concept and practice of notebook-based data narratives. We also want to showcase the technology and ecosystem a bit.

This puts me in a bit of a bind, because while I am strong proponent of IPython Notebook for reproducibility and teaching, I don't have an awful lot of experience with anything beyond the default Python kernel and Notebook interface.

So my question to the blogosphere: what, beyond the basics, would you suggest demonstrating to people? Here is the menu from which I am planning to choose 5-6 topics and demos - suggestions and pointers welcome! (Note that I'm not looking for particular notebooks but rather demonstrations and concepts that show off the possibilities!)

--titus

### Randy Olson

#### Introducing TPOT, the Data Science Assistant

Some of you might have been wondering what the heck I’ve been up to for the past few months. I haven’t been posting much on my blog lately, and I haven’t been working on important problems like solving Where’s Waldo? and optimizing road trips around the world. (I promise: I’ll get back to fun posts like that soon!) Instead, I’ve been working on something far geekier, and I’m excited to finally have something to show for it.

Over the summer, I started a new postdoctoral research position funded by the NIH at the University of Pennsylvania Computational Genetics Lab. During my first month there, I started looking for big problems in the field of data science to take on. Science (especially computer science) is often too incremental, and if I was going to stay in academia, I wanted to tackle a big problem. It was around that time that I started thinking about the process of machine learning and how we could let machines solve problems themselves rather than needing input from humans.

You see, machine learning is transforming the world as we know it. Google search engines were massively improved by machine learning, as were Gmail’s spam filters. Voice assistants like Siri — as silly as they can be — use machine learning to translate your voice into something the computer can understand. Stock market investors make millions every day using machine learning to predict when to buy and sell. And the list goes on and on…

Wonder how Facebook always knows who you are in your photos? They use machine learning.

The problem with machine learning is that building an effective model can require a ton of human input. Humans have to figure out the right way to transform the data before feeding it to the machine learning model. Then they have to pick the right machine learning model that will learn from the data best, and then there’s a whole bunch of model parameters to tweak that can make the difference between a dud and a Nostradamus-like model. Building these pipelines — i.e., sequences of steps that turn the raw data into a predictive model — can easily take weeks of tinkering depending on the difficulty of the problem. This is obviously a huge issue when machine learning is supposed to allow machines to learn on their own.

An example machine learning pipeline, and what parts of the pipeline TPOT automates.

Thus, the Tree-based Pipeline Optimization Tool (TPOT) was born. TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming. Think of TPOT as your “Data Science Assistant”: TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines, then recommending the pipelines that work best for your data.

An example TPOT pipeline with two copies of the data set entering the pipeline.

Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. As an added bonus, TPOT is built on top of scikit-learn, so all of the code it generates should look familiar… if you’re familiar with scikit-learn, anyway.

TPOT is still under active development and in its early stages, but it’s worked very well on the classification problems I’ve applied it to so far. (Research papers coming soon!)

Check out the TPOT GitHub repository to see the latest goings on. I’ll be working on TPOT and pushing the boundaries of machine learning pipeline optimization for the majority of my postdoc.

### How to install TPOT

TPOT is built on top of several existing Python libraries, including:

• NumPy
• SciPy
• pandas
• scikit-learn
• DEAP

Except for DEAP, all of the necessary Python packages can be installed via the Anaconda Python distribution, which I strongly recommend that you use. I also strongly recommend that you use of Python 3 over Python 2 if you’re given the choice.

NumPy, SciPy, pandas, and scikit-learn can be installed in Anaconda via the command:

conda install numpy scipy pandas scikit-learn


DEAP can be installed with pip via the command:

pip install deap


Finally to install TPOT, run the following command:

pip install tpot


Please get in touch if you run into installation problems.

### An example using TPOT

I wanted to make TPOT versatile, so it can be used on the command line or via Python scripts. You can look up the detailed usage instructions on the GitHub repository if you’re interested.

For this post, I’ve provided a basic example of how you can use TPOT to build a pipeline that classifies hand-written digits in the classic MNIST data set.

from tpot import TPOT
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75)

tpot = TPOT(generations=5)
tpot.fit(X_train, y_train)
tpot.score(X_train, y_train, X_test, y_test)


After a couple minutes, TPOT will discover a pipeline that achieves >98% accuracy. In this case, TPOT will probably discover that a random forest classifier does very well on MNIST with only a little bit of tuning. If you give TPOT even more time by setting the “generations” parameter to a higher number, it will most likely find even better pipelines.

### “TPOT sounds cool! How can I get involved?”

TPOT is an open source project, and I’m happy to have you join our efforts to build the best tool possible. If you want to contribute some code, check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.

### tl;dr in image format

Justin Kiggins had a great summary of TPOT when I first tweeted about it:

Anyway, that’s what I’ve been up to lately. I’m looking forward to presenting TPOT at several research conferences in the coming months, and I’d really like to see what the machine learning community thinks about pipeline automation. In the meantime, give TPOT a try and let me know what you think.

## November 14, 2015

### Titus Brown

#### Using Hypothesis to put annotations on workshop web sites

I've often wanted to mark up arbitrary Web sites with annotations and reminders, and it's always been puzzling to me that this is missing from the Web. In recent years, I've heard more and more about a small non-profit called Hypothesis, which provides this general functionality via both a Chrome plugin and JavaScript (as well as a proxy).

A few weeks back, we invited Jon Udell from Hypothesis out to visit Davis, and my lab spent a couple of hours with him exploring Hypothesis. The short version is it looks totally awesome, and I think there are dozens of uses, but one specific use case that came up several times was to annotate Software Carpentry and Data Carpentry lessons with it.

This should be very do-able, but there's a catch -- the lessons are duplicated independently across workshops and taught many times over many years, with slight modifications. So we don't want to annotate just one site - we want the annotations to show up across all the copies of a lesson so that we don't have to remember which workshop version we annotated.

When we discussed this with Jon, he said "yep. Should work!" What he explained to us was this: in order to tie annotations to text even in the face of editing and textual updates, Hypothesis uses robust anchoring techniques to place each annotation within the document. Moreover, Hypothesis can tie things to canonical URLs, so if there's a single URL specified on all of the classes as the "one true location" then the annotation will appear on all pages with that specification.

So, today I tried it out. And it worked!

I have two sites, one from a workshop in May and one from a workshop in October (the October one was actually canceled before I changed any real content, so the sites are essentially the same ;). Here are the URLs:

First, I wanted to enable hypothesis on them. These are Sphinx Web sites, constructed from a combination of reStructuredText marked-up content and Jinja2 templates, so to add it to all the pages I just edited the base templates to include the Hypothesis JavaScript:

<script async defer src="//hypothes.is/embed.js"></script>


I added this to the May site (commit) and the October site (commit).

Next, I edited the October site so that its canonical URL was that of the May site (commit). This makes it so that Hypothesis understands that annotations should be applied to both sites, or, to be more precise, that the October site is not the master site - the May one is. (The "right" way to do this is probably to have a site that is always the latest set of tutorials and then to set the canonical URL to that, but I didn't want to do all that work for this test.)

Then, after the sites rebuilt on ReadTheDocs and Hypothesis turned on, I added three comments: one on text specific to the October site, one on text specific to the May site, and one on text that appears on both sites.

And voila, it works!

(If you want to see them yourself, click the "eye" icon, or otherwise fiddle about with the hypothesis stuff at the top right of the page.)

In summary,

• I can add annotations to common text on one site, and it will appear on the other.
• Annotations specific to text on one site or the other will only be tied to that site.
• All is well.

So, this appears to be something that I can just add into all my sites in the future. Huzzah!

For Software and Data Carpentry to use this robustly, they would need to specify a canonical URL for each lesson - this probably makes sense to do anyway ;). Note that the canonical URL probably doesn't need to actually resolve to anything, although probably it'd be bad form if it didn't...

SWC and DC don't have to enable Hypothesis on all (or any) of their lessons; the canonical URL will work just fine with the plugin or the proxy. But they could enable Hypothesis on some workshop pages and have it lurk quietly in the upper right of the page until it's needed - I envision this being really useful for instructors and TAs who want to make notes in the heat of the moment.

Note, since Hypothesis makes it straightforward to retrieve annotations by annotated URL (or at least I think it does - I can't figure out how to do it, but Jon showed us ;), it is also easy when revising lessons to go look at all the annotations on a page and use that to revise the page. If Hypothesis catches on amongst instructors, this could become part of the standard lesson update process.

I've got a bunch of ideas to try out with Hypothesis, so expect more. But this was a fun (and easy!) start!

--titus

## November 11, 2015

### Continuum Analytics news

#### Anaconda Brings High Performance Python to Innovative GPU Technology

Posted Monday, November 16, 2015

Compute Intensive Analytics Gets 2-150X Performance Gains

AUSTIN, TXNovember 16, 2015—Continuum Analytics, developer of Anaconda, the leading modern open source analytics platform powered by Python, today announced the availability of Anaconda on AMD’s Accelerated Processing Units (APU), giving Anaconda users a simple way to exploit AMD's latest technology innovation for the most demanding data science applications with the simplicity of Python.

This combined technology is useful for advanced analytic workloads where communication overhead has previously made GPU acceleration challenging. These situations come up often when working with financial trading analysis, implementing graph analytics algorithms, and creating other complex data processing workflows. With a 2-1000x performance gain, Anaconda will enable data scientists to more easily tackle these traditionally compute-bound problems.

“Up until now, high performance computing has been largely inaccessible, requiring knowledge of low-level programming languages and workarounds to enable high performance computing,” said Travis Oliphant, Continuum Analytics CEO and co-founder. “With this collaboration, Anaconda helps democratize high performance computing for Big Data by bringing AMD's APU to Python, the fastest growing open data science language.”

Continuum and AMD teamed up to enable Numba, a Continuum created just-in-time (JIT) Python compiler, within the Anaconda platform to support the AMD APU Heterogeneous System Architecture (HSA). The APU combines the specialized capabilities of both the central processing unit (CPU) and graphics processing unit (GPU) to deliver increased performance and capabilities, expanding throughput by sharing data in memory and eliminating data movement. Typical applications that benefit from the increased throughput include scientific computing applications such as geospatial analytics and facial recognition along with business applications that involve machine learning, deep learning and artificial intelligence.

“Anaconda opens up the HSA architecture to the rich Python community that wants to leverage high performance computing capabilities of the AMD APU and innovative Numba technology,” said AMD senior director Compute Technology Solutions, Gregory Stoner. “Anaconda, plus AMD, delivers high performance computing to data scientists and developers.”

Continuum Analytics develops Anaconda, the leading modern open source analytics platform powered by Python. Continuum’s Python-based platform and consulting services empower organizations to analyze, manage and visualize big data - turning massive datasets into actionable insights and business value. Built on proven open source technology and easily integrated within existing IT environments, Anaconda allows organizations to make critical business decisions based on their data quickly, easily, inexpensively, and with flexibility. Without having to worry about how to access their data, organizations can free up resources to solve actual problems. Continuum's founders and developers have created or contribute to some of the most popular data science technologies, including NumPy, SciPy, Pandas, Jupyter/IPython, and many others. To learn more about the Anaconda platform, training and consulting services, visit http://www.continuum.io.

###

Media Contact

Treble

Aaron De Lucia

(512) 485-3016

continuum@treblepr.com

## November 10, 2015

### Thomas Wiecki

#### MCMC sampling for dummies

When I give talks about probabilistic programming and Bayesian statistics, I usually gloss over the details of how inference is actually performed, treating it as a black box essentially. The beauty of probabilistic programming is that you actually don't have to understand how the inference works in order to build models, but it certainly helps.

When I presented a new Bayesian model to Quantopian's CEO, Fawce, who wasn't trained in Bayesian stats but is eager to understand it, he started to ask about the part I usually gloss over: "Thomas, how does the inference actually work? How do we get these magical samples from the posterior?".

Now I could have said: "Well that's easy, MCMC generates samples from the posterior distribution by constructing a reversible Markov-chain that has as its equilibrium distribution the target posterior distribution. Questions?".

That statement is correct, but is it useful? My pet peeve with how math and stats are taught is that no one ever tells you about the intuition behind the concepts (which is usually quite simple) but only hands you some scary math. This is certainly the way I was taught and I had to spend countless hours banging my head against the wall until that euraka moment came about. Usually things weren't as scary or seemingly complex once I deciphered what it meant.

This blog post is an attempt at trying to explain the intuition behind MCMC sampling (specifically, the Metropolis algorithm). Critically, we'll be using code examples rather than formulas or math-speak. Eventually you'll need that but I personally think it's better to start with the an example and build the intuition before you move on to the math.

## The problem and its unintuitive solution

Lets take a look at Bayes formula:

$$P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)}$$

We have $P(\theta|x)$, the probability of our model parameters $\theta$ given the data $x$ and thus our quantity of interest. To compute this we multiply the prior $P(\theta)$ (what we think about $\theta$ before we have seen any data) and the likelihood $P(x|\theta)$, i.e. how we think our data is distributed. This nominator is pretty easy to solve for.

However, lets take a closer look at the denominator. $P(x)$ which is also called the evidence (i.e. the evidence that the data x was generated by this model). We can compute this quantity by integrating over all possible parameter values: $$P(x) = \int_\Theta P(x, \theta) \, \mathrm{d}\theta$$

This is the key difficulty with Bayes formula -- while the formula looks innocent enough, for even slightly non-trivial models you just can't compute the posterior in a closed-form way.

Now we might say "OK, if we can't solve something, could we try to approximate it? For example, if we could somehow draw samples from that posterior we can Monte Carlo approximate it." Unfortunately, to directly sample from that distribution you not only have to solve Bayes formula, but also invert it, so that's even harder.

Then we might say "Well, instead lets construct a Markov chain that has as an equilibrium distribution which matches our posterior distribution". I'm just kidding, most people wouldn't say that as it sounds bat-shit crazy. If you can't compute it, can't sample from it, then constructing that Markov chain with all these properties must be even harder.

The surprising insight though is that this is actually very easy and there exist a general class of algorithms that do this called Markov chain Monte Carlo (constructing a Markov chain to do Monte Carlo approximation).

## Setting up the problem¶

First, lets import our modules.

In [1]:
%matplotlib inline

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import norm

sns.set_style('white')
sns.set_context('talk')

np.random.seed(123)


Lets generate some data: 100 points from a normal centered around zero. Our goal will be to estimate the posterior of the mean mu (we'll assume that we know the standard deviation to be 1).

In [2]:
data = np.random.randn(20)

In [3]:
ax = plt.subplot()
sns.distplot(data, kde=False, ax=ax)
_ = ax.set(title='Histogram of observed data', xlabel='x', ylabel='# observations');


Next, we have to define our model. In this simple case, we will assume that this data is normal distributed, i.e. the likelihood of the model is normal. As you know, a normal distribution has two parameters -- mean $\mu$ and standard deviation $\sigma$. For simplicity, we'll assume we know that $\sigma = 1$ and we'll want to infer the posterior for $\mu$. For each parameter we want to infer, we have to chose a prior. For simplicity, lets also assume a Normal distribution as a prior for $\mu$. Thus, in stats speak our model is:

$$\mu \sim \text{Normal}(0, 1)\\ x|\mu \sim \text{Normal}(x; \mu, 1)$$

What is convenient, is that for this model, we actually can compute the posterior analytically. That's because for a normal likelihood with known standard deviation, the normal prior for mu is conjugate (conjugate here means that our posterior will follow the same distribution as the prior), so we know that our posterior for $\mu$ is also normal. We can easily look up on wikipedia how we can compute the parameters of the posterior. For a mathemtical derivation of this, see here.

In [4]:
def calc_posterior_analytical(data, x, mu_0, sigma_0):
sigma = 1.
n = len(data)
mu_post = (mu_0 / sigma_0**2 + data.sum() / sigma**2) / (1. / sigma_0**2 + n / sigma**2)
sigma_post = (1. / sigma_0**2 + n / sigma**2)**-1
return norm(mu_post, np.sqrt(sigma_post)).pdf(x)

ax = plt.subplot()
x = np.linspace(-1, 1, 500)
posterior_analytical = calc_posterior_analytical(data, x, 0., 1.)
ax.plot(x, posterior_analytical)
ax.set(xlabel='mu', ylabel='belief', title='Analytical posterior');
sns.despine()


This shows our quantity of interest, the probability of $\mu$'s values after having seen the data, taking our prior information into account. Lets assume, however, that our prior wasn't conjugate and we couldn't solve this by hand which is usually the case.

## Explaining MCMC sampling with code¶

Now on to the sampling logic. At first, you find starting parameter position (can be randomly chosen), lets fix it arbitrarily to:

mu_current = 1.


Then, you propose to move (jump) from that position somewhere else (that's the Markov part). You can be very dumb or very sophisticated about how you come up with that proposal. The Metropolis sampler is very dumb and just takes a sample from a normal distribution (no relationship to the normal we assume for the model) centered around your current mu value (i.e. mu_current) with a certain standard deviation (proposal_width) that will determine how far you propose jumps (here we're use scipy.stats.norm):

proposal = norm(mu_current, proposal_width).rvs()


Next, you evaluate whether that's a good place to jump to or not. If the resulting normal distribution with that proposed mu explaines the data better than your old mu, you'll definitely want to go there. What does "explains the data better" mean? We quantify fit by computing the probability of the data, given the likelihood (normal) with the proposed parameter values (proposed mu and a fixed sigma = 1). This can easily be computed by calculating the probability for each data point using scipy.stats.normal(mu, sigma).pdf(data) and then multiplying the individual probabilities, i.e. compute the likelihood (usually you would use log probabilities but we omit this here):

likelihood_current = norm(mu_current, 1).pdf(data).prod()
likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()

# Compute prior probability of current and proposed mu
prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)

# Nominator of Bayes formula
p_current = likelihood_current * prior_current
p_proposal = likelihood_proposal * prior_proposal


Up until now, we essentially have a hill-climbing algorithm that would just propose movements into random directions and only accept a jump if the mu_proposal has higher likelihood than mu_current. Eventually we'll get to mu = 0 (or close to it) from where no more moves will be possible. However, we want to get a posterior so we'll also have to sometimes accept moves into the other direction. The key trick is by dividing the two probabilities,

p_accept = p_proposal / p_current


we get an acceptance probability. You can already see that if p_proposal is larger, that probability will be > 1 and we'll definitely accept. However, if p_current is larger, say twice as large, there'll be a 50% chance of moving there:

accept = np.random.rand() < p_accept

if accept:
# Update position
cur_pos = proposal


This simple procedure gives us samples from the posterior.

### Why does this make sense?

Taking a step back, note that the above acceptance ratio is the reason this whole thing works out and we get around the integration. We can show this by computing the acceptance ratio over the normalized posterior and seeing how it's equivalent to the acceptance ratio of the unnormalized posterior (lets say $\mu_0$ is our current position, and $\mu$ is our proposal):

$$\frac{\frac{P(x|\mu) P(\mu)}{P(x)}}{\frac{P(x|\mu_0) P(\mu_0)}{P(x)}} = \frac{P(x|\mu) P(\mu)}{P(x|\mu_0) P(\mu_0)}$$

In words, dividing the posterior of proposed parameter setting by the posterior of the current parameter setting, $P(x)$ -- that nasty quantity we can't compute -- gets canceled out. So you can intuit that we're actually dividing the full posterior at one position by the full posterior at another position (no magic here). That way, we are visiting regions of high posterior probability relatively more often than those of low posterior probability.

### Putting it all together

In [5]:
def sampler(data, samples=4, mu_init=.5, proposal_width=.5, plot=False, mu_prior_mu=0, mu_prior_sd=1.):
mu_current = mu_init
posterior = [mu_current]
for i in range(samples):
# suggest new position
mu_proposal = norm(mu_current, proposal_width).rvs()

# Compute likelihood by multiplying probabilities of each data point
likelihood_current = norm(mu_current, 1).pdf(data).prod()
likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()

# Compute prior probability of current and proposed mu
prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)

p_current = likelihood_current * prior_current
p_proposal = likelihood_proposal * prior_proposal

# Accept proposal?
p_accept = p_proposal / p_current

# Usually would include prior probability, which we neglect here for simplicity
accept = np.random.rand() < p_accept

if plot:
plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accept, posterior, i)

if accept:
# Update position
mu_current = mu_proposal

posterior.append(mu_current)

return posterior

# Function to display
def plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accepted, trace, i):
from copy import copy
trace = copy(trace)
fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize=(16, 4))
fig.suptitle('Iteration %i' % (i + 1))
x = np.linspace(-3, 3, 5000)
color = 'g' if accepted else 'r'

# Plot prior
prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
prior = norm(mu_prior_mu, mu_prior_sd).pdf(x)
ax1.plot(x, prior)
ax1.plot([mu_current] * 2, [0, prior_current], marker='o', color='b')
ax1.plot([mu_proposal] * 2, [0, prior_proposal], marker='o', color=color)
ax1.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2),
arrowprops=dict(arrowstyle="->", lw=2.))
ax1.set(ylabel='Probability Density', title='current: prior(mu=%.2f) = %.2f\nproposal: prior(mu=%.2f) = %.2f' % (mu_current, prior_current, mu_proposal, prior_proposal))

# Likelihood
likelihood_current = norm(mu_current, 1).pdf(data).prod()
likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()
y = norm(loc=mu_proposal, scale=1).pdf(x)
sns.distplot(data, kde=False, norm_hist=True, ax=ax2)
ax2.plot(x, y, color=color)
ax2.axvline(mu_current, color='b', linestyle='--', label='mu_current')
ax2.axvline(mu_proposal, color=color, linestyle='--', label='mu_proposal')
#ax2.title('Proposal {}'.format('accepted' if accepted else 'rejected'))
ax2.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2),
arrowprops=dict(arrowstyle="->", lw=2.))
ax2.set(title='likelihood(mu=%.2f) = %.2f\nlikelihood(mu=%.2f) = %.2f' % (mu_current, 1e14*likelihood_current, mu_proposal, 1e14*likelihood_proposal))

# Posterior
posterior_analytical = calc_posterior_analytical(data, x, mu_prior_mu, mu_prior_sd)
ax3.plot(x, posterior_analytical)
posterior_current = calc_posterior_analytical(data, mu_current, mu_prior_mu, mu_prior_sd)
posterior_proposal = calc_posterior_analytical(data, mu_proposal, mu_prior_mu, mu_prior_sd)
ax3.plot([mu_current] * 2, [0, posterior_current], marker='o', color='b')
ax3.plot([mu_proposal] * 2, [0, posterior_proposal], marker='o', color=color)
ax3.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2),
arrowprops=dict(arrowstyle="->", lw=2.))
#x3.set(title=r'prior x likelihood $\propto$ posterior')
ax3.set(title='posterior(mu=%.2f) = %.5f\nposterior(mu=%.2f) = %.5f' % (mu_current, posterior_current, mu_proposal, posterior_proposal))

if accepted:
trace.append(mu_proposal)
else:
trace.append(mu_current)
ax4.plot(trace)
ax4.set(xlabel='iteration', ylabel='mu', title='trace')
plt.tight_layout()
#plt.legend()


## Visualizing MCMC¶

To visualize the sampling, we'll create plots for some quantities that are computed. Each row below is a single iteration through our Metropolis sampler.

The first columns is our prior distribution -- what our belief about $\mu$ is before seeing the data. You can see how the distribution is static and we only plug in our $\mu$ proposals. The vertical lines represent our current $\mu$ in blue and our proposed $\mu$ in either red or green (rejected or accepted, respectively).

The 2nd column is our likelihood and what we are using to evaluate how good our model explains the data. You can see that the likelihood function changes in response to the proposed $\mu$. The blue histogram which is our data. The solid line in green or red is the likelihood with the currently proposed mu. Intuitively, the more overlap there is between likelihood and data, the better the model explains the data and the higher the resulting probability will be. The dotted line of the same color is the proposed mu and the dotted blue line is the current mu.

The 3rd column is our posterior distribution. Here I am displaying the normalized posterior but as we found out above, we can just multiply the prior value for the current and proposed $\mu$'s by the likelihood value for the two $\mu$'s to get the unnormalized posterior values (which we use for the actual computation), and divide one by the other to get our acceptance probability.

The 4th column is our trace (i.e. the posterior samples of $\mu$ we're generating) where we store each sample irrespective of whether it was accepted or rejected (in which case the line just stays constant).

Note that we always move to relatively more likely $\mu$ values (in terms of their posterior density), but only sometimes to relatively less likely $\mu$ values, as can be seen in iteration 14 (the iteration number can be found at the top center of each row).

In [6]:
np.random.seed(123)
sampler(data, samples=8, mu_init=-1., plot=True);


Now the magic of MCMC is that you just have to do that for a long time, and the samples that are generated in this way come from the posterior distribution of your model. There is a rigorous mathematical proof that guarantees this which I won't go into detail here.

To get a sense of what this produces, lets draw a lot of samples and plot them.

In [7]:
posterior = sampler(data, samples=15000, mu_init=1.)
fig, ax = plt.subplots()
ax.plot(posterior)
_ = ax.set(xlabel='sample', ylabel='mu');


This is usually called the trace. To now get an approxmation of the posterior (the reason why we're doing all this), we simply take the histogram of this trace. It's important to keep in mind that although this looks similar to the data we sampled above to fit the model, the two are completely separate. The below plot represents our belief in mu. In this case it just happens to also be normal but for a different model, it could have a completely different shape than the likelihood or prior.

In [8]:
ax = plt.subplot()

sns.distplot(posterior[500:], ax=ax, label='estimated posterior')
x = np.linspace(-.5, .5, 500)
post = calc_posterior_analytical(data, x, 0, 1)
ax.plot(x, post, 'g', label='analytic posterior')
_ = ax.set(xlabel='mu', ylabel='belief');
ax.legend();


As you can see, by following the above procedure, we get samples from the same distribution as what we derived analytically.

## Proposal width¶

Above we set the proposal width to 0.5. That turned out to be a pretty good value. In general you don't want the width to be too narrow because your sampling will be inefficient as it takes a long time to explore the whole parameter space and shows the typical random-walk behavior:

In [9]:
posterior_small = sampler(data, samples=5000, mu_init=1., proposal_width=.01)
fig, ax = plt.subplots()
ax.plot(posterior_small);
_ = ax.set(xlabel='sample', ylabel='mu');


But you also don't want it to be so large that you never accept a jump:

In [10]:
posterior_large = sampler(data, samples=5000, mu_init=1., proposal_width=3.)
fig, ax = plt.subplots()
ax.plot(posterior_large); plt.xlabel('sample'); plt.ylabel('mu');
_ = ax.set(xlabel='sample', ylabel='mu');


Note, however, that we are still sampling from our target posterior distribution here as guaranteed by the mathemtical proof, just less efficiently:

In [11]:
sns.distplot(posterior_small[1000:], label='Small step size')
sns.distplot(posterior_large[1000:], label='Large step size');
_ = plt.legend();