August 13, 2018

Thomas Wiecki

Hierarchical Bayesian Neural Networks with Informative Priors

(c) 2018 by Thomas Wiecki

Imagine you have a machine learning (ML) problem but only small data (gasp, yes, this does exist). This often happens when your data set is nested -- you might have many data points, but only few per category. For example, in ad-tech you may want predict how likely a user will buy a certain product. There could be thousands of products but you only have a small number of measurements (e.g. purchase decisions) per item. Likely, there will be similarities between each product category, but they will all have individual differences too.

Using classic ML methods is unlikely to yield good results in this setting. Firstly, ML is happiest with lots of data. If you want to be fancy and use deep learning, that usually requires even more data to give good performance. Moreover, we can't really make use of the hierarchical structure (product-groupings) we know to be present in our data. So we can then either ignore it and try to learn one model for all categories which does not allow us to capture the differences between categories; or we can fit a ML model to each category separately but do not exploit the similarities between categories. (Technically, you could add the category as a feature and fit a single model but when I tried it on the data presented futher below it was at chance).

Hierarchical Bayesian models work amazingly well in exactly this setting as they allow us to build a model that matches the hierarchical structure present in our data set. Check out my previous blog post The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3 for a refresher.

In this setting we could likely build a hierarchical logistic Bayesian model using PyMC3. However, what if our decision surface is actually more complex and a linear model would not give good performance?

In this blog post I explore how we can take a Bayesian Neural Network (BNN) and turn it into a hierarchical one. Once we built this model we derive an informed prior from it that we can apply back to a simple, non-hierarchical BNN to get the same performance as the hierachical one.

This problem also occurs in multi-task learning. Neural networks trained to play multiple atari games have a hard time because they tend to forget what they learned before fairly quickly. Hierarchical BNNs could provide an elegant solution to this problem by sharing the higher-order representations that are useful but still allowing every subnetwork to specialize in a single task.

In [1]:
%matplotlib inline
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings

import pymc3 as pm
from pymc3 import floatX
import theano
import theano.tensor as tt
from pymc3.theanof import set_tt_rng, MRG_RandomStreams

import sklearn
from sklearn import datasets
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons



cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
cmap_uncertainty = sns.cubehelix_palette(light=1, as_cmap=True)

layer_names = ['w_in_1_grp', 'w_1_2_grp', 'w_2_out_grp']

The data set we are using are our battle tested half-moons as it is simple, non-linear and leads to pretty visualizations. This is what it looks like:

In [2]:
X, Y = make_moons(noise=0.3, n_samples=1000)
plt.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0')
plt.scatter(X[Y==1, 0], X[Y==1, 1], color='r', label='Class 1')
sns.despine(); plt.legend();

This is just to illustrate what the data generating distribution looks like, we will use way fewer data points, and create different subsets with different rotations.

In [3]:
def rotate(X, deg):
    theta = np.radians(deg)
    c, s = np.cos(theta), np.sin(theta)
    R = np.matrix([[c, -s], [s, c]])

    X =
    return np.asarray(X)
In [4]:

n_samples = 100
n_grps = 18
n_grps_sq = int(np.sqrt(n_grps))
Xs, Ys = [], []
for i in range(n_grps):
    # Generate data with 2 classes that are not linearly separable
    X, Y = make_moons(noise=0.3, n_samples=n_samples)
    X = scale(X)
    X = floatX(X)
    Y = floatX(Y)
    # Rotate the points randomly for each category
    rotate_by = np.random.randn() * 90.
    X = rotate(X, rotate_by)
Xs = np.stack(Xs)
Ys = np.stack(Ys)

Xs_train = Xs[:, :n_samples // 2, :]
Xs_test = Xs[:, n_samples // 2:, :]
Ys_train = Ys[:, :n_samples // 2]
Ys_test = Ys[:, n_samples // 2:]
In [5]:
fig, axs = plt.subplots(figsize=(15, 12), nrows=n_grps_sq, ncols=n_grps_sq, 
                        sharex=True, sharey=True)
axs = axs.flatten()
for i, (X, Y, ax) in enumerate(zip(Xs_train, Ys_train, axs)):
    ax.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0')
    ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r', label='Class 1')
    sns.despine(); ax.legend()
    ax.set(title='Category {}'.format(i + 1), xlabel='X1', ylabel='X2')

As you can see, we have 18 categories that share a higher-order structure (the half-moons). However, in the pure data space, no single classifier will be able to do a good job here. Also, because we only have 50 data points in each class, a NN will likely have a hard time producing robust results. But let's actually test this.

Classify each category separately

The code for the NN below is explained in my previous blog post on Bayesian Deep Learning.

In [6]:
def construct_flat_nn(ann_input, ann_output):
    """Function to create a flat BNN given data."""
    n_hidden = 5
    # Initialize random weights between each layer
    init_1 = floatX(np.random.randn(X.shape[1], n_hidden))
    init_2 = floatX(np.random.randn(n_hidden, n_hidden))
    init_out = floatX(np.random.randn(n_hidden))
    with pm.Model() as neural_network:
        # Weights from input to hidden layer
        weights_in_1 = pm.Normal('w_in_1', 0, sd=1, 
                                 shape=(X.shape[1], n_hidden), 
        # Weights from 1st to 2nd layer
        weights_1_2 = pm.Normal('w_1_2', 0, sd=1, 
                                shape=(n_hidden, n_hidden), 
        # Weights from hidden layer to output
        weights_2_out = pm.Normal('w_2_out', 0, sd=1, 
        # Build neural-network using tanh activation function
        act_1 = pm.math.tanh(, weights_in_1))
        act_2 = pm.math.tanh(, weights_1_2))
        act_out =, weights_2_out)
        # Binary classification -> Bernoulli likelihood
        out = pm.Bernoulli('out', 
    return neural_network

Next, we have our function to create the BNN, sample from it, and generate predicitions. You can skip this one.

In [98]:
def fit_and_eval_bnn(X_train, X_test, Y_train, Y_test, grid, dummy_out, bnn_func, bnn_kwargs=None, sample_kwargs=None):
    """Utility function to create a BNN from a function, sample from it, and create predictions."""
    if bnn_kwargs is None:
        bnn_kwargs = {}
    if sample_kwargs is None:
        sample_kwargs = {'chains': 1, 'progressbar': False}
    ann_input = theano.shared(X_train)
    ann_output = theano.shared(Y_train)
    model = bnn_func(ann_input, ann_output, **bnn_kwargs)
    with model:
        # fit model
        trace = pm.sample(**sample_kwargs)
        # sample posterior predictive
        ppc_train = pm.sample_ppc(trace, samples=500, progressbar=False) 
        # Use probability of > 0.5 to assume prediction of class 1
        pred_train = ppc_train['out'].mean(axis=0) > 0.5
        # Make predictions on test-set
        ppc_test = pm.sample_ppc(trace, samples=500, progressbar=False)

        pred_test = ppc_test['out'].mean(axis=0) > 0.5
        # Evaluate classifier over grid
        ppc_grid = pm.sample_ppc(trace, samples=500, 
    return pred_train, pred_test, ppc_grid, trace

Next, we loop over each category and fit a different BNN to each one. Each BNN has its own weights and there is no connection between them. But note that because we are Bayesians, we place priors on our weights. In this case these are standard normal priors that act as regularizers to keep our weights close to zero.

We use NUTS sampling here because the hierarchical model further below has a more complex posterior (see Why hierarchical models are awesome, tricky, and Bayesian for an explanation) and I wanted the results to be comparable. All simulations in here work with ADVI as well but the results don't look quite as strong.

In [ ]:
Ys_pred_train = []
Ys_pred_test = []
grid_eval = []

grid = pm.floatX(np.mgrid[-3:3:100j,-3:3:100j])
grid_2d = grid.reshape(2, -1).T
dummy_out = np.ones(grid.shape[1], dtype=np.int8)

for X_train, Y_train, X_test, Y_test in zip(Xs_train, Ys_train, Xs_test, Ys_test):
    pred_train, pred_test, ppc_grid, trace_flat = \
        fit_and_eval_bnn(X_train, X_test, 
                         Y_train, Y_test, 
                         grid_2d, dummy_out, 

Ys_pred_train = np.stack(Ys_pred_train)
Ys_pred_test = np.stack(Ys_pred_test)
ppc_grid_single = np.stack(grid_eval)
In [9]:
print ("Train accuracy = {:.2f}%".format(100*np.mean(Ys_pred_train == Ys_train)))
Train accuracy = 86.78%
In [10]:
print ("Test accuracy = {:.2f}%".format(100*np.mean(Ys_pred_test == Ys_test)))
Test accuracy = 83.78%

OK, that doesn't seem so bad. Now let's look at the decision surfaces -- i.e. what the classifier thinks about each point in the data space.

In [11]:
fig, axs = plt.subplots(figsize=(15, 12), nrows=n_grps_sq, ncols=n_grps_sq, sharex=True, sharey=True)
axs = axs.flatten()
for i, (X, Y_pred, Y_true, ax) in enumerate(zip(Xs_train, Ys_pred_train, Ys_train, axs)):
    contour = ax.contourf(grid[0], grid[1], ppc_grid_single[i, ...].mean(axis=0).reshape(100, 100), cmap=cmap)
    ax.scatter(X[Y_true == 0, 0], X[Y_true == 0, 1], label='Class 0')
    ax.scatter(X[Y_true == 1, 0], X[Y_true == 1, 1], color='r', label='Class 1')
    sns.despine(); ax.legend()

That doens't look all that convincing. We know from the data generation process as well as from the previous blog post Bayesian Deep learning using the same data set that it give us a "Z"-shaped decision surface. So what happens is we don't have enough data to properly estimate the non-linearity in every category.

Hierarchical Bayesian Neural Network

Can we do better? You bet!

It's actually quite straight-forward to turn this into one big hierarchical model for all categories, rather than many individual ones. Let's call the weight connecting neuron $i$ in layer 1 to neuron $j$ in layer 2 in category $c$ $w_{i, j, c}$ (I just omit the layer index for simplicity in notation). Rather than placing a fixed prior as we did above (i.e. $ w_{i, j, c} \sim \mathcal{N}(0, 1^2)$), we will assume that each weight comes from an overarching group distribution: $ w_{i, j, c} \sim \mathcal{N}(\mu_{i, j}, \sigma^2)$. The key is that we will estimate $\mu_{i, j}$ and $\sigma$ simultaneously from data.

Why not allow for different $\sigma_{i,j}^2$ per connection you might ask? Mainly just to make our life simpler and because it works well enough.

Note that we create a very rich model here. Every individual weight has its own hierarchical structure with a single group mean parameter and 16 per-category weights distributed around the group mean. While this creates a big amount of group distributions (as many as the flat NN had weights) there is no problem with this per-se, although it might be a bit unusual. One might argue that this model is quite complex and while that's true, in terms of degrees-of-freedom, this model is simpler than the unpooled one above (more on this below).

As for the code, we stack weights along a 3rd dimenson to get separate weights for each group. That way, through the power of broadcasting, the linear algebra works out almost the same as before.

In [12]:
def construct_hierarchical_nn(Xs, Ys):
    n_hidden = 5
    n_grps = Xs.shape[0].eval()
    n_data = Xs.shape[2].eval()
    # Initialize random weights between each layer
    init_1 = floatX(np.random.randn(n_data, n_hidden))
    init_2 = floatX(np.random.randn(n_hidden, n_hidden))
    init_out = floatX(np.random.randn(n_hidden))
    with pm.Model() as neural_network:
        # Group mean distribution for input to hidden layer
        weights_in_1_grp = pm.Normal('w_in_1_grp', 0, sd=1, 
                                 shape=(n_data, n_hidden), 
        # Group standard-deviation
        weights_in_1_grp_sd = pm.HalfNormal('w_in_1_grp_sd', sd=1.)
        # Group mean distribution for weights from 1st to 2nd layer
        weights_1_2_grp = pm.Normal('w_1_2_grp', 0, sd=1, 
                                shape=(n_hidden, n_hidden), 
        weights_1_2_grp_sd = pm.HalfNormal('w_1_2_grp_sd', sd=1.)
        # Group mean distribution from hidden layer to output
        weights_2_out_grp = pm.Normal('w_2_out_grp', 0, sd=1, 
        weights_2_out_grp_sd = pm.HalfNormal('w_2_out_grp_sd', sd=1.)
        # Separate weights for each different model, just add a 3rd dimension
        # of weights
        weights_in_1_raw = pm.Normal('w_in_1', 
                                     shape=(n_grps, n_data, n_hidden))
        # Non-centered specification of hierarchical model
        # see
        weights_in_1 = weights_in_1_raw * weights_in_1_grp_sd + weights_in_1_grp
        weights_1_2_raw = pm.Normal('w_1_2', 
                                    shape=(n_grps, n_hidden, n_hidden))
        weights_1_2 = weights_1_2_raw * weights_1_2_grp_sd + weights_1_2_grp
        weights_2_out_raw = pm.Normal('w_2_out', 
                                      shape=(n_grps, n_hidden))
        weights_2_out = weights_2_out_raw * weights_2_out_grp_sd + weights_2_out_grp
        # Build neural-network using tanh activation function
        # tt.batched_dot just calls .dot along an axis
        act_1 = pm.math.tanh(tt.batched_dot(Xs, weights_in_1))
        act_2 = pm.math.tanh(tt.batched_dot(act_1, weights_1_2))
        act_out = tt.batched_dot(act_2, weights_2_out)
        # Binary classification -> Bernoulli likelihood
        out = pm.Bernoulli('out', logit_p=act_out, observed=Ys)
    return neural_network
In [68]:
sample_kwargs = dict(init='advi+adapt_diag', tune=5000, chains=1, 
                     nuts_kwargs={'target_accept': 0.9}, 
grid_3d = np.repeat(grid_2d[None, ...], n_grps, axis=0)
dummy_out_3d = np.ones((n_grps, grid.shape[1]), dtype=np.int8)

Ys_hierarchical_pred_train, Ys_hierarchical_pred_test, ppc_grid, trace_hier = \
        Xs_train, Xs_test, 
        Ys_train, Ys_test, 
        grid_3d, dummy_out_3d,
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Convergence achieved at 54100
Interrupted at 54,099 [27%]: Average Loss = 522.66
Sequential sampling (1 chains in 1 job)
NUTS: [w_2_out, w_1_2, w_in_1, w_2_out_grp_sd, w_2_out_grp, w_1_2_grp_sd, w_1_2_grp, w_in_1_grp_sd, w_in_1_grp]
Only one chain was sampled, this makes it impossible to run some convergence checks
In [69]:
print('Train accuracy = {:.2f}%'.format(100*np.mean(Ys_hierarchical_pred_train == Ys_train)))
Train accuracy = 91.11%
In [70]:
print('Test accuracy = {:.2f}%'.format(100*np.mean(Ys_hierarchical_pred_test == Ys_test)))
Test accuracy = 89.78%

Great -- we get higher train and test accuracy. Let's look at what the classifier has learned for each category.

In [88]:
fig, axs = plt.subplots(figsize=(15, 12), nrows=n_grps_sq, ncols=n_grps_sq, sharex=True, sharey=True)
axs = axs.flatten()
for i, (X, Y_pred, Y_true, ax) in enumerate(zip(Xs_train, Ys_hierarchical_pred_train, Ys_train, axs)):
    contour = ax.contourf(grid[0], grid[1], ppc_grid[:, i, :].mean(axis=0).reshape(100, 100), cmap=cmap)
    ax.scatter(X[Y_true == 0, 0], X[Y_true == 0, 1], label='Class 0')
    ax.scatter(X[Y_true == 1, 0], X[Y_true == 1, 1], color='r', label='Class 1')
    sns.despine(); ax.legend()

Awesome! By (partially) pooling the data for each individual category we actually manage to retrieve the non-linearity. This is the strength of hierarchical models: we model the similarities of individual categories and their differences, sharing statistical power to the degree it's useful.

Of course, as we are in a Bayesian framework we also get the uncertainty estimate in our predictions:

In [89]:
fig, axs = plt.subplots(figsize=(15, 12), nrows=n_grps_sq, ncols=n_grps_sq, sharex=True, sharey=True)
axs = axs.flatten()
for i, (X, Y_pred, Y_true, ax) in enumerate(zip(Xs_train, Ys_hierarchical_pred_train, Ys_train, axs)):
    contour = ax.contourf(grid[0], grid[1], ppc_grid[:, i, :].std(axis=0).reshape(100, 100), 
    ax.scatter(X[Y_true == 0, 0], X[Y_true == 0, 1], label='Class 0')
    ax.scatter(X[Y_true == 1, 0], X[Y_true == 1, 1], color='r', label='Class 1')
    sns.despine(); ax.legend()

Further analysis

There are a couple of things we might ask at this point. For example, how much does each layer specialize its weight per category. To answer this we can look at the group standard-deviation which informs us how much each weight is allowed to deviate from its group mean.

In [90]:
pm.traceplot(trace_hier, varnames=['w_in_1_grp_sd', 'w_1_2_grp_sd', 'w_2_out_grp_sd']);

Interestingly, it seems that the specialization of the individual category sub-models is happening at the last layer where weights change most strongly from their group mean (as group variance is highest). I had assumed that this would happen at the first layer, based on what I found in my earlier blog post Random-Walk Bayesian Deep Networks: Dealing with Non-Stationary Data where the first layer acted as a rotation-layer.

Another interesting property of hierarchical models reveals itself here. As the group standard deviation is small for the weights in layers 1 and 2, it means these weights will be close to their group mean, reducing the effective number of parameters (degrees of freedom) of this model. This is different from the separate models where no similarities could be exploited to reduce the effective number of parameters. So from this perspective, the hierarchical model is simpler than the sum of the separate ones above.

Finally, I wondered what the group-model actually learned. To get at that, we can use the trace of $\mu_{i,j}$ from the hierarchical model and pass it to the non-hierarchical model as if we trained these weights directly on a single data set.

In [91]:
trace_flat._straces[0].samples['w_in_1'] = trace_hier['w_in_1_grp']
trace_flat._straces[0].samples['w_1_2'] = trace_hier['w_1_2_grp']
trace_flat._straces[0].samples['w_2'] = trace_hier['w_2_out_grp']

ann_input = theano.shared(X_train)
ann_output = theano.shared(Y_train)
neural_network = construct_flat_nn(ann_input, ann_output)

with neural_network:
    # Evaluate classifier over grid
    ppc_grid_hier2 = pm.sample_ppc(trace_flat, samples=500, 
In [106]:
contour = plt.contourf(grid[0], grid[1], ppc_grid_hier2.mean(axis=0).reshape(100, 100), cmap=cmap)
plt.xlabel('X1'); plt.ylabel('X2'); plt.title('Decision surface of the group weights');

It seems like the group model is representing the Z-shape in a fairly noisy way, which makes sense because the decision surface for every sub-model looks quite different.

Correlations between weights

Usually we estimate NNs with MLE and BNNs with mean-field variational inference (like ADVI) which both ignore correlations between weights. As we used NUTS here, I was curious if there are meaningful correlations. Here, we look at the correlations in the first layer of the group distribution.

In [93]:
sns.clustermap(np.corrcoef(trace_hier[layer_names[0]].reshape((trace_hier[layer_names[0]].shape[0], -1)).T), 
               vmin=-1, center=0, vmax=1)
<seaborn.matrix.ClusterGrid at 0x1c40e14ef0>

It indeed seems like point or mean-field estimates miss a lot of higher-order structure.

Informative priors for Bayesian Neural Networks

Informative priors are a powerful concept in Bayesian modeling. Any expert information you encode in your priors can greatly increase your inference.

The same should hold true for BNNs but it raises the question how we can define informative priors over weights which exist in this abstract space that is very difficult to reason about (understanding the learned representations of neural networks is an active research topic).

While I don't know how to answer that generally, we can nonetheless explore this question with the techniques we developed this far. The group distributions from our hierarchical model are providing structured regularization for the subnetworks. But there is no reason we can't use the group distributions only in a hierarchical network. We can just use the inferred group structure and reapply it in the form of informative priors to individual, flat networks.

For this, we must first estimate the group distribution as it looks to the subnetworks. The easiest approach is to draw sample $l$ from the group posterior distributions ($\mu_l$ and $\sigma_l$) and, using this sample, draw realizations $x$ from the resulting distribution: $x \sim \mathcal{N}(\mu_l, \sigma_l^2)$. This is essentially sampling from the group posterior predictive distribution.

In [94]:
from collections import defaultdict
samples_tmp = defaultdict(list)
samples = {}

for layer_name in layer_names:
    for mu, sd in zip(trace_hier.get_values(layer_name, chains=0),
                      trace_hier.get_values(layer_name+'_sd', chains=0)):
        for _ in range(20): # not sure why the `size` kwarg doesn't work
            samples_tmp[layer_name].append(stats.norm(mu, sd).rvs())
    samples[layer_name] = np.asarray(samples_tmp[layer_name])

While there is no guarantee that this distribution is normal (technically it is a mixture of normals so could look much more like a Student-T), this is a good enough approximation in this case. As the correlation structure of the group distributions seem to play a key role as we've seen above, we use MvNormal priors.

Note that this code just creates a single, non-hierarchical BNN.

In [95]:
def construct_flat_prior_nn(ann_input, ann_output, prior_1_mu=None, prior_1_cov=None, 
                            prior_2_mu=None, prior_2_cov=None, prior_out_mu=None, prior_out_cov=None):
    n_hidden = 5
    with pm.Model() as neural_network:
        # In order to model the correlation structure between the 2D weights,
        # we flatten them first. Now here we have to reshape to give them their
        # original 2D shape.
        weights_in_1 = (pm.MvNormal('w_in_1', prior_1_mu.flatten(), 
                        .reshape((X.shape[1], n_hidden)))
        # Weights from 1st to 2nd layer
        weights_1_2 = (pm.MvNormal('w_1_2', prior_2_mu.flatten(), 
                       .reshape((n_hidden, n_hidden)))
        # Weights from hidden layer to output
        weights_2_out = (pm.MvNormal('w_2_out', prior_out_mu.flatten(), 
        # Build neural-network using tanh activation function
        act_1 = pm.math.tanh(, 
        act_2 = pm.math.tanh(, 
        act_out =, weights_2_out)
        # Binary classification -> Bernoulli likelihood
        out = pm.Bernoulli('out', 
    return neural_network

Again, we just loop over the categories in our data set and create a separate BNN for each one. This is identical to our first attempt above, however, now we are setting the prior estimated from our group posterior of our hierarchical model in our second approach.

In [ ]:
Ys_pred_train = []
Ys_pred_test = []
grid_eval = []

for X_train, Y_train, X_test, Y_test in zip(Xs_train, Ys_train, Xs_test, Ys_test):
    n_samples = samples['w_in_1_grp'].shape[0]
    # Construct informative priors from previous hierarchical posterior
    bnn_kwargs = \
         prior_1_cov=np.cov(samples['w_in_1_grp'].reshape((n_samples, -1)).T),
         prior_2_cov=np.cov(samples['w_1_2_grp'].reshape((n_samples, -1)).T),
         prior_out_cov=np.cov(samples['w_2_out_grp'].reshape((n_samples, -1)).T))
    pred_train, pred_test, ppc_grid, _ = \
        fit_and_eval_bnn(X_train, X_test, 
                         Y_train, Y_test, 
                         grid_2d, dummy_out, 

Ys_info_pred_train = np.asarray(Ys_pred_train)
Ys_info_pred_test = np.asarray(Ys_pred_test)
grid_eval = np.asarray(grid_eval)

Drum roll

In [100]:
fig, axs = plt.subplots(figsize=(15, 12), nrows=n_grps_sq, ncols=n_grps_sq, sharex=True, sharey=True)
axs = axs.flatten()
for i, (X, Y_pred, Y_true, ax) in enumerate(zip(Xs_train, Ys_pred_train, Ys_train, axs)):
    contour = ax.contourf(grid[0], grid[1], grid_eval[i, ...].mean(axis=0).reshape(100, 100), cmap=cmap)
    ax.scatter(X[Y_true == 0, 0], X[Y_true == 0, 1], label='Class 0')
    ax.scatter(X[Y_true == 1, 0], X[Y_true == 1, 1], color='r', label='Class 1')
    sns.despine(); ax.legend()
In [101]:
print('Train accuracy = {:.2f}%'.format(100*np.mean(Ys_info_pred_train == Ys_train)))
Train accuracy = 91.56%
In [102]:
print('Test accuracy = {:.2f}%'.format(100*np.mean(Ys_info_pred_test == Ys_test)))
Test accuracy = 87.11%

Holy mackerel, it actually worked!

As demonstrated, informed priors can help NNs a lot. But what if we don't have hierarchical structure or it would be too expensive to estimate? We could attempt to construct priors by deriving them from pre-trained models. For example, if I wanted to train an object recognition model to my own custom categories, I could start with a model like ResNet trained on the CIFAR data set, derive priors from the weights, and then train a new model on my custom data set which could then get by with fewer images than if we trained from scratch.


In this blog post I showed how we can borrow ideas from Bayesian statistics (hierarchical modeling and informative priors) and apply them to deep learning to boost accuracy when our data set is nested and we may not have a huge amount of data.

If you want to play around with this notebook yourself, download it here.


If you enjoyed this blog post, please consider supporting me on Patreon.

Thanks to Adrian Seyboldt and Chris Chatham for useful discussions and feedback. Thanks also to my patrons, particularily Jonathan Ng and Vladislavs Dovgalecs.

by Thomas Wiecki at August 13, 2018 02:00 PM

August 12, 2018


NeuroML2/LEMS is moving into Neural Mass Models and whole brain networks

In the last months, as part of the Google Summer of Code 2018, I have been working on a project that aimed to implement neuronal models which represent averaged population activity on NeuroML2/LEMS. The project was supported by the INCF organisation and my mentor, Padraig Gleeson, and I had 3 months to shape and bring to life all the ideas that we had in our heads. This blog post summarises the core motivation of the project, the technical challenges, what I have done, and future steps.

NeuroML version 2 and LEMS were introduced in order to standardise the description of neuroscience computational models and facilitate the shareability of results among different research groups1. However, so far, NeuroML2/LEMS have focused on modelling spiking neurons and how information is exchanged between them in networks. With the introduction of neural mass models, NeuroML2/LEMS can be extended to study interactions between large-scale systems such as cortical regions and indeed whole brain dynamics. To achieve this, my project consisted of implementing the basic structures needed in NeuroML2/LEMS to simulate Neural Mass Models and compare the results with previously published papers.

What I did
During the project I focused on the implementation of three previously described Neural Mass Models into NeuroML2/LEMS:

1. del Molino et al., 20172: This study analyses the dynamics and interaction between excitatory neurons and three types of interneurons (parvalbumin (PV), somatostatin (SST) and vasoactive intestinal peptide (VIP) expressing).  It first looks at the interactions between single units representing each population and then it scales up to analyse the interaction between a network with multiple interacting units in each population. A detailed description of the model that I have implemented in NeuroML and an illustrative Jupyter notebook that reproduces the main findings from the paper can be found at this git repository.

Overview of the del Molino et al., 2017 model implemented in NeuroML2/LEMS. The scheme on the left illustrate how the different populations are connected and the entry point of the top-down modulatory input. Once the dynamics of the interaction between single units have been analysed, we scale up to look at the interaction of a network of multiple interacting units in each population. The network population is illustrated on the right

2. Wilson and Cowan, 19723: This classic model describes the interaction between populations of excitatory and inhibitory neurons.  In this project, I have implemented a NEURON interpretation of the Wilson and Cowan model into NeuroML and compared the dynamics of the model by looking at the dynamics of the generated results. The repository with the Wilson and Cowan simulations can be found here.

Illustration of the population modelled with Wilson and Cowan simulation and how the dynamics over time change with (Drive) and without (No Drive) an additional external input current  

3. Mejias et al., 20164: analyses the dynamics across multiple scales of the primate cortex (intralaminar, interlaminar, interareal and whole cortex). This git repo so far implements the intralaminar and interlaminar simulations of the cortex in Python and provides the methods needed for analysing the results from the NeuroML2/LEMS simulation.  It also contains a first model of the interlaminar simulation using NeuroML2/LEMS that will be further extended to simulate the firing rate at interlaminar, interareal and whole cortex level.

Illustration of the Mejias et al., 2016 models implemented so far: While at the intralaminar level analyses the dynamics of the excitatory (in red) and inhibitory population (in blue) for each layer are considered independent, in the interlaminar the interaction between supra- (Layer 2/3) and infragranular (Layer 5/6) layers are taken into account

The technical challenges 

In order to be able to simulate Neural Mass Models, we had to extend previously defined  NeuroML2 components used to simulate spiking models. To this end we defined two new core components:

  • baseRateUnit: which extends the baseCellMembPot but instead of exposing the membrane potential it exposes the population’s firing rate.
  • rateSynapse: In the spiking models a change in current is only triggered if the cell membrane exceeds a specific threshold. In a rate base model, however, there is a continuous transmission of currents between the populations. Therefore we extended the baseSynapse component so that it allows the continuous transmission of currents between the population using continuous connections.

The detailed implementation of the two components can be found here.

The future
The projects I have worked on during these 3 months were a proof of concept to explore how NeuroML2/LEMS can be extended to simulate Neural Mass Models. They provided valuable insight into the necessary components to extend NeuroML2/LEMS to large-scale dynamics and a proof of concept that the generated signal is comparable to those generated with other tools.

These are, however, just the first steps into a very interesting direction: just imagine how incredible it would be to extend this data to a whole brain simulation. One possible candidate would be to use mouse data for the simulation (e.g. from the Allen Institute mouse connectivity datasets). So stay tuned for future updates!

The experience
Working on this project was not only a great way of getting to learn the intricacies of NeuroML, get a better understanding of Neural Mass Models but it was also a great opportunity to get my hands dirty with the code. It was also very satisfying to produce in a short time something from the beginning to the end. In addition to all these, it was also my first contact with the open source community. Thank you very much Padraig, for the help and the guidance during these months!

1 Cannon, R. C., Gleeson, P., Crook, S., Ganapathy, G., Marin, B., Piasini, E., & Silver, R. A. (2014). LEMS: a language for expressing complex biological models in concise and hierarchical form and its use in underpinning NeuroML 2. Frontiers in neuroinformatics, 8, 79 .
2 Garcia Del Molino, Luis Carlos, Guangyu Robert Yang, Jorge F. Mejias, and Xiao-Jing Wang. 2017a. “Paradoxical Response Reversal of Top-down Modulation in Cortical Circuits with Three Interneuron Types.” eLife 6 (December). .
3 Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical journal, 12(1), 1-24 .
4 Mejias, Jorge F., John D. Murray, Henry Kennedy, and Xiao-Jing Wang. 2016a. “Feedforward and Feedback Frequency-Dependent Interactions in a Large-Scale Laminar Network of the Primate Cortex.” .  

by Jessica Dafflon ( at August 12, 2018 12:25 PM

August 09, 2018

Titus Brown

"Labor" and "Engaged effort"

A few blog posts back, I suggested that "effort" is a common pool resource to be managed by communities building open online resources, such as open source projects or other "digital commons" projects. The logic was roughly thus: there is clearly a common pool resource at work in these projects - but what is it? A common pool resource must be both non-excludable and rivalrous. This means that the resource is consumable but you cannot easily limit who consumes it in quantity. Timber in a shared forest, irrigation water shared amongst many farmers, and grass in a commons are all examples.

But when you start looking at data, source code, and other resources managed in open online projects, it is clear that they are more like public goods: they can be consumed by anyone, and the costs of distribution are minimal. And yet we know from experience that there are costs associated with these projects, and it's fairly clear that "maintenance" is one source of costs (but not the only source). "Effort" was my first attempt to name this.

One of the comments I've gotten from multiple people is that "labor" seems like a better word. After all, isn't that what "effort" means? Someone is putting in labor?

Labor and effort and investment

Let's start by pointing out that I know very little about economics :). So, reader beware.

To me, the term "labor" implies a significant degree of fungibility: anyone can do the work, and the challenge is to find and support someone. If labor is a resource to be managed, then it seems to me we are saying (1) there is some fixed pool of labor available, either directly (funds already allocated) or indirectly (fixed pool of funds); (2) the challenge is to allocate that fixed pool of labor among the tasks to be done; and (3) it is not necessarily hard to quickly find someone to perform the labor.

In contrast, when I look at open source projects, I see that while there is some pool of labor available, it is not necessarily fixed (because there is volunteer labor available, given enthusiasm or engagement); and that while we do wish to allocate it amongst the tasks to be done, the labor involved in deciding the allocation is a big part of the challenge; and that it is not easy to quickly find someone to perform the labor, since often rather specialized skills and experience are required.

Open source projects solve these challenges by inviting enthusiasm and supporting self determination of task allocation; communicating in a bottom-up way about tasks that need to be done; and trying to grow the pool of people with specialized skills available to the project over time. This places the pool of "labor" itself in charge of the project direct, which I think is connected to the self-governance aspect of managing CPRs.

So I'm using "effort" to differentiate that kind of "engaged" labor from other labor. It seems like a significant distinction to me.

Engaged effort and maintenance

In Ostrom's book, there are a number of riveting examples of maintenance effort being applied by locally organized efforts. To quote from one (emphasis mine):

The Bacarra-Vinter federation of zanjeras constructs and maintains a 100-meter-long brush dam that spans the Bacarra-Vinter river [in the Philippines]. ... During the rainy season each year, the river destroys the federation's dam ... During some years the dam will be destroyed three or four times.

[ ... details of how the dam is rebuilt ... ]

In terms of the contemporary schedule of 5 days per week, this amounts to two months of work supplied without direct monetary payment.

This is a pretty significant investment of labor! And while there is probably some sociology going on here around ownership - people are more willing to invest labor in things they have ownership in - this level of labor must be tied to pretty strong economic incentives. (In this case, it's that the dams are a critical part of the irrigation system that the zanjeras depend upon for farming: no dams, no irrigation, no crops.)

This, to me, is akin to what goes on in open source projects. There is an understanding in the community that a common problem must be solved, and so the community solves it - because they gain some kind of utility from the continuation of the project.

Is effort non-excludable on open online projects?

Another question that I've been thinking about: is "effort" really non-excludable on open online projects? That would mean that anyone in the project's community would be able to appropriate the effort of anyone else on the project. Is that true? How does that work, if so??

Let's back up: one of the implications of claiming that "effort" is a common pool resource in open online projects is that it's non-excludable: it can be appropriated by other people on the project. As a common resource, it would need to be managed by the project rather than managed individually.

Is this true?

I think it actually is. This in fact corresponds to one of the unwritten rules about how open source projects work: when a bug is discovered and posted to the issue tracker, an open source project will immediately evaluate and prioritize working on it (or not). In turn, someone who is devoting their time to the project will pick it up and work on it; the community thus appropriates the effort of someone. The fact that it's all "voluntary" isn't a relevant distinction, since the culture of open source projects virtually guarantees that someone will pick up and fix important bugs. This bears some interesting resemblances to how the zanjeras above works: a committee figures out if the dam needs to be rebuilt, and then the community of contributors goes forth and rebuilds it.

This culture corresponds directly with the problem of "maintainer burnout" on open source projects, where the maintenance burden becomes a crushing burden on the core contributors. In effect, a project that doesn't steward the common pool resource of engaged effort properly leads to a depleted resource pool of that effort. (This doesn't explain the part where maintainers continue to contribute effort well past the point of personal sustainability; I think that could be explained by some aspect of the community engagement around solving shared problems, but I need to think more about that.)

Interestingly, another parallel that emerges from this perspective is the intersection of community governance with community labor: those who put effort in to prioritizing the bug and fixing it are those who help decide which bugs are going to be fixed next and which features are going to be added. Or, in other words, those who contribute effort become part of the community that decides how to allocate that effort in the future.

What are the implications, good sir?

If this is not an incorrect perspective, I think there are lots of implications for the active design and growth of open online communities, and more specifically for governance design. One obvious one is that the pool of engaged effort on a project is the central resource to be preserved against encroachment, and that maintaining and growing this effort is a key to sustainability.

Comments and thoughts welcome as always!


p.s. Thanks especially to Carly Strasser for her quizzical looks and probing questions on this topic, Josh Greenberg for pointing me at club goods and this topic more generally, and Adam Resnick, Stan Ahalt, and Matthew Trunnell for the initial insight!

by C. Titus Brown at August 09, 2018 10:00 PM

Continuum Analytics

Deploying Machine Learning Models is Hard, But It Doesn’t Have to Be

With free, open source tools like Anaconda Distribution, it has never been easier for individual data scientists to analyze data and build machine learning models on their laptops. So why does deriving actual business value from machine learning remain elusive for many organizations? Because while it’s easy for data scientists to build powerful models on …
Read more →

The post Deploying Machine Learning Models is Hard, But It Doesn’t Have to Be appeared first on Anaconda.

by Rory Merritt at August 09, 2018 04:32 PM

August 07, 2018

Matthew Rocklin

Building SAGA optimization for Dask arrays

This work is supported by ETH Zurich, Anaconda Inc, and the Berkeley Institute for Data Science

At a recent Scikit-learn/Scikit-image/Dask sprint at BIDS, Fabian Pedregosa (a machine learning researcher and Scikit-learn developer) and Matthew Rocklin (Dask core developer) sat down together to develop an implementation of the incremental optimization algorithm SAGA on parallel Dask datasets. The result is a sequential algorithm that can be run on any dask array, and so allows the data to be stored on disk or even distributed among different machines.

It was interesting both to see how the algorithm performed and also to see the ease and challenges to run a research algorithm on a Dask distributed dataset.


We started with an initial implementation that Fabian had written for Numpy arrays using Numba. The following code solves an optimization problem of the form

import numpy as np
from numba import njit
from sklearn.linear_model.sag import get_auto_step_size
from sklearn.utils.extmath import row_norms

def deriv_logistic(p, y):
    # derivative of logistic loss
    # same as in lightning (with minus sign)
    p *= y
    if p > 0:
        phi = 1. / (1 + np.exp(-p))
        exp_t = np.exp(p)