December 03, 2013

Philip Herron

GCC Rust

So today i announced on gcc i am using this as a project to take my compiler development pretty seriously i believe that rust would target a gcc front-end much better than llvm. I have a feeling that the high-level optimisers might have a good chance at completely optimising the crap out of rust.

There is some concern from the rust community over the duplication in effort and consistency in how both compilers may end up with inconsistencies. But in my opinion i think this is where rust needs to stabalize and even consider publishing something of a high-level language specification nothing really formal because those are never helpful.

There are really 3 options for continuing development:

• Reuse libsyntax from rustc and have a gcc backend switch over llvm
• Continue with what i am doing now
• Copy a bunch of code from rustc and bootstrap the gcc front-end with rustc

The only viable option here is to continue with what i am doing here, since the other options have license problems and technical issues of reusing rust code from rustc. Then to have a gcc backend switch over llvm really depends on David Malcom’s work on gccjit.

I tend to think of gcc-rust as how gccgo is to go.

Luis Pedro Coelho

luispedro

Python “lists” are not lists. They are arrays.

§

This seems like the sort of rookie mistake that starting scientists make: they use terms imprecisely, they write like journalists not scientists.

I can totally imagine myself telling a student starting out: a list is a list and an array is an array, these are well defined computer science terms, you cannot use them interchangibly (and I can imagine their mental reaction: this guy is a pedantic prick [1]).

I tell the students to think like a politician: everything you write must still be defensible even taken out of context[2]

Calling an array a list is wrong.

§

I wondered whether they had been lists in the past, so I went back and checked the earliest version of Python available, which happens to be Python 1.0.1

After a minor bug fix [3], it actually compiled and ran on my Ubuntu 13.04 machine! It is somewhat crashy on errors, but otherwise works.

§

First find: already in Python 1.0.1, lists were arrays! This was already year old code base, so maybe at an even earlier time point, men were men and lists were lists. But by 1994, lists were arrays.

§

A surprising thing about the code: if you’re familiar with recent Python code, this will feel very familiar. Here is how you set a list member:

int
setlistitem(op, i, newitem)
register object *op;
register int i;
register object *newitem;
{
register object *olditem;
if (!is_listobject(op)) {
XDECREF(newitem);
return -1;
}
if (i < 0 || i >= ((listobject *)op) -> ob_size) {
XDECREF(newitem);
err_setstr(IndexError, "list assignment index out of range");
return -1;
}
olditem = ((listobject *)op) -> ob_item[i];
((listobject *)op) -> ob_item[i] = newitem;
XDECREF(olditem);
return 0;
}

This feels very similar to recent Python.

§

Even more surpring, here is the inner loop:

switch (opcode) {

case POP_TOP:
v = POP();
DECREF(v);
break;

case ROT_TWO:
v = POP();
w = POP();
PUSH(v);
PUSH(w);
break;

and so on… This is almost exactly the same as the most recent Python. We are still using fundamentally the same implementation of Python that was out 20 years ago.

 [1] Student: I meant list as a general one-thing-after-another-in-order, Wise older researcher You mean a sequence Student: Yeah, that. Wise older researcher: then you should write sequence. At this point, the student attains enlightment and starts applying for jobs in industry.
 [2] A common complaint I have heard after several MS thesis defenses is the jury member took this one sentence from my introduction out of context and made a big deal out of it. It wasn’t even that related to my work.
 [3] There is a function getline() in the standard library which was not there in the early 1990s. So, Python’s use of an internal function with the same name gets the compiler confused. Renaming the internal function to python_getline fixes it.

December 01, 2013

Jake Vanderplas

Kernel Density Estimation in Python

Last week Michael Lerner posted a nice explanation of the relationship between histograms and kernel density estimation (KDE). I've made some attempts in this direction before (both in the scikit-learn documentation and in our upcoming textbook), but Michael's use of interactive javascript widgets makes the relationship extremely intuitive. I had been planning to write a similar post on the theory behind KDE and why it's useful, but Michael took care of that part. Instead, I'm going to focus here on comparing the actual implementations of KDE currently available in Python. If you're unsure what kernel density estimation is, read Michael's post and then come back here.

There are several options available for computing kernel density estimates in Python. The question of the optimal KDE implementation for any situation, however, is not entirely straightforward, and depends a lot on what your particular goals are. Here are the four KDE implementations I'm aware of in the SciPy/Scikits stack:

Each has advantages and disadvantages, and each has its area of applicability. I'll start with a table summarizing the strengths and weaknesses of each, before discussing each feature in more detail and running some simple benchmarks to gauge their computational cost:

table.mytable{ border-collapse:collapse; border: 5px solid #DDDDDD; } td.green{background-color:#DDFFDD;} td.red{background-color:#FFDDDD;}
 Scipy StatsmodelsKDEUnivariate StatsmodelsKDEMultivariate Scikit-Learn BandwidthSelection AvailableKernels Multi-dimension Heterogeneousdata FFT-basedcomputation Tree-basedcomputation Scott &Silverman One (Gauss) Yes No No No Scott &Silverman Seven 1D only No Yes No normal referencecross-validation Seven Yes Yes No No None built-in;Cross val. available 6 kernels x12 metrics Yes No No Ball Treeor KD Tree

Comparing the Implementations

The four implementations mentioned above have very different interfaces. For the sake of the examples and benchmarks below, we'll start by defining a uniform interface to all four, assuming one-dimensional input data. The following functions should make clear how the interfaces compare:

In [1]:
from sklearn.neighbors import KernelDensity
from scipy.stats import gaussian_kde
from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.nonparametric.kernel_density import KDEMultivariate

def kde_scipy(x, x_grid, bandwidth=0.2, **kwargs):
"""Kernel Density Estimation with Scipy"""
# Note that scipy weights its bandwidth by the covariance of the
# input data.  To make the results comparable to the other methods,
# we divide the bandwidth by the sample standard deviation here.
kde = gaussian_kde(x, bw_method=bandwidth / x.std(ddof=1), **kwargs)
return kde.evaluate(x_grid)

def kde_statsmodels_u(x, x_grid, bandwidth=0.2, **kwargs):
"""Univariate Kernel Density Estimation with Statsmodels"""
kde = KDEUnivariate(x)
kde.fit(bw=bandwidth, **kwargs)
return kde.evaluate(x_grid)

def kde_statsmodels_m(x, x_grid, bandwidth=0.2, **kwargs):
"""Multivariate Kernel Density Estimation with Statsmodels"""
kde = KDEMultivariate(x, bw=bandwidth * np.ones_like(x),
var_type='c', **kwargs)
return kde.pdf(x_grid)

def kde_sklearn(x, x_grid, bandwidth=0.2, **kwargs):
"""Kernel Density Estimation with Scikit-learn"""
kde_skl = KernelDensity(bandwidth=bandwidth, **kwargs)
kde_skl.fit(x[:, np.newaxis])
# score_samples() returns the log-likelihood of the samples
log_pdf = kde_skl.score_samples(x_grid[:, np.newaxis])
return np.exp(log_pdf)

kde_funcs = [kde_statsmodels_u, kde_statsmodels_m, kde_scipy, kde_sklearn]
kde_funcnames = ['Statsmodels-U', 'Statsmodels-M', 'Scipy', 'Scikit-learn']

print "Package Versions:"
import sklearn; print "  scikit-learn:", sklearn.__version__
import scipy; print "  scipy:", scipy.__version__
import statsmodels; print "  statsmodels:", statsmodels.__version__

Package Versions:
scikit-learn: 0.14.1
scipy: 0.13.1
statsmodels: 0.5.0



Because several of these are newer functionalities (in particular, the KernelDensity estimator was added in version 0.14 of Scikit-learn), I added an explicit print-out of the versions used in running this notebook.

Now that we've defined these interfaces, let's look at the results of the four KDE approaches. We'll start with the normal matplotlib backend command, and then plot visualizations of the four results on the same 1 dimensional bimodal data:

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from scipy.stats.distributions import norm

# The grid we'll use for plotting
x_grid = np.linspace(-4.5, 3.5, 1000)

# Draw points from a bimodal distribution in 1D
np.random.seed(0)
x = np.concatenate([norm(-1, 1.).rvs(400),
norm(1, 0.3).rvs(100)])
pdf_true = (0.8 * norm(-1, 1).pdf(x_grid) +
0.2 * norm(1, 0.3).pdf(x_grid))

# Plot the three kernel density estimates
fig, ax = plt.subplots(1, 4, sharey=True,
figsize=(13, 3))

for i in range(4):
pdf = kde_funcs[i](x, x_grid, bandwidth=0.2)
ax[i].plot(x_grid, pdf, color='blue', alpha=0.5, lw=3)
ax[i].fill(x_grid, pdf_true, ec='gray', fc='gray', alpha=0.4)
ax[i].set_title(kde_funcnames[i])
ax[i].set_xlim(-4.5, 3.5)

from IPython.display import HTML
HTML("<font color='#666666'>Gray = True underlying distribution</font><br>"
"<font color='6666ff'>Blue = KDE model distribution (500 pts)</font>")

Out[3]:
Gray = True underlying distribution
Blue = KDE model distribution (500 pts)

The results are identical, as we'd expect: all four algorithms are effectively computing the same result by different means.

Features of the Algorithms

Given that the results of the algorithms (for 1 dimensinoal data, at least) are essentially equivalent, why would you use one over another? The answer to that lies a bit deeper in the theory of how KDE is computed and applied. Above I showed a table that summarizes some of the advantages and disadvantages of each algorithm: here' I'll discuss a few of those features in a bit more detail:

Bandwidth selection

The selection of bandwidth is an important piece of KDE. For the same input data, different bandwidths can produce very different results:

In [4]:
fig, ax = plt.subplots()
for bandwidth in [0.1, 0.3, 1.0]:
ax.plot(x_grid, kde_sklearn(x, x_grid, bandwidth=bandwidth),
label='bw={0}'.format(bandwidth), linewidth=3, alpha=0.5)
ax.hist(x, 30, fc='gray', histtype='stepfilled', alpha=0.3, normed=True)
ax.set_xlim(-4.5, 3.5)
ax.legend(loc='upper left')

Out[4]:
<matplotlib.legend.Legend at 0x108b8c3d0>


Using different bandwidths can lead to entirely different ideas of the underlying nature of the data! Given the importance of bandwidth, how might you determine the optimal bandwidth for any given problem?

There are two classes of approaches to this problem: in the statistics community, it is common to use reference rules, where the optimal bandwidth is estimated from theoretical forms based on assumptions about the data distribution. A common reference rule is Silverman's rule, which is derived for univariate KDE and included within both the Scipy and Statsmodels implementations. Other potential reference rules are ones based on Information Criteria, such as the well-known AIC and BIC.

In the Machine Learning world, the use of reference rules is less common. Instead, an empirical approach such as cross validation is often used. In cross validation, the model is fit to part of the data, and then a quantitative metric is computed to determine how well this model fits the remaining data. Such an empirical approach to model parameter selection is very flexible, and can be used regardless of the underlying data distribution.

Because the various reference rules generally depend on (often dubious) assumptions about the underlying distribution of the data, bandwidth selection based in cross-validation can produce more trustworthy results for real-world datasets. A leave-one-out cross-validation scheme is built-in to the Statsmodels KDEMultivariate class. For large datasets, however, leave-one-out cross-validation can be extremely slow. Scikit-learn does not currently provide built-in cross validation within the KernelDensity estimator, but the standard cross validation tools within the module can be applied quite easily, as shown in the example below.

Bandwidth Cross-Validation in Scikit-Learn

Using cross validation within Scikit-learn is straightforward with the GridSearchCV meta-estimator:

In [5]:
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(),
{'bandwidth': np.linspace(0.1, 1.0, 30)},
cv=20) # 20-fold cross-validation
grid.fit(x[:, None])
print grid.best_params_

{&aposbandwidth&apos: 0.19310344827586207}



According to the cross-validation score (i.e. the maximum likelihood), the best bandwidth is around 0.19. Let's plot the result:

In [6]:
kde = grid.best_estimator_
pdf = np.exp(kde.score_samples(x_grid[:, None]))

fig, ax = plt.subplots()
ax.plot(x_grid, pdf, linewidth=3, alpha=0.5, label='bw=%.2f' % kde.bandwidth)
ax.hist(x, 30, fc='gray', histtype='stepfilled', alpha=0.3, normed=True)
ax.legend(loc='upper left')
ax.set_xlim(-4.5, 3.5);


We see that the cross-validation yields a bandwidth which is close to what we might choose by-eye, and the resulting density estimate closely reflects the distribution of the underlying data.

Kernels

Above we've been using the Gaussian kernel, but this is not the only available option. KDE can be used with any kernel function, and different kernels lead to density estimates with different characteristics.

The Scipy KDE implementation contains only the common Gaussian Kernel. Statsmodels contains seven kernels, while Scikit-learn contains six kernels, each of which can be used with one of about a dozen distance metrics, resulting in a very flexible range of effective kernel shapes.

Here is a quick visualization of the six kernel forms available in Scikit-learn. For clarity, the plot_kernels() function used here is defined at the end of the notebook:

In [8]:
plot_kernels()


The various kernel shapes lead to estimators with very different characteristics. For some examples of these in action, see the Scikit-learn documentation or the AstroML examples.

Heterogeneous Data

One advantage that Statsmodels' KDEMultivariate has over the other algorithms is its ability to handle heterogeneous data, i.e. a mix of continuous, ordered discrete, and unordered discrete variables. All of the other implementations require homogeneous datasets. Though the problem of heterogeneous data is interesting, I won't discuss it more here. For more details, see the KDEMultivariate documentation.

FFT-based computation

For large datasets, a kernel density estimate can be computed efficiently via the convolution theorem using a fast Fourier transform. This requires binning the data, so the approach quickly becomes inefficient in higher dimensions. Of the four algorithms discussed here, only Statsmodels' KDEUnivariate implements an FFT-based KDE. As we'll see below, the FFT provides some computational gains for large numbers of points, but in most situations is not as effective as tree-based KDE implementations.

Tree-based computation

For M evaluations of N points, the KDE computation naively requires $$\mathcal{O}[MN]$$ computations (i.e. a distance computation between each input/output pair). This puts KDE in the same category as Nearest Neighbors, N-point correlation functions, and Gaussian Process Regression, all of which are examples of Generalized N-body problems which can be efficiently computed using specialized data structures such as a KD Tree (I discussed spatial trees in the context of nearest neighbors searches in a previous blog post).

The main idea is this: if you can show that a query point $$Q_0$$ is geometrically far from a set of training points $$\{T_i\}$$, then you no longer need to compute every kernel weight between $$Q_0$$ and the points in $$\{T_i\}$$: it is sufficient to compute one kernel weight at the average distance, and use that as a proxy for the others. With careful consideration of the bounds on the distances and the maximum error tolerance for the final result, it is possible to greatly reduce the number of required operations for a KDE computation.

For the 0.14 release of Scikit-learn, I wrote an efficient KDE implementation built on a KD Tree and a Ball Tree. By setting the parameters rtol (relative tolerance) and atol (absolute tolerance), it is possible to compute very fast approximate kernel density estimates at any desired degree of accuracy. The final result $$p$$ is algorithmically guaranteed to satisfy

${\rm abs}\left(p - p_{true}\right) < {\tt atol} + {\tt rtol} \cdot p_{true}$

This scheme effectively sets up a tradeoff between computation time and accuracy. As we'll see below, even marginal reduction in accuracy (say, allowing errors of 1 part in $$10^8$$) can lead to impressive gains in computational efficiency.

Computational Efficiency

Next comes the fun part: comparing the computational efficiency of the various algorithms. Here we'll look at the computation time as a function of the number of points in the distribution for a 1-dimensional case. As noted above, several of the algorithms also implement multi-dimensional computations: the scaling with the number of points should not change appreciably in the multi-dimensional case.

Here and throughout, we will compute the KDE for 5000 query points. For clarity, the plot_scaling function used here is defined at the end of the notebook: if you download and run this notebook, please scroll down and execute that cell first.

Scaling with the Number of Points

First we'll look at the scaling with the number of points in the input distribution, spread from 10 points to 10,000 points:

In [9]:
plot_scaling(N=np.logspace(1, 4, 10),
kwds={'Statsmodels-U':{'fft':False}});


The SciPy algorithm (red line) exhibits the expected $$\mathcal{O}[N]$$ scaling for a naive KDE implementation. The tree-based implementation in scikit-learn is slightly better for a large number of points (for small datasets, the overhead of building the tree dominates). Both statsmodels implementations are appreciably slower: in particular, the KDEMultivariate implementation displays a relatively large computational overhead.

However, these benchmarks are not entirely fair to the Statsmodels Univariate algorithm or to the Scikit-learn algorithm. For KDEUnivariate, we have not used the FFT version of the calculation; for Scikit-learn, we've set rtol and atol to zero, effectively asking the algorithm for a perfectly precise computation which allows very few tree nodes to be trimmed from the calculation.

Let's change this, and use the FFT computation for statsmodels, and set rtol=1E-4 for scikit-learn. The latter setting says that we want a faster computation, at the expense of accepting a 0.01% error in the final result:

In [10]:
plot_scaling(N=np.logspace(1, 4, 10),
rtol=1E-4,
kwds={'Statsmodels-U':{'fft':True}});


The FFT computation significantly speeds the statsmodels univariate computation in the case of large numbers of points. The real winner here, though, is the Scikit-learn implementation: by allowing errors of 1 part in 10,000, we've sped the computation for the largest datasets by an additional factor of 5 or so, making it an order of magnitude faster than the next best algorithm!

Dependence on rtol

You might wonder what the effect of rtol is on the speed of the computation. In most situations for the scikit-learn estimator, increasing rtol will directly lead to an increase in computational efficiency, as distant tree nodes are able to be trimmed earlier in the computation.

To illustrate this, we'll plot the computation time as a function of rtol, from $$1$$ part in $$10^{16}$$ (effectively the floating point precision on 64-bit machines) all the way up to $$1$$ part in $$10$$ (i.e. 10% error on the results):

In [11]:
plot_scaling(N=1E4,
rtol=np.logspace(-16, -1, 10),
bandwidth=0.2);


This plot displays the direct tradeoff between precision and efficiency enabled by Scikit-learn's tree-based algorithms. Any variation in the timing for the other algorithms is simply statistical noise: they cannot take advantage of rtol.

Dependence on Bandwidth

One somewhat counter-intuitive effect of the tree-based approximate KDE is that the bandwidth becomes an important consideration in the time for computation. This is because the bandwidth effectively controls how "close" points are to each other.

When points are very far apart in relation to the kernel size, their contribution to the density is very close to zero. In this range, whole groups of such points can be removed from the computation. When points are very close together in relation to the kernel size, the distance is effectively zero, and whole groups of such points in the tree can be considered, as a group, to contribute the maximal kernel contribution.

The tree-based KDE computation in Scikit-learn takes advantage of these situations, leading to a strong dependence of computation time on the bandwidth: for very small and very large bandwidths, it is fast. For bandwidths somewhere in the middle, it can be slower than other algorithms, primarily due to the computational overhead of building and traversing the tree:

In [12]:
plot_scaling(N=1E4, rtol=1E-4,
bandwidth=np.logspace(-4, 3, 10));


Fortunately the inefficient bandwidths are generally too large to be useful in practice, and bandwidths in the faster range are favorable for most problems.

Notice that only the Scikit-learn results depend strongly on the bandwidth: the other implementations have constant computation time to within random errors. This is due to the tree-based KDE implementation used by Scikit-learn, for the reasons discussed above. The optimal bandwidth near 0.15 lies in a region of relatively fast computation, especially compared to the alternate algorithms.

Dependence on Kernel

As you might expect, the same principles that lead to the dependence of computation time on bandwidth also lead to a dependence of computation time on the kernel shape used. For faraway points, some kernels have weights much closer to zero than others: in the case of kernels with "hard" cutoffs (such as the tophat kernel), distant points contribute exactly zero to the density, and thus the speedup will be realized even if rtol and atol are zero.

At the opposite extreme, for points which are very close compared to the kernel size, Kernels which are very "flat" (e.g. the tophat kernel) will allow whole groups of points to be considered at once, while kernels which are less flat (e.g. the linear or exponential kernel) will not admit such efficiencies.

We can see this below: here we'll plot the computation time as a function of kernel width for the Scikit-learn implementation, using several kernels:

In [13]:
plot_scaling_vs_kernel(kernels=['tophat', 'linear', 'exponential', 'gaussian'],
bandwidth=np.logspace(-4, 3, 10),
N=1E4, rtol=1E-4);


Notice the two regions of interest: for very small bandwidths, kernels with a hard cutoff (tophat, linear) out-perform kernels with a broad taper (gaussian, exponential). And tapered kernels which fall off more quickly (gaussian, with $$p \sim \exp(-d^2)$$) are more efficiently computed than kernels which fall off more slowly (exponential, with $$p \sim \exp(-d)$$).

At the other end, kernels with very flat profiles near zero (tophat, gaussian) show improvement for large bandwidths, while kernels with very steep profiles near zero (linear, exponential) show no improvement: they reach the asymptotic limit in which all of the $$\mathcal{O}[MN]$$ distances must be computed.

For good measure, here are the scalings of the computations with rtol and with N:

In [14]:
plot_scaling_vs_kernel(kernels=['tophat', 'linear', 'exponential', 'gaussian'],
bandwidth=0.15, N=1E4, rtol=np.logspace(-16, -1, 10));


As we'd expect, for a reasonable kernel size, rtol is not significant for kernels with a hard cutoff, and becomes more significant the wider the "wings" of the kernel are. The scaling with N is also as we'd expect:

In [15]:
plot_scaling_vs_kernel(kernels=['tophat', 'linear', 'exponential', 'gaussian'],
bandwidth=0.15, rtol=1E-4, N=np.logspace(1, 4, 10));


This dependence of computation time on bandwidth and kernel shape is an issue to keep in mind as you choose your KDE algorithm: in the case of tree-based approaches, the bandwidth and kernel can matter to the tune of several orders of magnitude in computation time!

Conclusion

I hope all this has been helpful to you: my main takeaway is that under most situations that are applicable to my own research, the Scikit-learn KDE is far superior to the other implementations that are available. There are situations where other choices are appropriate (namely, Statsmodels' KDEMultivariate when your data is heterogeneous, or Scipy's gaussian_kde for exact results with fewer than about 500 points), but scikit-learn will be much faster for most other relevant settings.

Finally, here are some of the functions I used to generate the above plots. If you download this notebook, you'll have to execute this cell before running any of the above cells that use these:

In [7]:
import matplotlib
from collections import defaultdict
from time import time

functions = dict(zip(kde_funcnames, kde_funcs))

def plot_scaling(N=1000, bandwidth=0.1, rtol=0.0,
Nreps=3, kwds=None, xgrid=None):
"""
Plot the time scaling of KDE algorithms.
Either N, bandwidth, or rtol should be a 1D array.
"""
if xgrid is None:
xgrid = np.linspace(-10, 10, 5000)
if kwds is None:
kwds=dict()
for name in functions:
if name not in kwds:
kwds[name] = {}
times = defaultdict(list)

assert len(B.shape) == 1

for N_i, bw_i, rtol_i in B:
x = np.random.normal(size=N_i)
kwds['Scikit-learn']['rtol'] = rtol_i
for name, func in functions.items():
t = 0.0
for i in range(Nreps):
t0 = time()
func(x, xgrid, bw_i, **kwds[name])
t1 = time()
t += (t1 - t0)
times[name].append(t / Nreps)

fig, ax = plt.subplots(figsize=(8, 6),
subplot_kw={'axisbg':'#EEEEEE',
'axisbelow':True})
ax.grid(color='white', linestyle='-', linewidth=2)
plot_kwds={'linewidth':3, 'alpha':0.5}

if np.size(N) > 1:
for name in kde_funcnames:
ax.loglog(N, times[name], label=name, **plot_kwds)
ax.set_xlabel('Number of points')
elif np.size(bandwidth) > 1:
for name in kde_funcnames:
ax.loglog(bandwidth, times[name], label=name, **plot_kwds)
ax.set_xlabel('Bandwidth')
elif np.size(rtol) > 1:
for name in kde_funcnames:
ax.loglog(rtol, times[name], label=name, **plot_kwds)
ax.set_xlabel('Relative Tolerance')

for spine in ax.spines.values():
spine.set_color('#BBBBBB')
ax.legend(loc=0)
ax.set_ylabel('time (seconds)')
ax.set_title('Execution time for KDE '
'({0} evaluations)'.format(len(xgrid)))

return times

def plot_scaling_vs_kernel(kernels, N=1000, bandwidth=0.1, rtol=0.0,
Nreps=3, kwds=None, xgrid=None):
"""
Plot the time scaling for Scikit-learn kernels.
Either N, bandwidth, or rtol should be a 1D array.
"""
if xgrid is None:
xgrid = np.linspace(-10, 10, 5000)
if kwds is None:
kwds=dict()
times = defaultdict(list)

assert len(B.shape) == 1

for N_i, bw_i, rtol_i in B:
x = np.random.normal(size=N_i)
for kernel in kernels:
kwds['kernel'] = kernel
kwds['rtol'] = rtol_i
t = 0.0
for i in range(Nreps):
t0 = time()
kde_sklearn(x, xgrid, bw_i, **kwds)
t1 = time()
t += (t1 - t0)
times[kernel].append(t / Nreps)

fig, ax = plt.subplots(figsize=(8, 6),
subplot_kw={'axisbg':'#EEEEEE',
'axisbelow':True})
ax.grid(color='white', linestyle='-', linewidth=2)
plot_kwds={'linewidth':3, 'alpha':0.5}

if np.size(N) > 1:
for kernel in kernels:
ax.loglog(N, times[kernel], label=kernel, **plot_kwds)
ax.set_xlabel('Number of points')
elif np.size(bandwidth) > 1:
for kernel in kernels:
ax.loglog(bandwidth, times[kernel], label=kernel, **plot_kwds)
ax.set_xlabel('Bandwidth')
elif np.size(rtol) > 1:
for kernel in kernels:
ax.loglog(rtol, times[kernel], label=kernel, **plot_kwds)
ax.set_xlabel('Relative Tolerance')

for spine in ax.spines.values():
spine.set_color('#BBBBBB')
ax.legend(loc=0)
ax.set_ylabel('time (seconds)')
ax.set_title('Execution time for KDE '
'({0} evaluations)'.format(len(xgrid)))

return times

def plot_kernels():
"""Visualize the KDE kernels available in Scikit-learn"""
fig, ax = plt.subplots(figsize=(8, 6),
subplot_kw={'axisbg':'#EEEEEE',
'axisbelow':True})
ax.grid(color='white', linestyle='-', linewidth=2)
for spine in ax.spines.values():
spine.set_color('#BBBBBB')

X_src = np.zeros((1, 1))
x_grid = np.linspace(-3, 3, 1000)

for kernel in ['gaussian', 'tophat', 'epanechnikov',
'exponential', 'linear', 'cosine']:
log_dens = KernelDensity(kernel=kernel).fit(X_src).score_samples(x_grid[:, None])
ax.plot(x_grid, np.exp(log_dens), lw=3, alpha=0.5, label=kernel)
ax.set_ylim(0, 1.05)
ax.set_xlim(-2.9, 2.9)
ax.legend()


I hope you found this post useful - Happy coding!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

November 20, 2013

Continuum Analytics

Bokeh 0.3 Released

We are pleased to announce the release of version 0.3 of the Bokeh plotting library. This release sees the BokehJS browser runtime merged into the Bokeh project for much simplified building and deployment. We also took the opportunity to fix several bugs, add more examples, and improve the documentation.

November 19, 2013

NeuralEnsemble

PyNN 0.8 beta 1 released

We're very happy to announce the first beta release of PyNN 0.8.

For PyNN 0.8 we have taken the opportunity to make significant, backward-incompatible
changes to the API. The aim was fourfold:

•   to simplify the API, making it more consistent and easier to remember;
•   to make the API more powerful, so more complex models can be expressed with less code;
•   to allow a number of internal simplifications so it is easier for new developers to contribute;
•   to prepare for planned future extensions, notably support for multi-compartmental models.

For a list of the main changes between PyNN 0.7 and 0.8, see the release notes for the 0.8 alpha 1 release.

For the changes in this beta release see the release notes.

The biggest change with this beta release is that we now think the PyNN 0.8 development branch is stable enough to do science with. If you have an existing project using an earlier version of PyNN, you might not want to update, but if you're starting a new project, we recommend using this beta release.

The source package is available from the INCF Software Center

What is PyNN?

PyNN (pronounced 'pine' ) is a simulator-independent language for building neuronal network models.

In other words, you can write the code for a model once, using the PyNN API and the Python programming language, and then run it without modification on any simulator that PyNN supports (currently NEURONNEST and Brian).

Even if you don't wish to run simulations on multiple simulators, you may benefit from writing your simulation code using PyNN's powerful, high-level interface. In this case, you can use any neuron or synapse model supported by your simulator, and are not restricted to the standard models.

The code is released under the CeCILL licence (GPL-compatible).

November 16, 2013

Philip Herron

Type Inferencing GCC Rust

As of GCC rust now has type inferencing so code such as:

fn test () -> int { 1 }

fn main () {
let x = test ();
println ("fuck yeah!");
}


Gets turned into Dot IL

fn test ( void ) -> int {
_rust_retval:
1;
}

fn main ( void ) -> void {
let [_final_] x -> [__infer_me] = test ();
println ("fuck yeah!");
}


And it correctly inferences the type to:

fn test ( void ) -> int {
_rust_retval:
1;
}

fn main ( void ) -> void {
let [_final_] x -> [int] = test ();
println ("fuck yeah!");
}


And then finally we get GIMPLE

test ()
{
signed int <retval>;

<retval> = 1;
return <retval>;
}

main ()
{
void <retval>;
const signed int x;

x = test ();
__GRUST_println ("fuck yeah!");
}


Then some good old asm

	.text
.globl _test
_test:
LFB0:
pushq	%rbp
LCFI0:
movq	%rsp, %rbp
LCFI1:
movl	$1, %eax popq %rbp LCFI2: ret LFE0: .const LC0: .ascii "fuck yeah!" .space 1 .text .globl _main _main: LFB1: pushq %rbp LCFI3: movq %rsp, %rbp LCFI4: subq$16, %rsp
call	_test
movl	%eax, -4(%rbp)
leaq	LC0(%rip), %rdi
call	___GRUST_println
leave
LCFI5:
ret
LFE1:
.section __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
EH_frame1:
.set L$set$0,LECIE1-LSCIE1
.long L$set$0
LSCIE1:
.long	0
.byte	0x1
.ascii "zR\0"
.byte	0x1
.byte	0x78
.byte	0x10
.byte	0x1
.byte	0x10
.byte	0xc
.byte	0x7
.byte	0x8
.byte	0x90
.byte	0x1
.align 3
LECIE1:
LSFDE1:
.set L$set$1,LEFDE1-LASFDE1
.long L$set$1
LASFDE1:
.long	LASFDE1-EH_frame1
.set L$set$2,LFE0-LFB0
.quad L$set$2
.byte	0
.byte	0x4
.set L$set$3,LCFI0-LFB0
.long L$set$3
.byte	0xe
.byte	0x10
.byte	0x86
.byte	0x2
.byte	0x4
.set L$set$4,LCFI1-LCFI0
.long L$set$4
.byte	0xd
.byte	0x6
.byte	0x4
.set L$set$5,LCFI2-LCFI1
.long L$set$5
.byte	0xc
.byte	0x7
.byte	0x8
.align 3
LEFDE1:
LSFDE3:
.set L$set$6,LEFDE3-LASFDE3
.long L$set$6
LASFDE3:
.long	LASFDE3-EH_frame1
.set L$set$7,LFE1-LFB1
.quad L$set$7
.byte	0
.byte	0x4
.set L$set$8,LCFI3-LFB1
.long L$set$8
.byte	0xe
.byte	0x10
.byte	0x86
.byte	0x2
.byte	0x4
.set L$set$9,LCFI4-LCFI3
.long L$set$9
.byte	0xd
.byte	0x6
.byte	0x4
.set L$set$10,LCFI5-LCFI4
.long L$set$10
.byte	0xc
.byte	0x7
.byte	0x8
.align 3
LEFDE3:
.subsections_via_symbols


November 15, 2013

Matthew Rocklin

Wordcounting and Verbosity

In a blogpost last month I announced PyToolz a Python implementation of the functional standard library. Today I want to discuss the wordcounting example in more depth, highlighting differences between simple/verbose and complex/concise code.

tl;dr: Library code reduces code-length at the cost of universal comprehension. Libraries simplify code for a subset of programmers while alienating others. This is behind the common complaint that functional programming is hard to read. We use word-counting as a case study.

Verbose solution with simple terms

My standard wordcounting function looks like the following:

def stem(word):
""" Stem word to primitive form

>>> stem("Hello!")
'hello'
"""
return word.lower().rstrip(",.!)-*_?:;$'-\"").lstrip("-*'\"(_$'")

def wordcount(string):
""" Count the number of occurences of each word in a string

>>> sentence = "This cat jumped over this other cat!"
>>> wordcount(sentence)
{'cat': 2, 'jumped': 1, 'other': 1, 'over': 1, 'this': 2}
"""
words = string.split()

stemmed_words = []
for word in words:
stemmed_words.append(stem(word))

counts = dict()
for word in stemmed_words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1

return counts


While long/verbose, this solution is straightforward and comprehensible to all moderately experienced Python programmers.

Concise solution with complex terms

Using the definition for stem above and the frequencies function from toolz we can condense wordcount into the following single line.

>>> from toolz import frequencies

>>> frequencies(map(stem, sentence.split()))
{'cat': 2, 'jumped': 1, 'other': 1, 'over': 1, 'this': 2}


While dense, this solution solves the problem concisely using pre-existing functionality.

Increasing readability with pipe

The functional solution above with frequencies(map(stem, sentence.split())) is concise but difficult for many human readers to parse. The reader needs to traverse a tree of parentheses to find the innermost element (sentence) and then work outwards to discover the flow of computation. We improve the readability of this solution with the pipe function from toolz.

To introduce pipe we consider the abstract process of doing laundry:

>>> wet_clothes = wash(clothes)
>>> dry_clothes = dry(wet_clothes)
>>> result = fold(dry_clothes)


This pushes the data, clothes through a pipeline of functions, wash, dry, and fold. This pushing of data through a pipeline of functions is a common pattern. We encode this pattern into toolz.pipe.

>>> from toolz import pipe

>>> result = pipe(clothes, wash, dry, fold)


Pipe pushes data (first argument) through a sequence of functions (rest of the arguments) from left to right. Here is another example.

from toolz import pipe

# Simple example
def double(x):
return 2 * x

>>> pipe(3, double, double, str)
'12'


Using pipe we can rearrange our functional wordcounting solution to the following form

>>> from toolz.curried import map

>>> # frequencies(map(stem, sentence.split()))
>>> pipe(sentence, str.split, map(stem), frequencies)
{'cat': 2, 'jumped': 1, 'other': 1, 'over': 1, 'this': 2}


This code reads like a story from left to right. We take a sentence, split it into words, stem each word, and then count frequencies. This is sufficiently simple so that I am confident in the correctness of the result after a brief review of the code. There is little room for error.

note: here we used a curried version of map. See the toolz docs for more info.

Discussion

The first solution uses lots of simple words. The second solution uses a few complex words. Just as in natural language there are benefits and drawbacks to a rich vocabulary. The choice of suitable vocabulary largely depends on the audience.

Long solutions of simple words are universally understandable but require reader effort to construct meaning. Most Python programmers can understand the first solution without additional training but will need to expend effort to deduce its meaning. This is like the approach taken by Simple English Wikipedia.

Concise solutions of complex words are not universally understandable but do convey meaning more quickly if the terms are already known by the reader. Additionally if the terms themselves are well tested then these solutions are less prone to error.

A good vocabulary can concisely express most relevant problems with a small number of terms. The functional standard library (e.g. map, groupby, frequencies, …) is such a set. Understanding a relatively few number of terms (around 10-20) enables the concise expression of most common programming tasks. This set was developed and refined across decades of language evolution.

November 14, 2013

Continuum Analytics

Painless Streaming Plots with Bokeh

Bokeh is quickly proving to be the easiest plotting toolkit for web-enabled graphs. Continuum recently released version 0.2 , with 0.3 coming shortly. One of the distinguishing features of Bokeh compared to other web-based plotting toolkits is that streaming, as a feature, is not an afterthought. Animated as well as streaming realtime plots on the web should be nearly as easy as static plots — and with Bokeh, it really is that easy. It’s good to remind yourself that Bokeh is built with Python as a first-class citizen — in other words, all of the plotting commands are in Python and NOT Javascript. Bokeh is open-source and our documentation has loads of beautiful interactive plots.

November 13, 2013

Fernando Perez

An ambitious experiment in Data Science takes off: a biased, Open Source view from Berkeley

Today, during a White House OSTP event combining government, academia and industry, the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation announced a $37.8M funding commitment to build new data science environments. This caps a year's worth of hard work for us at Berkeley, and even more for the Moore and Sloan teams, led by Vicki Chandler, Chris Mentzel and Josh Greenberg: they ran a very thorough selection process to choose three universities to participate in this effort. The Berkeley team was led by Saul Perlmutter, and we are now thrilled to join forces with teams at the University of Washington and NYU, respectively led by Ed Lazowska and Yann LeCun. We have worked very hard on this in private, so it's great to finally be able to publicly discuss what this ambitious effort is all about. Most of the UC Berkeley BIDS team, from left to right: Josh Bloom, Cathryn Carson, Jas Sekhon, Saul Perlmutter, Erik Mitchell, Kimmen Sjölander, Jim Sethian, Mike Franklin, Fernando Perez. Not present: Henry Brady, David Culler, Philip Stark and Ion Stoica (photo credit: Kaja Sehrt, VCRO). As Joshua Greenberg from the Sloan Foundation says, "What this partnership is trying to do is change the culture of universities to create a data science culture." For us at Berkeley, the whole story has two interlocking efforts: 1. The Moore and Sloan foundations are supporting a cross-institution initiative, where we will tackle the challenges that the rise of data-intensive science is posing. 2. Spurred by this, Berkeley is announcing the creation of the new Berkeley Institute for Data Science (BIDS), scheduled to start full operations in Spring 2014 (once the renovations of the Doe 190 space are completed). BIDS will be the hub of our activity in the broader Moore/Sloan initiative, as a partner with the UW eScience Institute and the newly minted NYU Center for Data Science. Since the two Foundations, Berkeley and our university partners will provide ample detail elsewhere (see link summary at the bottom), I want to give my own perspective. This process has been, as one can imagine, a complex one: we were putting together a campus-wide effort that was very different from any traditional grant proposal, as it involved not only a team of PIs from many departments who normally don't work together, but also serious institutional commitment. But I have seen real excitement in the entire team: there is a sense that we have been given the chance to tackle a big and meaningful problem, and that people are willing to move way out of their comfort zone and take risks. The probability of failure is non-negligible, but these are the kinds of problems worth failing on. The way I think about it, we're a startup in the Eric Ries sense, a "human institution designed to deliver a new product or service under conditions of extreme uncertainty". Our outcomes are new forms of doing science in academic settings, and we're going against the grain of large institutional priors. We've been given nearly$38M of VC funding and a 5 year runway, now it's our job to make it beyond this.

Saul Perlmutter speaks during our internal launch event on Oct 25, at the location of the newly created BIDS in the campus Doe Library. This area, when renovated, will be set up in a similar fashion for seminars and workshops (photo credit: Kaja Sehrt, VCRO).

Why is "Data Science" a different problem?

The original mandate from our funders identified a set of challenges brought about by the rise of data intensive research to the modern university. I would summarize the problem as: the incentive mechanisms of academic research are at sharp odds with the rising need for highly collaborative interdisciplinary research, where computation and data are first-class citizens. The creation of usable, robust computational tools, and the work of data acquisition and analysis must be treated as equal partners to methodological advances or domain-specific results.

Here are, briefly stated, what I see as the main mechanisms driving this problem in today's academic environment:

• An incentive structure that favors individualism, hyper-specialization and "novelty" to a toxic extreme. The lead-author-paper is the single currency of the realm, and work that doesn't lead to this outcome is ultimately undertaken as pure risk. On the altar of "novelty", intellectual integrity is often sacrificed.

• Collaboration, tool reuse and cross-domain abstractions are punished.

• The sharing of tools that enable new science is discouraged/punished: again, it's much better to put out a few more "novel" first-author papers than to work with others to amplify their scientific outcomes.

• Building a CV spread across disciplines is typically career suicide: no single hiring/tenure committee will understand it, despite the potentially broad and significant impact.

• Most scientists are taught to treat computation as an afterthought. Similarly, most methodologists are taught to treat applications as an afterthought (thanks to Bill Howe, one of our UW partners, for pointing out this mirror problem).

• Rigor in algorithmic usage, application and implementation by practitioners isn't taught anywhere: people grab methods like shirts from a rack, to see if they work with the pants they are wearing that day. Critical understanding of algorithmic assumptions and of the range of validity and constraints of new methods is a rarity in many applied domains.

• Not to fault only domain practitioners, methodologists tend to only offer proof-of-concept, synthetic examples, staying largely shielded from real-world concerns. This means that reading their papers often leaves domain users with precious little guidance on what to actually do with the method (and I'm just as guilty as the rest).

• Computation and data skills are all of a sudden everybody's problem. Yet, many academic disciplines are not teaching their students these tools in a coherent way, leaving the job to a haphazard collage of student groups, ad-hoc workshops and faculty who create bootcamps and elective seminar courses to patch up the problem.

• While academia punishes all the behaviors we need, industry rewards them handsomely (and often in the context of genuinely interesting problems). We have a new form of brain drain we had never seen before at this scale. Jake van der Plas' recent blog post articulates the problem with clarity.

Two overarching questions

The above points frame, at least for me, the "why do we have this problem?" part. As practicioners, for the next five years we will tackle it in a variety of ways. But all our work will be inscribed in the context of two large-scale questions, an epistemological one about the nature of science and research, and a very practical one about the structure of the institutions where science gets done.

1: Is the nature of science itself really changing?

"Data Science" is, in some ways, an over-abused buzzword that can mean anything and everything. I'm roughly taking the viewpoint nicely summarized by Drew Conway's well-known Venn Diagram:

It is not clear to me that a new discipline is arising, or at least that we should approach the question by creating yet another academic silo. I see this precisely as one of the questions we need to explore, both through our concrete practice and by taking a step back to reflect on the problem itself. The fact that we have an amazing interdisciplinary team, that includes folks who specialize in the philosophy and anthropology of knowledge creation, makes this particularly exciting. This is a big and complex issue, so I'll defer more writing on it for later.

2: How do the institutions of science need to change in this context?

The deeper epistemological question of whether Data Science "is a thing" may still be open. But at this point, nobody in their sane mind challenges the fact that the praxis of scientific research is under major upheaval because of the flood of data and computation. This is creating the tensions in reward structure, career paths, funding, publication and the construction of knowledge that have been amply discussed in multiple contexts recently. Our second "big question" will then be, can we actually change our institutions in a way that responds to these new conditions?

This is a really, really hard problem: there are few organizations more proud of their traditions and more resistant to change than universities (churches and armies might be worse, but that's about it). That pride and protective instinct have in some cases good reasons to exist. But it's clear that unless we do something about this, and soon, we'll be in real trouble. I have seen many talented colleagues leave academia in frustration over the last decade because of all these problems, and I can't think of a single one who wasn't happier years later. They are better paid, better treated, less stressed, and actually working on interesting problems (not every industry job is a boring, mindless rut, despite what we in the ivory tower would like to delude ourselves with).

One of the unique things about this Moore/Sloan initiative was precisely that the foundations required that our university officials were involved in a serious capacity, acknowledging that success on this project wouldn't be measured just by the number of papers published at the end of five years. I was delighted to see Berkeley put its best foot forward, and come out with unambiguous institutional support at every step of the way. From providing us with resources early on in the competitive process, to engaging with our campus Library so that our newly created institute could be housed in a fantastic, central campus location, our administration has really shown that they take this problem seriously. To put it simply, this wouldn't be happening if it weren't for the amazing work of Graham Fleming (our Vice Chancellor for Research) and Kaja Sehrt, from his office.

We hope that by pushing for disruptive change not only at Berkeley, but also with our partners at UW and NYU, we'll be able to send a clear message to the nation's universities on this topic. Education, hiring, tenure and promotion will all need to be considered as fair game in this discussion, and we hope that by the end of our initial working period, we will have already created some lasting change.

Open Source: a key ingredient in this effort

One of the topics that we will focus on at Berkeley, building off strong traditions we already have on that front, is the contribution of open source software and ideas to the broader Data Science effort. This is a topic I've spoken at length about at conferences and other venues, and I am thrilled to have now an official mandate to work on it.

Much of my academic career has been spent living a "double life", split between the responsibilities of a university scientist and trying to build valuable open source tools for scientific computing. I started writing IPython when I was a graduate student, and I immediately got pulled into the birth of the modern Scientific Python ecosystem. I worked on a number of the core packages, helped with conferences, wrote papers, taught workshops and courses, helped create a Foundation, and in the process met many other like-minded scientists who were committed to the same ideas.

We all felt that we could build a better computational foundation for science that would be, compared to the existing commercial alternatives and traditions:

• Technically superior, built upon a better language.
• Openly developed from the start: no more black boxes in science.
• Based on validated, tested, documented, reusable code whose problems are publicly discussed and tracked.
• Built collaboratively, with everyone participating as equal partners and worrying first about solving the problems, not about who would be first author or grant PI.
• Licensed as open source but in a way that would encourage industry participation and collaboration.

In this process, I've learned a lot, built some hopefully interesting tools, made incredible friends, and in particular I realized that in many ways, the open source community practiced many of the ideals of science better than academia.

What do I mean by this? Well, in the open source software (OSS) world, people genuinely build upon each other's work: we all know that citing an opaque paper that hides key details and doesn't include any code or data is not really building off it, it's simply preemptive action to ward off a nasty comment from a reviewer. Furthermore, the open source community's work is reproducible by necessity: they are collaborating across the internet on improving the same software, which means they need to be able to really replicate what each other is doing to continually integrate new work. So this community has developed a combination of technical tools and social practices that, taken as a whole, produces a stunningly effective environment for the creation of knowledge.

But a crucial reason behind the success of the OSS approach is that its incentive structure is aligned to encourage genuine collaboration. As I stated above, the incentive structure of academia is more or less perfectly optimized to be as destructive and toxic to real collaboration as imaginable: everything favors the lead author or grant PI, and therefore every collaboration conversation tends to focus first on who will get credit and resources, not on what problems are interesting and how to best solve them. Everyone in academia knows of (or has been involved in) battles that leave destroyed relations, friendships and collaborations because of concerns over attribution and/or control.

It's not that the OSS world is a utopia of perfect harmony, far from it. "Flame wars" on public mailing lists are famous, projects split in sometimes acrimonious ways, nasty things get said in public and in private, etc. Humanity is still humanity. But I have seen first hand the difference between a baseline of productive engagement and the constant mistrust of acrid competitiveness that is academia. And I know I'm not the only one: I have heard many friends over the years tell me how much more they enjoy scientific open source conferences like SciPy than their respective, discipline-specific ones.

We want to build a space to bring the strengths of the university together with the best ideas of the OSS world. Despite my somewhat stern words above, I am a staunch believer in the fundamental value to society of our universities, and I am not talking about damaging them, only about helping them adapt to a rapidly changing landscape. The OSS world is distributed, collaborative, mostly online and often volunteer; it has great strengths but it also often produces very uneven outcomes. In contrast, universities have a lasting presence, physical space where intense human, face-to-face collaboration can take place, and deep expertise in many domains. I hope we can leverage those strengths of the university together with the practices of the OSS culture, to create a better future of computational science.

For the next few years, I hope to continue building great open source tools for scientific computing (IPython and much more), but also to bring these questions and ideas into the heart of the debate about what it means to be an academic scientist. Ask me in five years how it went :)

Partnerships at Berkeley

In addition to our partners at UW and NYU, this effort at Berkeley isn't happening in a vacuum, quite the opposite. We hope BIDS will not only produce its own ideas, but that it will also help catalyze a lot of the activity that's already happening. Some of the people in these partner efforts are also co-PIs in the Moore/Sloan project, so we have close connections between all:

As this initiative was announced, there was a lot of press about it. I've summarized below the main links for convenience. I also keep a bitly bundle with all of them, updated with new content:

Note: as always, this post was written as an IPython Notebook, which can be obtained from my github repository. If you are reading the notebook version, the blog post is available here.

Philip Herron

Language Design and Typing

I’ve done quite a bit of compiler work in the last 5 years and working on a Rust front-end to GCC i’ve got some strong opinions on language design now.

So I love python but i just can’t bring myself to actually write anything like a distributed system in Python, not complaining about speed i tend to agree with Guido on python’s speed the speed is ok but people tend to use threading wrongly in systems. Where python seriously excels is in glue for C/C++ systems at work i work on a Trading Platform its all Shared Memory and distributed Systems work and Shared memory Databases are annoying as hell to work with. Never mind writing tools to do anything simple and nice takes ages, using Cython and Python we can do _programmatic_ introspection on shared memory at runtime which is amazing. It allowed me to develop a programmatic regression framework no annoying perl scripts parsing STDOUT and STDERR which is just pain. for example i am writting regression tests as simply as:

from UTPSuperTest import pyccg
from UTPSuperTest import SuperTest

import datetime

@SuperTest.quickTest
def test_datesSorted ():
"""
Make sure all dates are in decending order..
"""
table = pyccg.CCG_attachMarketCalendarTable ()
tlength = len (table)
i = 1
while i < tlength:
# get each date from the table
idate1 = map (int, table [i - 1]['Date'].split ('/'))
idate2 = map (int, table [i]['Date'].split ('/'))
# turn it into a full python date object
rd1 = datetime.date (idate1[0], idate1 [1], idate1 [2])
rd2 = datetime.date (idate2[0], idate2 [1], idate2 [2])
# let the python date compare do this compare for us ans assert
SuperTest.assertTrue (rd1 >= rd2)
i += 1


This is a basic example but then we can do more stuff such as:

@SuperTest.requireTest (pristine = True, generateCSV = _generator)

It sorts out system state and how its loaded etc and allows for a shared regression test with no dependency on global environment which is a nightmare. So everyone can use it we have and still use this old system of perl scripts that only one of us can use because the environment only works on his machine.

Thing is Python is truly amazing at all this work on small pieces of code. Then i compare to a project i wrote at my last company the base code is here drdoom Its a funny project to explain its easier to just see it. And in the end it was mostly just an interesting idea but needs work and in the end not a great solution to anything. Although it is still used at my last company. It basically allows you to write jobs in pure python such as

from slave import BuilderHost, BuilderFactory

class GnuHelloWorld (BuilderHost.LinuxBuilder):
def __init__ (self):
steps = { 'setup' : self.setup,
'build' : self.build }
BuilderHost.LinuxBuilder.__init__ (self, steps,
order = [('setup',
['http://ftp.gnu.org/gnu/hello/hello-2.7.tar.gz']),
('build', None),
])

def setup (self, args=None):
self.info ("trying to build gnu hello world from tarball!")
if args != None:
self.sourceDir = self.prepareBuild (self.tarball, "BUILDS",
self.SOURCE_TYPE ['TAR_GZ'])
self.info (self.sourceDir)
else:
raise BuilderFactory.BuildException ("builder requires argument url to gnu hello.tar.gz")

def build (self, args=None):
self.currDir = self.getWD ()
self.chdir (self.sourceDir)
self.evalScript (self.sourceDir, "configure", None)
self.runShellCommand ("make")
self.chdir (self.currDir)
self.info ("looks like build is succsessful")

__BUILD_JOB = GnuHelloWorld ()
if __name__ == "__main__":
__BUILD_JOB.start ()
__BUILD_JOB.__AUTO__ ()
__BUILD_JOB.running = False


This when you told to slave to connect to a master it could be a another master bot or you. Its a completely peer to peer system for communication and all of the code is asynchronous xmpp communication. It would allow a bot to talk to you if you specified yourself as an xmpp client and you could give it commands and the commands just so happened to be these jobs and it had the granularity so you called these functions in a really high-level way and it could do things for you.

Its a really hacker’y idea and i can’t set it up here to show you but believe me it does work and last work used it as a distributed builds system that you passed a version to it the master bot with a command and it build all packages of subversion on any platform under the sun and error handling worked really well. It was pretty powerful but really hard to maintain never mind the code could be a pain in the ass. It just kind of cool to see all these xmpp bots talking together building software and reporting back to a master where i was even working on a flasky front-end to the whole project but then i stopped development on it because writing large python applications was a pain and the code i wrote was really naive and bad so don’t use it lol.

Writing asynchronous code is hard but it compounded by no compile time type analysis. Which causes alot of issues around testing yes i could use decorators to do type checking but this is a pain. What i have found outside of glue for C/C++ projects scripting api’s for native systems like Games Python really does excel at its job Dynamic typing is great.

let x;
if true {
x = 1
}
else {
x = 7
}


And the compiler can do the dataflow analysis GCC Rust is starting to do this now. It needs more bug fixing but i will be pushing to master this week with alot of cool code. Whats good about this example is that the compiler optimizes this and knows that x is an integer. It really just saves you a few keystrokes you can also specify the type such as:

let myVar:int = 1


So you can still do static typing. Whats interesting is that i believe this is much better than Go’s language designed on typing type inference is something the user has to be aware of in that they must use := from what i remember to tell the compiler to infer this type. Go is a great language and i really do love it but i believe there are a few too many ways to do the same thing which annoys me in it. Never mind the whole GOHOME stuff and how imports work in fetching github’s etc.

I just thing new languages really need to consider that type inference is about the only way to develop type systems dynamic typing is really bad from a compiler perspective because you just can’t assume anything to the point it really doesn’t even help programmers the amount of python programmers out there who would really in the end love a variable declaration keyword even would be a great addition. Being truly dynamic is nice but also a complete pain sometimes.

November 11, 2013

Philip Herron

My dot emacs and screenshot

So i just thought i would share my dot emacs file. This is my works machine which is cut off from the internet so i dont use marmalade and i’ve been tending to prefer just this flat setup only difference is i have elpy and rust mode. I kind of prefer just manually setting them up because i dont really use that many features. I just love emacs c-mode and the indention etc.

(custom-set-variables
;; custom-set-variables was added by Custom.
;; If you edit it by hand, you could mess it up, so be careful.
;; Your init file should contain only one such instance.
;; If there is more than one, they won't work right.
'(custom-safe-themes (quote ("cdf06f979414532f1486b2b9d52218ae12899fc0c914d5c86fd36a2324d5017b" default)))
'(inhibit-startup-screen t)
'(initial-scratch-message ";; Happy Hacking redbrain!!")
'(scroll-bar-mode nil))

;; disable startup message
(setq inhibit-startup-message t)

(require 'auto-complete)
(require 'auto-complete-config)
(ac-config-default)

(require 'cython-mode)

(global-font-lock-mode 1)
(setq font-lock-maximum-decoration t)

;; parentheses highlighting
(show-paren-mode 1)

;; highligh markings
(transient-mark-mode 1)

;; highlight during query
(setq query-replace-highlight t)
;; highlight incremental search
(setq search-highlight t)

(setq-default truncate-lines t)

(setq auto-mode-alist
(cons '("SConstruct" . python-mode) auto-mode-alist))
(setq auto-mode-alist
(cons '("SConscript" . python-mode) auto-mode-alist))

(setq c-default-style "linux"
c-basic-offset 4)
(setq-default indent-tabs-mode nil)

(require 'color-theme)
(color-theme-initialize)

;;(global-hl-line-mode 1)
;; To customize the background color
;;(set-face-background 'hl-line "#330")  ;; Emacs 22 Only

;; Enable line numbers
(global-linum-mode t)
'(progn
((t :inherit 'linum
:foreground ,(face-attribute 'linum :background nil t)))
"Face for displaying leading zeroes for line numbers in display margin."
:group 'linum)

(defun linum-format-func (line)
(let ((w (length
(number-to-string (count-lines (point-min) (point-max))))))
(concat
(propertize (make-string (- w (length (number-to-string line))) ?0)
(propertize (number-to-string line) 'face 'linum))))

(setq linum-format 'linum-format-func)))

(setq linum-format "%4d|")
(setq mode-line-format
(list
;; value of mode-name'
"%m: "
;; value of current buffer name
"buffer %b, "
;; value of current line number
"line %l "
"-- user: "
;; value of user
(getenv "USER")))

(custom-set-faces
;; custom-set-faces was added by Custom.
;; If you edit it by hand, you could mess it up, so be careful.
;; Your init file should contain only one such instance.
;; If there is more than one, they won't work right.
'(default ((t (:background "color-234"))))
'(cursor ((t (:foreground "red"))))
'(linum ((t (:background "color-237" :foreground "yellow")))))

Then it looks like this:

November 09, 2013

Philip Herron

Just thought i would share a shell prompt using libreadline using Rust.

#[link_args = "-lreadline"]
extern {
fn readline (p: *std::libc::c_char) -> * std::libc::c_char;
}

#[fixed_stack_segment]
#[inline(never)]
pub fn rust_readline (prompt: &str) -> Option<~str> {
let mut retval:Option<~str> = None;
let cprmt = prompt.to_c_str ();
do cprmt.with_ref |c_buf| {
unsafe {
let ret = c_str::CString::new (readline (c_buf), true);
let cast = ret.as_str ();
match cast {
None => { retval = None }
_ => {
let tmp = cast.unwrap ();
retval = Some (tmp.to_str ());
}
}
}
}
retval
}


Then to use it in an interpreter loop:

   loop {
let val = rust_readline (">>> ");
match val {
None => { break }
_ => {
let input = val.unwrap ();
context.evaluate (input);
}
}
}


The Option..> of rust is really clever i just love the match statement in writting a lexer/parser

November 08, 2013

Philip Herron

Coverity and GCC/LLVM

So i just wanted to share a quick note with what I’ve found at work recently. If your company is rich enough to afford a coverity license and your doing C/C++ development you are literally throwing money down the toilet.

I’ve just found this morning almost all 99% (some dead code parts it fails) of all our coveritiy bugs in our code base (which is _huge_ btw) all can be found by running your code with -Wall with the _latest_ stable GCC and LLVM C/C++ compilers using both. So we have jenkins jobs to compile our code with both compilers to see all the warnings and the cost is ZERO.

I think these old school ideas of paying insane license fees for Perforce and Coverity are so stupid saving that money you could use Git and GCC and LLVM for _free_ and hire more developers with the money you save per year!

Anyone who think Perforce scales better than git literally uses git wrongly.

I was going to show up the errors that llvm catch and gcc catch compared to coverity but talking to my team i am little afraid since our code is proprietary and i might get sued or something daft.

I swear monolithic proprietary software vendors will die off eventually the code quality is generally so poor and un-creative that an open source competitor will eventually come along and destroy them its kind of retarded if people can see this these days.

</rant>

October 31, 2013

Philip Herron

Compiler Bootstrapping

Compiler bootstrapping is probably a corner of compilers which is deeply misified because of most of the writing on the subject is written by people with a strong Computer Scientist background. And what really is it.

For example the haskell compiler is written in Haskell, Rust is written in Rust and so on. But wait hang on how can a compiler for a new language which has no previous implementation be then written in itself.

So if you really dig into it and think about it the answer stares at you in the face. When you compile Rust of the Haskell compiler they each “bootstrap”. So in haskells case it has some C code to bootstrap a mini restricted haskell IIRC and in rust it downloads previous versions etc. So initially there _had_ to be a version of the language implemented with current technologies usually C for portability. Like for instance GCC is written in C/C++ it (now) requires a c++ compiler but you can bootstrap previous versions with a seriously crappy C compiler.

I personally tend to think writing a compiler in its own language is really stupid from a development perspective at least in early versions of a language. For 2 main reasons:

1. It puts of people from trying to help out development, they have to gain a deeper appreciation of the language before they can even understand the code.
2. Its potentially dangerous you have to be very strict if your compiler is going to use newer features of a language that previous implementations didn’t have what do you do.

I 100% agree with implementing as much as possible the run-time supporting code in the target language because this just makes sense. For example I’ve spent about 3 weeks learning rust and i am dying to help out the compiler but i just can’t until i learn more of rust in general. I’ve found writing GCC rust its much simpler because i know C just works i don’t worry if my code isn’t going to behave correctly.

I just fail to see the practical pragmatic benefit of going down this route. I believe that outside of the really cool perspective that it proves your language works and gives the designers better introspection on emerging patterns in code generation programs. But i tend to think that these should be found by users of the language so it would keep interactions and dialog between users and developers of a language. In saying all of this i 100% am dedicated to working on gcc-python and gcc-rust in my spare time i am dying to write a few libraries in rust which i have started but i need more time to learn rust before i want to put them on github. I started a basic http://en.wikipedia.org/wiki/Lisp_(programming_language) interpreter in rust and i want a rust-xmpp implementation.

October 29, 2013

Continuum Analytics

Python 3 support in Anaconda

Did you know that Anaconda has been supporting Python 3 for more than half a year? While the Anaconda installers use Python 2.7, it is quite easy to create new additional environments that provide Python 3.3 (and soon Python 3.4), as well as many scientific packages. This blog explains how this is done in detail.

October 28, 2013

Luis Pedro Coelho

byteswap0

I while back, I got a bug report for mahotas-imread [1]:

PNG reads in inverted: This image loads in imread 0.3.1 incorrectly. It looks right if I xor it with 255, but I don’t think that’s all that’s wrong.

The first thing I noticed was that this was a 16 bit image. If you’ve been coding for as long as I have, you’ll immediately think: It’s a byte-endiness issue [2]. So, I tested:

imshow(im.byteswap())

and saw the following:

Not good. The I looked at the hint that the original poster provided and it did seem to be true: imshow(~f) worked reasonably well. My working hypothesis was thus that there is a flag whereby the PNG data needs to be interpreted after a bit reversion. I also noticed another thing, though:

max_16bit_value = 2**16-1
imshow(max_16bit_value - f)

Also looks decent.

The TIFF format does allow you to specify whether zero is supposed to be white or black. Maybe PNG has a similar “feature.”

I read through the libpng documentation (which is not great), a bit through its source, and through online descriptions of PNG format. Along the way, I noticed that converting the image to TIFF (with ImageMagick) and loading it with imread also gave the wrong result. Perhaps the TIFF reader had the same bug or ImageMagick [3].

Eventually, I realised that PNG files are in network order (i.e., in big-endian format) and the code did not convert them to little-endian. Thus, my initial intuition had been right: it was a byte-swapping error!

But in this case, why did imshow(f.byteswap()) result in a mangled image?

I stated to suspect that matplotlib had a bug. I tried to do:

imshow(f.byteswap() / 2.**16)

and it resulted in the correct image being shown.

As it turned out, matplotlib does not do the right thing when given 16 bit files. Thus, I had this figured out. I fixed imread and released version 0.3.2 and closed the bug report.

§

A single bug is often easy to debug, but when you have multiple bugs interacting; it is much more than twice as hard.

§

Hairy details: You may want to stop reading now.

Consider the following identities:

255 == 0xff
-f == (f ^ 0xff + 1)
2**16 - f = -f + 2**16 == -f (because of overflow)

Thus, it should not be surprising that flipping the bits or subtracting the image resulted int , in two-bit complement, ~f is roughly -f. Not exactly, but similarly enough that, by eye, it is hard to tell apart.

Finally, it all makes sense when you realise that matplotlib assumes that non-8 bit images are floating point and does:

final_image = (input_image * 255)
final_image = final_image.astype(np.uint8)

Because what is multiplying by 255? It’s the same as multiplying by -1! Thus, matplotlib would multiply by -1 and then take the low order bits. Thus, it showed a correct image if you pre-multiplied it by -1 (or flipped the bits) and gave it a byteswapped image!

 [1] People don’t always appreciate how valuable good bug reports are. Seriously, they are a huge help: you are testing the software for me. Unfortunately, either shyness or past bad experiences will often cause people who see something worng to not report it.
 [2] I now have over 15 years of experience coding (having had a relative late start [I didn't care much about computers until I was close to college age], I’ve caught up.) If there is an area where I really feel that my experience shines through is debugging: I’ve seen enough mistakes and errors that my guesses as to what the bug is are more and more accurate (this is true even in code I have not seen).
 [3] One of the reasons I started mahotas-imread was that I had not found a single library that could read the formats I wanted without a lot of bugs. So, I trust no one on this. In this case, the paranoia was unwarranted, as we’ll see.

October 26, 2013

Jake Vanderplas

The Big Data Brain Drain: Why Science is in Trouble

Regardless of what you might think of the ubiquity of the "Big Data" meme, it's clear that the growing size of datasets is changing the way we approach the world around us. This is true in fields from industry to government to media to academia and virtually everywhere in between. Our increasing abilities to gather, process, visualize, and learn from large datasets is helping to push the boundaries of our knowledge.

But where scientific research is concerned, this recently accelerated shift to data-centric science has a dark side, which boils down to this: the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry. While academia, with typical inertia, gradually shifts to accommodate this, the rest of the world has already begun to embrace and reward these skills to a much greater degree. The unfortunate result is that some of the most promising upcoming researchers are finding no place for themselves in the academic community, while the for-profit world of industry stands by with deep pockets and open arms.

The Unreasonable Effectiveness of Data

In 1960, the physicist Eugene Wigner published his famous essay, The Unreasonable Effectiveness of Mathematics in the Natural Sciences. It expounds on the surprising extent to which abstract mathematical concepts seem to hold validity in contexts far beyond those in which they were developed. After all, who would have guessed that Riemann's 19th century studies in non-Euclidean geometry would form the basis of Einstein's rethinking of gravitation, or that a codification of the rotation groups of abstract solids might eventually lead physicists to successfully predict the existence of the Higgs Boson?

Echoing this, in 2009 Google researchers Alon Halevy, Peter Norvig, and Fernando Pereira penned an article under the title The Unreasonable Effectiveness of Data. In it, they describe the surprising insight that given enough data, often the choice of mathematical model stops being as important — that particularly for their task of automated language translation, "simple models and a lot of data trump more elaborate models based on less data."

If we make the leap and assume that this insight can be at least partially extended to fields beyond natural language processing, what we can expect is a situation in which domain knowledge is increasingly trumped by "mere" data-mining skills. I would argue that this prediction has already begun to pan-out: in a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research.

Now, I'm not arguing here that domain understanding is entirely obsolete; after all the 10GB/second produced by the Large Hadron Collider (LHC) would be virtually useless apart from a solid theoretical understanding of the particle interactions that produce them, just as the 15TB/night of raw image data produced by the Large Synoptic Survey Telescope (LSST) would have very little to tell us about cosmology absent our theoretical insight into the physical processes driving the expansion of the Universe. But the LHC and LSST reflect the increasingly common situation where scientific results are entirely dependent upon the use of sophisticated methods to analyze large datasets. Indeed, we're finding that even when the data don't quite qualify as "Big", progress in science is increasingly being driven by those with the skills to manipulate, visualize, mine, and learn from data.

The New Breed of Scientist

In some senses, this data-driven research is simply a continuation of past trends. Since we shed Aristotelianism in the 16th-17th centuries, scientific progress has been largely based on empirical experiment and observation. It was Tycho Brahe's unprecedented 16th-century survey of the sky, after all, that led to Kepler's 17th century Laws of planetary motion, and paved the way for Newton's Universal Law of Gravitation and eventually Einstein's General Theory of Relativity. Scientists have always grappled with data; the difference is that today this act of grappling is increasingly central to the scientific process.

The increasing data-centeredness of science, however, is already leading to new approaches to problems: in the era of the LHC and LSST, the most exciting research is being driven by those who have the expertise to apply high-performance data-parallel statistical algorithms to ask interesting questions of huge, community-generated datasets. It is driven by the application of new statistical approaches, of new machine learning algorithms, and of new and faster codes to repeat classic analyses at a previously unattainable scale. In short, the new breed of scientist must be a broadly-trained expert in statistics, in computing, in algorithm-building, in software design, and (perhaps as an afterthought) in domain knowledge as well. From particle physics to genomics to biochemistry to neuroscience to oceanography to atmospheric physics and everywhere in-between, research is increasingly data-driven, and the pace of data collection shows no sign of abating.

The Fundamental Role of Scientific Software

The common thread here is that of scientific software: none of this work happens without the writing of code. Unless that code is well-written, well-documented, and shared openly with the community, the reproducibility paramount to the scientific process will be threatened. Much has been written about the current crisis of irreproducibility in science, about the need for new forms of publication, and new openness of access to research, code, and data. I won't dwell on those issues here.

What I will dwell on is the central role of optimized, specialized software in the analysis and visualization of large datasets, and the direct translation of that to its central role in modern scientific research. My collaborator Gael Varoquaux and his colleagues recently published an editorial arguing this point (see Gael's short summary here), and making the case that open, well-documented, and well-tested scientific code is essential not only to reproducibility in modern scientific research, but to the very progression of research itself. New research cannot build upon past results if those results are simply mentioned in a paper, with the actual process of producing them trapped in undocumented code hidden somewhere in somebody's laptop. As Buckheit and Donoho have written,

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

Making code public might seem like an afterthought, but in general simply releasing code is not enough. As Brandon Rhodes put it at his RuPy 2013 talk, "The moment a program works, it's better to say that it barely works". Making scientific code useful to those beyond the research group that generated it takes a significant amount of investment. This is the incredible value of projects like NumPy, SciPy, Scikit-learn, and others: they give researchers a framework in which their code is shared, peer-reviewed on github, and released for the benefit of the research community.

This brings us to Academia's core problem: despite the centrality of well-documented, well-written software to the current paradigm of scientific research, academia has been singularly successful at discouraging these very practices that would contribute to its success. In the "publish-or-perish" model which dominates most research universities, any time spent building and documenting software tools is time spent not writing research papers, which are the primary currency of the academic reward structure. As a result, except in certain exceptional circumstances, those who focus on reproducible and open software are less likely to build the resume required for promotion within the academic system. And those poor souls whose gifts lie in scientific software development rather than the writing of research papers will mostly find themselves on the margins of the academic community.

To an extent, disconnects like this have always existed. The academic system has always rewarded some skills at the expense of others: teaching is a classic example of an essential skill which is perennially marginalized. But there are two main differences that make the current discussion more worrying:

1. As I've mentioned, the skills now slipping through the cracks of the academic reward structure are the very skills required for the success of modern research.

2. With virtually the entire world utilizing the tools of data-intensive discovery, the same skills academia now ignores and devalues are precisely the skills which are most valued and rewarded within industry.

The result of this perfect storm is that skilled researchers feel an insidious gradient out of research and into industry jobs. While software-focused jobs do exist within academia, they tend to be lower-paid positions without the prestige and opportunity for advancement found in the tenure track. Industry is highly attractive: it is addressing interesting and pressing problems; it offers good pay and benefits; it offers a path out of the migratory rat-wheel of temporary postdoctoral positions, and often even encourages research and publication in fundamental topics. Most importantly, perhaps, industry offers positions with a real possibility for prestige and career advancement. It's really a wonder that any of us stay in the academy at all.

I particularly worry about this in my own field of astronomy and astrophysics. The LSST project is ramping up for first-light toward the end of this decade. Its goal of real-time processing of 30TB of data per night over the course of a decade is incredibly ambitious. To handle this volume of data, the project will likely be looking to hire several dozen data-focused astronomical researchers over the coming years. Given the required skill set, along with the current compensation level and career outlook of engineering-oriented positions in academia, I have some serious doubts about whether the project will be able to attract a sufficient pool of applicants for these positions.

I'm in no way the only person thinking about these issues. I've discussed pieces of this topic with many folks from around the country and the world, and I know that there are policy-makers and funding agencies thinking about these very problems. But the practical question of how to address these concerns looms large. Complaining about the culture of academia seems to be a common past-time of academics: some of what I'm saying here echoes Deirdre McCloskey's Law of Academic Prestige: "the more useful the field, the lower its prestige". Though this was originally coined in lamentation of the low status of essential topics like freshman writing and composition, it seems readily applicable to the current subject.

I would argue that the concept of prestige is the key: the solution to the problem lies in taking deliberate measures within academia to catch-up with industry and increase the prestige of those who work to develop the software tools essential to current data-driven scientific research. There are a few specific things that researchers, funding agencies, and policy leaders can do to promote this. Here are a few ideas:

1. Continue to press the importance of reproducibility in academic publication. Not only is this absolutely essential to the scientific process itself, but reproducibility depends on open, well-documented, and well-written code. Making this code an essential part of the publication process will make those with software skills an essential part of the academic community.

2. Push for a new standard for tenure-track evaluation criteria: one which considers the creation and maintenance of open software along with more traditional activities like publication and teaching. This will remove a main disincentive against investing time producing clean, well-documented, and open code.

3. Create and fund a new academic employment track, from graduate and post-doctoral fellowships to teaching, research, and tenure-track faculty. These positions should particularly emphasize and reward the development of open, cross-disciplinary scientific software tools. Positions like these would present a viable academic career path for those interested in building and maintaining the essential software used by themselves and their colleagues.

4. Increase the pay of post-doctoral scientific research positions. Some might find this idea controversial, but the current situation is absolutely unsustainable. The base postdoctoral salary for NIH positions is under $40,000 per year for someone who has just completed a PhD in their field. This is generously increased to about$50,000 per year after seven (!) years of post-doctoral experience. Those with the skills mentioned in this article could easily ask for several times that compensation in a first-year industry job, and would find themselves working on interesting problems in a setting where their computational skills are utilized and valued.

I fear that without these sorts of changes in the culture of academia itself, the progress of scientific research will be severely handicapped in the coming years.

We live in an exciting time, where the depth and breadth of our scientific understanding of the world around us are being driven by an ever accelerating ability to gather, store, process, and learn from datasets of unprecedented size. To keep up this pace of discovery, the best researchers need incentives to stay within the research community. It's not an easy problem to address, but with a little effort, we can assure the health and sustainability of the scientific research community into the future.

I'm indebted to many colleagues for discussions which prompted these thoughts, in particular Bill Howe and Fernando Perez. Thanks also to my good friend Will Mari (@willthewordguy) for reading over this post and giving helpful feedback.

Filipe Saraiva

Cantor backends for Python2 and Scilab merged in master

A fast update: now Cantor backends for Python2 and Scilab were merged in master branch. I will do more polishing until the stable release in KDE 4.12. You can follow the new status of development compiling and testing Cantor from master branch.

In a related topic, KDE Edu sprint in A Coruña, Spain, began and runs through the October 30th. Unfortunately I can not participate this time but I expect go to the next meeting (maybe, in Akademy 2014). =)

Luis Pedro Coelho

luispedro

Last few weeks have been a bit hectic and my post queue dwindled, but here are some Saturday links:

1. The I Quit Academia genre. I like the meta-idea. I also notice that often the people leaving academia use the word “corporatization” (or some equivalent) to discuss what they dislike about the public or non-profit institutions they are leaving, in contrast to the less bureaucratic for-profit enterprises they are now joining. Should we conclude that corporatization means “becoming less like a 21st century corporation and more like a 20th century bureaucracy”.

2. Do things, write about it. This is about blogging, but the title could apply to science too.

3. Nice colour schemes: http://colorbrewer2.org/ with a Python implementation. Pretty nice.

4. Why Python uses 0-based indices. It’s the closed-open interval that matters, the 0 vs 1 discussion is actually a distraction.

This is one of those things where the there is a solution (zero-based) which is not intuitive at first but is actually more intuitive in the long run. With 1-based indices and inclusive splicing, you always need to keep track of +1s and -1s all over the place.

October 25, 2013

Philip Herron

Python and me

This is sort of a rant post and i don’t meant to annoy anyone but i’ve kind of had it with the Python Compiler Community. I’ve never not liked a community before but i cannot believe how badly a minority have treated me and its really made me hate a lot of things i think i am just going to carry on out of my own interest and really not give a toss about anything.

Some people believe my compiler is useless because unlike shedskin they do some dataflow analysis to try and generate code for C types directly if a user was to write code:

x = 1
y = x + 1


You could optimize this as:

int x = 1
int y = x + 1


This on the whole sounds great but it really does fail when it starts getting into:

x = 1
y = x + 1
z = [ x, y ]


The problem arises because no one really writes code like the first example (outside of mathematics where you should use numpy anyways) in python and when it comes to doing something like putting it into a list you cannot optimize this in a sane way because when you start mixing types your going to have a really slow period where you have to convert all types into python objects anyways.

I can already hear someone saying what about generating a vector at run time with type int.

struct myIntVec {
size_t len, size;
int * vector;
}
int x = 1
int y = x + 1
struct myIntVec z;

memset (&z, 0, sizeof (struct myIntVec))
vec_push (&z, x)
vec_push (&z, y);


This doesn’t work because where does the optimization end when you want to do something like:

x = 1
y = x + 1
def test (): pass
z = [ x, y , test]


Python lets you do crazy things but you cannot then assign that to an int on the cStack it doesn’t work. And in the end your added to the mental amount of Type Conversion that Python has to do behind the scense. And where does this type of optimization stop.

But then i hear you could namespace things into mangled names on the cStack so a variable x could have multiple declarations to implement each type that is assigned like:

int x_prime1 = 1
void test (void): pass
PyObject * x_prime2 = fold_static_method (&test);


Yeah you could do this but you really are getting into mind masturbation and the code to implement all of this just gets really out of hand. Its impossible to really static type ahead of time there was a paper published on it a while back but it basically meant running the program several times and doing data-flow analysis at runtime to generate the code but in the end this isn’t feasible either.

Consider the trivial example of:

def add (x, y):
return x + y


Then to show you how undefined all python can be look at:

pherron@daxtsutp01 {~} $python Python 2.7.5 (default, Oct 16 2013, 10:09:34) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> def add (x, y): ... return x + y ... >>> add (1,2) 3 >>> add ('str', 'ing') 'string' >>> add ([1,2], [3,4]) [1, 2, 3, 4]  What this shows is you cannot infer any typing on this function at all at any stage. Python is entirely this kind of code, so the types of optimizations you get with compiling in native types your actually going to slow down everything in the end because your going to end up doubling up on the type conversion you will have to do at runtime. Most people don’t seem to see this and i can understand why because it sounds great from the blog posts you read but in reality it doesn’t work on every day code outside of toy examples and this is a crucial distinction that i make with gccpy vs Nukita/shedskin/pypy. I am not all about optimization these crazy cases i am all about making it the simplest path for making your code just work as simply as possible. I hear you Cry jit’s now but this i argue isn’t as powerful as you might think it is, optimization languages at runtime is hard to do and not only that PyPy are spending most of their time writing their custom backends writing target cpu optimizations rather than language optimizations. LLVM and GCC already know these. So PyPy’s optimizations aren’t really on python as much as you would want to think. Sure they can at some points know a certain function only ever seems to get called with ints and re-compile it to do that but then what happens when a new function is introduced at runtime which it didn’t know about because this happens in python it then re-compiles back to what it was. So if you really think their benchmarks on their blog mean anything outside of toy projects you have to ask yourself a few things are they really caring about that or are they just liking all the fame they get at the moment. And outside of stupid toy benchmarks, it will not be faster. Optimizing python to do static typing isn’t really a valid way to optimize dynamic languages in this case you can to an extend do this like i do with classes but that’s about all you can do. So for example back to classes what i do is if you have a class: class myClass: def __init__ (self): self.x = 1 self.y = [ x, x + 1 ] self.z = "string" def someMember (self, x): self.y.append (x) # You get the idea some random class  Gccpy compiles this into a full type: struct myClass { object x object y object z object __init__ object someMember object extras_list # implemented for the class myclass: pass type of case members added on the fly };  This means in certain cases mostly with __init__ and the field initilization all data access is pure offset access always just like any variable access is just as fast as access on the C Stack as if you were writing C. Python hides this but so much of variable is all dictionary lookup with some special cases on locals which is offset based on arrays struct and cStack access blows this out of the water. And remmber unlike virtual machines or interpreters i dont maintain a parser at runtime no eval loop which all virtual machines like java and pypy and cPython have: while (running) { switch (BYTE_CODE_TYPE (code)) { case ACCESS_LOCAL: do_something () break .... default: error ("unknown bytecode"); break } }  You would be surprised the amount of jumping all over the place things have to do before _your_ code will get executed. Think about Loops or function calls ever wonder why python has a recursion limit? Before you function will get called its calling loads of c functions first. So when you want to loop over 1 to 10 the interpreter is going over a whole bunch of bytecodes to figure out and set all this up before it actually starts looping. Where as all loops and constructs like for and while are imeplemented just as a bunch of jmp and cmp functions. I swear i can’t wait until a year from now and show benchmarks on _REAL_ code PyPy love to do retarded benchmarks on toy python code vs c-code then compile the C code with _no_ optimizations and go look hey were faster than C just because they implement a tiny retarded optimization that only works in probably 4 peoples code in the world. I’m kind of hateful to PyPy mostly because i don’t care anymore I’ve kept it to myself for years: When my project started a pypy developer (fijal) came onto the GCC irc and said 3 things to me and other GCC developers: • “Can’t believe you let this project be part of GCC” • “You simply don’t know enough about Python you will fail” • “You should stop what your doing and join a real project PyPy because this is a dead end” (*kicked I can’t get over his pure self indulgence in his own flatulence when someone is to say such a think to a someone who he has no idea about me. Yeah he maybe more experienced that man at the time i wasn’t sure what to say to be honest for years but now i feel i know what i am talking about quite strongly i’ve wrote my own interpreter for a toy language that looked like make and python (which type inference). I’ve wrote my own Jit and virtual machine for it. I’ve wrote a full compiler with backend and optimizations for modula. I think i know compilers to some extent now. So i don’t feel afraid anymore of that kind of crap. I look up to PyPy as much as i dis them i do believe that jits are the way forward but no one has challenged this otherwise. So there is no solid proof. There are somethings where gccpy will shine because there are several restrictions in place on the python code that people will need to know and this is where its different. Classes will work like normal to users code but under the hood there is alot of work to generate a type for this so data access is very fast. Caveats is if you do something like: class myClass: pass x = myClass () x.member1 = 1 x.member2 = "string"  This is slower because no type can be generated so member access isn’t as powerful as it could be it will still work at the moment it doesnt but it will. So users should write: class myClass: member1 = None member2 = None x = myClass () x.member1 = 1 x.member2 = "string"  This is better code anyways i seriously doubt people really write empty class declarations in production code because its just bad coding use a dict if your going to do that its the same effect. Thing is after all of this i am starting to think python really should have implemented a var statement for variable declaration its very difficult to read code and fix bugs in big python projects because of how dynamic python can be sometimes. And i am not sure if dynamic typing is really that useful to programmers i think like go and rust type inference is the way forward for programming languages as you don’t have to specify your types because a compiler can figure it out because it knows what your trying to assign something to. And this allows for a lot more aggressive optimization and anyways so many people hate that in python including me that you can enforce types on function parameters how often do you get the errors function wanted a string or and int and you got a dictionary and its like ahhh not have a huge pile of code to extract data out of those types or i just fail it really isn’t elegant but then you could argue this is where unittesting should take over but the problem like that wouldn’t arise in the first place if it had that. So for example i am planning a gnu extension to python in the future such that you could: def myfunction (Int x, String y): ...  You could implement these quite easily so at run time you could get better errors since how often do you get that your code somehow magically works but messes up everything and its not clear why and some things like this could be really useful for a python api. It would be nice to say a hint it will return something but you can really do anything with that information at compile time or runtime so its sort of pointless having a return type specified but having it in the code documentation would be mandatory. Another is eval and exec will not be implemented i think its a good design decision as its not really part of the language which most people online don’t seem to understand and if your code relies on this then your stupid and you shouldn’t be allowed to program anything. Eval and Exec are developer functions that interpreter and Virtual Machine people use to write unittests for their system they shouldn’t be used but you can do some funky things i know but its seriously bad programming its a security risk and very dangerous and can lead to a lot of un-defined behavior so i am just not going to implement it period. Partly because of AOT i am not going to jit and put the code in or anything (I could with gnu lightening but its mostly pointless implementing a runtime parser). And in the end i bet 90% people using pthon don’t know or use eval or exec. This is it, native modules in the python api won’t work without a layer to make it work but thats wayyyy off in the future. Currently i am taking a break for a few weeks on GCCPY and i am going to write as much as i can of a rust front-end to GCC in 3 weeks. And alot of it is there already GCC is a really suited compiler for rust and i am very excited with the results of the next 2 and a bit weeks of work left on it. I mostly want to see how long it will take me to get alot of it compiling minus the rust pre-processor it has. October 23, 2013 NeuralEnsemble Mozaik - an integrated workflow framework for large scale neural simulations (0.1 release) Mozaik is intended to improve the efficiency of computational neuroscience projects by relieving users from writing boilerplate code for simulations involving complex heterogenous neural network models, complex stimulation and experimental protocols and subsequent analysis and plotting. Mozaik integrates the model, experiment and stimulation specification, simulation execution, data storage, data analysis and visualization into a single automated workflow, ensuring that all relevant meta-data are available to all workflow components. It is based on several widely used tools, including PyNN, Neo and Matplotlib. It offers a declarative way of specifying models and recording configurations, using hierarchically organized configuration files. To install the stable 0.1 version run: pip install mozaik The code repository with the latest developmental version is at https://github.com/antolikjan/mozaik The Mozaik homepage, with full documentation, is http://neuralensemble.org/mozaik/ October 22, 2013 Enthought PyQL and QuantLib: A Comprehensive Finance Framework Authors: Kelsey Jordahl, Brett Murphy Earlier this month at the first New York Finance Python User’s Group (NY FPUG) meetup, Kelsey Jordahl talked about how PyQL streamlines the development of Python-based finance applications using QuantLib. There were about 30 people attending the talk at the Cornell Club in New York City. We have a recording […] October 19, 2013 William Stein Jason Grout's description of the Sagemath Cloud Jason Grout's description of the Sagemath Cloud: William Stein, the lead developer of Sage, has been developing a new online interface to Sage, the Sage Cloud at https://cloud.sagemath.com. Currently in beta status, it is already a powerful computation and collaboration tool. Work is organized into projects which can be shared with others. Inside a project, you can create any number of files, folders, Sage worksheets, LaTeX documents, code libraries, and other resources. Real-time collaborative editing allows multiple people to edit and chat about the same document simultaneously over the web. The LaTeX editor features near real-time preview, forward and reverse search, and real-time collaboration. Also, it is easy to have Sage do computations or draw gures and have those automatically embedded into a LaTeX document using the SageTeX package (for example, after including the sagetex package, typing \sageplot{plot(sin(x))} in a TeX document inserts the plot of sin(x)). A complete Linux terminal is also available from the browser to work within the project directory. Snapshots are automatically saved and backed up every minute to ensure work is never lost. William is rapidly adding new features, often within days of a user requesting them. October 18, 2013 Luis Pedro Coelho nr_lines_python Why Python is Better than Matlab for Scientific Software This is an argument I made at EuBIAS when arguing for the Python numpy-stack for bioimage informatics. I had a few conversations around this and decided to write a longer post summarizing the whole argument. 0. My argument mostly applies for new projects If you have legacy code in MATLAB, then it may make sense to just continue using it. If it works, don’t fix it. However, if your Matlab code keeps causing you pain, Python might be a solution. Note too that porting code is not the same as writing from scratch. You can often convert code from MATLAB to Python in a small fraction of the time it would take you to start from scratch. 1. Python has caught up with Matlab and is in the process of overtaking it. This is my main argument: the momentum is in Python’s direction. Even two or three years ago, Python was behind. Now, it’s sailing past Matlab. This graph shows the number of lines of code in some important projects for bioimage informatics/science in general. As you can see, the base projects on the top (numpy and matplotlib) have been stable for some years, while the more applied packages at the bottom have exploded in recent years. Depending on what you are doing, Python may even better support it. It is now, Matlab which is playing catch-up with open source software (for example, Matlab is now introducing their own versions of Dataframe, which Python has through Pandas [ itself, a version of R's Dataframe object]). The Python projects are also newer and tend, therefore, to be programmed in a more modern way: it is typical to find automated testing, excellent and complete documentation, &c. This ain’t your grandfather’s open source with a dump on sourceforge and a single README file full of typos. As an example of the large amount of activity going on in the Python world, just this week, Yhat released ggplot for Python [1]. So, while last week, I was still pointing to plotting as one of the weakneses of Python, it might no longer be true. 2. Python is a real programming language Matlab is not, it is a linear algebra package. This means that if you need to add some non-numerical capabilities to your application, it gets hairy very fast. For scientific purposes, when writing a small specialized script, Python may often be the second best choice: for linear algebra, Matlab may have nicer syntax; for statistics, R is probably nicer; for heavy regular expression usage, Perl (ugh) might still be nicer; if you want speed, Fortran or C(++) may be a better choice. To design a webpage; perhaps you want node.js. Python is not perfect for any of these, but is acceptable for all of them. In every area, specialized languages are the best choice, but Python is the second best in more areas [2]. 3. Python can easily interface with other languages Python can interfact with any language which can be interacted through C, which is most languages. There is a missing link to some important Java software, but some work is being done to address that too. Technically, the same is true of Matlab. However, the Python community and especially the scientific Python community has been extremely active in developing tools to make this as easy as you’d like (e.g., Cython). Therefore, many tools/libraries in C-based languages are already wrapped in Python for you. 4. With Python, you can have a full open-source stack This means that you are allowed to, for example, ship a whole virtual machine image with your paper. You can also see look at all of the code that your computation depends on. No black boxes. 5. Matlab Licensing issues are a pain. And expensive. Note that I left the word expensive to the end, although in some contexts it may be crucial. Besides the basic MATLAB licenses, you will often need to buy a few more licenses for specific toolboxes. If you need to run the code on a cluster, often that will mean more licenses. However, even when you do have the money, this does not make the problem go away: now you need admin for the licensing. When I was at CMU, we had campus-wide licenses and, yet, it took a while to configure the software on every new user’s computer (with some annoying minor issues like the fact that the username on your private computer needed to match the username you had on campus), you couldn’t run it outside the network (unless you set up a VPN, but this still means you need network access to run a piece of software), &c. Every so often, the license server would go down and stop everybody’s work. These secondary costs can be as large as the licensing costs. Furthermore, using Python means you can more easily collaborate with people who don’t have access to Matlab. Even with a future version of yourself who decided to economize on Matlab licenses (or if the number of users shrinks and your institution decides to drop the campus licensing, you will not be one of the few groups now forced to buy it out-of-pocket). By the way, if you do want support, there are plenty of options for purchasing it [3]: larger companies as well as individual consultants available in any city. The issue of support is orthogonal to the licensing of the software itself. § Python does have weaknesses, of course; but that’s for another post.  [1] Yhat is a commercial company releasing open-source software, by the way; for those keeping track at home.  [2] I rememeber people saying this about C++ back in the day.  [3] However, because science is a third-world econonmy, it may be easier to spend 10k on Matlab licenses which come with phone support to spending 1k on a local Python consultant. October 17, 2013 Matthew Rocklin Introducing PyToolz The PyToolz project extends itertools and functools to provide a set of standard functions for iterators, functions, and dictionaries. tl;dr – PyToolz provides good functions for core data structures. These functions work together well. Here is a partial API: groupby, unique, isiterable, intersection, frequencies, get, concat, isdistinct, interleave, accumulate first, second, nth, take, drop, rest, last, memoize, curry, compose, merge, assoc Why? Two years ago I started playing with functional programming. One powerful feature of functional languages oddly stuck out as having very little to do with FP in general. In particular modern functional languages often have really killer standard libraries for dealing with iterators, functions, and dictionaries. This standard function set doesn’t depend on macros, monads, or any other mind bending language feature understandable only to LISP-ers or Haskell-ites. This feature only requires higher order functions and lazy iterators, both of which Python does quite well. This is well known. The libraries itertools and functools are supposed to fill this niche in the Python ecosystem. Personally I’ve found these libraries to be useful but often incomplete (although the Python 3 versions are showing signs of improvement.) To fill these gaps we started hacking together the libraries itertoolz and functoolz which were modeled largely after the Clojure standard library. These projects were eventually merged into a single codebase, named toolz which is available for your hacking pleasure at http://github.com/pytoolz/toolz/. Official The official description of Toolz from the docs is as follows: The Toolz project provides a set of utility functions for iterators, functions, and dictionaries. These functions are designed to interoperate well, forming the building blocks of common data analytic operations. They extend the standard libraries itertools and functools and borrow heavily from the standard libraries of contemporary functional languages. Toolz provides a suite of functions which have the following virtues: • Composable: They interoperate due to their use of core data structures. • Pure: They don’t change their inputs or rely on external state. • Lazy: They don’t run until absolutely necessary, allowing them to support large streaming data sets. This gives developers the power to write powerful programs to solve complex problems with relatively simple code which is easy to understand without sacrificing performance. Toolz enables this approach, commonly associated with functional programming, within a natural Pythonic style suitable for most developers. This project follows in the footsteps of the popular projects Underscore.js for JavaScript and and Enumerable for Ruby. Examples Word counting is a common example used to show off data processing libraries. The Python version that leverages toolz demonstrates how the algorithm can be deconstructed into the three operations of splitting, stemming, and frequency counting: >>> from toolz import * >>> def stem(word): ... """ Stem word to primitive form """ ... return word.lower().rstrip(",.!:;'-\"").lstrip("'\"") >>> wordcount = compose(frequencies, partial(map, stem), str.split) # Function >>> sentence = "This cat jumped over this other cat!" # Data >>> wordcount(sentence) {'this': 2, 'cat': 2, 'jumped': 1, 'over': 1, 'other': 1}  There are many solutions to the wordcounting problem. What I like about this solution is that it breaks down the wordcounting problem into a composition of three fundamental operations. 1. Splitting a text into words – (str.split) 2. Stemming those words to a base form so that 'Hello!' is the same as 'hello' – (partial(map, stem)) 3. Counting occurrences of each base word – (frequencies) Toolz provides both common operations for iterators (like frequencies for counting occurrences) and common operations for functions (like compose for function composition). Using these together, programmers can describe a number of data analytic solutions clearly and concisely. Here is another example performing analytics on the following directed graph >>> from toolz.curried import * >>> a, b, c, d, e, f, g = 'abcdefg' >>> edges = [(a, b), (b, a), (a, c), (a, d), (d, a), (d, e), ... (e, f), (d, f), (f, d), (d, g), (e, g)] >>> # Nodes >>> set(concat(edges)) {'a', 'b', 'c', 'd', 'e', 'f', 'g'} >>> # Out degree >>> countby(first, edges) {'a': 3, 'b': 1, 'd': 4, 'e': 2, 'f': 1} >>> # In degree >>> countby(second, edges) {'a': 2, 'b': 1, 'c': 1, 'd': 2, 'e': 1, 'f': 2, 'g': 2} >>> # Out neighbors >>> valmap(compose(list, map(second)), ... groupby(first, edges)) {'a': ['b', 'c', 'd'], 'b': ['a'], 'd': ['a', 'e', 'f', 'g'], 'e': ['f', 'g'], 'f': ['d']} >>> # In neighbors >>> valmap(compose(list, map(first)), ... groupby(second, edges)) {'a': ['b', 'd'], 'b': ['a'], 'c': ['a'], 'd': ['a', 'f'], 'e': ['d'], 'f': ['e', 'd'], 'g': ['d', 'e']}  Learning a small set of higher order functions like groupby, map, and valmap gives a surprising amount of leverage over this kind of data. Additionally the streaming nature of many (but not all) of the algorithms allows toolz to perform well even on datasets that do not fit comfortably into memory. I routinely process large network datasets at my work and find toolz to be invaluable in this context. For More Information October 16, 2013 Enthought Exploring NumPy/SciPy with the “House Location” Problem Author: Aaron Waters I created a Notebook that describes how to examine, illustrate, and solve a geometric mathematical problem called “House Location” using Python mathematical and numeric libraries. The discussion uses symbolic computation, visualization, and numerical computations to solve the problem while exercising the NumPy, SymPy, Matplotlib, IPython and SciPy packages. I hope that this […] October 12, 2013 William Stein "A Symphony of Cursors" (guest post by Jason Grout) Today's post is from guest blogger, Jason Grout, lead developer of the Sage Cell Server. The other day some students and I met to do some development on the Sage cell server. We each opened up our shared project on cloud.sagemath.com on our own laptops, and started going through the code. We had a specific objective. The session went something like this: Jason: Okay, here's the function that we need to modify. We need to change this line to do X, and we need to change this other line to do Y. We also need to write this extra function and put it here, and change this other line to do Z. James: can you do X? David: can you look up somewhere on the net how to do Y and write that extra function? I'll do Z. Then in a matter of minutes, cursors scattering out to the different parts of the code, we had the necessary changes written. I restarted the development sage cell server running inside the cloud account and we were each able to test the changes. We realized a few more things needed to be changed, we divided up the work, and in a few more minutes each had made the necessary changes. It was amazing: watching all of the cursors scatter out into the code, each person playing a part to make the vision come true, and then quickly coming back together to regroup, reassess, and test the final complete whole. Forgive me for waxing poetic, but it was like a symphony of cursors, each playing their own tune in their lines of the code file, weaving together a beautiful harmony. This fluid syncing William wrote takes distributed development to a new level. Thanks! October 10, 2013 Luis Pedro Coelho road_signs I changed the regular schedule of the posts because I wanted to write down these ideas. A few days ago, in a panel at EuBIAS, I argued again that scientists should learn how to programme. I also argued that usability of bioimage analysis was a false hope. Now, to be sure: usability is great, but usability does not mean usable without programming skills. Good usable programming environments can be the most usable way to achieve something[1]. I find the Python environment one of the most usable for data analysis currently, although there is still a lot of work which could improve it. § We can build communication systems without words, but only if the vocabulary is very limited. Otherwise, people need to learn how to read [2]. I think this a good analogy for non-programming environments. § The problem is that image analysis (or data analysis) is not a closed goal. Whatever we are doing today, will probably be packaged into simple-to-use tools, but the problems will grow in size and complexity. For a fixed target, like sending email or writing a blog, we can build nice tools that don’t require programming. Any modern email client basically does email well enough. There is probably only a small set of behaviours we want our blogs to do (like scheduling a post) and I think we can get a small set of features that covers 90%+ of uses (small here might mean a few thousand of plugins). There is no constant pressure to do 10 times more. But data analysis is not in the same category as sending email. It’s an open-ended problem, which will grow continuously, which has been growing continuously. Only a full-blown artificial intelligence system will be able to deal with the sort of analysis that we will want to do in 10 years (or we already want to do it, but are unable to do it because we do yet have the technology). § If anything, I have felt more and more of a need to think in low-level terms [3]. A few years ago, push-button analysis was sufficient for most problems. Load your data into Excel, select the rows, and plot. Fit a line, compute some stats. STATA gave you a bit more power if Excel did not suffice. Now, the problems grew and push-button solutions do not scale. Not only do we have more data, we have more complex, more unstructured data. A few years ago, people were writing things like feel free to use interpreted languages, it doesn’t matter that you’re losing performance compared to C; computers are super-fast, waste them. Now, there is much more interest in building implementations that are as fast as C (normally using Just-in-time compilation). This will not get better and just saying that tools should be easier for non-programmers is missing the point. § Programming is like writing: a general purpose technological skill which transforms all activities. And this means that, eventually, it becomes useful (or even necessary) for many activities which are outside the core of programming (who’d have thought a salesperson would have to know how to read and write? A firefighter?). Almost any job that does not require programming is one which can be done by a robot. Except entertainment and those jobs that Tyler Cowen, for lack of a better word, calls marketing. Tyler calls them marketing, but prostitution might be just as accurate, as it is about providing not a specific service or product, which could be provided by a machine, but the general positive feeling that comes from human contact [4]. Related Bayes and Big Data by Cosma Shalizi  [1] If you wish, read scripting for programming. I never cared much for this division.  [2] If you google for traffic signs you’ll see that actually most images have at least one sign with words or images.  [3] The need to managing parallelism (as our cores multiply, but not get faster) and memory access patterns as data grows faster than RAM have forced me to think about exactly what is happening in my machines. [4] Obviously, Tyler is right to use the word marketing even if it’s not a good fit. Prostitution has a strong negative charge.. October 07, 2013 Enthought Advanced Cython Recorded Webinar: Typed Memoryviews Author: Kurt Smith Typed memoryviews are a new Cython feature for accessing memory buffers, such as NumPy arrays, without any Python overhead. This makes them very useful for manipulating blocks of memory in Cython directly without calling into the Python-C API. Typed memoryviews have a clean declaration syntax and have a NumPy-like look and feel, […] Luis Pedro Coelho luispedro This week, I’m in Barcelona to talk about mahotas at the EuBIAS 2013 Meeting. You can get a preview of my talk here. The title is “Mahotas and the Python Scientific Ecosystem for Bioimage Analysis” and you’ll see it does not exclusively talks about mahotas, but the whole ecosystem. Comments are welcome (especially if they come in the next 24 hours). In preparation, I released new versions of mahotas (and the sister mahotas-imread) yesterday: There are a few bugfixes and small improvements throughout. October 04, 2013 William Stein Backing up the Sagemath Cloud The terms of usage of the Sagemath Cloud say "This free service is not guaranteed to have any uptime or backups." That said, I do actually care a huge amount about backing up the data stored there, and ensuring that you don't lose your work. Bup I spent a lot of time building a snapshot system for user projects on top of bup. Bup is a highly efficient de-duplicating compressed backup system built on top of git; unlike other approaches, you can store arbitrary data, huge files, etc. I looked at many open source options for making efficient de-duplicated distributed snapshots, and I think bup is overall the best, especially because the source code is readable. Right now https://cloud.sagemath.com makes several thousand bup snapshots every day, and it has practically saved people many, many hours in potentially lost work (due to them accidentally deleting or corrupting files). You can access these snapshots by clicking on the camera icon on the right side of the file listing page. Some lessons learned when implementing the snapshot system • Avoid creating a large number of branches/commits -- creating an almost-empty repo, but with say 500 branches, even with very little in them, makes things painfully slow, e.g., due to an enormous number of separate calls to git. When users interactively get directory listings, it should take at most about 1 second to get a listing, or they will be annoyed. I made some possibly-hackish optimization -- mainly caching -- to offset this issue, which are here in case anyone is interested: https://github.com/williamstein/bup (I think they are too hackish to be included in bup, but anybody is welcome to them.) • Run a regular test about how long it takes to access the file listing in the latest commit, and if it gets above a threshhold, create a new bup repo. So in fact the bup backup deamons really manage a sequence of bup repos. There are a bunch of these daemons running on different computers, and it was critical to implement locking, since in my experience bad things happen if you try to backup an account using two different bups at the same time. Right now, typically a bup repo will have about 2000 commits before I switch to another one. • When starting a commit, I wrote code to save information about the current state, so that everything could be rolled back in case an error occurs, due to files moving, network issues, the snapshot being massive due to a nefarious user, power loss, etc. This was critical to avoid the bup repo getting corrupted, and hence broken. • In the end, I stopped using branches, due to complexity and inefficiency, and just make all the commits in the same branch. I keep track of what is what in a separate database. Also, when making a snapshot, I record the changed files (as output by the command mentioned above) in the database with the commit, since this information can be really useful, and is impossible to get out of my backups, due to using a single branch, the bup archives being on multiple computers, and also there being multiple bup archives on each computer. NOTE: I've been recording this information for cloud.sagemath for months, but it is not yet exposed in the user interface, but will be soon. Availability The snapshots are distributed around the Sagemath Cloud cluster, so failure of single machines doesn't mean that backups become unavailable. I also have scripts that automatically rsync all of the snapshot repositories to machines in other locations, and keep offsite copies as well. It is thus unlikely that any file you create in cloud.sagemath could just get lost. For better or worse, is also impossible to permanently delete anything. Given the target audience of mathematicians and math students, and the terms of usage, I hope this is reasonable. October 01, 2013 Luis Pedro Coelho luispedro I would highly recommend the book ”Building machine learning system with python” on Packtpub or on Amazon September 28, 2013 Arink Verma Summer of 2013 with Numpy and Google GSoC is a learning process - we learn to work together to solve a big problem by breaking it down into smaller chunks. This year, 2013, I got chance to work as student contributor at Numpy. And the experience was worth paying the amount of effort. I have joined Numpy, the community that's quite friendly and helpful. I passed final evaluation but I value experience more than consequence (off-course passed makes me happy). Experience that I had for being a part of this program. Nothing could be more inspiring and motivational than working with people from almost every part of globe connected together just because of there love and passion for code and to develop amazing things. Also got chance to learn from such experienced and enthusiastic developers. I have written the summary of project into two post : Profiling and Bottle-necks. All together with my mentors Charles R Harris and whole community successfully achieved what we planned. Project information and final code can be seen here [https://google-melange.appspot.com/gsoc/project/google/gsoc2013/arinkverma/79001] What I gained this year? The most challenging aspect is to get work with people across different geographical reasons. People from different region thinks and act differently as they might have different concern. Coming out with product which is accepted by everyone is itself a strenuous task. To work in a team with huge diversity in timezone and thinking was colorful experience. This program and Numpy follow development workflow so well. I learned : • Git : to develop socially, concept of PR • Python and Numpy C api : making python module in c • To write good code, numpy follow strict and good practice of writing beautiful code. • Profiling and visualization of data to figure out bottle-neck in performance There are lots of things to make Numpy better, which I would love to do, even after GSoC. September 25, 2013 Philip Herron Legacy code C vs C++ So this is something I’ve learned in the last few months, personally i am seriously starting to dislike C++ in a very strong way. Well this is quite a broad statement i don’t mind its use in a well defined manor such as GCC move to C++ or LLVM’s code base are both really quite nice now. But lets look at what most C++ code is like big proprietary legacy systems made in 2000′ish. There is no real standard of what to use so you end up haveing const char * vs C++ string all over the place and loads of annoying bla.c_str() Another thing i really hate in C++ is the whole reference passing bull shit it has and memory allocation via new and delete. Consider the C code: struct myStruct { int someVal; unsigned char * somePointer; }; void myFunction (struct myStruct * x) { x->someVal = 7; x->someVal = somePointer; } int main (int argc, char **argv) { struct myStruct s; myFunction (&s); ... return 0; }  This is a straight forward piece of code and follows a good pattern for C development to keep memory allocation in a pure sense. Where the caller allocates memory and the callee _never_ cares about this and doesn’t have to assume or look at global state to control memory. Where as if you come from a Java background you would be very tempted to write this function as: struct myStruct * myFunction (void) { struct myStruct * x = malloc (sizeof (struct myStruct)); memset (x, 0, sizeof (*x)); x->someVal = 7; x->someVal = somePointer; return x; } int main (int argc, char **argv) { struct myStruct *s = myFunction (); ... return 0; }  This is bad code but you see it all the time in legacy code in loads of places. But i feel a lot of people hacking in C/C++ really don’t understand stack vs heap at all. If you follow this idiom allocating memory in a callee the caller has to worry about alot of stuff. What allocator is it using? How is it meant to be managed if its doing some funky stuff to my structure. And hell malloc is slow avoid it! And avoid bad code and memory leaks/corruption. If you do the first example where we declared the struct and passed the pointer there was nothing to worry about the allocation of memory was ‘pure’ i tend to think of. It doesn’t fiddle with your pointers and avoids memory corruption. But here comes to my real pet peave with c++ it allows for this syntax. struct myStruct { int someVal; unsigned char * somePointer; }; void myFunction (struct myStruct& x) { x.someVal = 7; x.someVal = somePointer; } int main (int argc, char **argv) { struct myStruct s; myFunction (s); // or struct myStruct *ss = malloc (...); memset (ss, 0, sizeof (*ss)); myFunction (*ss); return 0; }  This sucks because the callee is trying to be used in a stack allocated style or accessing data. Where the compile is basically optimizing this to a function void myFunction (struct myStruct *);  And the passing of the data is passing the address. But its just so much syntactic sugar. For something that is a bad idea. Then don’t get me started on templates, portability and namespaces. So much of all that in C++ is a good idea in principle but the code that ends up being created is enough to make my cats not want Ham (bad similie but hey…). The thing is C++ is really taken out of context so bad where awh sure you’ve done Java you’ll be dead on at C++. If projects are given more strict guidelines on C++ use or just use C major errors will come up a lot sooner. Arguing that productivity of c++ is greater than C is a bad argument since so much of the code I’ve seen in my time has been all from scratch badly implemented libraries. I am not saying all C++ code is bad and dont use it but in open source your vanity is on the line if you write bad code people notice and think your an idiot. But if you write C your stupidy slaps you in the face and tells you to do it right. And forces a lot more simplicity. Then the whole string crap and STL i like STL its good but i am so used to writing my own data structures for my C development that i dont really find STL the great help and amazing tools that people think they are. I am starting to feel as though no one really knows how to program anymore. I just belive so strongly that Java and C# in universities has destroyed really system level software engineering as we know it. How many people do you know who write their own kernel or compilers any more. What happens when that knowledge is more scare because it really is becoming that way. Awh yeah sure go buy a book but tangible experience is getting more and more rare. kthanksbye Luis Pedro Coelho luispedro A little Python thing I like to do, but have never seen in other people’s code is to remove a prefix like this: s = 'some string' if s.startswith('some '): s = s[len('some '):] I like the s[len('some '):] approach as I find it both error-robust (as opposed to typing the actual number like s[5:]) and self-documenting. For example, consider: from glob import glob files = glob('datadir/experiment/*.txt') ids = [f[len('datadir/'):] for f in files] It is pretty clear that what I want to do is remove the datadir/ prefix. It works for suffixes too: without_ext = filename[:-len('.txt')] combined = filename[len('datadir/experiment/'):-len('.txt')] This is much better than [1]: combined = filename[18:-4] § (One may be tempted to write filename.replace('.txt','') to get rid of a suffix, but this is wrong! It does not work with 'datadir/experiments/datafiles.txt/filename.txt', which is perfectly legal.) § It is slightly inefficient because the Python interpreter will actually create a string, then compute its length. [2] However, this is generally in code where it does not matter that much. If it did, I’d be doing it in C(++) or using some other method.  [1] It should have been filename[19:-4], but it’s hard to see immediately. In any case, writing a number always makes me think and code should not make you think too muchg  [2] It is not allowed to just replace it by the result statically because you may have redefined the function len. It could have a check for the common case, I suppose. Surya Kasturi (scipy-central GSOC) Minor Fixes While upgrading Django-1.5, I found few issues which were worthwhile to address. Each of them have been raised individually as PR's. 1. Providing clickjacking middleware 2. Providing django transactions middleware 3. Fixing minor design errors 4. Fixing image upload 5. Fixing revision creation of submissions 6. Using paginator class 7. DVCS Error fix [5] requires testing and reviewing. Some of the above issues have been merged while some are not yet. Why needed: 1. clickjacking middleware just adds a protection layer. Please read clickjacking middleware documentation in django docs 2. django transactions middleware can be considered as featured enhancement. But it should be largely considered as bug-fix. Whenever we create submissions and error is raised in the middle on the server side, the submission objects are still stored but incompletely. To stop this we need this middleware 3. During some recent changes in codebase [3][4][5] have been raised. Those fixes are done now 4. [6] is just improving code. paginator class is not present while developing scipy central. There are some model methods which return previous and next objects (submissions) to display on the site. Django paginator class is used here as better implementation 5. [7] is only observed in Windows environment which was observed lately. Thanks to Pauli for suggesting the fix. Django 1.5 Upgrade Upgrading the supporting framework on backend of the site has become very pressing issue since we our site was developed in v1.3 while the current stable version is at v1.5 There have been many new improvements that have been brought into the framework along with new features. When I started upgrading Django-1.5, I was surprised to see that the existing codebase was very relevant to the current stable version. So, there were no huge tweaks until now which I anticipated in the beginning. The following are the major changes we see in our codebase 1. URL template tag syntax is changed to {% url 'view_name' %} 2. UserProfile is deprecated in Django-1.5. This is where I anticipated few tweaks around database but it wasn't required. There was a one-to-one relation between User and UserProfile models from the beginning. Some of the model methods are now not available to access UserProfile object from the User object. Since we never used such functionality in codebase luckily, we didn't have to make huge changes here The minor changes to note: 1. slugify() implementation was changed which caused us to make changes in tests module. 2. There were some additional features for using datetime objects. These upgrade changes have been removed from django-1.5 PR and will be moved into new PR We also required to upgrade some dependencies 1. django-registration 2. django-haystack 3. xapian-haystack There have been few other changes that can be viewed clearly in commits. Also, The comments, thumbs functionality has not yet been upgraded which will happen shortly. Django-1.5 branch URL: https://github.com/ksurya/SciPyCentral/commits/django-1.5 Arink Verma GSoC'13 Project Summary-1 : Numpy's profiling Small numpy arrays are very similar to Python scalars but numpy incurs a fair amount of extra overhead for simple operations. For large arrays this doesn't matter, but for code that manipulates a lot of small pieces of data, it can be a serious bottleneck. For example In [1]: x = 1.0In [2]: numpy_x = np.asarray(x)In [3]: timeit x + x10000000 loops, best of 3: 61 ns per loopIn [4]: timeit numpy_x + numpy_x1000000 loops, best of 3: 1.66 us per loop This project involved • profiling simple operations like the above • determining possible bottlenecks • devising improved algorithms to solve them, with the goal of getting the numpy time as close as possible to the Python time. Profiling tools The very first objective to find bottleneck is profiling for time or space. During project I have used few tools for profiling and visualizing data of numpy execution flow. Google profiling tool This is the suit of different tools provided by Google. It Includes TCMalloc, heap-checker, heap-profiler and cpu-profiler. As need of project was to reduce time, so CPU-Profiler was used. Setting up Gperftools Following are the steps used to setup python C level profiler on Ubuntu 13.04. (For any other system, options see [1]) 1. Make sure to build it from source. Clone svn repository from http://gperftools.googlecode.com/svn/trunk/ 2. In order to build gperftools checked out from subversion repository you need to have autoconf, automake and libtool installed. 3. First, run ./autogen.sh script which generate ./configure and other files. Then run ./configure 4. 'make check', to run any self-tests that come with the package. Check is optional but recommended to use 5. After all test gets passed, type 'sudo make install' to install the programs and any data files and documentation. Running CPU profiler I evoked profiler manually before running sample code. Consider python code to profiled is in num.py file. $CPUPROFILE=num.py.prof LD_PRELOAD=/usr/lib/libprofiler.so python num.py

Alternatively, include profiler in code as follow
import ctypesimport timeitprofiler = ctypes.CDLL("libprofiler.so")profiler.ProfilerStart("num.py.prof")timeit.timeit('x+y',number=10000000,       setup='import numpy as np;x = np.asarray(1.0);y = np.asarray(2.0);')profiler.ProfilerStop()

$pprof --gv ./num.py num.py.prof  Callgraph generated by gperftools. Each block represent method with local and cumulative percentage. Oprofile OProfile is a system-wide profiler for Linux systems, capable of profiling all running code at low overhead. OProfile is released under the GNU GPL. Setting up Oprofile 1. Access the source via Git : git clone git://git.code.sf.net/p/oprofile/oprofile 2. Automake and autoconf is needed. 3. Run autogen.sh before attempting to build as normal. Running CPU profiler $opcontrol --callgraph=16$opcontrol --start$python num.py$opcontrol --stop$opcontrol --dump$opreport -cgf | gprof2dot.py -f oprofile | dot -Tpng -o output.png  Callgraph is visualized with help of script gprof2dot.py Perf from linux-tools Perf provides rich generalized abstractions over hardware specific capabilities. Among others, it provides per task, per CPU and per-workload counters, sampling on top of these and source code event annotation. Setting up perf $sudo apt-get install linux-tools-common $sudo apt-get install linux-tools-<kernal-version> Running Profiler and visualizing data as flame-graph $perf record -a -g -F 1000 ./num.py$perf script | ./stackcollapse-perf.pl > out.perf-folded$cat out.perf-folded | ./flamegraph.pl > perf-numpy.svg
The first command runs perf in sampling mode (polling) at 1000 Hertz (-F 1000; more on this later) across all CPUs (-a), capturing stack traces so that a call graph (-g) of function ancestry can be generated later. The samples are saved in a perf.data

 Script to visualize above flame graph is at https://github.com/brendangregg/FlameGraph.

Scope for improvement in _extract_pyvals

I my last post, I mentioned how two functions, _extract_pyvals and PyUFunc_GetPyValues together use >12% of time. But major culprit is *errobj = Py_BuildValue("NO", PyBytes_FromString(name), retval); in  _extract_pyvals. It alone takes 10% of time, with every operations.

Improvement

Caching Py_BuildValue

First approach I take, was to cached Py_BuildValue with Thread storage or dict. With this time consumption of _extract_pyvals dropped to 4% from 12%.
But since, TLS is a bit unreliable and risky. So @juliantaylor advised that it should be avoided. Even after many fixes, commit for this didnt managed to pass all test cases.

Py_BuildValue with PyTuple_Pack

For quick optimization  Py_BuildValue is replaced with PyTuple_Pack. And PyInt_AsLong is replaced with PyInt_As_Long which does not do error checking. This helps to improve _extract_pyvals by 3%.
 replaced Py_BuildValue by PyTuple_Pack

Optimization has been made at pr #3686

GSoC'13 Project Summary-2 : Numpy's Bottlenecks and its Removal

In last post, I have mentioned the tools which I used for profiling numpy. Call-graph, made my life not easier but a bit simpler to detect time consuming methods. But time taken just indicate the possibility of bottlenecks as many of time consuming methods are highly optimized.  Hence, critically reading code along with call-graph is the trick

Identifying bottlenecks

Following types bottlenecks were observed

• Numpy used to release Global Interpreter Lock or GIL for all two or one operand loops. But for short length array, it used to produce relative overhead instead.
• So, releasing GIL for smaller operations was not benefiting at all.

Redundant code doing extra work

• PyArrayCanCastArrayTo used to check evenif newtype is Null. In PyArrayFromArray, It was found that if argument newtype is Null, it get value of oldtype. So, technically it has to check if casting is possible for same type, which is useless.
• Numpy used to converts the Python scalar into its matching scalar (e.g. PyLong -> int32) and then extract the C value from the NumPy scalar.
• Clearing the error flags at every function call, then checking it. This happens unconditional even-if there is no need to do.

Improper structure

• For every single operation calls, numpy has to extract value of buffersize, errormask and name to pack and build error object. These two functions, _extract_pyvals and PyUFunc_GetPyValues together use significant time. It take useless time because all this time is spent on to look up entries in a python dict, extract them, and convert them into C level data. Not once but doing that again and again on every operation. Also which remain unused if no error occurs.
• loop selection method for scalar operation is inefficient and consume time. It check through all associated dtypes of function one by one from types array.

Not so effective memory allocation

• When allocating the return value, numpy allocate memory twice. One for the array object itself, and a second time for the array data.
• Also, shapes + strides together have 2*ndim elements, but to hold them numpy allocate a memory region sized to hold 3*ndim elements.

Removing bottlenecks

I really appreciate the effort and mentoring by Charles Harris, which provides lead to remove above mentioned bottlenecks. Making numpy a bit quick for smaller arrays too.

Removing know overhead for smaller arrays

Avoiding GIL release for smaller arrays
NPY_BEGIN_THREADS_THRESHOLDED(loop_size), variant of NPY_BEGIN_THREADS macro, has been create in ndarraytypes.h with threshold. Threshold has been taken 500 conservatively and loss of guessing low can be neglected.
#define NPY_BEGIN_THREADS_THRESHOLDED(loop_size) do { \ if (loop_size > 500) { \              _save = PyEval_SaveThread(); \       } \} while (0);

Avoid doing extra work, remove redundancy

In PyArrayCanCastArrayTo, it is better to bypass this for newtype is Null and flag is 0. As following improvement is made,
oldtype = PyArray_DESCR(arr);if (newtype == NULL) { /* Check if object is of array with Null newtype.  * If so return it directly instead of checking for casting.  */ if (flags == 0) {  Py_INCREF(arr);  return (PyObject *)arr; } newtype = oldtype; Py_INCREF(oldtype);}

Avoiding conversion of know dtype to NumPy Scalar

As, when trying to extract the underlying C value from a Python integer. Numpy converts the Python scalar into its matching scalar (e.g. PyLong -> int32) and then it extracts the C value from the NumPy scalar. Raul Cota did optimization for floats by avoiding it conversation. I extended it to integer which was a bit messy because different OS handle int differently. I have avoid conversion for know integer type but extracting its value. like,
For byte, short, int, long
#if PY_VERSION_HEX >= 0x03000000    if(PyLong_CheckExact(a)){        *arg1 = PyLong_AsLong(a);        return 0;            }#else    if (PyInt_CheckExact(a)){        *arg1 = PyInt_AS_LONG(a);        return 0;    }#endif

Avoiding heavy fp clear flag

Nathaniel pointed out something mysterious with floating point clear error-flags.

UFUNC_CHECK_STATUS is just single macro which do both checking clearing the error flags. It clear error flags every time after checking. We should avoid clear operation if not needed, as it is a bit expensive and take significant amount of time. I have avoided clear if not needed to save time. Before each ufunc loop when PyUFunc_clearfperr() flag error is checked, then clearing them if necessary. Now, checking results in macro doesn't get ignored unlike before.

Re-factoring

Deferred the creation of the errobject to when an error actually occurs. This same huge time used to consumed by PyUFunc_GetPyValues, for building error object.

Replacing loop by specialized conditions. Most of the function share identical signature. E.g These sets (add, subtracts) , (arccos, arcsin, arctan, arcsinh, arccosh) share same signature array. As if known, there are only 32 distinct signature arrays. I make code generator to identity and make specialized condition for each distinct signature arrays.

I also tried to stash two memory allocation for value return to one but it would take huge amount of re-factoring. Hence it moved out of scope for GSoC.

Something Interesting...

I came to know about tinyarray, which is similar to NumPy array, but optimized for small sizes.  Christoph Groth, who developed it, used more compact object like For example, the array object is a PyVarObject and not a PyObject. The array data is held in the same structure as the Python object itself, only a single memory allocation per Array is done.  So this could a good start if someone wants to optimize, memory allocation.

And more...

It is not possible to write about all coding and pull-request on this post. But all those modification have been written on Google doc http://goo.gl/M3vrnz

September 24, 2013

Luis Pedro Coelho

luispedro

Another review of our book:

Give yourself a time, get this book, study and use it with your data: suddenly many things that were obscure or abstruse, will became clearer. This books is a investment which worths every penny.

September 23, 2013

Blake Griffith

Wrapping Up Google Summer of Code

My Summer has been over for a while, but GSoC is officially ending this week. And what do I have to show?

• SciPy's sparse matrices now support boolean data.
• Sparse matrices now support boolean operations (!=, <=, etc.).
• NumPy universal functions (ufuncs) can now be overridden for compatibility.
• Sparse matrices uses this ufunc override functionality to do sensible things when acted on by ufuncs.
• Support fancy indexing.
• Axis arguments for min and max methods.
• various other things: speed ups, small features, test coverage, bugs fixes, code clean ups, and a few backports.

But I think that I have benefited personally from this experience much more than SciPy has. I learned so much this summer, a short list of things I know vastly more about than I did before this summer:

• PYTHON. Jeez looking at code I wrote before this summer is painful.
• The Python and NumPy C API, before this summer I was a total C noob. Now I can scratch together a python API in C.
• SciPy.
• NumPy, numpy.frompyfunc is awesome.
• Git, before this summer I could commit thing to my own repos and that was pretty much it. Git hooks are awesome.
• In general, development workflow.

I grown attached to SciPy and am excited I can make meaningful contributions to that are actually helping people. But contributing more will have to wait untill after the GRE, grad school applications, and my senior thesis. There are a few things I would like to add down the road to incorporate into SciPy 1.0. e.g. 64bit indices, change spares matrices to emulate numpy.array rather than numpy.matrix. I would also like to see if I can apply ufunc overrides in a few other places that mix numpy.array and scipy.sparse like scikit-learn.

Here is a link to git log -p upstream/master --author="Blake Griffith" in numpy and scipy.

Luis Pedro Coelho

luispedro

Over the last few posts, I described my nuclear segmentation paper.

It has a reproducible research archive.

§

If you now download that code, that is not the code that was used for the paper!

In fact, the version that generates the tables in the paper does not run anymore, because it only runs with old versions of numpy!

In order for it to compute the computation in the paper, I had to update the code. In order to run the code in the paper, you need to get old versions of software.

§

To some extent, this is due to numpy’s frustrating lack of forward compatibility [1]. The issue at hand was the changed semantics of the histogram function.

In the end, I think I completely avoided that function in my code for a few years as it was toxic (when you write libraries for others, you never know which version of numpy they are running).

§

But as much as I can gripe about numpy breaking code between minor versions, they would eventually be justified in changing their API with the next major version change.

In the end, the half-life of code is such that each year, it becomes harder to reproduce older papers even if the code is available.

 [1] I used to develop for the KDE Project where you did not break user’s code ever and so I find it extremely frustrating to have to explain that you should not change an API on esthetical grounds in between minor versions.

September 19, 2013

Gaël Varoquaux

Publishing scientific software matters

Christophe Pradal, Hans Peter Langtangen, and myself recently edited a version of the Journal of Computational Science on scientific software, in particular those written in Python. We wrote an editorial defending writing and publishing open source scientific software that I wish to summarize here. The full text preprint is openly available in my publications list as always. It includes, amongst other things, references.

Software is a central part of modern scientific discovery. Software turns a theoretical model into quantitative predictions; software controls an experiment; and software extracts from raw data evidence supporting or rejecting a theory. As of today, scientific publications seldom discuss software in depth, maybe because it is both highly technical and a recent addition to scientific tools. But times are changing. More and more scientific investigators are developing software and it is important to establish norms for publication of this work. Producing scientific software is an important part of the landscape of research activities. Very visible scientific software is found in products developed by private companies, such as Mathwork’s Matlab or Wolfram’s Mathematica, but let us not forget that these build upon code written by and for academics. Scientists writing software contribute to the advancement of Science via several factors.

First, software developed in one field, if written in a sufficiently general way, can often be applied to advance a different field if the underlying mathematics is common. Modern scientific software development has a strong emphasis on generality and reusability by taking advantage of the general properties of the mathematical structures in the problem. This feature of modern software help close the gap between fields and accelerate scientific discovery through packaging mathematical theories in a directly applicable way.

Second, the public availability of code is a corner stone of the scientific method, as it is a requirement to reproducing scientific results: “if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do.” (V. Stodden, The scientific method in practice). Emphasizing code to an extreme, Buckheit and Donoho have challenged the traditional view that a publication was the valuable outcome of scientific research: “an article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment [...]“.

It is important to keep in mind that going beyond replication of results requires reusable software tools: code that is portable, comes with documentation, and, most of all, is maintained throughout the years. Indeed, software development is a major undertaking that must build upon best practices and a quality process. Reversing Buckheit and Donoho’s argument, publications about scientific software play an increasingly important part in the scientific methodology. First, in the publish-or-perish academic culture, such publications give an incentive to software production and maintenance, because good software can lead to highly-cited papers. Second, the publication and review process are the de facto standards of ensuring quality in the scientific world. As software is becoming increasingly more central to the scientific discovery process, it must be subject to these standards. We have found that writing an article on software leads the authors to better clarify the project vision, technically and scientifically, the prior art, and the contributions. Last but not least, scientists publishing new results based on a particular software need an informed analysis of the validity of that software. Unfortunately, much of the current practice for adopting research software relies on ease of use of the package and reputation of the authors.

[...]

Today, software is to scientific research what Galileo’s telescope was to astronomy: a tool, combining science and engineering. It lies outside the central field of principal competence among the researchers that rely on it. Like the telescope, it also builds upon scientific progress and shapes our scientific vision. Galileo’s telescope was a leap forward in optics, a field of investigation that is now well established, with its own high-impact journals and scholarly associations. Similarly, we hope that visibility and recognition of scientific software development will grow.

Luis Pedro Coelho

luispedro

This week, I committed to mahotas-imread, some code to allow for setting options when saving:

from imread import imsave
image = ...
imsave('file.jpeg', image, opts={ 'jpeg:quality': 95 })

This saves the image array to file file.jpeg with quality 95 (out of 100).

§

This is only available in the version from github (at the moment), but I will probably put up a new release soon.

September 16, 2013

Filipe Saraiva

Python Backend for Cantor: Append Plot Images in Cantor Workspace

Other feature implemented in python backend for Cantor in last weeks was “append plot image to Cantor Workspace”.

In other backends you can, optionally, generate a plot image and this image will be append in Cantor workspace, not generating a separated window to the picture.

Below we have a command to generate a plot image in python using matplotlib and pyplot:

Now we have the result appended in Cantor workspace:

In python, to save a picture using pyplot, we type the command pyplot.savefig(). But, if a picture was saved, it can not be shown in separated window. Otherwise, if a picture is shown in a separated window, it can not be saved to a file.

To solve this problem, the python backend change the show() command to savefig(), with a random name to the picture. The image is saved in a temporary file and loaded in Cantor workspace.

The option to load figure in Cantor workspace or to use a separated window is configured in python backend configuration screen. The default is to use separated window because matplotlib/pyplot have several additional features in image screen.

I would like to see some feedback from you, in special if you are a python developer. The code is hosted in python-backend branch from Cantor repository.

September 14, 2013

Arink Verma

Speedup UFUNC_CHECK_STATUS by avoiding heavy clearing flag

UFUNC_CHECK_STATUS is just single macro which do both checking clearing the error flags. It clear error flags every time after checking. We should avoid clear operation if not needed, as it is a bit expensive and take significant amount of time.

The way numpy detect divide-by-zero, overflow, underflow, etc., is that before each ufunc loop it clear the FP error flags, and then after the ufunc loop we see if any have become set. And clear again. I have avoided clear if not needed to save time.

Improvement

Before each ufunc loop when PyUFunc_clearfperr() flag error is checked, then clearing them if necessary. Now, checking results in macro doesn't get ignored unlike before. Earlier time taken by PyUFunc_clearfperr() and PyUFunc_getfperr() combined was around 10%, which is now dropped to 1%, for operation which don't raise any error.

 callgraph comparing performancex = np.asarray([1]); x+x;

More

PR for this change is #3739

September 13, 2013

William Stein

IPython Notebooks in the Cloud with Realtime Synchronization and Support for Collaborators

I spent the last two weeks implementing hosted IPython notebooks with sync for https://cloud.sagemath.com. Initially I had just plan to simplify the port forwarding setup, since using multiple forward and reverse port forwards seemed complicated. But then I became concerned about multiple users (or users with multiple browsers) overwriting each other's notebooks; this is a real possibility, since projects are frequently shared between multiple people, and everything else does realtime sync. I had planned just to add some very minimal merge-on-save functionality to avoid major issues, but somehow got sucked into implementing full realtime sync (even with the other person's cursor showing).

Here's how to try it out

• Go to https://cloud.sagemath.com and make an account; this is a free service hosted on computers at University of Washington.
• Create a new project.
• Click +New, then click "IPython"; alternatively, paste in a link to an IPython notebook (e.g., anything here http://nbviewer.ipython.org/ -- you might need to get the actual link to the ipynb file itself!), or upload a file.
• An IPython notebook server will start, the given .ipynb file should load in a same-domain iframe, and then some of the ipython notebook code is and iframe contents are monkey patched, in order to support sync and better integration with https://cloud.sagemath.com.
• Open the ipynb file in multiple browsers, and see that changes in one appear in the other, including moving cells around, creating new cells, editing markdown (the rendered version appears elsewhere), etc.
Since this is all very new and the first (I guess) realtime sync implementation on top of IPython, there are probably a lot of issues. Note that if you click the "i" info button to the right, you'll get a link to the standard IPython notebook server dashboard.

IPython development

Regarding the monkey patching mentioned above, the right thing to do would be to explain exactly what hooks/changes in the IPython html client I need in order to do sync, etc., make sure these makes sense to the IPython devs, and send a pull request. As an example, in order to do sync efficiently, I have to be able to set a given cell from JSON -- it's critical to do this in place when possible, since the overhead of creating a new cell is huge (due probably to the overhead of creating CodeMirror editors); however, the fromJSON method in IPython assumes that the cell is brand new -- it would be nice to add an option to make a cell fromJSON without assuming it is empty. The ultimate outcome of this could be a clean well-defined way of doing sync for IPython notebooks using any third-party sync implementation. IPython might provide their own sync service and there are starting to be others available these days -- e.g., Google has one, and maybe Guido van Rosum helped write one for Dropbox recently?

How it works

Earlier this year, I implemented Neil Fraser's differential synchronization algorithm, since I needed it for file and Sage worksheet editing in https://cloud.sagemath.com. There are many approaches to realtime synchronization, and Fraser makes a good argument for his.  For example, Google Wave involved a different approach (Operational Transforms), whereas Google Drive/Docs uses Fraser's approach (and code -- he works at Google), and you can see which succeeded. The main idea of his approach is eventually stable iterative process that involves heuristically making and applying patches on a "best effort" basis; it allows for all live versions of the document to be modified simultaneously -- the only locking is during the moment when a patch is applied to the live document. He also explains how to handle packet loss gracefully. I did a complete implementation from scratch (except for using the beautiful Google diff/patch/match library). There might be a Python implementation of the algorithm as part of mobwrite.

The hardest part of this project was using Fraser's algorithm, which is designed for unstructured text documents, to deal with IPython's notebook format, which is a structured JSON document. I ended up defining another less structured format for IPython notebooks, which gets used purely for synchronization and nothing else. It's a plain text file whose first line is a JSON object giving metainformation; all other lines correspond, in order, to the JSON for individual cells. When patching, it is in theory possible in edge cases involving conflicts to destroy the JSON structure -- if this happens, the destruction is isolated to a single cell, and that part of the patch just gets rejected.

The IPython notebook is embedded as an iframe in the main https://cloud.sagemath.com page, but with exactly the same domain, so the main page has full access to the DOM and Javascript of the iframe. Here's what happens when a user makes changes to a synchronized IPython notebook (and at least 1 second has elapsed):
• The outer page notices that the notebook's dirty flag is set for some reason, which could involve anything from typing a character, deleting a bunch of cells, output appearing, etc.
• Computes the JSON representation of the notebook, and from that the document representation (with 1 line per cell) described above. This takes a couple of milliseconds, even for large documents, due to caching.
• The document representation of the notebook gets synchronized with the version stored on the server that the client connected with. (This server is one of many node.js programs that handles many clients at once, and in turn synchronizes with another server that is running in the VM where the IPython notebook server is running.  The sync architecture itself is complicated and distributed, and I haven't described it publicly yet.)
• In the previous step, we in fact get a patch that we apply -- in a single automatic operation (so the user is blocked for a few milliseconds) -- to our document representation of the notebook in the iframe. If there are any changes, the outer page modifies the iframe's notebook in place to match the document. My first implementation of this update used IPython's noteobook.fromJSON, which could easily take 5 seconds (!!) or more on some of the online IPython notebook samples. I spent about two days just optimizing this step. The main ideas are:
1. Map each of the lines of the current document and the new document to a unicode character,
2. Use diff-patch-match to find an efficient sequence of deletions, insertions, swaps to transforms one document to the other (i.e., swapping cells, moving cells, etc.) -- this is critical to do,
3. Change cells in place when possible.
With these tricks (and more can be done), modifying the notebook in place takes only a few milliseconds in most cases, so you don't notice this as you're typing.
• Send a broadcast message about the position of your cursor, so the other clients can draw it.  (Symmetrically, render the cursor on receiving a broadcast message.)

September 11, 2013

Arink Verma

Improvement in _extract_pyvals

Instead of *errobj = Py_BuildValue("NO", PyBytes_FromString(name), retval); in _extract_pyvals. It alone takes 10% of time, with every operations.  Its better to avoid packing of string instead assign pointer to struct and stash them to thread local storage. Hence no allocations would be need to for errorobj every-time, which almost goes unused.

Implementation

Error object include name of ufunc and callback pointer.
typedef struct {    PyObject_HEAD    char *name;    PyObject* retval;} PyErrObject;

Fast path for consecutive same operations, it better to use caching
errvalues = PyDict_GetItem(thedict, PyUFunc_PYVALS_ERROR);        if (errvalues == NULL) {        errvalues = PyObject_New(PyErrObject, &PyErrObject_Type);        errvalues->name = name;        errvalues->retval = retval;        PyDict_SetItem(thedict, PyUFunc_PYVALS_ERROR, (PyObject*)errvalues);    }        errvalues->name = name;    errvalues->retval = retval;    *errobj = errvalues;    Py_INCREF(errvalues);

Improvement

 Callgraph PyErrObjectx = np.asarray([5.0,1.0]); x+x
Time consumption of _extract_pyvals drop to 2% from 9.3%.

More

PR for this enhancement is #3686

September 10, 2013

Continuum Analytics news

Continuum Analytics Releases Anaconda 1.7

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, announced today the release of the latest version of Anaconda, its premium collection of libraries for Python. Anaconda enables big data management, analysis, and cross-platform visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 1.7, adds VTK, Mayavi, and Bokeh to the distribution, as well as updates IPython to 1.0 and MatPlotLib to 1.3.

September 07, 2013

Surya Kasturi (scipy-central GSOC)

Testing Thumbs

In this post I will try to explain what exactly is happening in Thumbs app in SciPy Central.
Firstly, Thumbs app is located at /scipy_central/thumbs/ directory. The handles voting of objects. Right now, Revision, Comment objects are available for voting while comment objects are only available for up-votes.

Broadly, Thumbs app does the following
1. Create thumb object
2. Get/ update reputation
3. Get update user profile reputation
4. Set wilson confidence score

We also primarily changed the way Revision objects are ordered. They are now ordered based on wilson score while before they are ordered based on created date.

I noticed two typical ways of implementing Thumbs considering the above mentioned points.
Trying to understand from the front-end of the site, the following are the areas where thumb-objects data is influenced.
1. When we open a Submission (Snippet, Link, Package..) page
2. When we open a User profile page
3. When we make a search query

[1] When we open a submission page, the following are required for us
1. Total reputation the object got
2. submit-vote form

[2] When we open user profile page, we need to list out all the objects that user created which also include the total reputation of each object (that's what we are interested here)

[3] When we make a search query, we are ordering based on the wilson score confidence levels. These levels are changed upon each vote submission

Now, trying to implement the above actions, we can do in two ways. The below are those including some reasons to support.

As "Thumb" objects are generic in nature and can be found with a reverse-generic relationship for any attached objects. We can find the above parameters reputation, wilson-score in two ways

1. calculate reputation/ wilson score every time they are needed to show on page
2. calculate reputation/ wilson score only when a vote is made and store in database.

Why [1]:

At first I implemented the thumbs app in this way! This implementation completely reduces admin maintenance overhead. For instance, if moderators need to manually moderate a vote of user, the total-reputation or wilson score levels are automatically done!

Why [2]:

Well, if we see [1] implementation, we are calculating reputation of objects each time they are needed based on the thumb-objects it has got. This implementation is fine as long as we have 10-20 thumb-objects considering the state of performance of servers we have and amount of parallelism and performance database (PostgreSQL) gives us.

What if we got 1000 thumb-objects for a submission object? I don't think its anything uncommon these days. We can typically find may questions in Stackoverflow getting 500 reputation (lets guess 650 up and 150 down -- this may not happen but we can assume as such). So, I thought counting each time the total reputation, querying all 1000 objects (approx), is not a good bet.

Consider a search query where 30 objects are returned, calculating reputation for each object dynamically and display would be really bad especially if those have 500 reputation!

It would also look bad if a user has lots of contributions on the site and opening his profile page is going to take lots of database queries!

What [2] does is create reputation, wilson-score fields. Since these are only changed when thumb-object is created or changed i.e., when someone votes or moderator make changes, we can simply update those fields when vote is made. Rest of the time, we can query those small integer fields.

This might sound quite good enough for you (I felt at least) but the following points are to be noted

On every vote a user makes we are additionally doing the following things
1. Calculate new reputation and update field
2. Calculate new wilson score and update field
3. Calculate user profile reputation and update field

Considering that voting objects is less obvious than search-queries or page-requests (opening submission page), we can possibly bet on this overhead since lots of performance can be boosted while in the latter.

You might very well see that my opinions are more favourable towards [2] implementation :) However, this implementation requires quite a bit of work on the maintenance side. For instance moderator manually deletes a vote (makes it None) made by the user, now, the reputation field, wilson score have to be changed. No wonder we also need to consider cases where several lots of other people are trying to vote on the same object while we are moderating it. We can try to compensate here with some custom admin-actions which update those fields!

Coming to the testing part of the above two situations I described, we need to see two parts in each situation
1. Time taken to vote
2. Time taken to query or open submission page
In case[2] we are only considering time taken to display reputation value

I have generated 1000 User objects, 1000 Submission (Link type) objects to test the situations. To be precise, 998 objects are created since Sqlite can't handle more than 1000 at once! (deleting all objects at once would become easy this way)

Submit Vote (Case A) (Finding reputation dynamically each time required)
Each of 1000 users are made to vote on randomly selected 50 Submission objects (out of 1000). I have logged about 49,950 thumb objects with time elapsed for voting.

The total time is accounted only for the execution of the below steps
1. create thumb object
2. calculate reputation (dynamically)
3. calculate wilson score

On an average, the execution time was between 0.1 sec to 0.2 sec (increasing slowly). This rate of increase of time is probably due to the fact that number of votes are increasing. On an average I noticed while monitoring the iterations, each revision object's total reputation was around 20-40 (don't know total number of votes)

The below is the plot over 20,000 iterations! x-axis represents iteration number; y-axis represents execution time at that iteration

 Case A - submit_vote execution time distribution

Submit Vote (Case B) (Finding reputation only when vote is made)

The total time is accounted only for the execution of the below steps
1. create thumb object
2. calculate reputation based on previous value, store it
3. calculate wilson score, store it
4. calculate user profile reputation, store it

On an average, the execution was around 0.2-0.35 sec with each object roughly having total reputation roughly -20 to 40. The plot over 20,000 iterations is shown below. The x-axis represents iterations while y-axis represents execution time corresponding to iteration number

 Case B - submit_vote execution time distribution

Here we need to note some interesting trend in the plot. In case-b, the average execution time over 20,000 iterations was roughly constant while in case-a, it was increasing slowly. This could mean that the behavior of case-a when too many votes is going to be bad enough than case-b!

However, the average execution time in case-b has been increased by 0.1 to 0.15 seconds. Well, this can be supported by the reason that we additionally calculating user-profile reputation apart from other functionality.  Also, this information is stored in database, probably a bit of slowness here too.

Page Request (Case A)

Here 1000 revision objects which are previously voted randomly according to case-a are taken and iterated to check execution time for calculating reputation.

The trends seems to look quite surprising at first as the execution-time distribution on an average seems to be constant while I was expecting more often up's. The can be supported by the reason that the total number of votes for an object happen to be a bit constant roughly, this kind of trends may happen. However, there are few sudden up's in the plot where number of thumb-objects for a reputation object seems to be higher

For instance, if we have more objects having higher reputation than rest of them, we might see more up's.

 Page-request case A - execution time calculation
Page Request (Case B)

This case is obvious with 0 execution time (although it is technically not 0 but very near to 0).
In the case, all we are doing is to query saved reputation field.

 Page Request Case B - execution time calculation
Search Query (case A)

Here we are assuming to query some objects and display on the page. Thus, we also need to calculate reputation. I have assumed that a search query returns 20 objects each time and calculated time taken to execute reputation for all those objects!

For each iteration, 20 objects are further iterated. All those 20 are timed together.
x-axis represents iteration number
y-axis represents execution time.

Again, a surprising trend for the reason that I expected for up's. However, its taking nearly 1.5 seconds to calculate!
 Search Query - Case A - execution time to calculate reputation
Search Query (case B)

A similar situation is taken here as in the above case-a, but we are only returning saved reputation field and not calculating dynamically. The trends seems to be quite convincing but there are some unexpected up's in the middle.

On an average, the execution time for each iteration containing 20 objects took 0.001sec, 0.0005 sec and even 0sec. This is fairly less!

Since the iterations are too fast, I would take up a bit large data. While in case-a, 200 iterations itself too quite lot of time.

The source code used for testing can be found at https://gist.github.com/ksurya/6473160

September 06, 2013

NeuralEnsemble

Spyke Viewer 0.4.0 released

We are pleased to announce the release of spykeutils and Spyke Viewer 0.4.0, available on PyPi, NeuroDebian and as binary version.

Spyke Viewer is a multi-platform GUI application for navigating, analyzing and visualizing data from electrophysiological experiments or neural simulations.  It is based on the Neo library, which enables it to load a wide variety of data formats used in electrophysiology. At its core, Spyke Viewer includes functionality for navigating Neo object hierarchies and performing operations on them. spykeutils is a Python library containing analysis functions and plots for Neo objects.

A central design goal of Spyke Viewer is flexibility. For this purpose, it includes an embedded Python console for exploratory analysis, a filtering system, and a plugin system. Filters are used to semantically define data subsets of interest. Spyke Viewer comes with a variety of plugins implementing common neuroscientific plots (e.g. rasterplot, peristimulus time histogram, correlogram, and signal plot). Custom plugins for other analyses or plots can be easily created and modified using the integrated Python editor or external editors. Documentation and installation instructions are at http://spyke-viewer.readthedocs.org/en/0.4.0/index.html

In addition, the Spyke Repository is now online. Find Spyke Viewer extensions (e.g. plugins, startup script snippets, IO plugins etc.) or share your own at http://spyke-viewer.g-node.org. Among the extensions currently hosted at the site are plugins for spike detection and spike sorting.

Some highlights from the changelogs since 0.2.0 (available for spykeutils and Spyke Viewer):
• spykeutils & Spyke Viewer: Support for lazy loading features in current Neo development version
• spykeutils & Spyke Viewer: New features for spike waveform and correlogram plots
• Spyke Viewer: Better support for starting plugins remotely: Progress bar and console output available
• Spyke Viewer: Additional data export formats
• Spyke Viewer: A startup script and an API enable more configuration and customization
• Spyke Viewer: Forcing a specific IO class is now possible. Graphical options for IOs with parameters
• Spyke Viewer: Filters are automatically deactivated on loading a selection if they prevent parts of it to be shown
• Spyke Viewer: Modified plugins are saved automatically before starting the plugin
• Spyke Viewer: Plugin configurations are now restored when saving or refreshing plugins and when restarting the program
• Spyke Viewer: Added context menu for navigation. Includes entries for removing objects and an annotation editor.
• Spyke Viewer: The editor now has search and replace functionality (access wiht Ctrl+F and Ctrl+H)
• spykeutils: Fast and well tested implementations for many spike train metrics (contributed by Jan Gosmann)

The new version can be installed over older versions and will keep previous configuration, filters and plugins.

Continuum Analytics

CUDA Performance: Maximizing Instruction-Level Parallelism

CUDA programming can be difficult because of the unfamiliar hardware architecture. CUDA is so new that not many people have enough experience to say what is the best approach to write performant code. In this post, we will revisit Vasily Volkov’s talk on Better Performance at Lower Occupancy to show the importance of instruction-level parallelism (ILP).

September 05, 2013

Continuum Analytics news

PyData NYC Call for Speakers

PyData New York 2013 will be held November 8th - 10th at One Chase Manhattan Plaza. PyData is a user and developer conference centered around Python for data analysis, BI, and big data. Speaker proposals are now being accepted until September 16th. Past conference topics and directions on how to submit proposals can be found here.

September 04, 2013

Surya Kasturi (scipy-central GSOC)

Fixing bugs

During the django-1.5 upgrade, I started reading around the code base more seriously than before looking for fixes and old pieces to be updated with better code.

1. There is a Windows platform specific error found while creating repos upon snippet submission. This is mostly not found by the fact Windows is not used by many of us
2. Some minor bugs corresponding to typo kind of errors in code
3. Improving "next_submission", "previous_submission" model methods using paginator class
4. Adding clickjacking, transactions middleware to our site. These are quite a bit important.

Firstly, to make a note:

When we try to submit a snippet (or any other submission) and found a server error (or we create it during development and testing), there have been some cases where the submission data has been partially stored in the database and was able to be displayed on the site.

For controlling these mistakes, the objects should not be "stored" into db until there have been no exceptions found and view response exists successfully with 200. Here are require django transactions middleware.

Even though django transactions middleware gets deprecated in django-1.6, we still need it at the moment until we come up with our own middleware to solve these issues.

Improving Thumbs

This is quite exciting that I have tried more than one way of building thumbs (reputation) system for the site. Firstly, the Thumbs app is generic which helps us to do less work for attaching reputation system to any other objects in future. Essentially, most of the functionality of thumbs has been built for "Revision" object and its just extended for "comments" objects!

Firstly, the below is the terminology for understanding this post - It looks a bit confusing with the keywords I have used.

1. "Reputation" is just a field name or in general referred as the functionality we are building
2. "Thumbs" is app name for reputation functionality [1].

Each "Thumb" object allows "up-vote", "down-vote", "None". None is used to un-vote! For comments objects, "down-vote" is disabled at views, template layers! There is no specific validation for comment objects at model layer. Probably we might need it.

There have been considerable changes in the Thumbs app at its core. So, I feel its necessary to get back what it was like before and what has been changed now.

Previous:

Thumbs are available for Revision, Comments objects. Since Thumbs model is generic, Revision  or Comment objects are attached to each Thumb object and are retrieved using reverse generic relationships.

On every page request of submission/ revision, the "reputation" aka aggregate of all votes an object got is calculated every time using model methods and presented. Yeah, on every page request. This sounds a bit weird. But this has its own benefits

1. We don't have to worry about finding total reputation based on all thumb-objects.
2. Any changes in admin- interface or at database level require no maintenance of database. All effects can be observed immediately
3. Right now, there is a scale which has the settings to provide the points for each vote. Its mentioned in scale.py -- Any changes in these settings require no extra maintenance of database

There are drawbacks too
1. We are calculating total reputation on every page request.. database queries are not cheap
2. Page loading may become too slow.. What if we have 1000 votes. Now, on every page request, 1000 iterations have to be done..

New changes:

This is an old idea but I have adopted lately. Now, I created a new field "reputation" which contains total reputation of object. Now, this field has to be manually updated when an vote is made. On every page request, we only read a small integer rather than 1000 objects and calculate.

Benefits:
1. Possible reduction in database load

Drawbacks:
1. Lots of maintenance from the admin

To reduce the admin maintenance, respective admin actions have been created which automatically update reputation field of respective objects wherever required. But still, there is a bit of uncertainty.

Things yet to be done:

There are several other pros and cons each of the above mentioned methods have and which might henceforth require manual tests to be done to selected best option for us. (This is still yet to be done)

Matthieu Brucher

Book: Building Machine Learning Systems in Python

I recently had the opportunity to be a technical reviewer for the new Building Machine Learning Systems in Python. As I took part in the book, I won’t write a review like what I did for other books.

First, I have to say that I was impressed by the quality of the content. Although I had some things that I thought were not excellent (I still need to check how my reviews changed the book), it’s the best book I’ve read from Packt so far. It has a good balance between code and comprehension, which is an equilibrium that is rarely achieved.

I don’t think it is possible to write a better book on Machine Learning in Python, unless the ecosystem evolves with new algorithms. Which it will, and it will mean a new edition of the book! Neat!

September 03, 2013

Arink Verma

Almost 12% of useless time consumption by PyUFunc_GetPyValues

For every single operation calls, numpy has to extract value of buffersize, errormask and name to pack and build error object. These two functions, _extract_pyvals and PyUFunc_GetPyValues together use >12% of time.

 Callgraph showing contribution of PyUFunc_GetPyValues

I think it take useless time because all this time is spent on to look up entries in a python dict, extract them, and convert them into C level data. Not once but doing that again and again on every operation. Instead these values should be converted once, at time of loading or when they set in first place.

September 02, 2013

William Stein

Status report: integrating IPython into https://cloud.sagemath.com -- my approach

I'm still working on the IPython notebook integration into https://cloud.sagemath.com right now. This will be a valuable new feature for users, since there's a large amount of good content out there being developed as IPython notebooks, and the IPython notebook itself is fast and rock solid.

I spent the last few days (it took longer than expected) creating a generic way to *securely* proxy arbitrary http-services from cloud projects, which is now done. I haven't updated the page yet, but I implemented code so that
   https://cloud.sagemath.com/[project-id]/port/[port number]/...
gets all http requests automatically proxied to the given port at the indicated project. Only logged in users with write access to that project can access this url -- with a lot of work, I think I've set things up so that one can safely create password-less non-ssl web services for a groub of collaborators, and all the authentication just piggy backs on cloud.sagemath accounts and projects: it's SSL-backed (with a valid cert) security almost for free, which solves what I know to be a big problem users have.

The above approach is also nice, since I can embed IPython notebooks via an iframe in cloud.sagemath pages, and the url is exactly the same as cloud.sagemath's, which avoids subtle issues with firewalls, same-source origin, etc. For comparison, here's what the iframe that contains a single ipynb worksheet looks like for wakari.io:
      iframe class="notebookiframe" id="" src="https://prod-vz-10.wakari.io:9014/auto_login/acd84627972f91a0838e512f32e09c9823782ec0?next=/notebook_relative/Listing 2.ipynb"
and here's what it's going to look like in cloud.sagemath:
       iframe class="notebookiframe" id="" src="https://cloud.sagemath.com/70a37ef3-4c3f-4bda-a81b-34b894c89701/port/9100/Listing 2.ipynb"
With the wakari.io approach, some users will find that notebooks just don't work, e.g., students at University of Arizona, at least if their wifi still doesn't allow connecting to nonstandard ports, like it did when I tried to setup a Sage notebook server there once for a big conference. By having exactly the same page origin and no nonstandard orts, the way I set things up, the parent page can also directly call javascript functions in the iframe (and vice versa), which is potentially very useful.

IPython notebook servers will be the first to use this framework, then I'll use something similar to serve static files directly out of projects. I'll likely also add sage cell server and the classic sage notebook as well at some point, and maybe wiki's, etc.