## July 17, 2018

### Continuum Analytics

#### New Release of Anaconda Enterprise features Expanded GPU and Container Usage

Anaconda, Inc. is thrilled to announce the latest release of Anaconda Enterprise, our popular AI/ML enablement platform for teams at scale. The release of Anaconda Enterprise 5.2 adds capabilities for GPU-accelerated, scalable machine learning and cloud-native model management, giving enterprises the power to respond at the speed required by today’s digital interactions.  Anaconda Enterprise—An AI/ML …

The post New Release of Anaconda Enterprise features Expanded GPU and Container Usage appeared first on Anaconda.

### Matthew Rocklin

#### Dask Development Log, Scipy 2018

This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Last week many Dask developers gathered for the annual SciPy 2018 conference. As a result, very little work was completed, but many projects were started or discussed. To reflect this change in activity this blogpost will highlight possible changes and opportunities for readers to further engage in development.

The dask-jobqueue project was a hit at the conference. Dask-jobqueue helps people launch Dask on traditional job schedulers like PBS, SGE, SLURM, Torque, LSF, and others that are commonly found on high performance computers. These are very common among scientific, research, and high performance machine learning groups but commonly a bit hard to use with anything other than MPI.

This project came up in the Pangeo talk, lightning talks, and the Dask Birds of a Feather session.

During sprints a number of people came up and we went through the process of configuring Dask on common supercomputers like Cheyenne, Titan, and Cori. This process usually takes around fifteen minutes and will likely be the subject of a future blogpost. We published known-good configurations for these clusters on our configuration documentation

Additionally, there is a JupyterHub issue to improve documentation on best practices to deploy JupyterHub on these machines. The community has done this well a few times now, and it might be time to write up something for everyone else.

### Get involved

If you are an administrator on a supercomputer you might consider helping to build a configuration file and place it in /etc/dask for your users. You might also want to get involved in the JupyterHub on HPC conversation.

Olivier Grisel and Tom Augspurger prepared and delivered a great talk on the current state of the new Dask-ML project.

## MyBinder and Bokeh Servers

Not a Dask change, but Min Ragan-Kelley showed how to run services through mybinder.org that are not only Jupyter. As an example, here is a repository that deploys a Bokeh server application with a single click.

I think that by composing with Binder Min effectively just created the free-to-use hosted Bokeh server service. Presumably this same model could be easily adapted to other applications just as easily.

## Dask and Automated Machine Learning with TPOT

Dask and TPOT developers are discussing paralellizing the automatic-machine-learning tool TPOT.

TPOT uses genetic algorithms to search over a space of scikit-learn style pipelines to automatically find a decently performing pipeline and model. This involves a fair amount of computation which Dask can help to parallelize out to multiple machines.

### Get involved

Trivial things work now, but to make this efficient we’ll need to dive in a bit more deeply. Extending that pull request to dive within pipelines would be a good task if anyone wants to get involved. This would help to share intermediate results between pipelines.

Among various features, Scikit-optimize offers a BayesSearchCV object that is like Scikit-Learn’s GridSearchCV and RandomSearchCV, but is a bit smarter about how to choose new parameters to test given previous results. Hyper-parameter optimization is a low-hanging fruit for Dask-ML workloads today, so we investigated how the project might help here.

So far we’re just experimenting using Scikit-Learn/Dask integration through joblib to see what opportunities there are. Dicussion among Dask and Scikit-Optimize developers is happening here:

## Centralize PyData/Scipy tutorials on Binder

We’re putting a bunch of the PyData/Scipy tutorials on Binder, and hope to embed snippets of Youtube videos into the notebooks themselves.

This effort lives here:

### Motivation

The PyData and SciPy community delivers tutorials as part of most conferences. This activity generates both educational Jupyter notebooks and explanatory videos that teach people how to use the ecosystem.

However, this content isn’t very discoverable after the conference. People can search on Youtube for their topic of choice and hopefully find a link to the notebooks to download locally, but this is a somewhat noisy process. It’s not clear which tutorial to choose and it’s difficult to match up the video with the notebooks during exercises. We’re probably not getting as much value out of these resources as we could be.

To help increase access we’re going to try a few things:

1. Produce a centralized website with links to recent tutorials delivered for each topic
2. Ensure that those notebooks run easily on Binder
3. Embed sections of the talk on Youtube within each notebook so that the explanation of the section is tied to the exercises

### Get involved

This only really works long-term under a community maintenance model. So far we’ve only done a few hours of work and there is still plenty to do in the following tasks:

1. Find good tutorials for inclusion
2. Ensure that they work well on mybinder.org
• are self-contained and don’t rely on external scripts to run
• have an environment.yml or requirements.txt
• don’t require a lot of resources
3. Find video for the tutorial
4. Submit a pull request to the tutorial repository that embeds a link to the youtube talk at the top cell of the notebook at the proper time for each notebook

I really enjoyed the talk on Ray another distributed task scheduler for Python. I suspect that Dask will steal ideas for actors for stateful operation. I hope that Ray takes on ideas for using standard Python interfaces so that more of the community can adopt it more quickly. I encourage people to check out the talk and give Ray a try. It’s pretty slick.

Dask and Scikit-learn developers had the opportunity to sit down again and raise a number of issues to help plan near-term development. This focused mostly around building important case studies to motivate future development, and identifying algorithms and other projects to target for near-term integration.

### Get involved

We could use help in building out case studies to drive future development in the project. There are also several algorithmic places to get involved. Dask-ML is a young and fast-moving project with many opportunities for new developers to get involved.

## Dask and UMAP for low-dimensional embeddings

Leland McKinnes gave a great talk Uniform Manifold Approximation and Projection for Dimensionality Reduction in which he lays out a well founded algorithm for dimensionality reduction, similar to PCA or T-SNE, but with some nice properties. He worked together with some Dask developers where we identified some challenges due to dask array slicing with random-ish slices.

A proposal to fix this problem lives here, if anyone wants a fun problem to work on:

If you use Dask and want to share your story we would absolutely welcome your experience. Having people like yourself share how they use Dask is incredibly important for the project.

## July 16, 2018

### Matthew Rocklin

This work is supported by Anaconda Inc

People often ask general questions like “Who uses Dask?” or more specific questions like the following:

1. For what applications do people use Dask dataframe?
2. How many machines do people often use with Dask?
3. How far does Dask scale?
4. Does dask get used on imaging data?
5. Does anyone use Dask with Kubernetes/Yarn/SGE/Mesos/… ?
6. Does anyone in the insurance industry use Dask?

This yields interesting and productive conversations where new users can dive into historical use cases which informs their choices if and how they use the project in the future.

New users can learn a lot from existing users.

To further enable this conversation we’ve made a new tiny project, dask-stories. This is a small documentation page where people can submit how they use Dask and have that published for others to see.

To seed this site six generous users have written down how their group uses Dask. You can read about them here:

We’ve focused on a few questions, available in our template that focus on problems over technology, and include negative as well as positive feedback to get a complete picture.

1. Who am I?
2. What problem am I trying to solve?
4. What pain points did I run into with Dask?
5. What technology do I use around Dask?

### Easy to Contribute

Contributions to this site are simple Markdown documents submitted as pull requests to github.com/dask/dask-stories. The site is then built with ReadTheDocs and updated immediately. We tried to make this as smooth and familiar to our existing userbase as possible.

This is important. Sharing real-world experiences like this are probably more valuable than code contributions to the Dask project at this stage. Dask is more technically mature than it is well-known. Users look to other users to help them understand a project (think of every time you’ve Googled for “some tool in some topic”)

If you maintain another project you might consider implementing the same model. I hope that this proves successful enough for other projects in the ecosystem to reuse.

## July 13, 2018

### Continuum Analytics

#### Deep Learning with GPUs in Anaconda Enterprise

AI is a hot topic right now. While a lot of the conversation surrounding advanced AI techniques such as deep learning and machine learning can be chalked up to hype, the underlying tools have been proven to provide real value. Even better, the tools aren’t as hard to use as you might think. As Keras …

The post Deep Learning with GPUs in Anaconda Enterprise appeared first on Anaconda.

## July 08, 2018

### Titus Brown

#### The Open Source Anti-Sisyphean League

(This title commonsed from Cory Doctorow)

I’ve been thinking about the design principles for sustainable open online resources a lot lately, and I really like a phrase that Cory Doctorow came up with: “an open source anti-Sisyphean league.” And I am wondering if this is one of the major motivations for community formation around open online resources.

Whence “Sisyphean”? Sisyphus is a figure from Greek mythology; to quote Wikipedia, "He was punished for his self-aggrandizing craftiness and deceitfulness by being forced to roll an immense boulder up a hill only for it to roll down when it nears the top, repeating this action for eternity. [...] tasks that are both laborious and futile are therefore described as Sisyphean."

When I was pitching the Common Pool Resource framework to Cory, and trying to relate it to the shared community labor meme in Walkaway, one of the points that came up is that there is a non-trivial amount of maintenance labor that simply needs to be done by somebody to keep the project going. In open source projects this can range from keeping the continuous integration running to curating new issues to dealing with tests broken by from some dependency upgrade; in wikis, spam removal and link fixing qualify; and in the Carpentries, lesson maintenance is an ongoing burden. Riffing on this idea, Cory said, “It’s like we need an ‘effin open source anti-Sisyphean League!” to handle these laborious and never-ending issues in common.

So perhaps one organizing principle in communities that sustain open online resources is that they are partly organized around these maintenance issues?

I think Cory is taking it further, tho. I took the “league” term as referring to the idea that, in open source, another organizing principle is that there is some number of common goals that simply need to be done by someone. In a perfectly spherical world where IP restrictions didn’t exist, knowledge and code could flow freely from project to project, so when someone solved a problem like, oh, say, building a system to link and distribute content via a distributed network of servers, that solution could be reused and remixed be everyone. (It's just crazy enough to work!) And, while there might be different trial solutions to any given problem, the communities behind those solutions would be able to learn from each other and iterate and maybe eventually converge to a small set of high quality solutions.

This bears a close resemblance to how my favorite open source communities work. When I look at the Python world, I see a plethora of small experimental Python modules that solve various problems, and a much smaller, higher quality collection of maintained modules. It’s relatively rare for me to have to choose between two Python libraries for a particular task, because usually there is only one well maintained one. When I do have to choose, it’s either because it’s in an expanding area (Web dev, back when), or because the Python stdlib has an old library that is kept around for reasons of backwards compatibility, or because there are still really strong differing opinions on how to handle that particular use case (see: argparse and the plethora of command line option parsers).

I’d guess that in any open community, there are several sources of friction that prevent more rapid convergence to a single solution. Poor awareness and/or bad communication are definitely one set of reasons - sometimes it’s hard to find the right search keywords to discover a project that does what you need, or you find that the project didn’t document itself properly. Another source of anti-convergence friction is stability: some people are going to go with a suboptimal solution that has been around for a while, because the existing community and documentation is so good, or maybe just because they’re familiar with it. Another obvious friction is ego and personality, where people refuse to adopt another community’s approach because key people in one community don’t like key people in the other community at an interpersonal level. And, of course, there may actually be several near-optimal solutions to any given problem, in which case multiple stable solutions may exist. Perhaps another is honest disagreement on approach - for example, while I am certainly aware of the Carpentry genomics lessons, we have chosen a somewhat different style and delivery format for our ANGUS 2018 genomics lessons, because we find it fits our needs better. (But here, I am hoping that we eventually converge, and we’re doing experiments to figure out what does and doesn’t work in both sets of lessons.)

(Interestingly, academia has failed quite spectacularly in the area of converging solutions. The plethora of virtually identical bioinformatics solutions to any given problem (mapping! annotation!) largely exists because in academia we are incentivized more for the appearance of knowledge production than for actual progress on hard problems. Many of my colleagues persist in working on the really hard problems out of idealism, but it’s a long and somewhat thankless road! In academia, we have adopted many of the bad approaches above: we communicate about solutions poorly with high latency (publications anyone?), we have little incentive to shift to new & better solutions, and ego and self-promotion run rampant within academic circles.)

On the flip side, you can see an amazing convergence in many places in the more practical side of computing. The convergence on R and Python as the data science lingua francas has been amazing - yes, there are still two languages, but even there I am starting to notice that approaches are converging, helped along by cross-language interlocutors. (Any bets on how long it will take for the #rstats folk to converge on vega for viz?) The rise of Docker and convergence on Kubernetes has been astounding. The speed with which the bioinformatics community seems to have adopted bioconda is astonishing - my entire lab shifted to using it overnight, AFAICT.

Returning to the Common Pool Resource framework, if we view "effort" or “labor” as the common pool resource being managed in the creation and maintenance of open online resources, then what we are trying to do here is minimize redundant labor being used to solve problems of collective interest . To again borrow from Cory, "We only want to push that effin' rock up the hill once, if we can manage it."

If you were to say “but this is all obvious!”, I would agree that this anti-Sisyphean organizing principle is entirely obvious and clear to me in retrospect.

Moreover, it can be rephrased in several ways (practical utility suggests that fewer solutions are better, all other things being equal! ecosystem principles suggest that a winner-take-all approach is likely when information flows freely!)

But in some sense that’s the point of writing this blog post: we should articulate the obvious stuff when we’re thinking about design principles for open online resources! This is for two reasons. First, if it doesn’t fit, then best to find counterexamples early. Second, if it does fit, then by articulating it we are not only communicating it clearly to others but also building a foundation for the next steps of inquiry.

For example, I have questions! How does a community coalesce around recognition of a common goal? How do we distinguish between global common goals (“how do we enable flexible viz in data science?”) vs more local goals (“my local bioinformatics community really needs a library to help me visualize this type of data”)? What is the community lifecycle beyond creation, e.g. maintenance, sustainability, merging, and forking? What are the unique properties of open online resources (or digital public goods, as Nadia Eghbal calls them) that makes them different - is it “just” the ease with which digital resources can spread in a networked environment, or is there more to it?

Inquiring minds want to know!

best, —titus

p.s A really interesting question is, are there common organizing principles around building these open source anti-Sisyphean leagues? Gosh I sure hope someone's asking that!

p.p.s. I will abandon the tortured “open online resources” term soon, I promise - I have been discovering a whole new world of terminology that fits this idea much better, and it seems only appropriate to try to coalesce around some common terminology in this space. See the entire above post for reasons. :)

p.p.s. Nadia Eghbal pointed me at this fascinating post by Jill Carlson: "Free Company: the Decentralized Future of Work", which comes at the same question from a different angle.

### Continuum Analytics

Building powerful machine learning models often requires more computing power than a laptop can provide. Although it’s fairly easy to provision compute instances in the cloud these days, all the computing power in the world won’t help you if your machine learning library cannot scale. Unfortunately, popular libraries like scikit-learn, XGBoost, and TensorFlow don’t offer …

### Matthew Rocklin

This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Current efforts for June 2018 in Dask and Dask-related projects include the following:

1. Yarn Deployment
2. More examples for machine learning
3. Incremental machine learning
4. HPC Deployment configuration

### Yarn deployment

Most Hadoop/Spark/Hive clusters are actually Yarn clusters. Yarn is the most common cluster manager used by many clusters that are typically used to run Hadoop/Spark/Hive jobs including any cluster purchased from a vendor like Cloudera or Hortonworks. If your application can run on Yarn then it can be a first class citizen here.

Unfortunately Yarn has really only been accessible through a Java API, and so has been difficult for Dask to interact with. That’s changing now with a few projects, including:

• skein: an easy way to launch generic services on Yarn clusters (this is primarily what backs dask-yarn)
• conda-pack: an easy way to bundle together a conda package into a redeployable environment, such as is useful when launching Python applications on Yarn

This work is all being done by Jim Crist who is, I believe, currently writing up a blogpost about the topic at large. Dask-yarn was soft-released last week though, so people should give it a try and report feedback on the dask-yarn issue tracker. If you ever wanted direct help on your cluster, now is the right time because Jim is working on this actively and is not yet drowned in user requests so generally has a fair bit of time to investigate particular cases.

from dask_yarn import YarnCluster

# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
worker_vcores=2,
worker_memory="8GB")
# Scale out to ten such workers
cluster.scale(10)

# Connect to the cluster
client = Client(cluster)


### More examples for machine learning

Dask maintains a Binder of simple examples that show off various ways to use the project. This allows people to click a link on the web and quickly be taken to a Jupyter notebook running on the cloud. It’s a fun way to quickly experience and learn about a new project.

Previously we had a single example for arrays, dataframes, delayed, machine learning, etc.

Now Scott Sievert is expanding the examples within the machine learning section. He has submitted the following two so far:

I believe he’s planning on more. If you use dask-ml and have recommendations or want to help, you might want to engage in the dask-ml issue tracker or dask-examples issue tracker.

### Incremental training

The incremental training mentioned as an example above is also new-ish. This is a Scikit-Learn style meta-estimator that wraps around other estimators that support the partial_fit method. It enables training on large datasets in an incremental or batchwise fashion.

#### Before

from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(...)

import pandas as pd

for filename in filenames:
X, y = ...

sgd.partial_fit(X, y)


#### After

from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(...)
inc = Incremental(sgd)

X, y = ...
inc.fit(X, y)


#### Analysis

From a parallel computing perspective this is a very simple and un-sexy way of doing things. However my understanding is that it’s also quite pragmatic. In a distributed context we leave a lot of possible computation on the table (the solution is inherently sequential) but it’s fun to see the model jump around the cluster as it absorbs various chunks of data and then moves on.

There’s ongoing work on how best to combine this with other work like pipelines and hyper-parameter searches to fill in the extra computation.

This work was primarily done by Tom Augspurger with help from Scott Sievert

Dask developers are often asked “Who uses Dask?”. This is a hard question to answer because, even though we’re inundated with thousands of requests for help from various companies and research groups, it’s never fully clear who minds having their information shared with others.

We’re now trying to crowdsource this information in a more explicit way by having users tell their own stories. Hopefully this helps other users in their field understand how Dask can help and when it might (or might not) be useful to them.

We originally collected this information in a Google Form but have since then moved it to a Github repository. Eventually we’ll publish this as a proper web site and include it in our documentation.

If you use Dask and want to share your story this is a great way to contribute to the project. Arguably Dask needs more help with spreading the word than it does with technical solutions.

### HPC Deployments

The Dask Jobqueue package for deploying Dask on traditional HPC machines is nearing another release. We’ve changed around a lot of the parameters and configuration options in order to improve the onboarding experience for new users. It has been going very smoothly in recent engagements with new groups, but will mean a breaking change for existing users of the sub-project.

## July 04, 2018

### Randy Olson

#### Does batting order matter in Major League Baseball? A simulation approach

If you’ve ever watched Major League Baseball, one of the feature points of the sport is the batting line-up that each team decides upon before each game. Traditional baseball logic tells us that speedy, reliable hitters like Trea Turner should

## July 02, 2018

### Continuum Analytics

#### Anaconda and Full Spectrum Analytics Partner to Deliver Enterprise Data Science to Banks, Lenders, and Investments Firms

Anaconda, Inc., the most popular Python data science platform provider with 2.5 million downloads per month, is pleased to announce a new partnership with Full Spectrum Analytics, a data science consultancy that applies vast industry experience and advanced analytics capabilities to help lending businesses and retail banks leverage their own data to grow resiliently and …

## July 01, 2018

### Titus Brown

#### A framework for thinking about Open Source Sustainability?

I just revisited Nadia Eghbal’s wonderful post on “the tragedy of the commons” and her thoughts of an alternate ending for it, based on Elinor Ostrom's work on Common Pool Resources, and it resonated with some thinking I’d been doing in another context, and I wanted to share.

Nadia has been exploring the open source sustainability problem (ref), in which a good deal of our important open source software is maintained by a relatively small number of people without much in the way of guaranteed funding. The size and scope of the problem vary depending on who you talk to, but there was a pretty shocking picture of the scientific python computing ecosystem in which there were only half a dozen maintainers for numpy. Since quite a bit of the Python scientific computing ecosystem relies on numpy, it seems critically challenging to have so few maintainers. For an excellent detailed discussion on a specific instance of the general challenges around software development, see "The Astropy problem", Muna et al., 2016.

(The discussion below is mostly focused on scientific software, but I think it might apply much more broadly.)

I myself work mostly in bioinformatics, where the field uses a constantly frothing mixture of software packages that are maintained (or not ;) by a wide variety of people - some graduate students, some postdocs, some faculty, some staff. We develop software in my lab (mostly khmer and sourmash along with some other things) and over the years we’ve developed tips and tricks for keeping the software going, mostly revolving around testing and continuous integration. But there is always something that isn’t working, and even with automation maintenance is a constant low-level burden. Luckily very few people use our software compared to projects like Jupyter, so we are not particularly deluged by bug reports and maintenance issues.

That having been said, the constant need to maintain our open source software affects us quite a bit. It is rare that a week goes by where some piece of software we maintain isn’t revealed to have a bug, or need some evolution to add a new function or command line flag. If I and the other people in the lab are on a research kick (vs coding kick), then we may not get to the problem for a while.

The same is true of training materials. We run an annual two-week workshop on sequence analysis, and every year we need to evolve the lessons a bit to account for new methods, new software, and new data types. While some of the lessons from 2010 may still work, my guess is that most of them have undergone bitrot.

From my own experience as well as from observing quite a few packages over the years, I’ve come to the firm conclusion that open online projects (including software and training materials) that aren’t actively maintained quickly decay. And, if people are actively using a project, they will invariably find bugs and problems that need to be fixed. This completely ignores the people that (lord forfend) actually want to help you improve your online projects and submit pull requests on GitHub that need to be reviewed and merged or rejected, or (if you’re really successful) the companies that want to join your project and merge in their own code.

This need for constant attention to projects, the sprawling ecosystem of amazing scientific software packages, and the relatively small community of actual maintainers, when combined, lead to the open source sustainability problem in science: we do not have the person power to keep it all running without heroic efforts. And when you couple this with the lack of clear career paths for software maintenance in science, it is clear that we cannot ethically and sustainably recruit more people into open source maintainership.

Recently a group of colleagues and I were brainstorming about another open online project (more on that later) and trying to frame it as a common pool resource problem. We were looking for this framing because we knew that success would present a sustainability problem and were hoping to use the common pool resource framework to think about sustainability.

## Common Pool Resources, the Tragedy of the Commons, and Design Principles for Sustainability

Virtually everything I know about the common pool resource framework comes from Elinor Ostrom's excellent book, Governing the Commons. This is a tremendously readable book that outlines the problem in a very general way, and discusses how it has been solved by communities in the past.

Briefly, in the 60s and 70s, Elinor Ostrom and her collaborators noted that the so-called "tragedy of the commons”, in which common pool resources were over-utilized by selfish actors, was not an inevitable end; that government regulation and corporatization were not the only solution to managing commonses; and that, in fact, many communities had figured out how to manage common pool resources locally. From her own and others' case studies, Ostrom extracted eight "design principles" for sustainability of common pool resources.

Nadia does a great rundown of this in her blog post, so I will just point you at the eight design principles on Wikipedia, which are very readable.

For this work, Ostrom received the 2009 Nobel prize.

## Back to open online projects

So, my colleagues and I wondered, how would this framework apply to an open online project in which we were working with digital resources? After all, digital resources are not consumable in the same way as physical resources, and (to borrow from the open source musings above) it’s not like someone using our project’s source code is consuming that source code in such a way that it would not be usable by others.

During this conversation, we realized the answer: effort. The common pool resource in open online projects is effort.

When a contributor to a project adds a feature, what are they doing? Applying effort. When a contributor files a bug report? They’re applying effort. When they file a good bug report? More effort. When they write documentation? When they test a feature? When they suggest a new feature? They’re applying effort, all of it.

But it goes deeper than that. When you bring a new contributor into a project, you’re growing the available pool of effort. When you engage a new investor in supplying funding for an open source project, often that funding goes to increasing the amount of dedicated effort that is being applied to the project.

Of course, not all contributions are positive in their effect on effort, as I wrote about here. Some contributions (new feature proposals, or bad bug reports) cost the project more net effort than they bring. Significant feature additions that don’t come with contributions to the underlying maintenance of the project can be very costly to the core project maintainers, if only in terms of reviewing and rejecting. And underpinning all of this is the low susurration of maintenance needs: as I outline above, maintenance needs act as a net drag on project effort.

At #GCCBOSC, Fernando Perez riffed on this same idea a bit, and pointed out that there are other extractive approaches being used by people who recruit from open source projects. Many companies recruit out of open source communities, and in a simple sense they are mining the effort that the open source community put into training the people in question.

If you look at the list of eight design principles for a sustainable common pool resource, and define “effort” as the common pool resource in question, you see that they apply more or less directly to the way open source projects have evolved:

1. Who is a contributor to an open source project is clearly defined.
2. Effort in open online projects is applied locally, to the needs of the project.
3. Many open source projects follow the rule that those who contribute participate in design decisions.
4. People who contribute significantly are often invited to join the project officially in more significant decision making roles.
5. There is often a range of sanctions for contributors who violate community rules.
6. Most conflicts are handled internally to the project, rather than being escalated to the legal system.
7. Most conflicts are handled by lightweight methods and discussions.
8. Many open source contributors contribute to multiple projects, e.g. in the Python ecosystem there are many projects to which the same people contribute. In this sense the Python ecosystem can be considered a larger-scale CPR of effort with many locally articulated CPRs like "core CPython dev" and "numerical computing/numpy library".

It seems likely to me that this will generalize to open communities of many kinds.

## So, uhh, what does this all mean?

Since my colleagues and I started thinking this way, and I started looking at open source projects and other online community resources through this lens, it has proven to be a really nice simple framework for me to think about open source sustainability. The entire “How open is too open?” blog post came directly out of this thinking! It also gives a straightforward explanation for why recruiting more people to your project is viewed so positively: you’re increasing the pool of effort available for allocation to the project's needs; this further explains why Codes of Conduct and contributor guidelines are becoming so important, because they enable you to recruit and retain effort over the long term.

By itself, this perspective doesn’t solve any problems. But it does tie into a really nice collection of case studies, and a lot of deep thinking from the CPR literature about how to sustainably manage community resources.

More specifically, in the context of generic open online projects, it suggests a few points of consideration.

First, the pool of effort available to an open online project needs to be preserved against encroachment. For successful projects, this means that potential contributions should evaluated in terms of their net likely impact on the pool of effort. While this principle is already enshrined for technical contributions (see: “technical debt”), it should also be considered for bug reports and feature suggestions. (Many projects already do this, of course!)

Second, the cost of the constant maintenance needs (code, documentation, installation, etc.) on the pool of available effort needs to be taken into account. Contributions of new features that do not come with effort applied to maintenance should be carefully considered - is this new contributor likely to stick around? Can they and will they devote some effort to maintenance? If not, maybe those contributions should be deferred in favor of contributions that add maintenance effort to the project, e.g. via partnerships.

Third, training and nurturing new contributors should be considered in the cold hard light of increasing the available effort over the long term. But contributor psychology is tricky, so it may not be simple to predict who will stick around. Some projects have excellent incubators, like the Python Core Mentorship Program, where people who are interested in applying their effort to recruiting new contributors can do so. I suspect that considerations like creating a friendly environment and laying out expectations like “we’re happy to help you get up to speed on both adding new features AND FIXING BUGS so that you contribute to our maintenance effort” might help point new contributors in the right direction. In the long term, the health of the community is the health of the project.

Fourth, there are some interesting governance implications around allowing all or most of the resource appropriators to participate in decision making. I need to dig more into this, but, briefly, I think projects should formally lay out what level of investment and contribution is rewarded with what kind of operational, policy making, and constitutional decision making authority.

Fifth, defining maturity metrics may help with setting funder expectations and obtaining investments. From experience, the primary goal of many funders is to chart a path to project sustainability. I think the above design principles (and the case studies from CPR) can serve as a foundation for a set of project maturity rubrics that connect to sustainability. If a project is writing a funding proposal, they could articulate which design principles they’re targeting for improvement and how that ties into a broader framework of sustainability. For example, a project could say “right now we are worried about our ability to onboard many new contributors, and we’re starting to get some inquiries from companies about contributing on a larger scale; we’d like to build out our governance principles and improve our guidance for contributors, so that new contributors and investors have clear expectations about what level of investment is expected from them.” I suspect funders would welcome this level of clarity :).

## Does the Common Pool Resource framework for “effort" really fit open source projects, and open online projects generally?

Good question! I am by no means knowledgeable about CPR, and I have a lot of reading ahead of me. I can see a few disconnects with the CPR framework, and I need to work through those; but I’m really, really excited by how well it fits with my intuition about how open source projects work. Having a conceptual framework like CPR is letting me revisit my observations and fit them into a neater picture and maybe reach different conclusions about fixes. Again, check out the “How open is too open?” blog post for an example of this.

Something else I really want to try is to engage in case studies of open source projects to see how practices in real living sustained open source projects do and don’t fit this framework. I have a sabbatical opportunity coming up in a few years… hmm….

One of the things I really like about this framework is that it divorces open source project thinking from the fuzzy woo that I used to value so much about free and open source software -- “we should be a big happy family and all work together!” starts to pale in the face of maintainers' lives getting destroyed by their commitment to open. This intersects with a comment that Fernando Perez made: open source projects really do run a significant part of the world these days, and we cannot afford to think about any part of them - including sustainability - in the informal way we are used to. We should hold ourselves accountable to the goal of building sustainable open projects, and lay out a realistic and hard-nosed framework by which the investors with money (the tech companies and academic communities that depend on our toolkits) can help foster that sustainability. To be clear, in my view it should not be the job of (e.g.) Google to figure out how to contribute to our sustainability; it should be our job to tell them how they can help us, and then follow through when they work with us. Many important projects are really incapable of doing this at the moment, however, and in any case there are no simple solutions in this space. But maybe CPR is a framework for thinking about it. We shall see!

On a personal level, it's interesting to look back at my various open source project efforts (none of which are sustainable/sustained :) and see why they fail to meet the design principles laid out by Ostrom. That may be another blog post down the road...

--titus

P.S. I'd like to especially thank Cameron Neylon and Michael Nielsen for pointing Elinor Ostrom's work out to me a few years back! I would also like to thank Nadia Eghbal for all of her explorations of this topic - the framing she provided in her blog posts is really instrumental to this post, and I hope to be fellow travelers going forward :)

P.P.S. While I'm dropping names, I got excellent feedback and suggestions on this topic from many people, including Luiz Irber, Katy Huff, Katie Mack, Cory Doctorow, Jake VanderPlas, Tracy Teal, Fernando Perez, Michael Crusoe, and Matthew Turk - thank you all! Several of these conversations were enabled by #scifoo18 and #gccbosc, so thanks, SciFoo and BOSC!

## June 27, 2018

### Continuum Analytics

#### Scalable Machine Learning in the Enterprise with Dask

You’ve been hearing the hype for years: machine learning can have a magical, transformative impact on your business, putting key insights into the hands of decision-makers and driving industries forward. But many organizations today still struggle to extract value from their machine learning initiatives. Why? Building & Training Models on Your Laptop is No Longer …

The post Scalable Machine Learning in the Enterprise with Dask appeared first on Anaconda.

## June 26, 2018

### Matthew Rocklin

This work is supported by Anaconda Inc.

## History

For the first year of Dask’s life it focused exclusively on single node parallelism. We felt then that efficiently supporting 100+GB datasets on personal laptops or 1TB datasets on large workstations was a sweet spot for productivity, especially when avoiding the pain of deploying and configuring distributed systems. We still believe in the efficiency of single-node parallelism, but in the years since, Dask has extended itself to support larger distributed systems.

After that first year, Dask focused equally on both single-node and distributed parallelism. We maintain two entirely separate schedulers, one optimized for each case. This allows Dask to be very simple to use on single machines, but also scale up to thousand-node clusters and 100+TB datasets when needed with the same API.

Dask’s distributed system has a single central scheduler and many distributed workers. This is a common architecture today that scales out to a few thousand nodes. Roughly speaking Dask scales about the same as a system like Apache Spark, but less well than a high-performance system like MPI.

## An Example

Most Dask examples in blogposts or talks are on modestly sized datasets, usually in the 10-50GB range. This, combined with Dask’s history with medium-data on single-nodes may have given people a more humble impression of Dask than is appropriate.

As a small nudge, here is an example using Dask to interact with 50 36-core nodes on an artificial terabyte dataset.

This is a common size for a typical modestly sized Dask cluster. We usually see Dask deployment sizes either in the tens of machines (usually with Hadoop style or ad-hoc enterprise clusters), or in the few-thousand range (usually with high performance computers or cloud deployments). We’re showing the modest case here just due to lack of resources. Everything in that example should work fine scaling out a couple extra orders of magnitude.

## Challenges to Scaling Out

For the rest of the article we’ll talk about common causes that we see today that get in the way of scaling out. These are collected from experience working both with people in the open source community, as well as private contracts.

### Simple Map-Reduce style

If you’re doing simple map-reduce style parallelism then things will be pretty smooth out to a large number of nodes. However, there are still some limitations to keep in mind:

1. The scheduler will have at least one, and possibly a few connections open to each worker. You’ll want to ensure that your machines can have many open file handles at once. Some Linux distributions cap this at 1024 by default, but it is easy to change.

2. The scheduler has an overhead of around 200 microseconds per task. So if each task takes one second then your scheduler can saturate 5000 cores, but if each task takes only 100ms then your scheduler can only saturate around 500 cores, and so on. Task duration imposes an inversely proportional constraint on scaling.

If you want to scale larger than this then your tasks will need to start doing more work in each task to avoid overhead. Often this involves moving inner for loops within tasks rather than spreading them out to many tasks.

### More complex algorithms

If you’re doing more complex algorithms (which is common among Dask users) then many more things can break along the way. High performance computing isn’t about doing any one thing well, it’s about doing nothing badly. This section lists a few issues that arise for larger deployments:

1. Dask collection algorithms may be suboptimal.

The parallel algorithms in Dask-array/bag/dataframe/ml are pretty good, but as Dask scales out to larger clusters and its algorithms are used by more domains we invariably find that small corners of the API fail beyond a certain point. Luckily these are usually pretty easy to fix after they are reported.

2. The graph size may grow too large for the scheduler

The metadata describing your computation has to all fit on a single machine, the Dask scheduler. This metadata, the task graph, can grow big if you’re not careful. It’s nice to have a scheduler process with at least a few gigabytes of memory if you’re going to be processing million-node task graphs. A task takes up around 1kB of memory if you’re careful to avoid closing over any unnecessary local data.

3. The graph serialization time may become annoying for interactive use

Again, if you have million node task graphs you’re going to be serializaing them up and passing them from the client to the scheduler. This is fine, assuming they fit at both ends, but can take up some time and limit interactivity. If you press compute and nothing shows up on the dashboard for a minute or two, this is what’s happening.

4. The interactive dashboard plots stop being as useful

Those beautiful plots on the dashboard were mostly designed for deployments with 1-100 nodes, but not 1000s. Seeing the start and stop time of every task of a million-task computation just isn’t something that our brains can fully understand.

This is something that we would like to improve. If anyone out there is interested in scalable performance diagnostics, please get involved.

5. Other components that you rely on, like distributed storage, may also start to break

Dask provides users more power than they’re accustomed to. It’s easy for them to accidentally clobber some other component of their systems, like distributed storage, a local database, the network, and so on, with too many requests.

Many of these systems provide abstractions that are very well tested and stable for normal single-machine use, but that quickly become brittle when you have a thousand machines acting on them with the full creativity of a novice user. Dask provies some primitives like distributed locks and queues to help control access to these resources, but it’s on the user to use them well and not break things.

## Conclusion

Dask scales happily out to tens of nodes, like in the example above, or to thousands of nodes, which I’m not showing here simply due to lack of resources.

Dask provides this scalability while still maintaining the flexibility and freedom to build custom systems that has defined the project since it began. However, the combination of scalability and freedom makes it hard for Dask to fully protect users from breaking things. It’s much easier to protect users when you can constrain what they can do. When users stick to standard workflows like Dask dataframe or Dask array they’ll probably be ok, but when operating with full creativity at the thousand-node scale some expertise will invariably be necessary. We try hard to provide the diagnostics and tools necessary to investigate issues and control operation. The project is getting better at this every day, in large part due to some expert users out there.

## A Call for Examples

Do you use Dask on more than one machine to do interesting work? We’d love to hear about it either in the comments below, or in this online form.

## June 25, 2018

### Titus Brown

#### How open is too open?

When I look at open source projects, I divide the people involved into three categories: the investors, the contributors, and the users. The contributors do the work on the project, while the investors (if any) support the contributors in some way. The users are those who simply use the project without contributing to it.

For example, in sourmash, the investors are (primarily) the Moore Foundation, because they support most of the people working on the project via the Moore grant that I have. There are the contributors - myself, Luiz Irber, and many others in and out of my lab - who have submitted code, documentation, tutorials, or bug reports. And then there are the users, who have used the project and not contributed to it. (Projects can have many investors, many contributors, and many users, of course.)

I consider anybody who used sourmash and then contacted us - with a bug report, a question, or a suggestion - as a contributor. They may have made a small contribution, but it is a contribution nonetheless. I should add that those who cite us or build on us are contributing back in a reasonably significant way, by providing a formal indication that they found our code useful. This is a good signal of utility that is quite helpful when discussing new investments.

Users are interesting, because they contribute nothing to the project but also cost us nothing. If someone downloads sourmash, installs it, runs it, and gets a result, but for whatever reason never publishes their use and cites us, then they are a zero-cost user. If they file a bug report, that’s potentially a small burden on the project (someone has to pay attention to it), but - especially if they file a good bug report that makes it easy to track down the bug - then I think they are contributing back to the project, by helping us meet our long-term goals of less-buggy / more correct code.

Some (rare) contributors are more burden then help. They are the contributors who discover an interesting project, try it out, find that it doesn’t quite fit their needs, and then ask the developers to adjust it for them without putting any effort into it. Or, they ask many questions via private e-mail, consuming the time and energy of developers in private without contributing to the public discussion of the software’s scope and functionality. Or, they argue passionately about planned features without putting any other time into the project themselves. I call these extractive contributors.

These extractive contributors are far more of a burden then you might think. They consume the effort of the project with no gain to the project. Sometimes feature requests, questions, and high-energy discussions lead the project in new, worthwhile directions, but quite often they’re simply a waste of time and energy for everyone involved. (We don’t have any such contributors in sourmash, incidentally, but I’ve seen them in many other projects - the more well known and useful your project is, the more likely you are to have people who demand things of the project.) Quote from a friend: “They don’t contribute much code, but boy do they have strong opinions!"

You could certainly imagine an extractive contributor who implements some big new feature and then dumps it on the project with a request to merge (these are often called “code bombs”). If the feature was discussed beforehand and aligns with the direction of the project, that’s great! But sometimes people submit a merge request that simply won’t get merged - perhaps it’s misaligned with the project’s roadmap, or it adds a significant maintenance burden. Or, perhaps the project developers don’t know and trust the submitter enough to merge their code without a lot of review. Again, this is not a problem we’ve had in sourmash, but I know this happens with some frequency in the bigger Python projects.

You could even imagine a significant regular code contributor being extractive if they are not contributing to the maintenance of the code. If someone is working for a company, for example, and that company is asking them to implement features X, Y, and Z in a project, but not giving them time to contribute to the overall project maintenance and infrastructure as part of the core team, then they may be extracting more from the project than they are putting in. Again, on the big projects, I’m told this is a serious problem. To quote a friend, “sometimes pull requests are more effort than they are worth."

I don’t know what the number or cost of extractive contributors is on big projects, but at least by legend they are a significant part of the software sustainability problem. Part of the problem is on the side of the core maintainers of any project, of course, who don’t protect their own time - in the open source world, developers are taught to value all users, and will often bend over backwards to meet user’s needs. But a larger part of the problem is on the side of the extractive contributors, who are effectively sapping valuable effort from the project’s contributors.

I don’t think it’s necessarily easy to identify extractive contributors, nor do I think it’s straightforward to draw well-considered boundaries around an open project in which you indicate exactly which contributions are welcome, and how. And some extractive contributors can turn into net positive contributors with a little bit of mentoring and effort; we could think of such an effort as incurring contributor debt that could be recouped if more "effort" is brought into the project than is lost, over the long term.

Looking at things through this lens, some features of the Python core dev group come into sharp focus. Python has a ‘python-ideas’ list where potentially crackpot ideas can be floated and killed without much effort if they are misaligned with the project. If an idea passes some threshold of critical review there, it can potentially move into a formal suggestion for python implementation via a Python Enhancement Proposal, which must follow certain formatting and content guidelines before it can even be considered. These two mechanisms seem to me to be progressive gating mechanisms that serve to block extractive users from sapping effort from the project: before a major change request will be taken seriously, first the low threshold of a successful python-ideas discussion has to be met, and then the significant burden of writing a PEP needs to be undertaken.

A few (many?) years ago, I seem to recall Martin van Loewis offering to review one externally contributed patch for every ten other patches reviewed by the submitter. (I can’t find the link, sorry!) This imposes work requirements on would-be contributors that obligate them to contribute substantively to the project maintenance, before their pet feature gets implemented.

Projects can also decrease the cost of extractive contributors by lowering the cost of engagement. For example, the “pull request hack” makes it possible for anyone who has made a small "minimally viable" contribution to a project to become a committer on the project. While it probably wouldn't work for big complex projects, on smaller projects you could imagine it working well, especially for bug fixes and documentation-centric issues.

Another mechanism of blocking extractive contributors is to gate contributions on tests: in sourmash and khmer, as in many other open source projects, we don’t even consider reviewing pull requests until they pass the continuous integration tests. We do help people who are having trouble with them, in general, but I almost never ask Luiz to review my own PRs until they pass tests. When applied to potential contributors, this imposes a minimum level of engagement and effort on the part of that contributor before they consume the time and energy of the central project.

I suspect there are actually a bunch of techniques that are used in this way, even if they serve purposes beyond gating contributors (we also care if our tests pass!). I’d be really interested in hearing from people if they have encountered strategies that seem to be aimed at blocking or lowering the cost of extractive contributors.

How does this connect with the title, "How open is too open?" Well, this question of sustainability and "extractive" contributors seems to apply to all putatively "open" projects, but techniques aimed at blocking extractive contributors seem to trading openness for sustainability. And I’m curious if that’s something we need to pay attention to when building open communities, and how we should measure and evaluate the tradeoffs, and what clever social hacks people have for doing this.

—titus

## June 15, 2018

### Continuum Analytics

#### Introducing Dask for Scalable Machine Learning

Although Python contains several powerful libraries for machine learning, unfortunately, they don’t always scale well to large datasets. This has forced data scientists to use tools outside of the Python ecosystem (e.g., Spark) when they need to process data that can’t fit on a single machine. But thanks to Dask, data scientists can now use …

The post Introducing Dask for Scalable Machine Learning appeared first on Anaconda.

## June 14, 2018

### Matthew Rocklin

This work is supported by Anaconda Inc.

I’m pleased to announce the release of Dask version 0.18.0. This is a major release with breaking changes and new features. The last release was 0.17.5 on May 4th. This blogpost outlines notable changes since the last release blogpost for 0.17.2 on March 21st.

conda install dask


or pip install from PyPI:

pip install dask[complete] --upgrade


Full changelogs are available here:

We list some breaking changes below, followed up by changes that are less important, but still fun.

## Context

The Dask core library is nearing a 1.0 release. Before that happens, we need to do some housecleaning. This release starts that process, replaces some existing interfaces, and builds up some needed infrastructure. Almost all of the changes in this release include clean deprecation warnings, but future releases will remove the old functionality, so now would be a good time to check in.

As happens with any release that starts breaking things, many other smaller breaks get added on as well. I’m personally very happy with this release because many aspects of using Dask now feel a lot cleaner, however heavy users of Dask will likely experience mild friction. Hopefully this post helps explain some of the larger changes.

## Notable Breaking changes

### Centralized configuration

Taking full advantage of Dask sometimes requires user configuration, especially in a distributed setting. This might be to control logging verbosity, specify cluster configuration, provide credentials for security, or any of several other options that arise in production.

We’ve found that different computing cultures like to specify configuration in several different ways:

1. Configuration files
2. Environment variables
3. Directly within Python code

Now we centralize configuration in the dask.config module, which collects configuration from config files, environment variables, and runtime code, and makes it centrally available to all Dask subprojects. A number of Dask subprojects (dask.distributed, dask-kubernetes, and dask-jobqueue), are being co-released at the same time to take advantage of this.

If you were actively using Dask.distributed’s configuration files some things have changed:

1. The configuration is now namespaced and more heavily nested. Here is an example from the dask.distributed default config file today:

distributed:
version: 2
scheduler:
allowed-failures: 3     # number of retries before a task is considered bad
work-stealing: True     # workers should steal tasks from each other
worker-ttl: null        # like '60s'. Workers must heartbeat faster than this

worker:
multiprocessing-method: forkserver
use-file-locking: True

2. The default configuration location has moved from ~/.dask/config.yaml to ~/.config/dask/distributed.yaml, where it will live along side several other files like kubernetes.yaml, jobqueue.yaml, and so on.

However, your old configuration files will still be found and their values will be used appropriately. We don’t make any attempt to migrate your old config values to the new location though. You may want to delete the auto-generated ~/.dask/config.yaml file at some point, if you felt like being particularly clean.

### Replaced the common get= keyword with scheduler=

Dask can execute code with a variety of scheduler backends based on threads, processes, single-threaded execution, or distributed clusters.

Previously, users selected between these backends using the somewhat generically named get= keyword:

x.compute(get=dask.threaded.get)


We’ve replaced this with a newer, and hopefully more clear, scheduler= keyword:

x.compute(scheduler='threads')
x.compute(scheduler='processes')


The get= keyword has been deprecated and will raise a warning. It will be removed entirely on the next major release.

Related to the configuration changes, we now include runtime state in the configuration. Previously people used to set runtime state with the dask.set_options context manager. Now we recommend using dask.config.set:

with dask.set_options(scheduler='threads'):  # Before
...

...


The dask.set_options function is now an alias to dask.config.set.

This was unadvertised and saw very little use. All functionality (and much more) is now available in Dask-ML.

### Other

• We’ve removed the token= keyword from map_blocks and moved the functionality to the name= keyword.
• The dask.distributed.worker_client automatically rejoins the threadpool when you close the context manager.
• The Dask.distributed protocol now interprets msgpack arrays as tuples rather than lists.

## Fun new features

### Arrays

#### Generalized Universal Functions

Dask.array now supports Numpy-style Generalized Universal Functions (gufuncs) transparently. This means that you can apply normal Numpy GUFuncs, like eig in the example below, directly onto a Dask arrays:

import dask.array as da
import numpy as np

# Apply a Numpy GUFunc, eig, directly onto a Dask array
x = da.random.normal(size=(10, 10, 10), chunks=(2, 10, 10))
w, v = np.linalg._umath_linalg.eig(x, output_dtypes=(float, float))
# w and v are dask arrays with eig applied along the latter two axes


Numpy has gufuncs of many of its internal functions, but they haven’t yet decided to switch these out to the public API. Additionally we can define GUFuncs with other projects, like Numba:

import numba

@numba.vectorize([float64(float64, float64)])
def f(x, y):
return x + y

z = f(x, y)  # if x and y are dask arrays, then z will be too


What I like about this is that Dask and Numba developers didn’t coordinate at all on this feature, it’s just that they both support the Numpy GUFunc protocol, so you get interactions like this for free.

#### New “auto” value for rechunking

Dask arrays now accept a value, “auto”, wherever a chunk value would previously be accepted. This asks Dask to rechunk those dimensions to achieve a good default chunk size.

x = x.rechunk({
0: x.shape[0], # single chunk in this dimension
# 1: 100e6 / x.dtype.itemsize / x.shape[0],  # before we had to calculate manually
1: 'auto'      # Now we allow this dimension to respond to get ideal chunk size
})

# or
x = da.from_array(img, chunks='auto')


This also checks the array.chunk-size config value for optimal chunk sizes

>>> dask.config.get('array.chunk-size')
'128MiB'


To be clear, this doesn’t support “automatic chunking”, which is a very hard problem in general. Users still need to be aware of their computations and how they want to chunk, this just makes it marginally easier to make good decisions.

#### Algorithmic improvements

Dask.array gained a full einsum implementation thanks to Simon Perkins.

Also, Dask.array’s QR decompositions has become nicer in two ways:

1. They support short-and-fat arrays
2. The tall-and-skinny variant now operates more robustly in less memory. Here is a friendly GIF of execution:

This work is greatly appreciated and was done by Jeremy Chan.

Native support for the Zarr format for chunked n-dimensional arrays landed thanks to Martin Durant and John A Kirkham. Zarr has been especially useful due to its speed, simple spec, support of the full NetCDF style conventions, and amenability to cloud storage.

### Dataframes and Pandas 0.23

As usual, Dask Dataframes had many small improvements. Of note is continued compatibility with the just-released Pandas 0.23, and some new data ingestion formats.

Dask.dataframe is consistent with changes in the recent Pandas 0.23 release thanks to Tom Augspurger.

#### Orc support

Orc is a format for tabular data storage that is common in the Hadoop ecosystem. The new dd.read_orc function parallelizes around similarly new ORC functionality within PyArrow . Thanks to Jim Crist for the work on the Arrow side and Martin Durant for parallelizing it with Dask.

The dd.read_json function matches most of the pandas.read_json API.

This came about shortly after a recent PyCon 2018 talk comparing Spark and Dask dataframe where Irina Truong mentioned that it was missing. Thanks to Martin Durant and Irina Truong for this contribution.

### Joblib

The Joblib library for parallel computing within Scikit-Learn has had a Dask backend for a while now. While it has always been pretty easy to use, it’s now becoming much easier to use well without much expertise. After using this in practice for a while together with the Scikit-Learn developers, we’ve identified and smoothed over a number of usability issues. These changes will only be fully available after the next Scikit-Learn release (hopefully soon) at which point we’ll probably release a new blogpost dedicated to the topic.

This release is timed with the following packages:

2. distributed

There is also a new repository for deploying applications on YARN (a job scheduler common in Hadoop environments) called skein. Early adopters welcome.

## Acknowledgements

Since March 21st, the following people have contributed to the following repositories:

The core Dask repository for parallel algorithms:

• Andrethrill
• Beomi
• Brendan Martin
• Christopher Ren
• Guido Imperiale
• Diane Trout
• fjetter
• Frederick
• Henry Doupe
• James Bourbeau
• Jeremy Chen
• Jim Crist
• John A Kirkham
• Jon Mease
• Jörg Dietrich
• Ksenia Bobrova
• Larsr
• Marc Pfister
• Markus Gonser
• Martin Durant
• Matt Lee
• Matthew Rocklin
• Pierre-Bartet
• Scott Sievert
• Simon Perkins
• Stefan van der Walt
• Stephan Hoyer
• Tom Augspurger
• Uwe L. Korn
• Yu Feng

The dask/distributed repository for distributed computing:

• Bmaisonn
• Grant Jenks
• Henry Doupe
• Irene Rodriguez
• Irina Truong
• John A Kirkham
• Joseph Atkins-Turkish
• Kenneth Koski
• Loïc Estève
• Marius van Niekerk
• Martin Durant
• Matthew Rocklin
• Olivier Grisel
• Russ Bubley
• Tom Augspurger
• Tony Lorenzo

• Brendan Martin
• J Gerard
• Matthew Rocklin
• Olivier Grisel
• Yuvi Panda

• Guillaume Eynard-Bontemps
• jgerardsimcock
• Joe Hamman
• Joseph Hamman
• Loïc Estève
• Matthew Rocklin
• Ray Bell
• Rich Signell
• Shawn Taylor
• Spencer Clark

The dask-ml repository for scalable machine learning:

• Christopher Ren
• Jeremy Chen
• Matthew Rocklin
• Scott Sievert
• Tom Augspurger

### Acknowledgements

Thanks to Scott Sievert and James Bourbeau for their help editing this article.

## June 13, 2018

### Continuum Analytics

#### 2018 Anaconda State of Data Science Report Released

We at Anaconda greatly value our data science community and are always striving to learn more about how you are using our products and how we can improve your overall experience. With this goal in mind, we recently launched our first Anaconda State of Data Science Survey to gain a better understanding of what users …

The post 2018 Anaconda State of Data Science Report Released appeared first on Anaconda.

## June 12, 2018

### Paul Ivanov

#### Get in it

Two weeks ago, Project Jupyter had our only planned team meeting for 2018. There was too much stuff going on for me to write a poem during the event as I had in previous years (2016, and 2017), so I ended up reading one of the pieces I wrote during my evening introvert breaks in Cleveland at PyCon a few weeks earlier.

Once again, Fernando and Matthias had their gadgets ready to record (thank you both!). The video below was taken by Fernando.

# Get in it

Time suspended
Gellatinous reality - the haze
submerged in murky drops summed
in swamp pond of life

believe and strive, expand the mind
A state sublime, when in your prime you came to
me and we were free to flow and fling our
cares, our dreams, our in-betweens, our
rêves perdues, our residue -- the lime of light
the black of sight -- all these converge and
merge the forks of friction filled with fright
and more -- the float of logs that plunges deep
beyond the fray, beyond the keep -- a leap of faith
the lore of rite, with passage clear, let
fear subside, the wealth of confidence will
rise and iron out wrinkles of doubt

Commit to change and stash your pride
then push your luck, and make amends.
Branch out your thoughts, reset assumptions
then checkout.

The force of pulls t'wards master class
Remote of possibilities. Rehash the past
Patch up the present -- what's the diff?

There's nothing left -- except to glide -- and
soar beyond your frame of mind.  try not to pry
cry, freedom, cry.


## June 09, 2018

### Titus Brown

#### How long does it take to produce scientific software?

Over here at UC Davis, the Lab for Data Intensive Biology has been on extended walkabout developing software for, well, doing data intensive biology.

Over the past two to three years or so, various lab members have been working on the following new pieces of software -

I should say that all of these except for kevlar have been explicitly supported by my Moore Foundation funding from the Data Driven Discovery Initiative.

With the possible exception of dammit, every single one of these pieces of software was developed entirely since the move to UC Davis (so, since 2015 or later). And almost all of them are now approaching some reasonable level of maturity, defined as "yeah, not only does this work, but it might be something that other people can use." (Both dammit and sourmash are being used by other people already; kevlar, spacegraphcats, and boink are being written up now.)

All of these coming together at the same time seems like quite a coincidence to me, and I would like to make the following proposition:

It takes a minimum of two to three years for a piece of scientific software to become mature enough to publicize.

This fits with my previous experiences with khmer and the FamilyRelations/Cartwheel set of software as well - each took about two years to get to the point where anyone outside the lab could use them.

I can think of quite a few reasons why some level of aging could be necessary -

• often in science one has no real idea of what you're doing at the beginning of a project, and that just takes time to figure out;

• code just takes time to get reasonably robust when interfacing with real world data;

• there are lots of details that need to be worked out for installation and distribution of code, and that also just takes time;

but I'm somewhat mystified by the 2-3 year arc. It could be tied to the funding timeline (the Moore grant ends in about a year) or career horizons (the grad students want to graduate, the postdocs want to move on).

My best guess, tho, is that there is some complex tradeoff between scope and effort that breaks the overall software development work into multiple stages - something like,

1. figure out the problem
2. implement a partial solution
3. make an actual solution
4. expand solution cautiously to apply to some other nearby problems.

I'm curious as to whether or not this pattern fits with other people's experiences!

I do expect these projects to continue maturing as time and opportunity permits, much like khmer. boink, spacegraphcats, and sourmash should all result in multiple papers from my lab; kevlar will probably move with Daniel to his next job, but may be something we also extend in our lab; etc.

Another very real question in my mind is: which software do we choose to maintain and extend? It's clearly dependent on funding, but also on the existence of interesting problems that the software can still address, and on who I have in my lab... right now a lot of our planning is pretty helter skelter, but it would be good to articulate a list of guiding considerations for when I do see pots of money on the horizon.

Finally: I think this 2-3 year timeline has some interesting implications for the question of whether or not we should require people to release usable software. I think it's a major drain on people to expect them to not only come up with some cool new idea and implement it in software they can use, but then also make software that is more generally usable. Both sides of this take special skills - some people are good at methods & algorithms development, some people are good at software development, but very few people are good at both. And we should value both, but not require that people be good at both.

--titus

## June 02, 2018

### Titus Brown

#### Detecting microbial contamination in long-read assemblies (from known microbes)

A week ago, Erich Schwarz e-mailed our lab list asking,

I would like to be able to download a set of between 1,000 and 10,000 bacterial genome assembly sequences that are reasonably representative of known bacteria. RefSeq's bacterial genome set is easy to download, but absolutely freaking huge (the aggregate FASTA file for its genome sequences is 410 GB).

After digging in a bit, Erich gave us his actual goal: to search for potential microbial contaminants, like so:

Do MegaBlastN on new genome assemblies from PacBio data. With PacBio one gets very few large contigs, so bacterial contamination is really easy to filter out with a simple MegaBlastN. However, I did my last big download of 3,000 microbial genomes from EBI in 2013. There's a lot more of them now!

My response:

I think sourmash gather on each contig would probably do the right thing for you, actually; https://sourmash.readthedocs.io/en/latest/tutorials.html if you have a "true positive" contaminated scaffold to share, I can test that fairly quickly.

Also - I assume the contigs are never chimeric, so if you find contamination in one it's ok to discard the whole thing?

Also - kraken should do a fine job of this albeit in a more memory intensive way. MegaBlastN isn't not much more sensitive than k-mer based approaches, I think.

This would let Erich search all 100k+ bacterial genomes without downloading the complete genomes. My recommendation was to do this to identify candidate genomes for contaminants, and then use something like mashmap to do a more detailed alignment and contaminant removal.

Erich responded with some useful links.

In fact, I have what should be both positive and negative controls for microbial contamination:

http://woldlab.caltech.edu/~schwarz/caeno_pacbio.previous/nigoni_mhap.decont_2015.11.11.fa.gz
http://woldlab.caltech.edu/~schwarz/caeno_pacbio.previous/nigoni_mhap.CONTAM_2015.11.11.fa.gz


which you are very welcome to try sourmashing!

After some other back and forth, I wrote a script to do the work; here's a rough run protocol:

curl -O -L http://woldlab.caltech.edu/~schwarz/caeno_pacbio.previous/nigoni_mhap.decont_2015.11.11.fa.gz
./gather-by-contig.py nigoni_mhap.CONTAM_2015.11.11.fa.gz genbank-k31.sbt.json --output-match foo.txt --output-nomatch foo2.txt —csv summary.csv


which should take a minute or two to run on a modern SSD laptop, and requires less than 1 GB of RAM (and about 18 GB of disk space for the genbank index).

A few comments before I go through the script in detail:

• this uses MinHash downsampling as implemented in sourmash, so you have to feed long contigs in. This is appropriate for PacBio and Nanopore assemblies, but not for raw reads of any kind, and probably not for Illumina assemblies.
• sourmash will happily do contaminant estimation of an entire data set (genomes, reads, etc.) - the goal here was to go line by line through the contigs and split them into "match" and "no match".

Last, but not least: this kind of ad hoc scripting functionality is what we aspire to enable with all our software. A command line program can't address all needs, but a default set of functionality provided via the command line, wrapping a more general purpose library, can!

## An annotate version of the script

First, import the necessary things:

#! /usr/bin/env python
import argparse
import screed
import sourmash
from sourmash import sourmash_args, search
from sourmash.sbtmh import SearchMinHashesFindBestIgnoreMaxHash
import csv


In the main function, set up some arguments:

def main():
p = argparse.ArgumentParser()
args = p.parse_args()


Then, find the SBT database to load:

    tree = sourmash.load_sbt_index(args.sbt_database)
print(f'found SBT database {args.sbt_database}')


Next, figure out the MinHash parameters used to construct this database, so we can use them to construct MinHashes for each sequence in the input file:

    leaf = next(iter(tree.leaves()))
mh = leaf.data.minhash.copy_and_clear()

print(f'using ksize={mh.ksize}, scaled={mh.scaled}')


Give some basic info:

    print(f'loading sequences from {args.input_seqs}')
if args.output_match:
print(f'saving match sequences to {args.output_match.name}')
if args.output_nomatch:
print(f'saving nomatch sequences to {args.output_nomatch.name}')
if args.csv:
print(f'outputting CSV summary to {args.csv.name}')


In the main loop, we'll need to track found items (for CSV summary output), and other basic stats:

    found_list = []
total = 0
matches = 0


Now, for each sequence in the input file of contigs:

    for record in screed.open(args.input_seqs):
total += 1
found = False


Set up a search function that finds the best match, and construct a new MinHash for each query sequence:

        search_fn = SearchMinHashesFindBestIgnoreMaxHash().search

query_mh = mh.copy_and_clear()
query = sourmash.SourmashSignature(query_mh)


If the sequence is too small, quit.

        # too small a sequence/not enough hashes? notify
if not query_mh.get_mins():
print(f'note: skipping {query.name[:20]}, no hashes in sketch')
continue


Now do the search, and pull off the first match:

        for leaf in tree.find(search_fn, query, args.threshold):
found = True
matches += 1
similarity = query.similarity(leaf.data)
found_list.append((record.name, leaf.data.name(), similarity))
break


Nothing found? That's ok, just indicate empty.

        if not found:
found_list.append((record.name, '', 0.0))


Output sequences appropriately:

        if found and args.output_match:
args.output_match.write(f'>{record.name}\n{record.sequence}')
args.output_match.write(f'>{record.name}\n{record.sequence}')


and update the user:

        print(f'searched {total}, found {matches}', end='\r')


At the end, print out the summary (this merely leaves the preceding line alone), and output CSVs:

    print('')

if args.csv:
w = csv.DictWriter(args.csv, fieldnames=['query', 'match', 'score'])
for (query, match, score) in found_list:
w.writerow(dict(query=query, match=match, score=score))


Finally, ...call the main function if this is run as a script:

if __name__ == '__main__':
main()


Comments and questions welcome, as always!

best, --titus

## May 31, 2018

### Continuum Analytics

#### Anaconda Distribution 5.2 Released

We’re excited to announce the release of Anaconda Distribution 5.2! With over 6 million users, Anaconda Distribution is the world’s most popular and easiest way to do Python data science and machine learning. Download and install Anaconda Distribution 5.2 now, or update your current Anaconda Distribution installation to version 5.2 by using conda update conda …

The post Anaconda Distribution 5.2 Released appeared first on Anaconda.

## May 30, 2018

### Titus Brown

#### Communicating outside of big consortia is tough! (but important!)

I've often been disparaging of the community efforts of big academic collaborations, because it seems like they rarely communicate with the outside world well - this is particularly true of interim (not-yet-publishable) results and software. Over the years I've evolved a theory that big consortia are so busy communicating within that they have no energy for communicating without. This robs the larger scientific community of insight and scientific results in a way that I feel like smaller collaborations do not - you could probably come up with "communication per ", or something, as a metric, and I bet large consortia would show poorer numbers.

I particularly admire open source communities here, because the communication is often so good (compared, at least, with consortia, or really academics of any kind) and rather fine grained. Since many open source communities are both distributed and asynchronous, they really seem to excel at information sharing in useful ways.

(See Max Ogden's excellent doc about how to run an async team if you're interested in some of the lowdown here.)

I am hoping to use my coordination position within the #CommonsPilot to facilitate better communication, and we've even hired some people to do that. So imagine my frustration to be in exactly that "silent" situation with the #CommonsPilot! I can now partly confirm my initial theory, and elaborate upon it, with the benefit of about 6 months of experience.

## The top N reasons why I think big consortia are unusually silent.

1. We're too busy talking to each other by e-mail!

By the time I finish reading and responding to e-mails (and, ahem, sending new ones) from the #CommonsPilot each day, I'm out of time and energy.

1. We're too busy talking to each other on teleconferences!

Most information is passed via in-person teleconferences, from which very little information actually escapes. This is exacerbated by people's interest in ONLY communicating this way, because it leads to more focused and thoughtful engagement by busy academics. It's high bandwidth, sure, but it's also isolating - only the people who have the time and energy to show up for all the teleconferences are in the know.

One takeaway that I got from this excellent blog post, aturon.log: listening and trust, part 1, about the Rust community, is that "all major project decisions must go through the RFC process" - which must involve written communication that clearly recapitulates anything discussed on a phone call. We were already instituting this in the Data Commons before this blog post, but now I have extra reasons to do so :)

1. Consortium wide decisions require multiple rounds of discussion before consensus is reached and can be communicated externally.

I really don't want to post things that people disagree with, but it takes a lot of time to figure out what that is (and isn't).

1. Rules for communicating with the outside world aren't clear.

Funding bodies and senior PIs are often risk averse, and figuring out what is and isn't a risk is tough. We've finally gotten some blogging and Twitter guidelines approved and we'll post them when we can.

1. Hierarchies interfere.

Typically the people most familiar with social media are junior in collaborations, and (for better or for worse) are worried about irritating those senior to them by speaking out of turn.

Here I have an edge, since I'm both a PI and a coordinator on this project, and my proposal focused on outreach (and this proposal was accepted by the NIH). So I have a mandate.

1. Communicating externally takes time, energy, and willpower.

Usually, there's no one whose job it is to communicate with the community. To which I say...

...welcome, Dr. Rayna Harris :).

## Wait, why should we be communicating anyway?

I started with the implicit assumption that consortia should be communicating with the outside world. Why??

I think there are many reasons. It's not just about communicating science more effectively, although that's part of it; it's also about:

• communicating about what big, expensive consortia are doing that's worthwhile; think "accountability to taxpayers and stakeholders".

• gaining buy-in for consortium decisions from the wider community. This is particularly important for efforts like the #CommonsPilot, where we are hoping to identify and implement good standards and build a community of practice.

• getting feedback (negative and positive) on consortium decisions. If we're picking tech that is out of date or old or bad, we should know - and we don't always!

Perhaps the best reason, though, is that external communication can help people internal to the Consortium understand what's going on. I've often found that there are relatively few people truly "in the loop" in any given situation, and a commitment to external communication of internal decisions can actually help communicate those same internal decisions internally.

Or, to put it another way, if you're not communicating in one venue, you're probably not communicating well in any venue, and this is probably harming your consortium and limiting the contributions of people - especially junior people.

## So, what's the status, anyway?

More soon, I hope :)

--titus

p.s. Thanks to VM Brasseur for her comments and suggestions on this post!

## May 28, 2018

### Titus Brown

#### Open-source style community engagement for the Data Commons Pilot Phase Consortium

Note: this is a guest post by Dr. Rayna M. Harris.

In November 2017, the National Institutes of Health (NIH) announced the formation of a Data Commons Pilot Phase Consortium (DCPPC) to accelerate biomedical discovery by making big biomedical data more findable and usable.

It's called a consortium because the awardees are all working together in concert and collaboration to achieve the larger goal. Those awardees (big cats who run academic research labs or companies) have each brought on numerous students, postdocs, and staff, so the size of the consortium has already grown to over 300 people! That's a lot of cats to herd.

So, how are we keeping everyone in the community coordinated and engaged? Here's a little insight into our approach, which was first outlined by Titus in this blog post.

## DCPPC Key Capabilities and teams

The overall structure of the DCPPC is a little complex, especially to the uninitiated. Members of the consortium organized themselves into "Key Capabilities" or focus groups that correspond to elements of the funding call and the major objectives of the Data Commons. Key Capabilities (KC) 1-9 are described in more detail here.

On top of the KC lingo, each of the awardees all adopted team names from the elements of the periodic table, so you'll hear thing things like "KC1 has a meeting on Wednesday" or "Team Copper is meeting on Tuesday". I made infographic below to help myself see the connections between the DCPPC objections, key capacities and teams.

I am a member of Team Copper, which consists of members or affiliates of the Data Intensive Biology Lab at UC Davis (C. Titus Brown, Phillip Brooks, Rebecca Calisi Rodriguez, Amanda Charbonneau, Rayna Harris, Luiz Irber, Tamer Mansour, Charles Reid, Daniel Standage and Karen Word), the Biomedical data analysis company Curoverse (Alexander (Sasha) Wait Zaranek, VM (Vicky) Brasseur, Sarah Edrie, Meredith Gamble and Sarah Wait Zaranek), and the Harvard Chan Bioinformatics Core (Brad Chapman, Radhika Khetani and Mary Piper).

## GitHub for project management of 522 milestones and 50 deliverables

Very early on, it was decided that GitHub would be our authoritative and canonical source for all DCPPC milestones and deliverables. What are milestones and deliverables? Milestones are team-defined tasks that must be completed in order to achieve the long-term objective of the DCPPC. Deliverables are the currency by which we evaluate whether or not a milestone has been reached. Deliverables can be in either the form of a demo (activities or documentation that demonstrate completion of goals of the Commons) or products (resources such as standards and conventions, APIs, data resources, websites, repositories, documentation, and training or outreach materials). The DCPPC has defined 522 milestones and 50 deliverables that are due in the first 180 days (between April 1 and September 28, 2018).

_Why GitHub?__ We chose GitHub because it makes cross-project linking and commenting easy and many people are familiar with it.

How did we get all the information about 500 milestones into GitHub issues? We automated it! One of the first accomplishments of Team Copper was developed a collection of scripts (collectively referred to as the "DCPPC bot") that takes a CSV file of all the milestones and deliverables and opens GitHub issues with a brief description, a due date, and a label corresponding to the relevant Team. We also interlink the milestones corresponding to each deliverable.

Right now, the DCPPC bot only deals with DCPPC milestones and deliverables, but you could imagine how this tools could be modified and adapted to many other large-scale community projects.

## On-boarding existing and new members

To get everyone on the same page, we put in place some loose guidelines for communication (we'll be using this platform for e-mail, that project for documents, etc.). We defined a community code of conduct and have adopted open and transparent workflows to the best of our ability.

We wrote some simple onboarding documents and checklists to connect people to those guidelines, communication channels, and useful resources. New members fill out a Google form providing basic contact information and their affiliation to the DCPPC. Then Team Copper gives them access to all the various communication channels. Finally, we send a follow-up email pointing new members to all the relevant resources and documentation. We haven't perfected on onboarding process, but this thank you note is evidence that we are on the right track!

"Thank you so much for this information! I just started with [the DCPPC] 3 weeks ago and the learning curve has been steep. These docs have been the best crash course. Thank you!" - Anonymous DCPPC member

It is important to note that we are paying attention to what communication avenues are actually being used or working well and are fine-tuning accordingly. For instance, we started using Google Calendars, but it wasn't working, so we switched to the Groups.io calendar. Our goal is to layer on more structure only when the need becomes apparent (but without doing so too early or often) to preserve flexibility and adaptation to suit the needs of the community.

The best thing (in my opinion) about using Groups.io, GitHub, and Slack for communication is that new members have access to all the conversation that has taken place since the beginning. This provides a wealth of information that would be lost if all communication took place via personal email or face to face communication.

Another excellent feature of the tools we are using is the availability of APIs for automating processes and reconciling access lists. We configured our groups.io calendars to automatically post upcoming meeting notifications to the appropriate Slack channel, so that's cool! We also built a tool that calls the Slack, GitHub, and Groups.io APIs and returns a list of everyone with access. This is really useful for checking to be sure that everyone who needs access has it (or that no one who shouldn't doesn't).

## Monthly, unconference style meetings and hackathons

Virtual tools like Slack, GitHub, Twitter, and Zoom make synchronous and asynchronous communication possible from nearly anywhere in the world, but the power of face to face (f2f) communication is undeniable a powerful way to boost collaboration and creativity. As a testament to the Consortium’s commitment to community engagement, a significant part of our budget is being used to cover all the associated travel, lodging, and food costs.

Team Copper (see the list of members below) has taken on the role of organizing or facilitating these f2f meeting. We are adopting an "unconference style" format where the attendees determine the topics of discussion or direction of a hackathon.

The goal of the first f2f meetings in December 2017 was to determine what the DCPPC actually needed to do during the first 180 days of this effort (aka Pilot Phase I). This meeting was attended by NIH staff, awardees, cloud service providers, and data stewards. You can read more about the outcomes of this meeting in a blog post written by C. Titus Brown. The second f2f meeting took place on April 2018. The goal of the April meeting the goal was to showcase our progress to the NIH.

Moving forward, we are planning a f2f meeting every month at various sites around the US. The goals of the DCPPC May workshop are to build community, to facilitate planned and serendipitous collaboration across teams, and to surface hidden issues around technical and conceptual interoperability. A major focus of the June meeting will be a multi-team, multi-KC hackathon. The goals and topics for our meetings in July - October meetings have yet to be determined but will likely correspond to relevant milestones and deliverable that are due those months or the near future.

There's a lot that I didn't cover, so stay tuned for more in-depth blog posts about building an open-source style community around the Data Commons. In the mean time, get regular updates by following the #CommonsPilot hashtag or the @nih_dcppc ‏and @NIH_CommonFund accounts on Twitter.

## Executive Summary

In recent years Python’s array computing ecosystem has grown organically to support GPUs, sparse, and distributed arrays. This is wonderful and a great example of the growth that can occur in decentralized open source development.

However to solidify this growth and apply it across the ecosystem we now need to do some central planning to move from a pair-wise model where packages need to know about each other to an ecosystem model where packages can negotiate by developing and adhering to community-standard protocols.

With moderate effort we can define a subset of the Numpy API that works well across all of them, allowing the ecosystem to more smoothly transition between hardware. This post describes the opportunities and challenges to accomplish this.

We start by discussing two kinds of libraries:

1. Libraries that implement the Numpy API
2. Libraries that consume the Numpy API and build new functionality on top of it

## Libraries that Implement the Numpy API

The Numpy array is one of the foundations of the numeric Python ecosystem, and serves as the standard model for similar libraries in other languages. Today it is used to analyze satellite and biomedical imagery, financial models, genomes, oceans and the atmosphere, super-computer simulations, and data from thousands of other domains.

However, Numpy was designed several years ago, and its implementation is no longer optimal for some modern hardware, particularly multi-core workstations, many-core GPUs, and distributed clusters.

Fortunately other libraries implement the Numpy array API on these other architectures:

• CuPy: implements the Numpy API on GPUs with CUDA
• Sparse: implements the Numpy API for sparse arrays that are mostly zeros
• Dask array: implements the Numpy API in parallel for multi-core workstations or distributed clusters

So even when the Numpy implementation is no longer ideal, the Numpy API lives on in successor projects.

Note: the Numpy implementation remains ideal most of the time. Dense in-memory arrays are still the common case. This blogpost is about the minority of cases where Numpy is not ideal

So today we can write code similar code between all of Numpy, GPU, sparse, and parallel arrays:

import numpy as np
x = np.random.random(...)  # Runs on a single CPU
y = x.T.dot(np.log(x) + 1)
z = y - y.mean(axis=0)
print(z[:5])

import cupy as cp
x = cp.random.random(...)  # Runs on a GPU
y = x.T.dot(cp.log(x) + 1)
z = y - y.mean(axis=0)
print(z[:5].get())

x = da.random.random(...)  # Runs on many CPUs
y = x.T.dot(da.log(x) + 1)
z = y - y.mean(axis=0)
print(z[:5].compute())

...


Additionally, each of the deep learning frameworks (TensorFlow, PyTorch, MXNet) has a Numpy-like thing that is similar-ish to Numpy’s API, but definitely not trying to be an exact match.

## Libraries that consume and extend the Numpy API

At the same time as the development of Numpy APIs for different hardware, many libraries today build algorithmic functionality on top of the Numpy API:

1. XArray for labeled and indexed collections of arrays
2. Autograd and Tangent: for automatic differentiation
3. TensorLy for higher order array factorizations
4. Dask array which coordinates many Numpy-like arrays into a logical parallel array

(dask array both consumes and implements the Numpy API)

5. Opt Einsum for more efficient einstein summation operations

These projects and more enhance array computing in Python, building on new features beyond what Numpy itself provides.

There are also projects like Pandas, Scikit-Learn, and SciPy, that use Numpy’s in-memory internal representation. We’re going to ignore these libraries for this blogpost and focus on those libraries that only use the high-level Numpy API and not the low-level representation.

## Opportunities and Challenges

Given the two groups of projects:

1. New libraries that implement the Numpy API (CuPy, Sparse, Dask array)
2. New libraries that consume and extend the Numpy API (XArray, Autograd/tangent, TensorLy, Einsum)

We want to use them together, applying Autograd to CuPy, TensorLy to Sparse, and so on, including all future implementations that might follow. This is challenging.

Unfortunately, while all of the array implementations APIs are very similar to Numpy’s API, they use different functions.

>>> numpy.sin is cupy.sin
False


This creates problems for the consumer libraries, because now they need to switch out which functions they use depending on which array-like objects they’ve been given.

def f(x):
if isinstance(x, numpy.ndarray):
return np.sin(x)
elif isinstance(x, cupy.ndarray):
return cupy.sin(x)
elif ...



Today each array project implements a custom plugin system that they use to switch between some of the array options. Links to these plugin mechanisms are below if you’re interested:

For example XArray can use either Numpy arrays or Dask arrays. This has been hugely beneficial to users of that project, which today seamlessly transition from small in-memory datasets on their laptops to 100TB datasets on clusters, all using the same programming model. However when considering adding sparse or GPU arrays to XArray’s plugin system, it quickly became clear that this would be expensive today.

Building, maintaining, and extending these plugin mechanisms is costly. The plugin systems in each project are not alike, so any new array implementation has to go to each library and build the same code several times. Similarly, any new algorithmic library must build plugins to every ndarray implementation. Each library has to explicitly import and understand each other library, and has to adapt as those libraries change over time. This coverage is not complete, and so users lack confidence that their applications are portable between hardware.

Pair-wise plugin mechanisms make sense for a single project, but are not an efficient choice for the full ecosystem.

## Solutions

I see two solutions today:

1. Build a new library that holds dispatch-able versions of all of the relevant Numpy functions and convince everyone to use it instead of Numpy internally

2. Build this dispatch mechanism into Numpy itself

Each has challenges.

### Build a new centralized plugin library

We can build a new library, here called arrayish, that holds dispatch-able versions of all of the relevant Numpy functions. We then convince everyone to use it instead of Numpy internally.

So in each array-like library’s codebase we write code like the following:

# inside numpy's codebase
import arrayish
import numpy
@arrayish.sin.register(numpy.ndarray, numpy.sin)
@arrayish.cos.register(numpy.ndarray, numpy.cos)
@arrayish.dot.register(numpy.ndarray, numpy.ndarray, numpy.dot)
...

# inside cupy's codebase
import arrayish
import cupy
@arrayish.sin.register(cupy.ndarray, cupy.sin)
@arrayish.cos.register(cupy.ndarray, cupy.cos)
@arrayish.dot.register(cupy.ndarray, cupy.ndarray, cupy.dot)
...


and so on for Dask, Sparse, and any other Numpy-like libraries.

In all of the algorithm libraries (like XArray, autograd, TensorLy, …) we use arrayish instead of Numpy

# inside XArray's codebase
# import numpy
import arrayish as numpy


This is the same plugin solution as before, but now we build a community standard plugin system that hopefully all of the projects can agree to use.

This reduces the big n by m cost of maintaining several plugin systems, to a more manageable n plus m cost of using a single plugin system in each library. This centralized project would also benefit, perhaps, from being better maintained than any individual project is likely to do on its own.

However this has costs:

1. Getting many different projects to agree on a new standard is hard
2. Algorithmic projects will need to start using arrayish internally, adding new imports like the following:

import arrayish as numpy


And this wll certainly cause some complications interally

3. Someone needs to build an maintain the central infrastructure

Hameer Abbasi put together a rudimentary prototype for arrayish here: github.com/hameerabbasi/arrayish. There has been some discussion about this topic, using XArray+Sparse as an example, in pydata/sparse #1

### Dispatch from within Numpy

Alternatively, the central dispatching mechanism could live within Numpy itself.

Numpy functions could learn to hand control over to their arguments, allowing the array implementations to take over when possible. This would allow existing Numpy code to work on externally developed array implementations.

There is precedent for this. The array_ufunc protocol allows any class that defines the __array_ufunc__ method to take control of any Numpy ufunc like np.sin or np.exp. Numpy reductions like np.sum already look for .sum methods on their arguments and defer to them if possible.

Some array projects, like Dask and Sparse, already implement the __array_ufunc__ protocol. There is also an open PR for CuPy. Here is an example showing Numpy functions on Dask arrays cleanly.

>>> import numpy as np

>>> x = da.ones(10, chunks=(5,))  # A Dask array

>>> np.sum(np.exp(x))             # Apply Numpy function to a Dask array


I recommend that all Numpy-API compatible array projects implement the __array_ufunc__ protocol.

This works for many functions, but not all. Other operations like tensordot, concatenate, and stack occur frequently in algorithmic code but are not covered here.

This solution avoids the community challenges of the arrayish solution above. Everyone is accustomed to aligning themselves to Numpy’s decisions, and relatively little code would need to be rewritten.

The challenge with this approach is that historically Numpy has moved more slowly than the rest of the ecosystem. For example the __array_ufunc__ protocol mentioned above was discussed for several years before it was merged. Fortunately Numpy has recently received funding to help it make changes like this more rapidly. The full time developers hired under this funding have just started though, and it’s not clear how much of a priority this work is for them at first.

For what it’s worth I’d prefer to see this Numpy protocol solution take hold.

## Final Thoughts

In recent years Python’s array computing ecosystem has grown organically to support GPUs, sparse, and distributed arrays. This is wonderful and a great example of the growth that can occur in decentralized open source development.

However to solidify this growth and apply it across the ecosystem we now need to do some central planning to move from a pair-wise model where packages need to know about each other to an ecosystem model where packages can negotiate by developing and adhering to community-standard protocols.

The community has done this transition before (Numeric + Numarray -> Numpy, the Scikit-Learn fit/predict API, etc..) usually with surprisingly positive results.

The open questions I have today are the following:

1. How quickly can Numpy adapt to this demand for protocols while still remaining stable for its existing role as foundation of the ecosystem
2. What algorithmic domains can be written in a cross-hardware way that depends only on the high-level Numpy API, and doesn’t require specialization at the data structure level. Clearly some domains exist (XArray, automatic differentiation), but how common are these?
3. Once a standard protocol is in place, what other array-like implementations might arise? In-memory compression? Probabilistic? Symbolic?

## Update

After discussing this topic at the May NumPy Developer Sprint at BIDS a few of us have drafted a Numpy Enhancement Proposal (NEP) available here.

## May 16, 2018

### Continuum Analytics

#### Generate Custom Parcels for Cloudera CDH with Anaconda Enterprise 5

As part of our partnership with Cloudera, we offer a freely available Anaconda Python parcel for Cloudera CDH based on the Anaconda Distribution. The Anaconda parcel has been very well-received by both Anaconda and Cloudera users by making it easier for data scientists and analysts to use libraries from Anaconda that they know and love …

The post Generate Custom Parcels for Cloudera CDH with Anaconda Enterprise 5 appeared first on Anaconda.

### Continuum Analytics

#### CyberPandas: Extending Pandas with Richer Types

By Tom Augspurger, Data Scientist at Anaconda Over the past couple months, Anaconda has supported a major internal refactoring of pandas. The outcome is a new extension array interface that will enable an ecosystem of rich array types, that meet the needs of pandas’ diverse user base. Using the new interface, we’ve built a library …

The post CyberPandas: Extending Pandas with Richer Types appeared first on Anaconda.

## May 10, 2018

### Travis Oliphant

#### Reflections on Anaconda as I start a new chapter with Quansight

Leaving the company you founded is always a tough decision and a tough process that involves many people. It requires a series of potentially emotional "crucial-conversations."  It is actually not that uncommon in venture-backed companies for one or more of the original founders to leave at some point.  There is a decent article on the topic here:  https://hbswk.hbs.edu/item/the-founding-ceos-dilemma-stay-or-go.

Still it is extremely difficult to let go. You live and breathe the company you start.  Years of working to connect as many people as possible to the dream gives you a feeling of "ownership" and connection that no stock certificate can replace. Starting a company is a lot of work.  It takes a lot of effort. There are many decisions to make and many voices to incorporate. Hiring, firing, raising money, engaging customers, engaging employees, planning projects, organizing events, and aligning a pastiche of personalities while staying relevant in a rapidly evolving technology jungle is difficult.

As a founder over 40 with modest means, I had a family of 6 children who relied on me.  That family had teenage children who needed my attention and pre-school and elementary-school children that I could not simply leave only in the hands of my wife. I look back and sometimes wonder how we pulled it off. The truth probably lies in the time we borrowed: time from exercise, time from sleep, time from vacations, and time from family. I'd like to say that this dissonance against "work-life-harmony" was always a bad choice, but honestly, I don't see how I could have made too may different choices and still have created Anaconda.

Several things drove me. I could not let the people associated with the company down. I would not lose the money for those that invested in us. I could not let down the people who worked their tail off to build manage, document, market, and sell the technology and products that we produced. Furthermore, I would not let the community of customers and users down that had enabled us to continue to thrive.

The only way you succeed as a founder is through your customers being served by the efforts of those who surround you. It is only the efforts of the talented people who joined us in our journey that has allowed Anaconda to succeed so far. It is critical to stay focused on what is in the best interests of those people.

Permit me to use the name Continuum to describe the angel-funded and bootstrapped early-stage company that Peter and I founded in 2012 and Anaconda to describe the venture-backed company that Continuum became (This company we called Continuum 2.0 internally that really got started in the summer of 2015 after we raised the first tranche of $22 million from VCs.) Back in 2012, Peter and I knew a few things: 1) we had to connect Python to the Big Data movement; 2) we needed to help the scientific programmer, or a data-scientist developer build visualization-based applications quickly in the web; and 3) we needed to scale the stack of code around the PyData community to bigger hardware and multiple machines. We had big visions of an interconnected data-web, distributed schedulers, and data-structures that traversed the internet which could be analyzed across the cloud with simple Python scripts. We talked and talked about these things and grew misty-eyed in our enthusiasm for the potential of what was possible if we just built the right technology and sold just the right product to fund it. We knew that we wanted to build a product-company -- though we didn't know exactly what those products would be at the outset. We had some ideas, only portions of which actually worked out. I knew how to run a consulting and training company around Python and open-source. Because of this, I felt comfortable raising money from family members. While consulting companies are not "high-growth" they can make real returns for investors. I was pretty confident that I would not lose their money. We raised$2.25million from a few dozen investors consisting of Peter's family, my family, and a host of third-parties from our mutual networks.  Peter's family was critical to this early stage because they basically "led the early round" and ensured that we could get off the ground.   After they put their money in the bank, we could finish raising the rest of the seed round which took about 6 months to finish.

It is interesting (and somewhat embarrassing and so not detailed here) to go back and look at what products we thought we would be making. Some of the technologies we ended up building (like Excel integration, Numba, Bokeh, and Dask) were reflected in those early product dreams.  However, the real products and commercial success that Anaconda has had so far are only a vague resemblance to what we thought we would do.

Building a Python distribution was the last thing on our minds. I had been building Python distributions since I released SciPy in 2001.  As I have often repeated, SciPy was actually the first Python distribution masquerading as a library. The single biggest effort in releasing SciPy was building the binary installers and making sure everything compiled well.  With Fortran compilers still more scarce than they should be, it can still be difficult to compile and build SciPy.

Fortunately, with conda, conda-forge, and Anaconda, along with the emergence of wheels, almost nobody needs to build SciPy anymore.  It is so easy today to get started with a data-science project and get all the software you need to do amazing work fast. You still have to work to maintain your dependencies and keep that workflow reproducible.  But, I'm so happy that Anaconda makes that relatively straightforward today.

This was only possible because General Catalyst and BuildGroup joined us in the journey in the spring of 2015 to really grow the Anaconda story.  Their investment allowed us to 1) convert to a serious product-company from a bootstrapped consulting company with a few small products and 2) continue to invest heavily in conda, conda-forge, and Anaconda.

There is nothing like real-world experience as a teacher, and the challenge of converting to a serious product company was a tremendous experience that taught me a great deal. I'm grateful to all the people who brought their best to the company and taught me everyday.  It was a privilege and an honor to be a part of their success.  I am grateful for their patience with me as my "learning experiences" often led to real struggles for them.

There are many lasting learnings that I look forward to applying in future endeavors. The one that deserves mention in this post, however, is that building enterprise software that helps open-source communities should be done by selling a complementary product to the open-source.  The "open-core" model does not work as well.  I'm a firm believer that there will always be software to sell, but infrastructure should be and will be open-source --- sustained vibrantly from the companies that depend on it.  Joel Spolsky has written about complementary products before. You should read his exposition.

Early on at Anaconda, Peter and I decided to be a board-led company. This board which includes Peter and I has the final say in company leadership and made the important decision to transition Anaconda from being founder-led to being led by a more experienced CEO.  After this transition and through multiple conversations over many months we all concluded that the best course of action that would maximize my energy and passion while also allowing Anaconda to focus on its next chapter would be for me to spin-out of Anaconda and start a new services and open-source company where I could pursue a broader mission.

This new company is Quansight (short for Quantitative Insight). Our place-holder homepage is at http://www.quansight.com and we are @quansightai on Twitter. I'm excited to tell you more about the company in future blog-posts and announcements.  A few paragraphs will suffice for now.

Our overall mission is to develop people, build technology, and discover products to empower people with knowledge and data to solve the world’s most challenging problems.  We are doing that currently by connecting organizations sustainably with open source communities to solve their hardest problems by enabling teams to transparently apply science to their data.

One of the things we are doing is to help companies get started with AI and ML by applying the entire PyData stack to the fundamental data organization, data visualization, and model management problem that is required for practical success with ML and AI in business.  We also help companies generally improve their data-science practice by leveraging all the power of the Python, PyData, and related ecoystems.

We are also hard at work on the sustainability problem by continuing the tradition we started at Continuum Analytics of building successful and sustainable open-source "practices" that synchronize company needs with open-source technology development.   We have some innovative business approaches to this that we will be announcing in the coming weeks and months.

I'm excited that we have several devs working hard to help bring JupyterLab to 1.0 this year along with a vibrant community. There are many exciting extensions to this remarkable platform that remain to be written.

We also expect to continue to contribute to the PyViz activities that continue to explode in the Python ecosystem as visualization is a critical first step to understanding and using any data you care about.

Finally, Stefan Krah has joined us at Quansight.  Stefan is an award-winning Python core developer who has been steadily working over the past 18 months on a small but powerful collection of projects collectively called Plures.  These will be more broadly available in the next few months and published under the xnd brand.  Xnd is a generic container concept in C with a Python binding that together with its siblings ndtypes and gumath allows building flexible array-computing pipelines over many kinds of data-types.

This technology will serve to underly any array-computing framework and be a glue between machine-learning and data-science frameworks of all kinds.  Our plan is to use this tool to help reduce the data and computational silos that currently exist across the open-source ecosystem.

There is still much to work on and many more technologies to emerge.  It's an exciting time to work in machine learning, data-science, and scientific computing.  I'm thrilled that I continue to get the opportunity to be part of it.  Let me know if you'd like to be a part of our journey.

## May 08, 2018

### Matthieu Brucher

#### Address Sanitizer: alternative to valgrind

Recently, at work, I encountered a strange bug with GCC 7.2 and clang 6 (I didn’t test it with Visual Studio 2017 for different reasons). The bug was not visible on “old” compilers like gcc 4, Visual Studio 2013 or even Intel Compiler 2017. In debug mode, everything was fine, but in release mode, the application crashed. But not always at the same location.

#### Tools to debug

As we run valgrind all the time, I knew that the error could not be found with valgrind. When debugging the error, there was nothing that was wrong. All the variables were defined properly, were local or passed by value (for shared pointers), so nothing popped up.

But I had a feeling I would be able to find it with Address Sanitizer. So I ran it with the option ASAN_OPTIONS=detect_stack_use_after_return=1. And then I found it. Use after stack, and where ASAN found the error, I could figure out that we kept a reference to a stack variable that was removed.

#### What Address Sanitizer found and how to understand the reports

The following piece of code is a simplification of what was written. It may well be that the code was correct the first time it was written because Foo was supposed to be used locally. But in this context, it is not correct.

#include <iostream>

struct Foo
{
Foo(const int& bar)
: bar(bar)
{}

const int& bar;
};

Foo generate()
{
int i = 99;
return Foo(i);
}

int main()
{
Foo foo = generate();

std::cout << foo.bar << std::endl;
}

As you can see, Foo keeps a reference to an int, and in this case, that integer was allocated on the stack and was destroyed when we access the reference. In debug mode, you would get 99. In optimized mode, you get anything. Literally.

To compile it, just do

clang++ test.cpp -fsanitizer=address

OK, so what does ASAN returns?


=================================================================
==24406==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f2db2c00040 at pc 0x0000005172e0 bp 0x7ffe043cf770 sp 0x7ffe043cf768
#0 0x5172df in main (/home/mbrucher/local/temp/a.out+0x5172df)
#1 0x7f2db60ffc04 in __libc_start_main (/lib64/libc.so.6+0x21c04)
#2 0x41a757 in _start (/home/mbrucher/local/temp/a.out+0x41a757)

Address 0x7f2db2c00040 is located in stack of thread T0 at offset 64 in frame
#0 0x516fcf in generate() (/home/mbrucher/local/temp/a.out+0x516fcf)

This frame has 2 object(s):
[32, 40) 'retval'
[64, 68) 'i' == Memory access at offset 64 is inside this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
(longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-use-after-return (/home/mbrucher/local/temp/a.out+0x5172df) in main
0x0fe636577fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0fe636577fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0fe636577fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0fe636577fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0fe636577ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0fe636578000: f5 f5 f5 f5 f5 f5 f5 f5[f5]f5 f5 f5 f5 f5 f5 f5
0x0fe636578010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0fe636578020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0fe636578030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0fe636578040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0fe636578050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==24406==ABORTING


The report can be confusing. The trick is to compile with -g to have proper stack information. Here, I get where the bad memory access occurs AND where I stored the wrong reference. Then, the color system allows to check what happens in the memory. Here, it’s only f5, so stack after return information (you could get bound check, deallocated memory…). So we can look for a stack variable that was used, hence the reference that is the culprit.

#### Conclusion

Address Sanitizer is great. Of course, by default, it checks memory leaks, bound checks… But it can do far more than just these. It is better than valgrind on several aspects, like speed (as it’s not emulation based) but also on what it can check. It saved me lots of time already despite having used it only for a few months, so consider adopting it. It’s a puppy that doesn’t require much time.

## May 03, 2018

### Thomas Wiecki

#### An intuitive, visual guide to copulas

(c) 2018 by Thomas Wiecki

People seemed to enjoy my intuitive and visual explanation of Markov chain Monte Carlo so I thought it would be fun to do another one, this time focused on copulas.

If you ask a statistician what a copula is they might say "a copula is a multivariate distribution $C(U_1, U_2, ...., U_n)$ such that marginalizing gives $U_i \sim \operatorname{\sf Uniform}(0, 1)$". OK... wait, what? I personally really dislike these math-only explanations that make many concepts appear way more difficult to understand than they actually are and copulas are a great example of that. The name alone always seemed pretty daunting to me. However, they are actually quite simple so we're going to try and demistify them a bit. At the end, we will see what role copulas played in the 2007-2008 Financial Crisis.

## Example problem case¶

Let's start with an example problem case. Say we measure two variables that are non-normally distributed and correlated. For example, we look at various rivers and for every river we look at the maximum level of that river over a certain time-period. In addition, we also count how many months each river caused flooding. For the probability distribution of the maximum level of the river we can look to Extreme Value Theory which tells us that maximums are Gumbel distributed. How many times flooding occured will be modeled according to a Beta distribution which just tells us the probability of flooding to occur as a function of how many times flooding vs non-flooding occured.

It's pretty reasonable to assume that the maximum level and number of floodings is going to be correlated. However, here we run into a problem: how should we model that probability distribution? Above we only specified the distributions for the individual variables, irrespective of the other one (i.e. the marginals). In reality we are dealing with a joint distribution of both of these together.

Copulas to the rescue.

## What are copulas in English?¶

Copulas allow us to decompose a joint probability distribution into their marginals (which by definition have no correlation) and a function which couples (hence the name) them together and thus allows us to specify the correlation seperately. The copula is that coupling function.

Before we dive into them, we must first learn how we can transform arbitrary random variables to uniform and back. All we will need is the excellent scipy.stats module and seaborn for plotting.

In [1]:
%matplotlib inline

import seaborn as sns
from scipy import stats


## Transforming random variables¶

Let's start by sampling uniformly distributed values between 0 and 1:

In [2]:
x = stats.uniform(0, 1).rvs(10000)
sns.distplot(x, kde=False, norm_hist=True);


Next, we want to transform these samples so that instead of uniform they are now normally distributed. The transform that does this is the inverse of the cumulative density function (CDF) of the normal distribution (which we can get in scipy.stats with ppf):

In [3]:
norm = stats.distributions.norm()
x_trans = norm.ppf(x)
sns.distplot(x_trans);


If we plot both of them together we can get an intuition for what the inverse CDF looks like and how it works:

In [4]:
h = sns.jointplot(x, x_trans, stat_func=None)
h.set_axis_labels('original', 'transformed', fontsize=16);


As you can see, the inverse CDF stretches the outer regions of the uniform to yield a normal.

We can do this for arbitrary (univariate) probability distributions, like the Beta:

In [5]:
beta = stats.distributions.beta(a=10, b=3)
x_trans = beta.ppf(x)
h = sns.jointplot(x, x_trans, stat_func=None)
h.set_axis_labels('orignal', 'transformed', fontsize=16);


Or a Gumbel:

In [6]:
gumbel = stats.distributions.gumbel_l()
x_trans = gumbel.ppf(x)
h = sns.jointplot(x, x_trans, stat_func=None)
h.set_axis_labels('original', 'transformed', fontsize=16);


In order to do the opposite transformation from an arbitrary distribution to the uniform(0, 1) we just apply the inverse of the inverse CDF -- the CDF:

In [7]:
x_trans_trans = gumbel.cdf(x_trans)
h = sns.jointplot(x_trans, x_trans_trans, stat_func=None)
h.set_axis_labels('original', 'transformed', fontsize=16);


OK, so we know how to transform from any distribution to uniform and back. In math-speak this is called the probability integral transform.

## Adding correlation with Gaussian copulas¶

How does this help us with our problem of creating a custom joint probability distribution? We're actually almost done already. We know how to convert anything uniformly distributed to an arbitrary probability distribution. So that means we need to generate uniformly distributed data with the correlations we want. How do we do that? We simulate from a multivariate Gaussian with the specific correlation structure, transform so that the marginals are uniform, and then transform the uniform marginals to whatever we like.

Create samples from a correlated multivariate normal:

In [8]:
mvnorm = stats.multivariate_normal(mean=[0, 0], cov=[[1., 0.5],
[0.5, 1.]])
# Generate random samples from multivariate normal with correlation .5
x = mvnorm.rvs(100000)

In [9]:
h = sns.jointplot(x[:, 0], x[:, 1], kind='kde', stat_func=None);
h.set_axis_labels('X1', 'X2', fontsize=16);


Now use what we learned above to "uniformify" the marignals:

In [10]:
norm = stats.norm()
x_unif = norm.cdf(x)
h = sns.jointplot(x_unif[:, 0], x_unif[:, 1], kind='hex', stat_func=None)
h.set_axis_labels('Y1', 'Y2', fontsize=16);


This joint plot above is usually how copulas are visualized.

Now we just transform the marginals again to what we want (Gumbel and Beta):

In [11]:
m1 = stats.gumbel_l()
m2 = stats.beta(a=10, b=2)

x1_trans = m1.ppf(x_unif[:, 0])
x2_trans = m2.ppf(x_unif[:, 1])

h = sns.jointplot(x1_trans, x2_trans, kind='kde', xlim=(-6, 2), ylim=(.6, 1.0), stat_func=None);
h.set_axis_labels('Maximum river level', 'Probablity of flooding', fontsize=16);


Contrast that with the joint distribution without correlations:

In [12]:
x1 = m1.rvs(10000)
x2 = m2.rvs(10000)

h = sns.jointplot(x1, x2, kind='kde', xlim=(-6, 2), ylim=(.6, 1.0), stat_func=None);
h.set_axis_labels('Maximum river level', 'Probablity of flooding',  fontsize=16);


So there we go, by using the uniform distribution as our lingua franca we can easily induce correlations and flexibly construct complex probability distributions. This all directly extends to higher dimensional distributions as well.

## More complex correlation structures and the Financial Crisis¶

Above we used a multivariate normal which gave rise to the Gaussian copula. However, we can use other, more complex copulas as well. For example, we might want to assume the correlation is non-symmetric which is useful in quant finance where correlations become very strong during market crashes and returns are very negative.

In fact, Gaussian copulas are said to have played a key role in the 2007-2008 Financial Crisis as tail-correlations were severely underestimated. If you've seen The Big Short, the default rates of individual mortgages (among other things) inside CDOs (see this scene from the movie as a refresher) are correlated -- if one mortgage fails, the likelihood of another failing is increased. In the early 2000s, the banks only knew how to model the marginals of the default rates. This infamous paper by Li then suggested to use copulas to model the correlations between those marginals. Rating agencies relied on this model heavily, severly underestimating risk and giving false ratings. The rest, as they say, is history.

Read this paper for an excellent description of Gaussian copulas and the Financial Crisis which argues that different copula choices would not have made a difference but instead the assumed correlation was way too low.

## Getting back to the math¶

Maybe now the statement "a copula is a multivariate distribution $C(U_1, U_2, ...., U_n)$ such that marginalizing gives $U_i \sim \operatorname{\sf Uniform}(0, 1)$" makes a bit more sense. It really is just a function with that property of uniform marginals. It's really only useful though combined with another transform to get the marginals we want.

We can also better understand the mathematical description of the Gaussian copula (taken from Wikipedia):

For a given $R\in[-1, 1]^{d\times d}$, the Gaussian copula with parameter matrix R can be written as $C_R^{\text{Gauss}}(u) = \Phi_R\left(\Phi^{-1}(u_1),\dots, \Phi^{-1}(u_d) \right)$ where $\Phi^{-1}$ is the inverse cumulative distribution function of a standard normal and $\Phi_R$ is the joint cumulative distribution function of a multivariate normal distribution with mean vector zero and covariance matrix equal to the correlation matrix R.

Just note that in the code above we went the opposite way to create samples from that distribution. The Gaussian copula as expressed here takes uniform(0, 1) inputs, transforms them to be Gaussian, then applies the correlation and transforms them back to uniform.

## Support me on Patreon¶

Finally, if you enjoyed this blog post, consider supporting me on Patreon which allows me to devote more time to writing new blog posts.

This post is intentionally light on math. You can find that elsewhere and will hopefully be less confused as you have a strong mental model to integrate things into. I found these links helpful:

We also haven't addressed how we would actually fit a copula model. I leave that, as well as the PyMC3 implementation, as an exercise to the motivated reader ;).

## Acknowledgements¶

Thanks to Adrian Seyboldt, Jon Sedar, Colin Carroll, and Osvaldo Martin for comments on an earlier draft. Special thanks to Jonathan Ng for being a Patreon supporter.

## May 02, 2018

### Continuum Analytics

#### Anaconda’s Damian Avila on the 2017 ACM Software System Award for Jupyter

I am very happy to inform you that Project Jupyter has been awarded the 2017 ACM Software System Award! As part of the Jupyter Steering Council, I am one of the official recipients of the award, but I wanted to highlight that I am just one member of a large group of people (contributors and …

The post Anaconda’s Damian Avila on the 2017 ACM Software System Award for Jupyter appeared first on Anaconda.

## May 01, 2018

### Titus Brown

#### Increasing transparency in postdoc hiring and on-boarding

This is a guest post from Dr. Rayna Harris.

This week I started a post-doc working in Titus Brown's Data Intensive Biology lab. If there is such a thing as a dream job, this is it. I've interacted with Titus and his lab members many times through BEACON, the Marine Biological Laboratory, Software Carpentry, and Data Carpentry.

One of the things I appreciate so much about Titus's style is his transparency. Here are a few of my thoughts about why interview process and the on-boarding have gone so well.

## The quick turn around

Titus posted the job announcement on the Software Carpentry discuss list and on hackmd on March 15. During April, I had an interview, received an offer, accepted the offer, and set a May 1 start date. I wish all things in academia could move so fast.

## The interview questions

The really cool part about the interview process what Titus posted the interview questions ahead of time here, and I submitted my responses here. This meant I could be more relaxed during the interview because the questions weren't out of the blue. Titus and his lab mates were able to ask me to delve into the details a little more or say, "cool, let's move on to the next topic".

Even before Titus informed me that I would be given the interview questions ahead of time, I knew this was coming because, well, he wrote a blog post with the interview questions used for a postdoc position building pipelines. I hope to see more of this transparency and sharing of interview questions in the future.

## The salary

The salary was posted with the job announcement, so I was never in doubt about what my salary would be. Additionally, in May 2016, Titus posted a blog about increasing postdoc pay. In this post, he makes it clear that he pays all his postdocs the same and that he doesn't negotiate salary. So, when he made me an offer, I didn't have to waste my energy negotiating salary, which freed up time to talk about other things that were valuable to me.

## Code of Conduct

I've always liked that Titus has a Code of Conduct on his lab's website. Back in January of 2017, I was recruiting some undergrads, and I wondered if I should put a Code of Conduct on my personal website (many PIs that are affiliated with The Carpentries have one on their website, but my grad advisor did not). So, I reached out to Titus and asked him what he thought. He gave me some really good advice about how the purpose of the CoC was to convey that "these are my expectations for our behavior, and these are the paths to resolution".

What I've come to realize over the past year is that even though I and many of my colleagues point to CoCs at conferences and workshops, very few of us feel equipped to responding to Code of Conduct incidents. So, I'm excited that this Friday I'll be participating in a workshop on Training for Code of Conduct Incident Response with some Carpentry Colleagues. I think this is an important step toward increasing diversity in our community.

## Communication

Right now, were use Slack and GitHub for most of our communication. This means that progress on all projects is visible to the rest of the lab, and most of it is under version control. I really like both of these technologies because they work synchronously or asynchronously and collaboratively on projects and easily keep track of what's working well or not.

## Summary

In the last two years, I applied for 13 different postdocs positions or jobs and had 10 interviews, but this is one is my dream job. I super excited about being in an environment with our goal is increase transparency in both our scientific methodology but also with respect to the social aspects of science. Stay tuned for more updates about our progress!

--Rayna

### Matthieu Brucher

#### Analog modelling: A prototype generic modeller in Python

A few month ago, mystran published on KVR a small SPICE simulator for real-time processing. I liked the idea, the drawback being that the code is generic and not tailored like a static version of the optimizer. So I wondered if it was doable. But for this, I have to start from the basics and build from there. So let’s go.

#### Why Python?

I’ve decided to do the prototype in Python. The reason is simple, it’s easy to create something very fast in Python, write all the basic tests there and figure out what functionalities are required.

First, the objective is to have a statically generated model in the long-term, so I need to differentiate between static voltage pins, input voltage pins and output voltage pins. The latter ones could be any pins for which the voltage will be computed by the modeller.

Then we need components, like resistors, diodes, transistors… Capacitors and coils are also required, but let’s use the model I presented in a previous blog post. This will simplify writing the equations.

So what is the basic equation in an electronic modeller? Some people use MNA, but I want something small and easy to understand, so I’ll use Kirchhoff’s current law $\sum{i_{component}} = 0$. It is simple enough and (almost) all the models I use describe current from a pin as a function of voltages.

There is one thing in the prototype that is not perfect and that I haven’t figured out yet. A circuit is in a steady state when we start feeding it an audio signal. To compute it, we need to make capacitor like open circuits (easy) and coil like short circuits (not easy). The issue with short-circuits also happens when you have a variable resistor than you turn off entirely. In a dynamic model, we can easily collapse pins together when required, but in a static model, when you want to optimize the shape of the matrices, this is less than ideal. So to avoid this, I use a very small resistor. Not perfect, but it seems to work. For now.

#### Description of the different methods

Let’s start with the basic Modeler class (I wrote the prototype with American English and the C++ version in British English, seems like I can’t decide which one I should use…). The constructor takes a number of dynamic pins (the ones we will compute the voltage for), static pins (fixed input voltage) and input pins (variable input voltage). I will keep a list of the components of the model, then a structure for accessing the pins (useful for the dynamic pins so that we can get the sum of the currents) and the same for the state voltages. The distinction between dynamic, static and input voltages will be done through the character ‘D’, ‘S’ and ‘I’ (in Python, in C++, I’ll use an enum). We actually don’t need to store anything more than the dynamic pins, but for the sake of this prototype, let’s store all of them.

class Modeler(object):
"""
Modeling class
"""
def __init__(self, nb_dynamic_pins, nb_static_pins, nb_inputs = 0):
self.components = []
self.dynamic_pins = [[] for i in range(nb_dynamic_pins)]
self.static_pins = [[] for i in range(nb_static_pins)]
self.input_pins = [[] for i in range(nb_inputs)]
self.pins = {
'D': self.dynamic_pins,
'S': self.static_pins,
'I': self.input_pins,
}

self.dynamic_state = np.zeros(nb_dynamic_pins, dtype=np.float64)
self.static_state = np.zeros(nb_static_pins, dtype=np.float64)
self.input_state = np.zeros(nb_inputs, dtype=np.float64)
self.state = {
'D': self.dynamic_state,
'S': self.static_state,
'I': self.input_state,
}
self.initialized = False

To add a new component, we need to keep the component and store the pins inside the component (because the components needs to track the pins for the current computation) and for each pin, we store the component attached to it and the index of that pin for the component. The latter functionality will allow to get the proper sign of the current for the pin when we compute our equations.

    def add_component(self, component, pins):
"""
:param pins: list of tuples indicating how the component is connected
"""
self.components.append(component)
component.pins = pins
for (i, pin) in enumerate(pins):
t, pos = pin
self.pins[t][pos].append((component, i))

As I’ve said, we need to compute a steady state. When we do so, we need to start by initializing components (coils and capacitors), solve for a steady state and then update the same components to update their internal state.

    def setup(self):
"""
Initializes the internal state
"""
for component in self.components:

self.solve(True)

for component in self.components:

self.initialized = True

To update the state based on an input, we do the following, reusing our solve() method:

    def __call__(self, input):
"""
Works out the value for the new input vector
:param input: vector of input values
"""
if not self.initialized:
self.setup()

self.input_state[:] = input

self.solve(False)

for component in self.components:
component.update_state(self.state)

return self.dynamic_state

Now we can write the solver part. We iterate several time, and for each component, we tell them to precompute their state (for costly components like diodes, transistors or valves), and then for each pin, we write an equation (remember that each pin is a state we have to solve, so with as many equations as we have pins, we should be able to get to the next state) and get the Jacobian for that equation. If the Kirchhoff equations are already satisfied (i.e. close to 0), we stop. If the delta we compute is small enough, we also stop.

    def solve(self, steady_state):
"""
Actually solve the equation system
:param steady_state: if set to True (default), computes for a steady state
"""
iteration = 0
while iteration < MAX_ITER and not self.iterate(steady_state):
iteration = iteration + 1

"""
Do one iteration
:param steady_state: if set to True (default), computes for a steady state
"""
for component in self.components:
eqs = []
jacobian = []
for i, pin in enumerate(self.dynamic_pins):
eqs.append(eq)
jacobian.append(jac)

eqs = np.array(eqs)
jacobian = np.array(jacobian)

if np.all(np.abs(eqs) < EPS):
return True

delta = np.linalg.solve(jacobian, eqs)
if np.all(np.abs(delta) < EPS):
return True

self.dynamic_state -= delta

return False

Now the last missing part will be the actual equation and Jacobian line building from the components:

    def compute_current(self, pin, steady_state):
"""
Compute Kirschhoff law for the non static pin
Compute also the jacobian for all the connected pins
:param pin: tuple indicating which pin we compute the current for
:param steady_state: if set to True (default), computes for a steady state
"""
eq = sum([component.get_current(i, self.state, steady_state) for (component, i) in pin])
jac = [0] * len(self.dynamic_state)
for (component, j) in pin:
for (i, component_pin) in enumerate(component.pins):
if component_pin[0] == "D":
return eq, jac

def retrieve_voltage(state, pin):
"""
Helper function to get the voltage for a given pin
"""
return state[pin[0]][pin[1]]

Thanks to the way the modeller is built, we can pass the entire state and keep ‘D’, ‘S’ and ‘I’ to check for which voltage we need to compute the Jacobian.

#### Conclusion

As this blog post is already long, I’ll pause here for now before tackling the different components and what we need to change to the modeller for some ideal components like opamp.

The code is available on GitHub.

## April 24, 2018

### Matthieu Brucher

#### Announcement: ATKSideChainCompressor 3.0.0

I’m happy to announce the update of ATK Side-Chain Compressor based on the Audio Toolkit and JUCE. It is available on Windows (AVX compatible processors) and OS X (min. 10.9, SSE4.2) in different formats.

This update changes storage format and allows linked channels to be steered by a mix of power coming from each channel, each passing through its own attack-release filter. It enables more creative workflows with makeup gain specific to each channel. The rest of the plugin works as before, with an optional Middle/Side processing as well as side-chain working either on each channel separately or in middle/side.

This plugin requires the universal runtime on Windows, which is automatically deployed with Windows update (see tis discussion on the JUCE forum). If you don’t have it installed, please check Microsoft website.

ATK Side-Chain Compressor 3.0.0

The supported formats are:

• VST2 (32bits/64bits on Windows, 32/64bits on OS X)
• VST3 (32bits/64bits on Windows, 32/64bits on OS X)
• Audio Unit (32/64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

## April 17, 2018

### Continuum Analytics

I had the great honor and pleasure of presenting the first tutorial at AnacondaCon 2018, on machine learning with scikit-learn. I spoke to a full room of about 120 enthusiastic data scientists and aspiring data scientists. I would like to thank my colleagues at Anaconda, Inc. who did such a wonderful job of organizing this …

The post Machines Learning about Humans Learning about Machines Learning appeared first on Anaconda.

### Matthieu Brucher

#### Book review: C++17 Quick Syntax Reference: A Pocket Guide to the Language, APIs and Library

I work on a day-to-day basis on a big project that has many developers with different C++ level. Scott Meyers wrote a wonderful book on modern C++ (that I still need to review one day, especially since there is a new Effective Modern C++), but it is not for beginners. So I’m looking for that rare book with modern C++ and an explanation of good practices.

#### Discussion

Let’s cut to the chase right away. It’s not this book. This book is bad. Very bad. So at the core, it’s supposed to be about the syntax, but even if it was about the syntax, you can still teach the good approach, can’t you?

A few examples. Templates are tackled in one of the last chapter, and so are classes. Then, the book starts almost from the beginning to tell people to use using namespace std. Is there anything more to add?

Yes, there is. New and delete are tackled, then the array version is done very much further, and I’m not even talking about smart pointers. They are addressed, but so far that people think it is still good to start by not using them. Yes, talk about new/delete, but RIGHT AWAY, say that they should use std::unique_ptr, std::shared_ptr and the make_* version. It’s supposed to be about C++17, and in C++17, we avoid new/delete. OK, it is mentioned, but 2 lines after several chapters of bad practices.

For range loops. They are introduced badly as well. for(auto&i: l) std::cout i std::endl; Why? Why the &? Why can’t you explain the purpose of this instead of waiting additional chapters and not even talking about when you use pass by value, pass by ref or pass by const ref?

I’m still trying to figure out why it is supposed to be a syntax book, but still the author tackles smart pointers. And tuples. Why not the rest of the standard library?

A good C++ book should start by presenting templates as soon as possible, the standard library and the good practices. Yes, it’s tough, but that’s why not everyone should write a C++ book.

#### Conclusion

The book is supposed to be about the syntax. But it lacks the good practices, with no reference to the C++ core guidelines. In the end, you still need to read another book to learn Modern C++ (hint…).

## April 16, 2018

### Continuum Analytics

#### AnacondaCON 2018 Recap: An Exploration of Modern Data Science

Last year’s inaugural AnacondaCON was a major milestone for our company. Our goal was to create a conference that highlights all the different ways people are using data science and predictive analytics, and reflects the passionate and eclectic nature of our growing Python community. When over 400 people descended upon Austin to connect with peers …

The post AnacondaCON 2018 Recap: An Exploration of Modern Data Science appeared first on Anaconda.

## April 12, 2018

### Continuum Analytics

#### What You Missed on Day Three of AnacondaCON 2018

And that’s a wrap! Yesterday was the third and final day of AnacondaCON 2018, and what a ride it’s been. Read some highlights from what you missed, and stay tuned for our comprehensive AnacondaCON 2018 recap, coming soon! Improving Your Anaconda Distribution User Experience Anaconda Product Manager Crystal Soja presented a roadmap of upcoming plans …

The post What You Missed on Day Three of AnacondaCON 2018 appeared first on Anaconda.

### Randy Olson

#### Traveling salesman portrait in Python

Last week, Antonio S. Chinchón made an interesting post showing how to create a traveling salesman portrait in R. Essentially, the idea is to sample a bunch of dark pixels in an image, solve the well-known traveling salesman problem for

## April 11, 2018

### Continuum Analytics

#### What You Missed on Day Two of AnacondaCON 2018

What a day! On Tuesday we got started bright and early, then partied our way into the night. Here are some highlights from Day Two of AnacondaCON 2018. Opening Keynote: John Kim John Kim, President of HomeAway, kicked things off for us with a personal, touching keynote on Love in the Age of Machine Learning. …

The post What You Missed on Day Two of AnacondaCON 2018 appeared first on Anaconda.

## April 10, 2018

### Matthieu Brucher

#### Book review: LLVM Cookbook

After the book on LLVM core libraries, I want to have a look at the cookbook.

#### Discussion

The idea was that once I had a broad view of LLVM, I could try to apply some recipes for what I wanted to do. Let’s just say that I was deeply mistaken.

First, the two authors have a very different way of writing code. One of them is… rubbish. I don’t think there is another way of saying this, but this is C++, and the guy writes C++ code as if it was C code, no class, with static states, without the override keyword. If such a guy is a professional developer, I’m sorry but I’m very scared about anything he would write professionally.

The second guy is better (he uses override, for instance, so it’s very disturbing to see both styles in the same book), it’s just too bad that the code he writes seems to be just showing things existing in LLVM, but no real recipes (OK, I’m exaggerating, there are a few such examples, but the majority is “execute that command to see how LLVM does this”, and just doing “this” doesn’t have any relevance in the big picture.

I suppose the only relevant and interesting parts are the first few recipes that are focused on reusing LLVM parts for a custom language. The rest is basically explanations of the later stages in a compiler. Basically what you would get from my previous review, without the explanations…

#### Conclusion

Have you ever read a recipe book that will explain how to prepare your kitchen for cooking instead of actually cooking recipes? This book is like that. You might learn how to use LLVM commands, but not LLVM libraries. Avoid.

## April 09, 2018

### Continuum Analytics

#### What You Missed on Day One of AnacondaCON 2018

And we’re off! Day One of AnacondaCON 2018 is officially in the books, y’all. For those of you who couldn’t make the trek to Texas, here are some highlights from what you missed today. “Why are they shooting at us?” “They’re the IT team!”The festivities kicked off this morning with a movie trailer for deep learning …

The post What You Missed on Day One of AnacondaCON 2018 appeared first on Anaconda.

#### Introducing the Anaconda Data Science Certification Program

There is strong demand today for data science skills across all sectors of the economy. Organizations worldwide are actively looking to recruit qualified data scientists and improve the skills of their existing teams. Individuals are looking to stand out from the competition and differentiate themselves in a growing marketplace. As the creators of the world’s …

The post Introducing the Anaconda Data Science Certification Program appeared first on Anaconda.

#### Anaconda Debuts Data Science Certification Program

Certification to Standardize Data Science Skill Set among Employers and Professionals AnacondaCON, Austin, TX—April 9, 2018 — Anaconda, the most popular Python data science platform provider, today introduced the Anaconda Data Science Certification, giving data scientists a way to verify their proficiency and organizations an independent standard for qualifying current and prospective data science experts. “The …

The post Anaconda Debuts Data Science Certification Program appeared first on Anaconda.

## April 03, 2018

### Matthieu Brucher

#### Book review: Getting Started with LLVM Core Libraries

LLVM has always intrigued me. Actually, I always thought about one day writing a compiler. But it was more a challenge than a requirement for any of my works, private or professional, so never dived into it. The design of LLVM was also very well thought, and probably close to something I would have had liked to create.

So now the easiest is just to use LLVM for the different goals I want to achieve. I recently had to write clang-tidy rules, and I also want to perhaps create a JIT for Audio Toolkit and the modeling libraries. So lots of reasons to look at LLVM.

#### Discussion

The book more or less goes from C/C++ parsing to code generation.

OF course, the first chapters are about setting everything up. The book using Makefiles mainly, which is not an option anymore in current LLVM versions. But it does provide the equivalent CMake version, so it is fine. Also the structure of the projects have not changed, so everything still works. Of course, lots of projects matured also over time (lld, libcxx…), so when you read that something is not yet production ready, check online (if you can find the information, I have to say that LLVM communication is very bad, just look at release notes to get an idea!).

The third chapter tackles LLVM design. That’s what I liked with LLVM, the modular design, but it can also be scary because you can build more or less anything, and the API do evolve with time. But the chapter does reassure me, and helps understanding the philosophy.

Then, at the fourth chapter, we start working through clang pipeline by starting with all the steps between the C/C++ code and LLVM Intermediate Representation. The AST and interaction with it are very well presented with the different stages required to generate the IR. The missing bit may be explaining why the AST is so important to have, why LLVM people had to create a new intermediate representation for this front-end.

The fifth chapter is about everything we can do on the IR. I left the chapter still hungry for more. OK, the IR phases can evolve the graph, but it feels like not enough here. How does the matching actually work? This is where you can see that the book is for beginners and not for intermediate or advanced users. Also it made me realize that there is no way I can generate IR directly for my projects, I would go from a C++ AST to IR to the JIT…

After working on the IR, of course, we get to code generation and the different tools in LLVM to generate either byte code or machine code and everything in-between. Lots of time is devoted to explain that this phase is very costly, as we go from something quite generic to something definitely not generic, and this part was very instructional.

The seventh chapter was strange. It spent lots of time talking about a part of LLVM that was about to be removed from LLVM, the “old” JIT framework. I suppose at the time the new one was too new and some people still had to understand the old one. I still felt it was a little bit a waste of space.

Cross-compilation is tackled after that, and more precisely that you may not require to do anything. This is also where one can see the limit of LLVM. To get the proper backends, you need to get the gcc toolchain. I think this is still something people do today. Even for clang 6, I actually compiled it against a gcc 7 set so that I don’t have to rebuild all the C++ third-party libraries. Also the ARM backend seemed to be broken for a long time, so that’s also not very great for trust!

The last two chapters tackle tools made with clang. The first one is the static analyzer, and I have to say that I didn’t even knew it existed. There are tools with it that allow to generate HTML reports, and I liked that. But when I tried to use them with CMake, they just broke (scan_build). There is chapter about libclang and clang-tidy, which is probably my reference now. Something that wasn’t done in 2014 is that the static analyzer rules are now integrated inside clang-tidy, it’s just that it can’t build HTML reports out of them. Is it really mandatory? It gives a better view of static code issues (whereas the other rules are geared towards sugar-coating).

The book ends very quickly in a small paragraph at the end of the libeling chapter. Very disturbing.

#### Conclusion

Despite the age of the book and the changes that went inside LLVM (clang-modernize is now part of clang-tidy, DragonEgg is… I don’t know where it went), the book seem to stay very much current (clang is still the main front-end). I would have liked more example on clang AST matchers, but I suppose it requires a full cookbook, and the audience may not be that big. Still, I’m looking forward to use the different bits to write a JIT and C++ output for electronic modeling/SPICE.

## March 30, 2018

### Continuum Analytics

#### Improved Security & Performance in Anaconda Distribution 5

We announced the release of Anaconda Distribution 5 back in October 2017, but we’re only now catching up with a blog post on the security and performance implications of that release.  Improving security and enabling new language features were our primary goals, but we also reaped some performance improvements along the way. This blog post …

The post Improved Security & Performance in Anaconda Distribution 5 appeared first on Anaconda.

## March 28, 2018

### Continuum Analytics

#### Anaconda Community Survey

If you’re an Anaconda user, we’d love to hear from you! Please complete our short survey below, or by clicking on this link. As an extra incentive when you complete the survey you can enter a drawing to win a Sonus One Smart Speaker with Amazon Alexa.

The post Anaconda Community Survey appeared first on Anaconda.