September 19, 2017

Matthieu Brucher

Book review: Audio Effects: Theory, Implementation and Application

I love reading books on signal processing, especially on audio signal processing. From books on using effects to a so-called reference, I still enjoy reading them, even if they are frustrating. The only one that was is DAFX: Digital Audio Effects, but I haven’t made a review of it!

This book came after reading a book on JUCE and starting using it myself. As it was using JUCE for its plugins, I thought it would be a good book.

Discussion

The book starts with a great introduction on signal processing. It actually gives all the theory for linear time invariant filters. Once this is set, I’m not even sure you need more DSP theory!

The second chapters dials back and tackles delays, but with the theoretical approach. It would have actually been better IMHO to give the theory bit by bit, as we don’t need to know anything about the z-transform for any delays. I still think that Zölzer approach on chorus (low-frequency random noise instead of LFO) is the best…

After that, there are two chapters on general LTI effects. First filter design, with all the equations to switch between different types of filters, and then their application to EQ, was-wah and others. It was interesting, but lots of talks and the code quality is… bad. Really, you don’t call modulo in the DSP loop.

The next chapter is yet another mystery: amplitude modulation. The concepts are so basic that it should have been before delays… Not even talking about it.

OK, now the chapter on dynamic processing did introduce compression properly, but the authors forgot that compressors and expanders are more or less the same, and they confused compressor of type I (compressing above threshold) and expander of type II (expanding below threshold) with compressors and expanders. The only thing you can’t do is a compressor of type II, but expanders of type I do exist.

The chapter on overdrive is “cute”. I mean, waveshaping and oversampling, that’s great, but a proper book should tackle analog modeling as well in this kind of chapter…

The following chapter almost tackles vocoders. I mean almost because while starting talking about the theory and short-term FFT, it completely fails at the theory of FFT. It’s quite astonishing when comparing with the first chapter, but seriously, what the authors present as processing assumes a periodic signal. And even then, depending on the FFT window, the result is different. Why didn’t they warn about the fact that this kind of effect is not causal? And after, there are people designing effect directly in the frequency space and completely mess it up. Seriously!

After that chapter, it was all downhill. Spatial audio effects (pan, more or less, with mention of HRTF), the Doppler effect, which is a rehash of the delay chapter with some more information, reverberation (which should have been after delays), a chapter on audio production (the signal path in a condole or a DAW) and then a short chapter on JUCE (so short, that I’m not really sure it matters). I suppose that with my tone, you understand that the second half of the book was a big disappointment.

Conclusion

So there is still no better DSP book than DAFX, and definitely no proper book on DSP programming… Avoid this book, too expensive for what it brings to the table.

by Matt at September 19, 2017 07:05 AM

September 18, 2017

numfocus

The Econ-ARK joins NumFOCUS Sponsored Projects

​NumFOCUS is pleased to announce the addition of the Econ-ARK to our fiscally sponsored projects. As a complement to the thriving QuantEcon project (also a NumFOCUS sponsored project), the Econ-ARK is creating an open-source resource containing the tools needed to understand how diversity across economic agents (in preferences, circumstances, knowledge, etc) leads to richer and […]

by Gina Helfrich at September 18, 2017 05:06 PM

Matthew Rocklin

Dask on HPC - Initial Work

This work is supported by Anaconda Inc. and the NSF EarthCube program.

We recently announced a collaboration between the National Center for Atmospheric Research (NCAR), Columbia University, and Anaconda Inc to accelerate the analysis of atmospheric and oceanographic data on high performance computers (HPC) with XArray and Dask. The full text of the proposed work is available here. We are very grateful to the NSF EarthCube program for funding this work, which feels particularly relevant today in the wake (and continued threat) of the major storms Harvey, Irma, and Jose.

This is a collaboration of academic scientists (Columbia), infrastructure stewards (NCAR), and software developers (Anaconda and Columbia and NCAR) to scale current workflows with XArray and Jupyter onto big-iron HPC systems and peta-scale datasets. In the first week after the grant closed a few of us focused on the quickest path to get science groups up and running with XArray, Dask, and Jupyter on these HPC systems. This blogpost details what we achieved and some of the new challenges that we’ve found in that first week. We hope to follow this blogpost with many more to come in the future. Today we cover the following topics:

  1. Deploying Dask with MPI
  2. Interactive deployments on a batch job scheduler, in this case PBS
  3. The virtues of JupyterLab in a remote system
  4. Network performance and 3GB/s infiniband
  5. Modernizing XArray’s interactions with Dask’s distributed scheduler

A video walkthrough deploying Dask on XArray on an HPC system is available on YouTube and instructions for atmospheric scientists with access to the Cheyenne Supercomputer is available here.

Now lets start with technical issues:

Deploying Dask with MPI

HPC systems use job schedulers like SGE, SLURM, PBS, LSF, and others. Dask has been deployed on all of these systems before either by academic groups or financial companies. However every time we do this it’s a little different and generally tailored to a particular cluster.

We wanted to make something more general. This started out as a GitHub issue on PBS scripts that tried to make a simple common template that people could copy-and-modify. Unfortunately, there were significant challenges with this. HPC systems and their job schedulers seem to focus and easily support only two common use cases:

  1. Embarrassingly parallel “run this script 1000 times” jobs. This is too simple for what we have to do.
  2. MPI jobs. This seemed like overkill, but is the approach that we ended up taking.

Deploying dask is somewhere between these two. It falls into the master-slave pattern (or perhaps more appropriately coordinator-workers). We ended up building an MPI4Py program that launches Dask. MPI is well supported, and more importantly consistently supported, by all HPC job schedulers so depending on MPI provides a level of stability across machines. Now dask.distributed ships with a new dask-mpi executable:

mpirun --np 4 dask-mpi

To be clear, Dask isn’t using MPI for inter-process communication. It’s still using TCP. We’re just using MPI to launch a scheduler and several workers and hook them all together. In pseudocode the dask-mpi executable looks something like this:

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    start_dask_scheduler()
else:
    start_dask_worker()

Socially this is useful because every cluster management team knows how to support MPI, so anyone with access to such a cluster has someone they can ask for help. We’ve successfully translated the question “How do I start Dask?” to the question “How do I run this MPI program?” which is a question that the technical staff at supercomputer facilities are generally much better equipped to handle.

Working Interactively on a Batch Scheduler

Our collaboration is focused on interactive analysis of big datasets. This means that people expect to open up Jupyter notebooks, connect to clusters of many machines, and compute on those machines while they sit at their computer.

Unfortunately most job schedulers were designed for batch scheduling. They will try to run your job quickly, but don’t mind waiting for a few hours for a nice set of machines on the super computer to open up. As you ask for more time and more machines, waiting times can increase drastically. For most MPI jobs this is fine because people aren’t expecting to get a result right away and they’re certainly not interacting with the program, but in our case we really do want some results right away, even if they’re only part of what we asked for.

Handling this problem long term will require both technical work and policy decisions. In the short term we take advantage of two facts:

  1. Many small jobs can start more quickly than a few large ones. These take advantage of holes in the schedule that are too small to be used by larger jobs.
  2. Dask doesn’t need to be started all at once. Workers can come and go.

And so I find that if I ask for several single machine jobs I can easily cobble together a sizable cluster that starts very quickly. In practice this looks like the following:

$ qsub start-dask.sh      # only ask for one machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine

Our main job has a wall time of about an hour. The workers have shorter wall times. They can come and go as needed throughout the computation as our computational needs change.

Jupyter Lab and Web Frontends

Our scientific collaborators enjoy building Jupyter notebooks of their work. This allows them to manage their code, scientific thoughts, and visual outputs all at once and for them serves as an artifact that they can share with their scientific teams and collaborators. To help them with this we start a Jupyter server on the same machine in their allocation that is running the Dask scheduler. We then provide them with SSH-tunneling lines that they can copy-and-paste to get access to the Jupyter server from their personal computer.

We’ve been using the new Jupyter Lab rather than the classic notebook. This is especially convenient for us because it provides much of the interactive experience that they lost by not working on their local machine. They get a file browser, terminals, easy visualization of textfiles and so on without having to repeatedly SSH into the HPC system. We get all of this functionality on a single connection and with an intuitive Jupyter interface.

For now we give them a script to set all of this up. It starts Jupyter Lab using Dask and then prints out the SSH-tunneling line.

from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')

import socket
host = client.run_on_scheduler(socket.gethostname)

def start_jlab(dask_scheduler):
    import subprocess
    proc = subprocess.Popen(['jupyter', 'lab', '--ip', host, '--no-browser'])
    dask_scheduler.jlab_proc = proc

client.run_on_scheduler(start_jlab)

print("ssh -N -L 8787:%s:8787 -L 8888:%s:8888 -L 8789:%s:8789 cheyenne.ucar.edu" % (host, host, host))

Long term we would like to switch to an entirely point-and-click interface (perhaps something like JupyterHub) but this will requires additional thinking about deploying distributed resources along with the Jupyter server instance.

Network Performance on Infiniband

The intended computations move several terabytes across the cluster. On this cluster Dask gets about 1GB/s simultaneous read/write network bandwidth per machine using the high-speed Infiniband network. For any commodity or cloud-based system this is very fast (about 10x faster than what I observe on Amazon). However for a super-computer this is only about 30% of what’s possible (see hardware specs).

I suspect that this is due to byte-handling in Tornado, the networking library that Dask uses under the hood. The following image shows the diagnostic dashboard for one worker after a communication-heavy workload. We see 1GB/s for both read and write. We also see 100% CPU usage.

Network performance is a big question for HPC users looking at Dask. If we can get near MPI bandwidth then that may help to reduce concerns for this performance-oriented community.

How do I use Infiniband network with Dask?

XArray and Dask.distributed

XArray was the first major project to use Dask internally. This early integration was critical to prove out Dask’s internals with user feedback. However it also means that some parts of XArray were designed well before some of the newer parts of Dask, notably the asynchronous distributed scheduling features.

XArray can still use Dask on a distributed cluster, but only with the subset of features that are also available with the single machine scheduler. This means that persisting data in distributed RAM, parallel debugging, publishing shared datasets, and so on all require significantly more work today with XArray than they should.

To address this we plan to update XArray to follow a newly proposed Dask interface. This is complex enough to handle all Dask scheduling features, but light weight enough not to actually require any dependence on the Dask library itself. (Work by Jim Crist.)

We will also eventually need to look at reducing overhead for inspecting several NetCDF files, but we haven’t yet run into this, so I plan to wait.

Future Work

We think we’re at a decent point for scientific users to start playing with the system. We have a Getting Started with Dask on Cheyenne wiki page that our first set of guinea pig users have successfully run through without much trouble. We’ve also identified a number of issues that the software developers can work on while the scientific teams spin up.

  1. Zero copy Tornado writes to improve network bandwidth
  2. Enable Dask.distributed features in XArray by formalizing dask’s expected interface
  3. Dynamic deployments on batch job schedulers

We would love to engage other collaborators throughout this process. If you or your group work on related problems we would love to hear from you. This grant isn’t just about serving the scientific needs of researchers at Columbia and NCAR, but about building long-term systems that can benefit the entire atmospheric and oceanographic community. Please engage on the Pangeo GitHub issue tracker.

September 18, 2017 12:00 AM

September 15, 2017

numfocus

2017 GSoC Students Complete Work on NumFOCUS Projects

The summer has come to an end and so have the internships of our GSoC students. Ten students have finished their projects successfully and now continue to contribute to our open source community! FEniCS: Ivan Yashchuk has implemented quadrilateral and hexahedral meshes for finite element methods. Michal Habera has added support for the XDMF format.   […]

by Gina Helfrich at September 15, 2017 09:12 PM

September 14, 2017

Enthought

Webinar: Python for MATLAB Users: What You Need To Know

What:  A guided walkthrough and Q&A about how to migrate from MATLAB® to Python with Enthought Lead Instructor, Dr. Alexandre Chabot-Leclerc.

Who Should Watch: MATLAB® users who are considering migrating to Python, either partially or completely.

REGISTER


Python has a lot of momentum. Many high profile projects use it and more are migrating to it all the time. Why? One reason is that Python is free, but more importantly, it is because Python has a thriving ecosystem of packages that allow developers to work faster and more efficiently. They can go from prototyping to production to scale on hardware ranging from a Raspberry Pi (or maybe micro controller) to a cluster, all using the same language. A large part of Python’s growth is driven by its excellent support for work in the fields of science, engineering, machine learning, and data science.

You and your organization might be thinking about migrating from MATLAB to Python to get access to the ecosystem and increase your productivity, but you might also have some outstanding questions and concerns, such as: How do I get started? Will any of my knowledge transfer? How different are Python and MATLAB? How long will it take me to become proficient? Is it too big a of a shift? Can I transition gradually or do I have to do it all at once? These are all excellent questions.

We know people put a lot of thought into the tools they select and that changing platforms is a big deal. We created this webinar to help you make the right choice.

In this webinar, we’ll give you the key information and insight you need to quickly evaluate whether Python is the right choice for you, your team, and your organization, including:

  • How to get started
  • What you need in order to replicate the MATLAB experience
  • Important conceptual differences between MATLAB and Python
  • Important similarities between MATLAB and Python: What MATLAB knowledge will transfer
  • Strategies for converting existing MATLAB code to Python
  • How to accelerate your transition

REGISTER


Presenter: Dr. Alexandre Chabot-Leclerc, Enthought Lead Instructor

Ph.D, Electrical Engineering, Technical University of Denmark

 


Python for Scientists & Engineers Training: The Quick Start Approach to Turbocharging Your Work

If you are tired of running repeatable processes manually and want to (semi-) automate them to increase your throughput and decrease pilot error, or you want to spend less time debugging code and more time writing clean code in the first place, or you are simply tired of using a multitude of tools and languages for different parts of a task and want to replace them with one comprehensive language, then Enthought’s Python for Scientists and Engineers is definitely for you!

This class has been particularly appealing to people who have been using other tools like MATLAB or even Excel for their computational work and want to start applying their skills using the Python toolset.  And it’s no wonder — Python has been identified as the most popular coding language for five years in a row for good reason.

One reason for its broad popularity is its efficiency and ease-of-use. Many people consider Python more fun to work in than other languages (and we agree!). Another reason for its popularity among scientists, engineers, and analysts in particular is Python’s support for rapid application development and extensive (and growing) open source library of powerful tools for preparing, visualizing, analyzing, and modeling data as well as simulation.

Python is also an extraordinarily comprehensive toolset – it supports everything from interactive analysis to automation to software engineering to web app development within a single language and plays very well with other languages like C/C++ or FORTRAN so you can continue leveraging your existing code libraries written in those other languages.

Many organizations are moving to Python so they can consolidate all of their technical work streams under a single comprehensive toolset. In the first part of this class we’ll give you the fundamentals you need to switch from another language to Python and then we cover the core tools that will enable you to do in Python what you were doing with other tools, only faster and better!

Additional Resources

Upcoming Open Python for Scientists & Engineers Sessions:

Washington, DC, Sept 25-29
Los Alamos, NM, Oct 2-6, 2017
Cambridge, UK, Oct 16-20, 2017
San Diego, CA, Oct 30-Nov 3, 2017
Albuquerque, NM, Nov 13-17, 2017
Los Alamos, NM, Dec 4-8, 2017
Austin, TX, Dec 11-15, 2017

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Learn More

Download Enthought’s MATLAB to Python White Paper

Additional Webinars in the Training Series:

Python for Scientists & Engineers: A Tour of Enthought’s Professional Technical Training Course

Python for Data Science: A Tour of Enthought’s Professional Technical Training Course

Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat SheetsEnthought's Machine Learning with Python Cheat Sheets

The post Webinar: Python for MATLAB Users: What You Need To Know appeared first on Enthought Blog.

by admin at September 14, 2017 06:00 PM

September 08, 2017

Titus Brown

Classifying genome bins using a custom reference database, part I (Saturday Morning Bioinformatics)

Last week, Trina McMahon sent up a Twitter flare, asking for a tool that would taxonomically classify genome bins using a custom reference database. Based on having watched her excellent JGI talk, my understanding was that Trina was interested in taking single-cell and metagenome assembled genomes (SAGs and MAGs, in the parlance of the day - I'll just call them "bins" below) and identifying those using private databases. One particular use case of interest was using in-house databases of SAGs and MAGs that had already been classified.

I don't know how connected Trina's comment was to my previous blogging, but last Monday I put together a short blog post on taxonomic exploration of two sets of genome bins extracted from the Tara ocean data. That blog post focused rather narrowly on identifying genome bins that might be "confused" in the sense that they contained k-mers with disagreeing taxonomic assignments (see my blog post on k-mers and taxonomy for info here). Using the code in that blog post, it was in theory quite easily to do taxonomic assignments of unassigned genome bins.

Separately, Tom Delmont (the lead author on one of the genome binning studies I'd used for the blog post) implied that they had not compared their genome bins to GenBank genomes. That was basically the same problem that Trina wanted solved.

So, I put together a script to do something simple but hopefully useful, and (bonus!!) I added some pretty viz using metacodeR. The end result of applying taxonomic classification to a directory (with plotting) looks something like this, for the Tully et al. study:

A pretty metacodeR picture


When reading the stuff below, please bear in mind that I am doing nucleotide-level classification, which is only going to work at the ~species level. If you are interested in an approach that is more sensitive for more distant evolutionary relationships, see Meren's blog on their classification approach using CheckM.

All of the below was run on my laptop in under ~2 GB of RAM and less than 10 minutes of compute.


Starting at the end

If you want to use the ~100,000 Genbank genomes to classify your genome bins, and you already have a directory of genome signatures computed with sourmash, then you can run the classify-genome-sigs.py script like so:

# download genbank DB
curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz

# unpack LCA db
mkdir db
cd db
tar xzf ../genbank-lca-2017.08.26.tar.gz
cd ../

# Run classifier on the whole set of genomes
./classify-genome-sigs.py tully-genome-sigs --lca db/genbank.lca.json

# Plot!
Rscript plot-metacoder.R tully-genome-sigs.taxonomy.csv tully.metacoder.pdf

and you should get a much sparser version of the figure above.

The reason you're getting only a sparse set of classifications is that you're using genbank, and most of the MAGs in the Tully and Delmont studies are unknown. What you really want to use is a custom database - in this case, we'll use custom LCA databases built from the delmont and tully data sets, but you can use any LCA databases you've built.

To classify using custom databases, you've got to download and unpack the databases, and then add them to the end of the command above:

# grab the Tully et al. classifications
curl -L https://osf.io/3pc7j/download -o tully-lca-2017.09.04.tar.gz 
curl -L https://osf.io/bw2dj/download -o delmont-lca-2017.09.04.tar.gz

# unpack 'em
tar xzf tully-lca-2017.09.04.tar.gz
tar xzf delmont-lca-2017.09.04.tar.gz

# Run classifier on the whole set of genomes with all three LCA dbs
./classify-genome-sigs.py tully-genome-sigs \
    --lca db/genbank.lca.json db/tully.lca.json db/delmont.lca.json

# Plot!
Rscript plot-metacoder.R tully-genome-sigs.taxonomy.csv tully.metacoder.pdf

After a bunch of diagnostic output, you should see some summary output:

classified sourmash signatures in directory: 'tully-genome-sigs'
LCA databases used: db/genbank.lca.json, db/tully.lca.json, db/delmont.lca.json

total signatures found: 2631
no classification information: 624

could classify 2007
of those, 73 disagree at some rank.

number classified unambiguously, by lowest classification rank:
        superkingdom: 35
        phylum: 285
        class: 527
        family: 656
        genus: 339
        species: 91

disagreements by rank:
        superkingdom: 1
        phylum: 11
        order: 4
        class: 8
        family: 10
        genus: 10
        species: 29

classification output as CSV, here: tully-genome-sigs.taxonomy.csv

This gives you an output file tully-genome-sigs.taxonomy.csv containing the various assignments, and you should also have a tully.metacoder.pdf that looks like the image above.

Interpreting the CSV file

The CSV file has five columns: name, taxid, status, rank_info, and lineage. name is the name in the signature file; taxid and lineage are the NCBI taxonomic ID assigned to that name, and the taxonomic lineage of that ID.

The status and rank_info require a bit more explanation.

There are three possible status values at present: nomatch, found, and disagree.

nomatch is hopefully obvious - no (or insufficient) taxonomic information was found for k-mers in that signature, so you get a taxid of 0, and an empty rank and lineage:

TARA_IOS_MAG_00005,0,nomatch,,

found is hopefully also pretty obvious: it says that a simple straightforward lineage was found across all of the databases, and gives you that lineage down to the most detailed rank found:

TARA_PSW_MAG_00129,28108,found,species,Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Alteromonas;Alteromonas macleodii

disagree is the least clear. Here, the assigned rank is the rank immediately above where there is a taxonomic disagreement, and the taxid & lineage refer to the name at that rank (the least-common-ancestor at which an assignment can be made).

For example, look at this line in the CSV file:

TARA_ASW_MAG_00029,1224,disagree,phylum,Bacteria;Proteobacteria

TARA_ASW_MAG_00029 has k-mers that are shared between different orders: 'Pseudomonadales' and 'Rhodobacterales'. Therefore, the classifier status is disagree, and the classified taxid is at rank phylum - just above order.

Plotting with metacodeR

I like pretty pictures as much as the next person, but I rarely spend the time to make them - I'm just not good at plotting!

However, when I was last visiting Oregon State (to give a talk at the Oregon State Microbiome Initiative) I met Zachary Foster, a student of Nik Grunwald, and saw their metacodeR poster. metacodeR is an R package for making pretty taxonomic figures, and since I was deep in the throes of sourmash taxonomic foo I asked him for some help, which he promptly gave.

Fast forward to today, and I thought, "hey, Zach said that I should be able to use metacodeR with this kind of output - why don't I give it a try?"

At this point I should confess that I am not a very good R coder, so this filled me with trepidation. But leveling up is what Saturday Morning Bioinformatics is for! So I opened up a new window for all the R syntax google searches I was going to be done, started up RStudio, installed metacodeR, and ... after about an hour of struggling with R, got a script working!

You can see the script in all its glory here, or I've pasted it below - mainly because I need some help understanding the line that says names(csvnames) <- csvnames, which is Magic that is needed for extract_taxonomy but that I do not understand the need for.

# load the output of 'classify-genome-sigs.py'
library('metacoder')
library('dplyr')

load_classify_csv <- function(filename) {
  csvdata = read.csv(filename, header=TRUE)

  # eliminate empty lineages
  csvdata <- filter(csvdata, lineage != "")

  # create a mashup of name and lineage to satisfy extract_taxonomy
  csvnames <- paste(csvdata$name, csvdata$lineage, sep='\t')

  # heck if I know why I have to do this :), but set list names to list values
  names(csvnames) <- csvnames

  taxdata <- extract_taxonomy(csvnames, regex = "^(.*)\\\t(.*)",
                              key = c(id = "obs_info", "class"), class_sep=';')

  taxdata
}

Suggestions welcome on cleaning this up :)

Part 2 is yet to come

OK, I'm officially out of time for the day, so I'll stop here. In part 2 , I'll discuss how to create your own directory of genome signatures, show you how to build your own LCA databases, and give a fully repeatable workflow for everything. (If you're super enthusiastic to try the stuff above, you can probably ad lib it from this blog post).

In the meantime, comments welcome!

--titus

by C. Titus Brown at September 08, 2017 10:00 PM

Continuum Analytics

Securely Connecting Anaconda Enterprise to a Remote Spark Cluster

With the release of Anaconda Enterprise 5, we are introducing a more robust method of connecting JupyterLab, the interactive data science notebook environment, to an Apache Spark cluster.

by Sheyna Webster at September 08, 2017 01:00 PM

September 07, 2017

numfocus

NumFOCUS is Now Hiring: Events Coordinator

NumFOCUS seeks applicants for the position of Events Coordinator. Join our small but mighty team and help support data science communities all over the world! This is a junior-level position with the opportunity to make a big impact — you’ll be working on community management of our global open source scientific computing community! Why Work […]

by Gina Helfrich at September 07, 2017 03:56 PM

September 06, 2017

Pierre de Buyl

Testing the compact Hilbert index

In 2015, I worked to implement the so-called "compact" Hilbert index. You can read about it here. After some feedback on the code, I decided that I wanted to investigate "how good" the compact Hilbert actually is. In this post, I compare my code to Chris Hamilton's original code (using a program by Peter Colberg) and verify the basic promise of the compact Hilbert index (keep the ordering of the full Hilbert index).

What is already out there

I am aware of the following softwares for computing the compact Hilbert index:

  1. Chris Hamilton published libhilbert. His academic webpage has been removed but I had a copy of the library. I provide a copy of his "0.2.1" version on my GitHub account. The code is LGPL v2.1 so it can be redistributed.
  2. Peter Colberg has an implementation in his simulation software nano-dimer. This is what motivated me to use the compact Hilbert index in the first place. Peter Colberg has a test against Chris Hamilton's implementation. This code is MIT and implemented in OpenCL C.
  3. I have implemented the code in Python for my own understanding here and in Fortran for my simulation code RMPCDMD.

Are jumps "allowed" in the compact Hilbert curve?

By construction of the orginal Hilbert curve, consecutive points along the curve are at a distance of 1 from each other.

One question that came several times (I can still count on one hand though) is about the existence of jumps in the curve. The only promise in the compact Hilbert index paper is that it preserves the ordering of the Hilbert curve HR08. By graphical observations, some non-cubic curves can be seen as "repeat" of the cubic curve. But this simple situation does not hold for arbitrary shapes.

For "M = [2 2 3]", I show below an example where the dimension of first traversal of the curve is 0. One can see that around the middle of the curve, there is a jump in the z direction that is larger than one. This is reflected in the step size of the lower-right panel.

A compact Hilbert curve

In the next illustration, I have only changed the parameter vd to 1 and now there is no jump anymore.

A compact Hilbert curve

Actual testing

To perform the testing, I compiled the library by Chris Hamilton and used Peter Colberg's hilbert test program that will dump the coordinates of a hilbert curve in a HDF5 file.

Then, I dumped my code for the Hilbert curve in a Python file. The code relies on knowing the spatial dimension (in the code it is N) first and was not super suited to write a proper library. The dimensions of the curve are given as a set of integers ("2 2 3" in the example above) and there is the option --vd to specify the dimension of first traversal.

A plain cubic Hilbert curve is generated and the relative ordering of the compact indices with respect to the original one is performed (with success, ouf) at every run of the program. So this settles the fulfillment of the ordering.

When the option --compare is given, the program will load the data generated by the hilbert program and compare it against the Python data. So far, all combinations of dimensions did work.

The test program is called test_compact_hilbert.py and is available in the repository that I set up specifically for these tests. To display a compact hilbert curve, simply ask, for instance

python3 test_compact_hilbert.py 1 2 3 --plot --vd 2

To perform preset tests with the code, simply issue the command

make HDF5_HOME=/path/to/hdf5lib

(The variable HDF5_HOME is only necessary on the first invocation when the program hilbert is built.)

Summary

To gain confidence in my code, I have

  1. Made a backup of the original code by the author of the algorithm. That was just in time, as its homepage disappeared.
  2. Made a specific Python test program to test my algorithm against the reference one.
  3. Checked numerically the ordering of the curve.
  4. Adapted my program to also allow an easy graphical inspection of the compact Hilbert curve and pass as an argument the direction of traversal.

Bibliography

  • HR08 Chris H. Hamilton and Andrew Rau-Chaplin. Compact Hilbert indices: Space-filling curves for domains with unequal side lengths, Information Processing Letters 105, 155-163 (2008).

by Pierre de Buyl at September 06, 2017 12:00 PM

September 03, 2017

Titus Brown

Taxonomic examinations of genome bins from Tara Oceans (Honorary Sunday Morning Bioinformatics)

This is a follow-on to Comparing genome sets extracted from metagenomes.

After my first foray into comparing the genome bins extracted from the Tara Oceans data by Tully et al. (2017) and Delmont et al. (2017), I got curious about the taxonomic assignments of the bins and decided to investigate those on this fine Labor Day morning.

In this, I could take advantage of a simple reimplementation of the Kraken least-common-ancestor taxonomic classification algorithm described here that works on MinHash signatures - more on that in another, later blog post.

Briefly, I did the following -

  • used the posted lineages to assign NCBI taxonomic IDs to many of the tully samples and the delmont samples using really ugly scripts.
  • found the least-common-ancestor LCA for each hash value across all the MinHash signatures in each data set and stored them in an LCA database;
  • wrote a custom script that loaded in all of the genbank, tully, and delmont LCA assignments together with the NCBI taxonomy and classified each of the tully and delmont genome bins, looking for disagreements.

This last step is the most interesting one for the purposes of this morning's blog post, so let's go into it in a bit more detail.

Building the taxonomy databases for delmont and tully

I used an approach lifted from Kraken to build taxonomy classifiers for delmont and tully.

Kraken first links all k-mers in each genome to a list of taxonomic IDs, based on the taxonomic assignment for that genome. Then, Kraken looks for the least-common-ancestor in the taxonomic tree for each k-mer.

I did this for each k-mer in each MinHash signature from the full genbank data set (100,000 genomes), as well as the ~3000 tully and delmont data sets. (You have to use our super special opensauce if you want to do this on your own MinHash signatures.)

These databases are available for download, below.

Computing taxonomic assignments for each genome bin

Now I wanted to get an estimate of taxonomic IDs for each genome bin.

So, I wrote a script that loaded in all three LCA databases (genbank, tully, delmont) and, for each genome bin, extracted all of the taxonomic IDs for every hash value in the MinHash signature. And, for each taxonomic ID, it computed the LCA across all three databases, obviating the need to do build a single gigantic database of genbank+tully+delmont.

This can all be boiled down to "this script computed the most specific taxonomic ID for a randomly chosen subset of k-mers in each bin."

Finding disagreements in taxonomy in each bin

Next, I wanted to figure out where (if anywhere) there were disagreements in taxonomy for each bin.

So, in the script above, I took all taxonomic IDs for which there were five or more counts (i.e. five or more hash values with that specific assignment - randomly chosen but IMO conservative), and then found the lowest taxonomic rank at which there was a disagreement across all thresholded taxonomic IDs.

And then I summarized all of this :)

Some output!

The full output can be seen for delmont and for tully.

For each genome bin, we either get no output or a line like this:

TOBG_MED-638 has multiple LCA at rank 'phylum': Proteobacteria, Bacteroidetes

and then we get summary information like so:

for tully-genome-sigs found 2631 signatures;
out of 2631 that could be classified, 73 disagree at some rank.
    superkingdom: 1
    phylum: 11
    order: 4
    class: 8
    family: 10
    genus: 10
    species: 29

and

for delmont-genome-sigs found 957 signatures;
out of 957 that could be classified, 4 disagree at some rank.
    class: 3
    species: 1

(Note, for the 50 E. coli signatures in the sourmash tutorial, we get no disagreements. Good!)

Checking the results

I took a closer look at a few of the disagreements.

For TOBG_ARS-21 from tully, which was classified as both Archaea and Bacteria, the hash value taxids seem to be ~evenly split between:

cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Flavobacteriia;Flavobacteriales
cellular organisms;Archaea

Interestingly, TOBG_ARS-21 has no assignment in the associated spreadsheet, as far as I can tell. Instead, the Flavobacterium assignment comes from the delmont data set, where a few overlapping genome bins (TARA_ANW_MAG_00073 and TARA_ANW_MAG_00082 ) are assigned to Flavobacterium. And the archaeal assignment comes from the tully data set, but not from TOBG_ARS-21 - rather, there is a ~10% overlap with another tully genome bin, TOBG_RS-366, which is assigned to Archaea (with no more refined assignment).

Verdict: genuinely confusing. My script works!


For TOBG_IN-33, we had a disagreement at phylum: Chlorophyta and Euryarchaeota. Here the hash value taxids are split between Euryarchaeota and genus Ostreococcus.

cellular organisms;Archaea;Euryarchaeota
cellular organisms;Eukaryota;Viridiplantae;Chlorophyta;prasinophytes;Mamiellophyceae;Mamiellales;Bathycoccaceae;Ostreococcus

Again, TOBG_IN-33 has no direct taxonomic assignment in tully. The Euryarchaeota assignment comes from a match to TARA_IOS_MAG_00045 in the delmont data set, as well as a match to delmont genomes TOBG_EAC-675 and TOBG_RS-356 (both Euryarchaeota). So where does the assignment to Eukaryota come from?

If you dig a bit deeper, it turns out that there is ~5.8 - 7.6% similarity between TOBG_IN-33 (from tully) and two delmont genome bins, TARA_PSW_MAG_00136 and TARA_ANE_MAG_00093, both of which are assigned to Eukaryota;Chlorophyta;Mamiellophyceae;Mamiellales;Bathycoccaceae;Ostreococcus. Maybe TOBG_IN-33 is a chimeric bin?

Anyway, the verdict: genuinely confusing. My script works (again :)!

Concluding thoughts

As with most bioinformatics analyses, this gives us some directions to pursue, rather than (m)any hard answers :).

But, with the caveat that this is an analysis that did in just about 3 hours this morning and so the code is probably broken in a myriad of ways:

The vast majority of both the tully and delmont genomes are taxonomically homogeneous under this analysis. In many cases this may be due to limited taxonomic assignment, and it also definitely reflects a lack of information about these genomes!

It seems like the delmont bins have been curated more than the tully bins, with the end effect of removing taxonomically confusing information. Whether this is good or bad will have to await a broader and deeper investigation of the genome bins - right now there are clearly some confounding issues within and between the data sets.

Appendix: Running everything yourself

(This should take about 10-15 minutes to run on a laptop with ~10 GB of disk space and ~8 GB of RAM. This includes the downloads.)

Along with a sourmash install, you'll need two github repos: https://github.com/ctb/2017-tara-binning and https://github.com/ctb/2017-sourmash-lca.

git clone https://github.com/ctb/2017-sourmash-lca
git clone https://github.com/ctb/2017-tara-binning

Then set PYTHONPATH and change into the 2017-tara-binning directory:

export PYTHONPATH=$(pwd)/2017-sourmash-lca:$PYTHONPATH
cd 2017-tara-binning

Download the LCA databases:

curl -L https://osf.io/3pc7j/download -o tully-lca-2017.09.04.tar.gz 
curl -L https://osf.io/bw2dj/download -o delmont-lca-2017.09.04.tar.gz

curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz

Unpack them:

tar xzf delmont-lca-2017.09.04.tar.gz
tar xzf tully-lca-2017.09.04.tar.gz
cd db
tar xzf ../genbank-lca-2017.08.26.tar.gz
cd ../

Now, grab the signature collections and unpack them, too:

curl -L https://osf.io/bh2jr/download -o delmont-genome-sigs.tar.gz
curl -L https://osf.io/28r6m/download -o tully-genome-sigs.tar.gz
tar xzf delmont-genome-sigs.tar.gz 
tar xzf tully-genome-sigs.tar.gz 

And we are ready!

Now, run:

./examine-taxonomy.py delmont-genome-sigs -k 31 > delmont.taxinfo.txt
./examine-taxonomy.py tully-genome-sigs -k 31 > tully.taxinfo.txt

and you should see the same results as in the github links above.

by C. Titus Brown at September 03, 2017 10:00 PM

September 02, 2017

Titus Brown

Comparing genome sets extracted from metagenomes (Sunday Morning Bioinformatics).

(Because I'm a nerd who loves my research, I frequently do short-form bioinformatics analyses on weekends. I may start posting more of them. But an important note is that these will necessarily be incomplete and speculative because I put a hard limit on the time spent doing them - e.g. the below excursion took me about 2.5 hours total, starting from having the signatures calculated. Most of the 2 hours was before the kids woke up :)


I’ve been watching the metagenome-assembled genomics community for a while with some envy. The lure of extracting single genomes from the messy complex mixture of stuff in a metagenome is seductive! But the environment with which I have the most experience, agricultural soil, doesn’t lend itself to this process (because coverage is almost always too low, and complexity is too high), so I've been relegated to the sidelines ().

I've also been a bit wary of binning. First, almost all binning approaches operate post-assembly, and I generally think all metagenome assemblies are wrong (although of course some are useful!) Second, many binning approaches involve examining the clusters and applying cutoffs chosen by eyeball, which lends a potential degree of variability to things that I'm concerned by. And third, we just posted a paper suggesting that strain variation (which is almost certainly present to massive degree in any sufficiently deep sequencing data) really gets in the way of metagenome assembly. This suggests that most metagenome assemblies are going to be somewhat incomplete.

But I'm always happy to take advantage of other people's work and so I've been on the lookout for good binned data sets to play with.

Conveniently, two different groups recently applied their binning approaches to the Tara ocean data set (Sunagawa et al., 2015), and I thought I'd take the opportunity to compare 'em. While a comparison won't necessarily say anything about the completeness of the binning, it might help characterize variability in the approaches.

The two approaches were posted to bioRxiv as Tully et al., 2017 and Delmont et al., 2017. Below, I will refer to them as 'Tully' and 'Delmont'.

Tara Oceans: the bacterial fraction

(These stats are mostly taken from Tully et al., 2017)

Tara Oceans generated metagenomic sequencing data from 234 samples, size-filtered to enrich for "prokaryotic" (bacterial + archaeal) organisms. This ended up being a ~terabase data set containing 102 billion paired-end reads.

The Tully data

Tully et al. produced 2631 genomes, and estimated that 1491 were more than 70% complete and 603 were more than 90% complete. (Wow!)

The process used was (roughly) as follows --

  1. Assemble each of 234 samples with MEGAHIT, yielding 562m contigs.
  2. After length filtering the contigs at 2kb, use CD-HIT to eliminate contigs that are more than 99% similar.
  3. Co-assemble contigs from each of the 61 stations (geographical locations) with MINIMUS2, yielding 7.2m contigs.
  4. Apply BinSanity to cluster these contigs into genome bins, and use CheckM to evaluate completeness.

The Delmont data

Delmont et al. produced 957 genomes from 93 samples (presumably a subset of the 234 samples above?), using this very well documented approach - briefly,

  1. Co-assemble samples from 12 geographical regions with MEGAHIT.
  2. Bin results with CONCOCT.
  3. Manually refine CONCOCT results using a'n'vio'.
  4. Extract bins with > 70% completion or 2 Mbp of contigs into 1077 genome bins.
  5. Collapse redundant genomes from across the different regions into 957 bins based on average nucleotide identity (> 98% ANI).

Some initial thoughts

Quite different approaches were taken here: in particular, Tully et al. assembled each of 234 samples individually, then co-assembled the resulting assembly by each of 61 stations; while Delmont et al. divided 93 samples into 12 based on geographical location, and co-assembled those 12.

Both studies used MEGAHIT to assemble the primary contigs. It would be interesting to compare these initial assemblies directly, as this would place bounds on the similarity of the eventual binning outcomes.

Getting to the data

I downloaded both full sets of binned genomes, and then ran the following sourmash command to compute sourmash signatures from the data:

sourmash compute -k 21,31,51 \
                         --scaled 2000 \
                         --track-abundance

This took about an hour for the Tully collection, and less for the Delmont collection.

You can download the data sets from the Open Science Framework if you'd like to follow along --

curl -L https://osf.io/vngdz/download -o delmont-genome-sigs.tar.gz
curl -L https://osf.io/28r6m/download -o tully-genome-sigs.tar.gz

The first file is about 30 MB, the second file is about 90 MB.

I then unpacked the two collections on my laptop,

tar xzf delmont-genome-sigs.tar.gz 
tar xzf tully-genome-sigs.tar.gz 

and built search trees (SBTs) for each of them at k=31 --

sourmash index -k 31 delmont31 --traverse-directory delmont-genome-sigs
sourmash index -k 31 tully31 --traverse-directory tully-genome-sigs

Comparing binned genomes directly

I wrote a custom script that computed matches between genomes in one directory vs genomes in the other directory.

Here, 'similarity' is Jaccard similarity, 'identity' is more than 80% Jaccard similarity, and 'containment' is more than 95% of a genome's signature included in the other.

% ./compare-two-directories.py delmont-genome-sigs delmont31.sbt.json tully-genome-sigs tully31.sbt.json 

After lots of detailed output, you get the following summary:

thresholds:
min Jaccard similarity for any match: 0.05
to score as identical, similarity must be >= 0.8
to score as contained, containment must be >= 0.95
----
delmont-genome-sigs vs tully-genome-sigs: 957 signatures
identical count: 22
containment count: 66
matches: 591
no match: 366
no contain: 891
----
tully-genome-sigs vs delmont-genome-sigs: 2631 signatures
identical count: 27
containment count: 54
matches: 850
no match: 1781
no contain: 2577

Some initial thoughts here:

  • the tully genomes probably have more "redundancy" in this context because of the way Delmont et al. removed redundancy.
  • the data sets are strikingly different!

Getting overall stats for overlap/disjointness

After all this, I got to wondering about the overall similarity of the genomes. So, I wrote another custom script, compare-two-directories-hashvals.py, to compare the k-mer composition of tully and delmont directly. (For those of you familiar with MinHash, no, you cannot do this normally based on MinHash signatures; but we're using super-special opensauce in sourmash that lets us do this. Trust us.)

% ./compare-two-directories-hashvals.py delmont-genome-sigs tully-genome-sigs 
loading all signatures: delmont-genome-sigs
loading from cache: delmont-genome-sigs.pickle
...loaded 957 signatures at k=31
loading all signatures: tully-genome-sigs
loading from cache: tully-genome-sigs.pickle
...loaded 2631 signatures at k=31
total hashvals: 2837161
both: 17.2%
unique to delmont-genome-sigs: 22.1%
unique to tully-genome-sigs: 60.7%

Here we estimate that the binned genomes, in aggregate, share about 17.2% of the union, while Delmont et al. has about 22.1% of the unique content, and Tully has about 60.7% of the unique content.

This is probably largely due to the extra data that was taken into account by Tully et al. (234 samples in tully vs 93 samples in delmont.)

It is reassuring that some genomes (~20) are identical between the data sets, albeit at a low threshold of 80%; and that some genomes from one are mostly contained in bins from the other (~50-60 in each direction).

Concluding thoughts

It is not surprising that two different binning procedures run on different subsamples produce different results :).

I must admit that it is not immediately obvious to me where I go from here. What stats or distributions should I be looking at? How do people compare the output of different binning procedures when they're working on the same samples? Thoughts welcome!

If you want to play with the data yourself, you should be able to repeat my entire analysis on your own laptop in ~30 minutes, starting from a basic sourmash install and using the commands above. You'll need to clone the repo, of course.

--titus

by C. Titus Brown at September 02, 2017 10:00 PM

Matthieu Brucher

Book review: Fail U.: The False Promise of Higher Education

American universities have some reputation, in all kind of terms, and the amount of student debt is something I also found baffling. So a book on the failure of US universities was obviously of interest to me.

Discussion

The book is divided in 6 different parts, and lots of them resound with my experience with French and British universities. So it is scary.

Let’s start with the beginning. The first part is just a reminder that the situation was already bad several decades ago. The author tends to delve into “I told you so” and “If only they had listened to me”, but it is true that lots of people have said that the situation was deteriorating. And it happens now in France and the United Kingdom as well, and I think the solution they are trying is the one failing in the US, so interesting to see if there are other options, so let’s move on to the other parts.

The second one addresses the need in universities for teaching students. The interesting part is that researchers don’t like teaching. And those who would like to are not encouraged to do so. But it’s so rooted into our culture that even in Big Bang Theory, none of the protagonists actually teaches, and when they are forced to do so, it is the opportunity for jokes. This is an issue. It shouldn’t happen. And then, you here that the really big deal of US universities is the bachelor degree, not the master’s degree (PhD thesis may still be alright). But if the teachers are not the best in their field, one are these degree recognized as valuable? Well, they will soon not be. Managers start noticing that their new hires are not that effective anymore. It shows.

The third part tackles the expenditures of these universities. I didn’t know that they didn’t really balance their books. I did read in Weapons of Math Destruction that the tuitions were rising and that the universities were spending lots of money to get new students. This parts confirms it. And I can’t understand how big universities don’t think about balancing books and debts. At some point, when your debt is getting to big, you have a problem, and for people who are supposed to be knowledgeable, this is just baffling. And let’s be clear, it starts happening here as well. When will people understand that this is not possible?

The fourth part deals with several scandals in the universities. The first one is the failure of research. This is known, the researchers are ranked according their publications, and it doesn’t work because you can publish more or less anything, even if it doesn’t make sense. There are some efforts to change this with public reviews, and I’m not even talking about the research publication scam. And there id also the sexual scandals that keep on coming up all the time. There shouldn’t be any scandal like that, period.

OK, let’s move to a part that just baffled me at first. And then I saw that it is now everywhere in our society. Universities are supposed to be a place where you can exchange, face your opinions. It seems that it’s no longer the case. And it’s true for life after university as well. Can you say to someone who you don’t agree with their position on say homosexual marriage? No, there is no debate. But people have the right to have a different opinion. You may not impose being pro or against on someone else, but you can discuss. But today, if you are against, you are an extremist. Let’s go further. If you say that you don’t think democracy is a system that works, you may be in big trouble. And in some way, that may be why Trump won. With the intolerance of some people, the people who didn’t agree with the majority may have seen an opportunity. And now, we have two groups of intolerant people fighting each other.

Final part, and some opportunities for enhancements. As I’ve said, students are heavily in debt (and with people like Dame Glynis Breakwell, the universities have to raise their tuitions fees, just so that some people full of themselves can be paid far too much compared to what they bring to the table), and it’s going to be a big problem in the future. And it’s not by injecting more money or lowering the bar that the situation will get better. Then there is a chapter on MOOCs which I also talked about before. I don’t agree with the conclusion of this book, MOOCs are targeted towards professionals, we still rely on degrees. Perhaps this will change with the future loss of the bachelor degree worth.

Conclusion

There are many more examples in the book, but the tendency is worrying. We seem to be living in a society that is less and less tolerant by saying that they tolerate everything except intolerant people (and becoming intolerant in the process).

Universities in the world in general are in a bad position, and it gets worse. More people “have” to graduate, and in the end you don’t solve the problem, you diminish the quality of the degree. Not everyone can graduate. The sooner we accept that, the sooner we can cure our universities.

by Matt at September 02, 2017 09:19 PM

August 31, 2017

Titus Brown

Is your bioinformatics analysis package Big Data ready?

(Apologies for the Big Data buzzword, but it actually fits!)

I find it increasingly frustrating to use a broad swath of bioinformatics tools (and given how inherently frustrating the bioinformatics ecosystem is already, that's saying something!) My bugaboo this time is bioinformatics "prepared databases" - but not multiple installs of databases or their size. Rather, I'm frustrated by the cost of building them. Similarly, I'm frustrated at how many analysis pipelines require you to start from the very beginning if you add data to your experiment or sample design.

We are increasingly living in a world of what I call "infinite data", where, for many purposes, getting the data is probably the least of your problems - or, at least, the time and cost of generating sequence data is dwarfed by sample acquisition, data analysis, hypothesis generation/testing, and paper writing.

What this means for databases is that no sooner does someone publish a massive amount of data from a global ocean survey (Tara Oceans, Sunagawa et al. 2015) than does someone else analyze the data in a new and interesting way (Tully et al., 2017) and produce data sets that I, at least, really want to include in my own reference data sets for environmental metagenome data. (See Hug et al., 2016 and Stewart et al., 2017 for even more such data sets.)

Unfortunately, most bioinformatics tools don't let me easily update my databases or analyses with information from new references or new data. I'm going to pick on taxonomic approaches first because they're on my mind but this applies to any number of software packages commonly used in bioinformatics.

Let's take Kraken, an awesome piece of software that founded a whole class of approaches. The Kraken database and toolset is constructed in such a way that I cannot easily do two specific things:

  • add just one reference genome to it;
  • ask Kraken to search two different reference databases and report integrated results;

Similarly, none of the assemblers I commonly use lets me me add just one more sequencing data set into the assembly - I have to recalculate the entire assembly from scratch, after adding the additional data set. Annotation approaches don't let me update the database with new sequences and then run an abbreviated analysis that improves annotations. And etc.

What's the problem?

There are two classes of problems underneath this - one is the pragmatic engineering issue. For example, with Kraken, it is theoretically straightforward, but tricky from an engineering perspective, to support the addition of a single genome into the database; "all" you need to do is update the least-common-ancestor assignments in the database for each k-mer in your new genome. Likewise, the Kraken approach should be able to support search across databases, where you give it two different databases and it combines them in the analysis - but it doesn't seem to. And, last but not least, Kraken could in theory support taking an existing classified set of reads and "updating" their classification from a new database, although in practice there's not much point in doing this because Kraken's already pretty fast.

The second kind of challenge is theoretical. We don't really have the right algorithmic approaches to assembly to support fully online assembly in all its glory, although you could imagine doing things like error trimming in an incremental way (I mean, just imagine...) and saving that for later re-use. But we generally don't save the full information needed to add a single sequence and then quickly output an updated assembly.

And that last statement highlights an important question - for which approaches do we have the theory and the engineering to quickly update an analysis with a new reference database, or a new read data set? I can guess but I don't know.

Note, I'm explicitly not interested in heuristic approaches where you e.g. map the reads to the assembly to see if there's new information - I'm talking about using algorithms and data structures and software where you get identical results from doing analysis(n + m) and analysis(n) + analysis(m).

Even if we do have this theoretical ability, it's not clear that many tools support it, or at least they don't seem to advertise that support. This also seems like an area where additional theoretical development (new data structures! better algorithms!) would be useful.

I'm curious if I'm missing something; have I just not noticed that this is straightforward with many tools? It seems so useful...

Anyway, I'm going to make a conscious effort to support this in our tools (which we sort of already were planning, but now can be more explicit about).

--titus

p.s. This post partly inspired by Meren, who asked for this with sourmash databases - thanks!

by C. Titus Brown at August 31, 2017 10:00 PM

Continuum Analytics

Introducing Anaconda Enterprise 5

Anaconda Enterprise 5 is built for your entire organization and serves the needs of Data Scientists, IT Administrators, Business Analysts & Managers, Developers and Machine Learning Engineers.

by Team Anaconda at August 31, 2017 01:57 PM

Anaconda Enterprise 5 Introduces Secure Collaboration to Amplify the Impact of Enterprise Data Scientists

Anaconda, the Python data science leader, today introduced Anaconda Enterprise 5 software to help organizations respond to customers and stakeholders faster, deliver strategic insight for rapid decision-making and take advantage of cutting edge machine learning.

by Team Anaconda at August 31, 2017 01:00 PM

August 30, 2017

Continuum Analytics

Anaconda: A Coming of Age Story

No one tells our story better than Continuum co-founders, Travis Oliphant and Peter Wang. Read on for their thoughts and feelings on what’s next for Anaconda.

by Sheyna Webster at August 30, 2017 01:00 PM

Titus Brown

A post about k-mers - this time for taxonomy!

I love k-mers. Unlike software engineering or metagenomics, this isn't a love-hate relationship - this is a pure "gosh aren't they wonderful" kind of love and admiration. They "just work" in so many situations that it helps validate the last 10 years of my research life (which, let's be honest, has been pretty k-mer focused!).

DNA k-mers underlie much of our assembly work, and we (along with many others!) have spent a lot of time thinking about how to store k-mer graphs efficiently, discard redundant data, and count them efficiently.

More recently, we've been enthused about using k-mer based similarity measures and computing and searching k-mer-based sketch search databases for all the things.

But I haven't spent too much talking about using k-mers for taxonomy, although that has become an ahem area of interest recently, if you read into our papers a bit.

In this blog post I'm going to fix this by doing a little bit of a literature review and waxing enthusiastic about other people's work. Then in a future blog post I'll talk about how we're building off of this work in fun! and interesting? ways!

A brief introduction to k-mers

(Borrowed from the ANGUS 2017 sourmash tutorial!)

K-mers are a fairly simple concept that turn out to be tremendously powerful.

A "k-mer" is a word of DNA that is k long:

ATTG - a 4-mer
ATGGAC - a 6-mer

Typically we extract k-mers from genomic assemblies or read data sets by running a k-length window across all of the reads and sequences -- e.g. given a sequence of length 16, you could extract 11 k-mers of length six from it like so:

AGGATGAGACAGATAG

becomes the following set of 6-mers:

AGGATG
 GGATGA
  GATGAG
   ATGAGA
    TGAGAC
     GAGACA
      AGACAG
       GACAGA
        ACAGAT
         CAGATA
          AGATAG

k-mers are most useful when they're long, because then they're specific. That is, if you have a 31-mer taken from a human genome, it's pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904. So, you know, a lot.)

The important concept here is that long k-mers are species specific. We'll go into a bit more detail later.

K-mers and assembly graphs

We've already run into k-mers before, as it turns out - when we were doing genome assembly. One of the three major ways that genome assembly works is by taking reads, breaking them into k-mers, and then "walking" from one k-mer to the next to bridge between reads. To see how this works, let's take the 16-base sequence above, and add another overlapping sequence:

AGGATGAGACAGATAG
    TGAGACAGATAGGATTGC

One way to assemble these together is to break them down into k-mers -- e.g the sequences above becomes the following set of 6-mers:

AGGATG
 GGATGA
  GATGAG
   ATGAGA
    TGAGAC
     GAGACA
      AGACAG
       GACAGA
        ACAGAT
         CAGATA
          AGATAG -> off the end of the first sequence
           GATAGG <- beginning of the second sequence
            ATAGGA
             TAGGAT
              AGGATT
               GGATTG
                GATTGC

and if you walk from one 6-mer to the next based on 5-mer overlap, you get the assembled sequence:

AGGATGAGACAGATAGGATTGC

voila, an assembly!

Graphs of many k-mers together are called De Bruijn graphs, and assemblers like MEGAHIT and SOAPdenovo are De Bruijn graph assemblers - they use k-mers underneath.

Why k-mers, though? Why not just work with the full read sequences?

Computers love k-mers because there's no ambiguity in matching them. You either have an exact match, or you don't. And computers love that sort of thing!

Basically, it's really easy for a computer to tell if two reads share a k-mer, and it's pretty easy for a computer to store all the k-mers that it sees in a pile of reads or in a genome.

So! On to some papers!

MetaPalette

About a year and a half ago, I had the pleasure of reviewing the MetaPalette paper by David Koslicki and Daniel Falush. This paper did some really cool things and I'm going to focus on two of the least cool things (the cooler things are deeper and worth exploring, but mathematical enough that I haven't fully understood them).

The first cool thing: the authors show that k-mer similarity between genomes at different k approximates various degrees of taxonomic similarity. Roughly, high similarity at k=21 defines genus-level similarity, k=31 defines species-level, and k=51 defines strain-level. (This is an ad-lib from what is much more precisely and carefully stated within the paper.) This relationship is something that we already sorta knew from Kraken and similar k-mer-based taxonomic efforts, but the MetaPalette paper was the first paper I remember reading where the specifics really stuck to me.

Figure 2 from Koslicki and Falush, 2016

This figure, taken from the MetaPalette paper, shows clearly how overall k-mer similarity between pairs of genomes matches the overall structure of genera.


The second cool thing: the authors show that you can look at what I think of as the "k-mer size response curve" to infer the presence of strain/species variants of known species in a metagenomic data set. Briefly, in a metagenomic data set you would expect to see stronger similarity to a known genome at k=21 than at k=31, and k=31 than at k=51, if a variant of that known species is present. (You'd see more or less even similarity across the k-mer sizes if the exact genome was present.)

k-mer response curve

In this figure (generated by code from our metagenome assembly preprint), you can see that this specific Geobacter genome's inclusion in the metagenome stays the same as we increase the k-mer size, while Burkholderia's decreases. This suggests that we know precisely which Geobacter is present, while only a strain variant of that Burkholderia is present.

I'm not sure whether this idea has been so clearly explicated before MetaPalette - it may not be novel because it sort of falls out of De Bruijn/assembly graph concepts, for example - but again it was really nicely explained in the paper, and the paper did an excellent job of then using these two observations to build a strain-aware taxonomic classification approach for metagenomic data sets.

The two major drawbacks of MetaPalette were that the implementation (though competitive with other implementations) was quite heavyweight and couldn't run on e.g. my laptop; and the math was complicated enough that reimplementing it myself was not straightforward.

MinHash and mash

(This next paper is less immediately relevant to taxonomy, but we'll get there, promise!)

At around the same time as I reviewed the MetaPalette paper, I also reviewed the mash paper by Ondov et al., from Adam Phillipy's lab. (Actually, what really happened was two people from my lab reviewed it and I blessed their review & submitted it, and then when the revisions came through I read the paper more thoroughly & got into the details to the point where I reimplemented it for fun. My favorite kind of review!)

The mash paper showed how a simple technique for fast comparison of sets, called MinHash - published in 1998 by Broder - could be applied to sets of k-mers. The specific uses were things like comparing two genomes, or two metagenomes. The mash authors also linked the Jaccard set similarity measure to the better known (in genomics) measure, the Average Nucleotide Identity between two samples.

The core idea of MinHash is that you subsample any set of k-mers in the same way, by first establishing a common permutations of k-mer space (using a hash function) and then subselecting from that permutation in a systematic way, e.g. by taking the lowest N hashes. This lets you prepare a subsample from a collection of k-mers once, and then forever more use that same subsampled collection of k-mers to compare with other subsampled collections. (In sourmash, we call these subsampled sets of k-mers "signatures".)

What really blew my mind was how fast and lightweight MinHash was, and how broadly applicable it was to various problems we were working on. The mash paper started what is now an 18 month odyssey that has led in some pretty interesting directions. (Read here and here for some of them.)

The one big drawback of mash/MinHash for me stems from its strength: MinHash has limited resolution when comparing two sets of very different sizes, e.g. a genome and a metagenome. The standard MinHash approach assumes that the sets to be compared are of similar sizes, and that you are calculating Jaccard similarity across the entire set; in return for these assumptions, you get a fixed-size sketch with guarantees on memory usage and execution time for both preparation and comparisons. However, these assumptions prevent something I want to do very much, which is to compare complex environmental metagenomes with potential constituent genomes.

A blast from the recent past: Kraken

The third paper I wanted to talk about is Kraken, a metagenome taxonomic classifier by Wood and Salzberg (2014) - see manual, and paper. This is actually a somewhat older paper, and the reason I dug back into the literature is that there has been somewhat of a renaissance in k-mer classification systems as the CS crowd has discovered that storing and searching large collections of k-mers is an interesting challenge. More specifically, I've recently reviewed several papers that implement variations on the Kraken approach (most recently MetaOthello, by Liu et al (2017)). My lab has also been getting more and more office hour requests from people who want to use Kraken and Kaiju for metagenomic classification.

Briefly, Kraken uses k-mers to apply the least-common-ancestor approach to taxonomic classification. The Kraken database is built by linking every known k-mer to a taxonomic ID. In cases where a k-mer belongs to only a single known genome, the assignment is easy: that k-mer gets the taxonomic ID of its parent genome. In the fairly common case that a k-mer belongs to multiple genomes, you assign the taxonomic ID of the least-common-ancestor taxonomic level to that k-mer.

Then, once you have built the database, you classify sequences by looking at the taxonomic IDs of constitutent k-mers. Conveniently this step is blazingly fast!

Figure from Wood and Salzberg, 2014

This figure (from Wood and Salzberg, 2014) shows the process that Kraken uses to do taxonomic identification. It's fast because (once you've computed the k-mer-to-LCA mapping) k-mer looks up are super fast, albeit rather memory intensive.

What you end up with (when using Kraken) is a taxonomic ID for each read, which can then be post-processed with something like Bracken (Lu et al., 2017) to give you an estimated species abundance.

The drawback to Kraken (and many similar approaches) is shared by MetaPalette: the databases are large, and building the databases is time/CPU-intensive. In our experience we're seeing what the manual says: you need 30+ GB RAM to run the software, and many more to build the databases.

(At the end of the day, it may be that many people actually want to use something like Centrifuge, another tool from the inimitable Salzberg group (see Kim et al., 2016). I'm only slowly developing a deeper understanding of the opportunities, limitations, and challenges of the various approaches, and I may blog about that later, but for now let's just say that I have some reasons to prefer the Kraken-style approach.)

More general thoughts

There are a number of challenges that are poorly addressed by current k-mer based classification schemes. One is scalability of classification: I really want to be able to run this stuff on my laptop! Another is scalability of the database build step: I'm OK with running that on bigger hardware than my laptop, but I want to be able to update, recombine, and customize the databases. Here, large RAM requirements are a big problem, and the tooling for database building is rather frustrating as well - more on that below.

I also very much want a library implementation of these things - specifically, a library in Python. Basically, you lose a lot when you communicate between programs through files. (See this blog post for the more general argument.) This would let us intermix k-mer classification with other neat techniques.

More generally, in the current era of "sequence all the things" and the coming era of "ohmigod we have sequenced so many things now what" we are going to be in need of a rich, flexible ecosystem of tools and libraries. This ecosystem will (IMO) need to be:

  • decentralized and locally installable & usable, because many labs will have large internal private data sets that they want to explore;
  • scalable in memory and speed, because the sheer volume of the data is so ...voluminous;
  • customizable and programmable (see above) so that we can try out cool new ideas more easily;
  • making use of databases that can be incrementally (and routinely) updated, so that we can quickly add new reference information without rebuilding the whole database;

and probably some other things I'm not thinking of. The ecosystem aspect here is increasingly important and something I've been increasingly focusing on: approaches that don't work together well are simply not that useful.

Another goal we are going to need to address is classification and characterization of unknowns in metagenomic studies. We are making decent progress in certain areas (metagenome-resolved genomics!!) but there are disturbing hints that we largely acting like drunks looking for their keys under the streetlight. I believe that we remain in need of systematic, scalable, comprehensive approaches for characterizing environmental metagenome data sets.

This means that we will need to be thinking more and more about reference independent analyses. Of the three above papers, only mash is reference independent; MetaPalette and Kraken both rely on reference databases. Of course, those two tools address the flip side of the coin, which is to properly make use of the reference databases we do have for pre-screening and cross-validation.

--titus

by C. Titus Brown at August 30, 2017 09:25 AM

Matthew Rocklin

Dask Release 0.15.2

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.2. This release contains stability enhancements and bug fixes. This blogpost outlines notable changes since the 0.15.0 release on June 11th.

You can conda install Dask:

conda install dask

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on the defaults and conda-forge channels.

Full changelogs are available here:

Some notable changes follow.

New dask-core and dask conda packages

On conda there are now three relevant Dask packages:

  1. dask-core: Package that includes only the core Dask package. This has no dependencies other than the standard library. This is primarily intended for down-stream libraries that depend on certain parts of Dask.
  2. distributed: Dask’s distributed scheduler, depends on Tornado, cloudpickle, and other libraries.
  3. dask: Metapackage that includes dask-core, distributed, and all relevant libraries like NumPy, Pandas, Bokeh, etc.. This is intended for users to install

This organization is designed to both allow downstream libraries to only depend on the parts of Dask that they need while also making the default behavior for users all-inclusive.

Downstream libraries may want to change conda dependencies from dask to dask-core. They will then need to be careful to include the necessary libraries (like numpy or cloudpickle) based on their user community.

Improved Deployment

Due to increased deployment on Docker or other systems with complex networking rules dask-worker processes now include separate --contact-address and --listen-address keywords that can be used to specify addresses that they advertise and addresses on which they listen. This is especially helpful when the perspective of ones network can shift dramatically.

dask-worker scheduler-address:8786 \
            --contact-address 192.168.0.100:9000  # contact me at 192.168.0.100:9000
            --listen-address 172.142.0.100:9000  # I listen on this host

Additionally other services like the HTTP and Bokeh servers now respect the hosts provided by --listen-address or --host keywords and will not be visible outside of the specified network.

Avoid memory, file descriptor, and process leaks

There were a few occasions where Dask would leak resources in complex situations. Many of these have now been cleaned up. We’re grateful to all those who were able to provide very detailed case studies that demonstrated these issues and even more grateful to those who participated in resolving them.

There is undoubtedly more work to do here and we look forward to future collaboration.

Array and DataFrame APIs

As usual, Dask array and dataframe have a new set of functions that fill out their API relative to NumPy and Pandas.

See the full APIs for further reference:

Deprecations

Officially deprecated dask.distributed.Executor, users should use dask.distributed.Client instead. Previously this was set to an alias.

Removed Bag.concat, users should use Bag.flatten instead.

Removed magic tuple unpacking in Bag.map like bag.map(lambda x, y: x + y). Users should unpack manually instead.

Julia

Developers from the Invenia have been building Julia workers and clients that operate with the Dask.distributed scheduler. They have been helpful in raising issues necessary to ensure cross-language support.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.0 release on June 11th

  • Bogdan
  • Elliott Sales de Andrade
  • Bruce Merry
  • Erik Welch
  • Fabian Keller
  • James Bourbeau
  • Jeff Reback
  • Jim Crist
  • John A Kirkham
  • Luke Canavan
  • Mark Dunne
  • Martin Durant
  • Matthew Rocklin
  • Olivier Grisel
  • Søren Fuglede Jørgensen
  • Stephan Hoyer
  • Tom Augspurger
  • Yu Feng

The following people contributed to the dask/distributed repository since the 1.17.1 release on June 14th:

  • Antoine Pitrou
  • Dan Brown
  • Elliott Sales de Andrade
  • Eric Davies
  • Erik Welch
  • Evan Welch
  • John A Kirkham
  • Jim Crist
  • James Bourbeau
  • Jeremiah Lowin
  • Julius Neuffer
  • Martin Durant
  • Matthew Rocklin
  • Paul Anton Letnes
  • Peter Waller
  • Sohaib Iftikhar
  • Tom Augspurger

Additionally we’re happy to announce that John Kirkham (@jakirkham) has accepted commit rights to the Dask organization and become a core contributor. John has been active through the Dask project, and particularly active in Dask.array.

August 30, 2017 12:00 AM

August 29, 2017

Matthieu Brucher

Something I realized on scale temperament

When I started music almost thirty years ago, one of my music teachers told us that there was a difference between a flat and a sharp note. I didn’t really understand as on a trumpet, both would be the same! I forgot about it, until a few years ago, I was introduced ti the concept of temperament.

It started with he fact that the fifth in a scale has a mathematical relationship to the root note, and all other notes were built from there. At least for the 12 notes we have in the occidental world. Then, I read that at some point, this imperfect scale was replaced by another one, where all 12 notes were evenly split. Of course, that means that the fifth is not the fifth anymore…

Then, I watched a French video on it again, with the proper scientific explanation for different scales, and it got me thinking again. That was what my music teacher was talking about. If you build your scale with the harmonics, you end up with a difference between flats and sharps. But even more, I thought that a melody with flats was more melancholic than one with sharps. But this doesn’t make sense for an evenly tempered scale! They all feel the same way, no matter which scale you use. It’s only if you do like Bach and use an imperfect temperament that you can feel a difference.

So then a musical instrument like strings, or voices, can actually work in a pure temperament. For winds, you can work on adjusting notes on the fly, and also have kind of pure temperament. On a piano, you can’t, you need to tune it right (BTW, I still don’t know how piano tuners can tune pianos when you have to tune them “imperfectly”! Their job is just amazing), or it will feel different from with a symphonic orchestra that can use a pure temperament of any scale.

And this is something quite difficult to achieve with a digital instrument. We expect equal temperament. I wonder how Celemony handles these differences when they tune a voice track.

Anyway, maybe the pure scale is imperfect because the 13th note is not exactly the same of the original note, maybe the equal temperament makes everything similar, but they all bring something. We shouldn’t probably forget about these differences and use them in modern music as well.

by Matt at August 29, 2017 07:13 AM

August 28, 2017

Continuum Analytics

Continuum Analytics Officially Becomes Anaconda

Today, Continuum Analytics changed its name to Anaconda. Not surprised? We agree. The name change reinforces our commitment to the Anaconda open source development community, the 4.5 million Anaconda users across the globe and customers of our Anaconda Enterprise Platform. The name Anaconda points to our long-term commitment to supporting the Anaconda open source ecosystem …
Read more →

by Scott Collison at August 28, 2017 02:35 AM

August 24, 2017

Continuum Analytics news

Continuum Analytics Welcomes Mathew Lodge as SVP Products and Marketing

Thursday, August 24, 2017

Former VMware VP Joins Anaconda’s Executive Leadership Team to Help Accelerate Data Science Adoption across the Enterprise

AUSTIN, TEXAS—August 24, 2017—Continuum Analytics, the company behind Anaconda, the leading Python data science platform, today announced Mathew Lodge as the company’s new senior vice president (SVP) of products and marketing. Lodge brings extensive experience to the B2B product space––including software and SaaS––to help the company further expand adoption of Anaconda across the enterprise.  

“With a proven history of leading product and marketing strategies for some of the biggest brands and hottest startups in technology, Mathew brings a unique perspective that will help take Continuum Analytics to the next level and extend our position as the leading Python data science platform,” said Scott Collison, CEO at Continuum Analytics. “Mathew will lead our product and marketing efforts, helping to ensure that Anaconda continues to meet and exceed the requirements of today’s enterprise, empowering organizations across the globe to build data science-driven applications that deliver measurable business impact.”
 
The Python community is estimated at more than 30 million members and, according to the most recent O’Reilly Data Science Survey, among data scientists, 72 percent prefer Python as their main tool. Anaconda has more than 4.5 million users and is the world’s most popular and trusted Python data science platform. 

“Data science is foundational to digital and AI strategies, allowing organizations to deliver new products and services faster and cheaper, and respond more quickly to customers and competitors,” said Lodge. “Python is the open source ecosystem of choice for data scientists, and Anaconda is the gold standard platform for Python data science. I look forward to working with customers and partners to realize the huge potential of Anaconda to deliver actionable, automated insight and intelligence to every organization.”

This summer, Continuum Analytics was included in Gartner’s “Cool Vendors in Data Science and Machine Learning, 2017” report and Gartner’s “Hype Cycle for Data Science and Machine Learning for 2017.”
 
Mathew Lodge comes from Weaveworks, a container and microservices start-up, where he was the chief operating officer. Previously, he was vice president in VMware’s Cloud Services group; notably he was co-founder for what became its vCloud Air IaaS service. Early in his career, Lodge built compilers and distributed systems for projects like the International Space Station, helped connect six countries to the Internet for the first time and managed a $630M product line at Cisco. Lodge holds a Master of Engineering from University of York, where he graduated with honors.
 
About Anaconda Powered by Continuum Analytics
Anaconda is the leading data science platform powered by Python, the fastest growing data science language with more than 30 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io

 

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by swebster at August 24, 2017 03:56 PM

Continuum Analytics

Continuum Analytics Welcomes Mathew Lodge as SVP Products and Marketing

Continuum Analytics, the company behind Anaconda, the leading Python data science platform, today announced Mathew Lodge as the company’s new senior vice president (SVP) of products and marketing.

by Team Anaconda at August 24, 2017 04:47 AM

August 20, 2017

numfocus

How to Compare Photos of the Solar Eclipse using Python & SunPy

The following post was written by Steven Christe of SunPy. A Rare Opportunity to View the Solar Corona A solar eclipse presents a unique opportunity for us to view the solar corona or solar atmosphere. In visible light, the corona is about one million times fainter than the sun, or about as bright as the moon. […]

by Gina Helfrich at August 20, 2017 05:39 PM

August 17, 2017

Titus Brown

Thoughts on my Salmon review

I was one of the reviewers of the Salmon paper by Patro et al., 2017, Salmon provides fast and bias-aware quantification of transcript expression, and I posted my review in large part because of Lior Pachter's blog post levying charges of intellectual theft and dishonest against the Salmon authors. More recently, the Salmon authors posted a strong rebuttal that is worth reading.

I posted my review verbatim, without any editing, and with the charge for the review laid out by the journal. Part of my goal was to provide some insight into why I was enthusiastic about Salmon, and why I reviewed it positively for publication.


Looking back on my review, I have a few thoughts - especially in light of Lior's hit piece and Rob et al's response.

First, it is always a pleasure reviewing software papers when you have used the tool for ages. My lab uses Salmon, trains people in running Salmon, and passes around blog posts and papers discussing Salmon results. I think we had been using Salmon for at least a year when I was asked to review the paper. So I was already pretty confident that Salmon performed well in general and I did not need to try to run it, evaluate it, etc.

Second, my review hit on a number of high points about Salmon: generally, that it was an important new tool of interest and that it was a significant improvement to the tried-and-tested Kallisto technique. More specifically, I felt that the GC bias correction was likely to be critical in the future (see esp the isoform switching observation!), and that the open-source license was an important aspect of the Salmon software. To my mind this meant that it warranted publication, a case I made succinctly but tried to make strongly.

Third, my review missed at least one very important aspect, which was the streaming nature of the Salmon implementation. The rebuttal covers this nicely...

Last, my review uses the term "pseudoalignment" and not "quasimapping." The distinction between these two terms is part of the brouhaha; but to my mind the basic idea of finding equivalence classes for reading mapping is pretty similar no matter what you call it, and it was clear (both from the citations to Kallisto and the discussion in the paper) that the core concept had been published as part of the Kallisto work. I don't think it matters what it's called - it's a good idea, I'm glad that there are multiple implementations that agree well with each other, and I think it was adequately, appropriately, and generously cited in the Salmon paper.

So, anyway, there are my thoughts. Yours welcome below!

(Before you comment, please do see the code of conduct for this blog - it's linked below. In particular, I place myself under no obligation to retain posted comments, and I will delete comments that are useless, or dishonest.)

--titus

by C. Titus Brown at August 17, 2017 10:00 PM

Continuum Analytics

Continuum Analytics to Share Insights at JupyterCon 2017

Continuum Analytics, the creator and driving force behind Anaconda, the leading Python data science platform, today announced that the team will present one keynote, three talks and two tutorials at JupyterCon on August 23 and 24 in NYC, NY.

by Team Anaconda at August 17, 2017 04:41 AM

August 16, 2017

Continuum Analytics news

Continuum Analytics to Share Insights at JupyterCon 2017

Thursday, August 17, 2017

Presentation topics include Jupyter and Anaconda in the enterprise; open innovation in a data-centric world; building an Excel-Python bridge; encapsulating data science using Anaconda Project and JupyterLab; deploying Jupyter dashboards for datapoints; JupyterLab

NEW YORK, August 17, 2017—Continuum Analytics, the creator and driving force behind Anaconda, the leading Python data science platform, today announced that the team will present one keynote, three talks and two tutorials at JupyterCon on August 23 and 24 in NYC, NY. The event is designed for the data science and business analyst community and offers in-depth trainings, insightful keynotes, networking events and talks exploring the Project Jupyter platform.

Peter Wang, co-founder and CTO of Continuum Analytics, will present two sessions on August 24. The first is a keynote at 9:15 am, titled “Jupyter & Anaconda: Shaking Up the Enterprise.” Peter will discuss the co-evolution of these two major players in the new open source data science ecosystem and next steps to a sustainable future. The other is a talk, “Fueling Open Innovation in a Data-Centric World,” at 11:55 am, offering Peter’s perspectives on the unique challenges of building a company that is fundamentally centered around sustainable open source innovation.

The second talk features Christine Doig, senior data scientist, product manager, and Fabio Pliger, software engineer, of Continuum Analytics, “Leveraging Jupyter to build an Excel-Python Bridge.” It will take place on August 24 at 11:05 am and Christine and Fabio will share how they created a native Microsoft Excel plug-in that provides a point-and-click interface to Python functions, enabling Excel analysts to use machine learning models, advanced interactive visualizations and distributed compute frameworks without needing to write any code. Christine will also be holding a talk on August 25 at 11:55 am on “Data Science Encapsulation and Deployment with Anaconda Project & JupyterLab.” Christine will share how Anaconda Project and JupyterLab encapsulate data science and how to deploy self-service notebooks, interactive applications, dashboards and machine learning.

James Bednar, senior solutions architect, and Philipp Rudiger, software developer, of Continuum Analytics, will give a tutorial on August 23 at 1:30 pm titled, “Deploying Interactive Jupyter Dashboards for Visualizing Hundreds of Millions of Datapoints.” This tutorial will explore an overall workflow for building interactive dashboards, visualizing billions of data points interactively in a Jupyter notebook, with graphical widgets allowing control over data selection, filtering and display options, all using only a few dozen lines of code.

The second tutorial, “JupyterLab,” will be hosted by Steven Silvester, software engineer at Continuum Analytics and Jason Grout, software developer at Bloomberg, on August 23 at 1:30 pm. They will walk through JupyterLab as a user and as an extension author, exploring its capabilities and offering a demonstration on how to create a simple extension to the environment.

Keynote:
WHO: Peter Wang, co-founder and CTO, Anaconda Powered by Continuum Analytics
WHAT: Jupyter & Anaconda: Shaking Up the Enterprise
WHEN: August 24, 9:15am-9:25am ET
WHERE: Grand Ballroom

Talk #1:
WHO: Peter Wang, co-founder and CTO, Anaconda Powered by Continuum Analytics
WHAT: Fueling Open Innovation in a Data-Centric World
WHEN: August 24, 11:55am–12:35pm ET
WHERE: Regent Parlor

Talk #2:
WHO: 

  • Christine Doig, senior data scientist, product manager, Anaconda Powered by Continuum Analytics
  • Fabio Pliger, software engineer, Anaconda Powered by Continuum Analytics

WHAT: Leveraging Jupyter to Build an Excel-Python Bridge
WHEN: August 24, 11:05am–11:45am ET
WHERE: Murray Hill

Talk #3:
WHO: Christine Doig, senior data scientist, product manager, Anaconda Powered by Continuum Analytics
WHAT: Data Science Encapsulation and Deployment with Anaconda Project & JupyterLab
WHEN: August 25, 11:55am–12:35pm ET
WHERE: Regent Parlor

Tutorial #1:
WHO: 

  • James Bednar, senior solutions architect, Anaconda Powered By Continuum Analytics 
  • Philipp Rudiger, software developer, Anaconda Powered By Continuum Analytics 

WHAT: Deploying Interactive Jupyter Dashboards for Visualizing Hundreds of Millions of Datapoints
WHEN: August 23, 1:30pm–5:00pm ET
WHERE: Concourse E

Tutorial #2:
WHO: 

  • Steven Silvester, software engineer, Anaconda Powered By Continuum Analytics 
  • Jason Grout, software developer of Bloomberg

WHAT: JupyterLab Tutorial
WHEN: August 23, 1:30pm–5:00pm ET
WHERE: Concourse A

###

About Anaconda Powered by Continuum Analytics
Anaconda is the leading data science platform powered by Python, the fastest growing data science language with more than 30 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

 

by swebster at August 16, 2017 03:12 PM

August 15, 2017

Titus Brown

My Salmon review

I was one of the reviewers of the Salmon paper by Patro et al., 2017, Salmon provides fast and bias-aware quantification of transcript expression. I was asked to review the paper on September 14, 2016, and submitted my review (or at least stopped getting reminders :) soon after October 20th.

The review request stated,

... I would be grateful for your views on both the technical merits of this work and its interest to a broad readership.

Please try the tool and report your experience. Let me know if you have issues with access.

Brief Communications are intended to report concise studies of high quality and broad interest but are expected to be less substantial than Articles. Typically, a Brief Communication will describe a significant improvement to a tried-and-tested technique, its adaptation for a new application, or an important new tool of interest to the scientific community.

and I reviewed the paper as requested.

Below is the full text of my review as I submitted it. (Note that I have not been posting my reviews lately because I've been busy, and I don't have a simple automated system for doing it. In this case the review seems like it might be of general interest.)


The authors present salmon, a new RNAseq quantification tool that uses pseudoalignment and is highly performant. The significant factors recommending salmon over other tools include speed, high sensitivity, correction for technical biases in sequencing, an open source license, and a robust implementation that is straightforward to use. The paper makes the argument that salmon is the premier tool for RNAseq quantification.

The paper is well written, if unavoidably dense due to the nonsensical space constraints of this journal. The supporting and supplemental material is well written and seems comprehensive. The comparisons with kallisto and express seem fine to me although I note that the x axes of the various figures are zoomed into a slightly absurd extent; nontheless, the comparisons support the conclusions reached in the paper.

There are some interesting gems buried deep within the paper. For example, I was struck by the observation that apparent isoform switching may be caused by technical artifacts (supp fig 5). While the format and journal chosen by the authors doesn't permit proper exploration of this, I suspect that the technical bias correction is going to be a very important aspect of this software.

I should also say that various versions of the preprint and software have been out for well over a year, and we and others have been using it since very early on. Others have done independent benchmarks that agree with the conclusions reached in the paper.

Requested changes:

  • the authors should provide an archival location for salmon, and include a DOI to the specific version used in this paper. This can be easily accomplished with e.g. zenodo/github integration.
  • as the authors are putting this forward for general use, I would like at least a brief discussion of how they do software testing and validation. In particular, this can answer the important question of how users and developers can know that future versions of salmon will produce results at least as correct as the results in this paper.

Requested additions:

  • optional but recommended: the documentation should be expanded to include a "getting started" tutorial. Right now a naive user would have no idea where to star.

Typos/corrections:

p3, at the same false discovery rate than those => as those

probabilistic is mis-spelled on p5

p7, reference 5 has "page" in it inappropriately.

by C. Titus Brown at August 15, 2017 10:00 PM

Continuum Analytics

Keeping Anaconda Up To Date

Below is a question that gets asked so often that I decided it would be helpful to publish an answer explaining the various ways in which Anaconda can be kept up to date. The question was originally asked on StackOverflow.  I have Anaconda installed on my computer and I’d like to update it. In Navigator I can …
Read more →

by Team Anaconda at August 15, 2017 04:39 AM

Five Organizations Successfully Fueling Innovation with Data Science

Data science innovation requires availability, transparency and interoperability. But what does that mean in practice? At Anaconda, it means providing data scientists with open source tools that facilitate collaboration; moving beyond analytics to intelligence. Open source projects are the foundation of modern data science and are popping up across industries, making it more accessible, more …
Read more →

by Team Anaconda at August 15, 2017 04:28 AM

August 14, 2017

Continuum Analytics news

Five Organizations Successfully Fueling Innovation with Data Science

Tuesday, August 15, 2017
Christine Doig
Sr. Data Scientist, Product Manager

Data science innovation requires availability, transparency and interoperability. But what does that mean in practice? At Anaconda, it means providing data scientists with open source tools that facilitate collaboration; moving beyond analytics to intelligence. Open source projects are the foundation of modern data science and are popping up across industries, making it more accessible, more interactive and more effective. So, who’s leading the open source charge in the data science community? Here are five organizations to keep your eye on:

1. TaxBrain. TaxBrain is a platform that enables policy makers and the public to simulate and study the effects of tax policy reforms using open source economic models. Using the open source platform, anyone can plug elements of the administration’s proposed tax policy to get an idea of how it would perform in the real world.

 

2. Recursion Pharmaceuticals. Recursion is a pharmaceutical company dedicated to finding the remedies for rare genetic diseases. Its drug discovery assay is built on an open source software platform, combining biological science with machine learning techniques to visualize cell data and test drugs efficiently. This approach shortens research and development process, reducing time to market for remedies to these rare genetic diseases. Their goal is to treat 100 diseases by 2026 using this method.

3. The U.S. Government. Under the previous administration, the U.S. government launched Data.gov, an open data initiative that offers more than 197K datasets for public use. This database exists, in part, thanks to the former U.S. chief data scientist, DJ Patil. He helped drive the government’s data science projects forward, at the city, state and federal levels. Recently, concerns have been raised over the the Data.gov portal, as certain information has started to disappear. Data scientists are keeping a sharp eye on the portal to ensure that these resources are updated and preserved for future innovative projects.

4. Comcast. Telecom and broadcast giant, Comcast, run their projects on open source platforms to drive data science innovation in the industry. 

For example, earlier this month, Comcast’s advertising branch announced they were creating a Blockchain Insights Platform to make the planning, targeting, execution and measurement of video ads more efficient. This data-driven, secure approach would be a game changer for the advertising industry, which eagerly awaits its launch in 2018.

5. DARPA. The Defense Advanced Research Projects Agency (DARPA) is behind the Memex project, a program dedicated to fighting human trafficking, which is a top mission for the defense department. DARPA estimates that in two years, traffickers spent $250 million posting the temporary advertisements that fuel the human trafficking trade. Using an open source platform, Memex is able to index and cross reference interactive and social media, text, images and video across the web. This allows them to find the patterns in web data that indicate human trafficking. Memex’s data science approach is already credited in generating at least 20 active cases and nine open indictments. 

These are just some of the examples of open source-fueled data science turning industries on their head, bringing important data to the public and generally making the world a better place. What will be the next open source project to put data science in the headlines? Let us know what you think in the comments below!

by swebster at August 14, 2017 06:12 PM

Continuum Analytics

Anaconda Rides its Way into Gartner’s Hype Cycle

If you’re an Anaconda user and/or frequent reader of our blog, then you know how passionate we are about empowering our community (and future community!) with all the resources needed to bring data science to life. And while it will never stop us from tooting the data science horn, it’s always nice to know we …
Read more →

by Team Anaconda at August 14, 2017 04:22 AM

August 11, 2017

numfocus

How rOpenSci uses Code Review to Promote Reproducible Science

This post was co-authored by the rOpenSci Editorial Board: Noam Ross, Scott Chamberlain, Karthik Ram, and Maëlle Salmon. — At rOpenSci, we create and curate software to help scientists with the data life cycle. These tools access, download, manage, and archive scientific data in open, reproducible ways. Early on, we realized this could only be […]

by Gina Helfrich at August 11, 2017 05:30 PM

August 08, 2017

Continuum Analytics news

Anaconda Rides its Way into Gartner’s Hype Cycle

Monday, August 14, 2017
Scott Collison
Chief Executive Officer

If you’re an Anaconda user and/or frequent reader of our blog, then you know how passionate we are about empowering our community (and future community!) with all the resources needed to bring data science to life. And while it will never stop us from tooting the data science horn, it’s always nice to know we are in good company. Continuum Analytics was recently included in Gartner’s Hype Cycle for Data Science and Machine Learning for 2017, Peter Krensky and Jim Hare, July 28, 2017.

Per the Gartner report, “The hype around data science and machine learning has increased from already high levels in the past year. Data and analytics leaders should use this Hype Cycle to understand technologies generating excitement and inflated expectations, as well as significant movements in adoption and maturity.”

We couldn’t agree more. 

This report comes on the heels of our inclusion in Gartner’s Cool Vendors in Data Science and Machine Learning, 2017 report Peter Krensky, Alexander Linden, Svetlana Sicular, and Cindi Howson, May 11, 2017. 

One important point is that this was the first time Machine Learning was included in Gartner’s Hype Cycle for Emerging Technologies, which we believe clearly validates the growing importance of data science across the enterprise. While this isn’t news to us, it is important that influencers are acknowledging the shift to a more data-driven enterprise and moving data science from an ‘emerging’ category to ‘established.’ 

We witness this shift everyday as the Anaconda community swells to 4.5M active users with more new members joining daily. Our excitement grows along with every new user because we know the impact data science can have on so many areas––from enabling new and cutting edge innovations to solving some of the world’s biggest challenges and uncovering answers to questions that haven’t even been asked yet.

We’re grateful to our incredible community for bringing our vision to life and truly embracing the superpowers we offer for people who change the world. 

Gartner Disclaimer
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

by swebster at August 08, 2017 05:37 PM

August 07, 2017

Continuum Analytics

Galvanize Capstone Series: Elderly Financial Fraud Detection

This post is part of our Galvanize Capstone featured projects. This post was written by Sanhita Joshi and posted here with their permission.  The goal of this project is to detect signs for financial fraud detection from expense data collected from individuals whose cognitive abilities are compromised and have someone taking care of their finances. The data used is …
Read more →

by Team Anaconda at August 07, 2017 01:02 PM

August 05, 2017

August 04, 2017

Continuum Analytics news

Keeping Anaconda Up To Date

Tuesday, August 15, 2017
Ian Stokes-Rees
Continuum Analytics

Below is a question that gets asked so often that I decided it would be helpful to publish an answer explaining the various ways in which Anaconda can be kept up to date. The question was originally asked on StackOverflow

I have Anaconda installed on my computer and I'd like to update it. In Navigator I can see that there are several individual packages that can be updated, but also an anaconda package that sometimes has a version number and sometimes says custom. How do I proceed?

The Answer

What 95% of People Actually Want

In most cases what you want to do when you say that you want to update Anaconda is to execute the command:

conda update --all

This will update all packages in the current environment to the latest version—with the small print being that it may use an older version of some packages in order to satisfy dependency constraints (often this won't be necessary and when it is necessary the package plan solver will do its best to minimize the impact).

This needs to be executed from the command line, and the best way to get there is from Anaconda Navigator, then the "Environments" tab, then click on the triangle beside the root environment, selecting "Open Terminal":

This operation will only update the one selected environment (in this case, the root environment). If you have other environments you'd like to update you can repeat the process above, but first click on the environment. When it is selected there is a triangular marker on the right (see image above, step 3). Or, from the command line, you can provide the environment name (-n envname) or path (-p /path/to/env). For example, to update your dspyr environment from the screenshot above:

conda update -n dspyr --all

Update Individual Packages

If you are only interested in updating an individual package then simply click on the blue arrow or blue version number in Navigator, e.g. for astroid or astropy in the screenshot above, and this will tag those packages for an upgrade. When you are done you need to click the "Apply" button:

Or from the command line:

conda update astroid astropy

Updating Just the Packages in the Standard Anaconda Distribution

If you don't care about package versions and just want "the latest set of all packages in the standard Anaconda Distribution, so long as they work together," then you should take a look at this gist.

Why Updating the Anaconda Package is Almost Always a Bad Idea

In most cases, updating the Anaconda package in the package list will have a surprising result—you may actually downgrade many packages (in fact, this is likely if it indicates the version as custom). The gist above provides details.

Leverage conda Environments

Your root environment is probably not a good place to try and manage an exact set of packages—it is going to be a dynamic working space with new packages installed and packages randomly updated. If you need an exact set of packages, create a conda environment to hold them. Thanks to the conda package cache and the way file linking is used, doing this is typically fast and consumes very little additional disk space. For example: 

conda create -n myspecialenv -c bioconda -c conda-forge python=3.5 pandas beautifulsoup seaborn nltk

The conda documentation has more details and examples.

pip, PyPI, and setuptools?

None of this is going to help with updating packages that have been installed from PyPI via pip, or any packages installed using python setup.py install. conda list will give you some hints about the pip-based Python packages you have in an environment, but it won't do anything special to update them.

Commercial Use of Anaconda or Anaconda Enterprise

It's pretty much exactly the same story, with the exception that you may not be able to update the root environment if it was installed by someone else (say, to /opt/anaconda/latest). If you're not able to update the environments you are using, you should be able to clone and then update:

conda create -n myenv --clone root
conda update -n myenv --all

Where to Go Next

If you have more questions about Anaconda then you can refer to our online documentation, or make use of our commercial (paid) or community (free) support channels. If you are using Anaconda in an enterprise setting, then I think you'll be interested in learning about how Anaconda Enterprise can help you and your colleagues with collaboration, security, governance, and provenance around your data science workflows.

by swebster at August 04, 2017 03:18 PM

August 01, 2017

numfocus

Leveling up with Open Astronomy: Astropy affiliated packages

Matt Craig, Professor of Physics and Astronomy at Minnesota State University Moorhead, has created this list of Astropy affiliated packages to help improve your experience exploring astronomy using Python. This post was inspired by Ole Moeller-Nilsson’s recent blog post on how to get started doing astronomy with open data and Python. — One of the […]

by Gina Helfrich at August 01, 2017 05:00 PM

Matthieu Brucher

Audio ToolKit: modules for the JUCE framework

As some may know, I’ve switched from wdl-ol to JUCE 5 for my free plugins. In the past, I had to modify by hand all the projects created by the Projucer. And each time JUCE is updated, I need to add these changes to the generated project.

Not anymore.

On the develop branch, and in the next minor release ATK 2.1.2, modules for the Projucer will be available. This will enable an easier integration of ATK in your project, as you will “just” need to add these modules to the Projucer (and some additional include files to make ATK compliant with JUCE hierarchy).

There are currently 11 new modules (shameless comparison, that’s far more than the new DSP module, even if there are some filters than I’m missing, but feel free to propose a pull request with new features!).

See the explanation on the release branch and let me know what you think: https://github.com/mbrucher/AudioTK/tree/develop/modules/JUCE

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at August 01, 2017 07:21 AM

July 31, 2017

Continuum Analytics news

What’s New with Anaconda in 2017?

Monday, July 31, 2017
Scott Collison
Chief Executive Officer

It’s hard to believe that we’re halfway through 2017. I joined Anaconda as CEO back in January and I’ve been deeply impressed by this amazing team and all that we have collectively accomplished in just six short months. From new events to new products, our teams have already squeezed quite a bit into 2017. Before the year flies by completely, I wanted to take a moment to reflect and share the exciting milestones we’ve hit on the road to 2018.

New Senior Leadership Team Members

Here at Anaconda, the first half of the year was bookended by new hires. At the beginning of the year, I joined to run the company. This move allowed Anaconda’s co-founder, Travis Oliphant, to channel his energy into open source innovation and community. At the end of June, we added another new name to the executive team when Aaron Barfoot came on-board as CFO. Aaron is a world class CFO and is leading the effort to make our finance, IT and HR departments top notch. Both Aaron and I are thrilled to be a part of the next chapter for Anaconda as our numbers continue to climb past four million active users.

New Data

One of Anaconda’s most exciting projects this year was the release of our data-backed report: Winning at Data Science: How Teamwork Leads to Victory. Working with the independent research firm Vanson Bourne, we surveyed 200 data science and analytics decision makers, digging into the adoption, challenges and value of data science in the enterprise. While some of the findings were expected, others were jaw dropping. For example, 96 percent of respondents agree that data science is critical to the success of their organization, but 22 percent are failing to fully take advantage of their data. If you love numbers as much as we do, check out the full findings here.

New Events

This February, our inaugural user conference, AnacondaCON was a box office hit. Literally.

I’ve been at companies known for their stand-out conferences, and this is one of the best industry conferences I have ever attended, by far. 

With speakers from the Anaconda team, our customers/users and partners, it was fantastic to have the brightest minds in data science together in the same room. If you couldn’t make it to this year’s event, you can watch the 2017 keynotes and keep your eye out for updates for AnacondaCON 2018, which will be April 8-11, 2018 at the JW Marriott in Austin, TX.

New Partners

In April, we joined forces with IBM to offer Anaconda on IBM’s Cognitive Systems, the company’s deep learning platform, as well as the IBM Power AI software distribution for machine learning. In May, we made another big announcement with our partners H20.ai and MapD Technologies with the launch of the GPU Open Analytics Initiative (GOAI). GOAI is designed to create common data frameworks using NVIDIA’s technology, accelerating data science on GPUs. We see both of these partnerships as major points of validation for data science and machine learning, as data scientists continue to stretch their legs in the enterprise. Keep an eye out for even more activity in this area later in the year.

New Products

If you’re an Anaconda user, we hope you were as thrilled as we were about the launch of Anaconda 4.4 this June. Besides delivering a comprehensive platform for Python-centric data science with a single-click installer, Anaconda 4.4 is designed to simplify working with Python 2 and Python 3 code, supporting a gradual migration and avoiding the need for any “big bang” cutover.

New Users

In 2016 alone, Anaconda user growth products and services grew more than 100 percent. In 2017, the number of downloads-to-date has surpassed 25 million with more than four million active users, since the company’s founding in 2012. A recent poll from KDnuggets found that Anaconda usage has increased 37% since 2016, placing the platform in the top 10 ranking for the industry’s most popular analytics/data science tools. The poll also found that Python has overtaken R as the most popular data science language, a trend we anticipate will continue in the years ahead.

I’m super pleased with our accomplishments for the first half of the year and look forward to exciting second half of 2017!

by swebster at July 31, 2017 02:53 PM

Continuum Analytics

Galvanize Capstone Series: Predicting Demand with RideAustin

The purpose of my project is to try to bridge that gap. If I can successfully predict what areas in Austin are likely to have high demand, I can tell drivers about those spots and increase the supply to meet that demand. Predicting where these hotspots are going to be means taking in the historical …
Read more →

by Team Anaconda at July 31, 2017 02:52 AM

July 27, 2017

Continuum Analytics news

Galvanize Capstone Series: Elderly Financial Fraud Detection

Monday, August 7, 2017
Sanhita Joshi
Guest Blogger

This post is part of our Galvanize Capstone featured projects. This post was written by Sanhita Joshi and posted here with their permission. 

The goal of this project is to detect signs for financial fraud detection from expense data collected from individuals whose cognitive abilities are compromised and have someone taking care of their finances. The data used is a test set of data from the State of Minnesota, provided by Michael Curran of Guide Change. Minnesota is currently the only state that collects this information digitally. The intent of the data analysis is to learn patterns in elderly fraud and alert authorities about how these frauds happen.

Detecting Elderly Fraud Now

Previously, to detect elderly fraud, the state government used red flags such as a charitable donations over $100 or if a single cash transaction is over certain amount. This is not a very clever way to handle such a big and varied dataset, and there are too many red flags. This outdated process is a waste of resources, like accountants' time and, in turn, tax-payers' money.

The other difficult part of this system is that cases where the person has a monthly income (or estate) less than $3000 are not investigated. Therefore, it is important to find outliers within low-income households as well in our analysis.

Although some of these cases might be investigated in detail, there is no information about that in the dataset. Hence, the challenging part of the project is that this is unlabeled data, and this is going to be unsupervised learning.

Improving Elderly Fraud Detection: My Algorithm

For this project, I designed an algorithm to read all the expense transactions of these individuals and find outliers. For the purpose of the project, I made two dimensional plots for each expense category: X-axis is a normalized expense in the category and Y-axis is the number of transactions per day. Each point represents one individual. Because each plot shows outliers, I designed an algorithm to quantify them:

  1. For each category, distance between each pair of points is calculated.
  2. Number of neighbors within the median distance (from step 1) were counted.
  3. Each individual is ranked depending on the number of neighbors they have.
  4. All ranks were aggregated for quantifying outliers.

The figure below shows aggregated plot from this algorithm. Half the points are blue, or 'normal,' and the other half are red, or potential outliers.

 

Reading the Plot and Why This a Powerful Algorithm

1. From the individual expense category plots, limits on each spending category can be determined. Currently, the red flag for individual expense category is set arbitrarily.
2. This algorithm helps find frequent, yet low-dollar value, transactions; clever fraud might be buried within high-dollar value transactions.
3. The boundary between blue and red points is not very sharp; this implies that the algorithm can detect potential fraud that wouldn't be noticed with simplistic red flags, like described above.

Now that we have a regular spending pattern, it can be used:

  • to inform authorities about outliers that may point to fraud 
  • to help the elderly budget better

The idea with this project is to identify people who may have been defrauded and find patterns from their data to help the vulnerable elderly population. Going forward, this algorithm can be used to do just that. 

For more detailed information, please visit the project's GitHub.

by swebster at July 27, 2017 07:52 PM

July 26, 2017

Continuum Analytics news

Open Sourcing Anaconda Accelerate

Thursday, July 27, 2017
Stan Seibert
Director, Community Innovation

We’re very excited to announce the open sourcing and splitting of the proprietary Anaconda Accelerate library into several new projects. This change has been a long time coming, and we are looking forward to moving this functionality out into the open for the community.

A Brief History of Accelerate/NumbaPro

Continuum Analytics has always been a products, services and training company focused on open data science, especially Python. Prior to the introduction of Anaconda Enterprise, our flagship enterprise product, we created two smaller proprietary libraries: IOPro and NumbaPro. You may recall that IOPro was open sourced last year.

NumbaPro was one of the early tools that could compile Python for execution on NVIDIA GPUs. Our goal with NumbaPro was to make cutting-edge GPUs more accessible to Python users, and to improve the performance of numerical code in Python. NumbaPro proved this was possible, and we offered free licenses to academic users to help jump start early adoption.

In March 2014, we decided that the core GPU compiler in NumbaPro really needed to become open source to help advance GPU usage in Python, and it was merged into the open source Numba project. Later, in January 2016, we moved the compiler features for multithreaded CPU and GPU ufuncs (and generalized ufuncs) into the Numba project and renamed NumbaPro to Accelerate.

Our philosophy with open source is that we should open source a technology when we (1) think it should become core infrastructure in the PyData community and (2) we want to build a user/developer community around the technology. If you look at our other open source projects, we hope that spirit comes through, and it has guided us as we have transferred features from Accelerate/NumbaPro to Numba.

What is Changing?

Accelerate currently is composed of three different feature sets:

  • Python wrappers around NVIDIA GPU libraries for linear algebra, FFT, sparse matrix operations, sorting and searching.
  • Python wrappers around some of Intel’s MKL Vector Math Library functions
  • A “data profiler” tool based on cProfile and SnakeViz.

NVIDIA CUDA Libraries

Today, we are releasing a two new Numba sub-projects called pyculib and pyculib_sorting, which contain the NVIDIA GPU library Python wrappers and sorting functions from Accelerate. These wrappers work with NumPy arrays and Numba GPU device arrays to provide access to accelerated functions from:

  • cuBLAS: Linear algebra
  • cuFFT: Fast Fourier Transform
  • cuSparse: Sparse matrix operations
  • cuRand: Random number generation (host functions only)
  • Sorting: Fast sorting algorithms ported from CUB and ModernGPU

Going forward, the Numba project will take stewardship of pyculib and pyculib_sorting, releasing updates as needed when new Numba releases come out. These projects are BSD-licensed, just like Numba.

MKL Accelerated NumPy Ufuncs

The second Accelerate feature was a set of wrappers around Intel’s Vector Math Libraries to compute special math functions on NumPy arrays in parallel on the CPU. Shortly after we implemented this feature, Intel released their own Python distribution based on Anaconda. The Intel Distribution for Python includes a patched version of NumPy that delegates many array math operations to either Intel’s SVML library (for small arrays) or their MKL Vector Math Library (for large arrays). We think this is a much better alternative to Accelerate for users who want accelerated NumPy functions on the CPU. Existing Anaconda users can create new conda environments with Intel’s full Python distribution, or install Intel’s version of NumPy using these instructions.

Note that the free Anaconda distribution of NumPy and SciPy has used MKL to accelerate linear algebra and FFT operations for several years now, and will continue to do so.

Data Profiler

The final feature in Accelerate is what we have decided to call a “data profiler." This tool arose out of our experiences doing optimization projects for customers. Every optimization task should start by profiling the application to see what functions are consuming the most compute time. However, in a lot of scientific Python applications that use NumPy, it is important to also consider the shape and data type of the arrays being passed around, as that determines what optimization strategies are viable. Operations on a very large array could be accelerated with a multi-threaded CPU or GPU implementation, whereas many operations on small arrays might require some refactoring to batch processing for higher efficiency.  

The traditional Python profiler, cProfile, doesn’t capture information about data types or array sizes, so we extended it to record this extra information along with the function signature. Any tool that works with cProfile stats files should be able to display this information. We also modified the SnakeViz tool to more easily embed its interactive graphics into a Jupyter Notebook.

Today, we are open sourcing this tool in the data_profiler project on GitHub, also under the Numba organization. Again, like Numba, data_profiler is BSD-licensed.

Next Steps

  • If you were using the accelerate.cuda package, you can install the pyculib package today:
    • conda install -c numba pyculib
    • The documentation for pyculib shows how to map old Accelerate package names to the new version. The pyculib packages will appear in the default conda channel in a few weeks.
  • If you are interested in accelerated NumPy functions on the CPU, take a look at the Intel Python conda packages: Using Intel Distribution for Python with Anaconda.
  • If you want to try out the data_profiler, you can take a look at the documentation here.

We will continue to support current Anaconda Accelerate licensees until August 1, 2018, but we encourage you to switch over to the new projects as soon as possible.  If you have any questions, please contact support@continuum.io for more information.

by swebster at July 26, 2017 05:37 PM

Leonardo Uieda

GMT/Python update and feedback from Scipy 2017

Thumbnail image for publication.

Last week I presented the first working prototype of GMT/Python at Scipy 2017, which is my favorite conference. I got a lot of excellent feedback about the project and will need to make some major changes as a result. Sadly, I wasn't very good at managing my time during the talk and didn't get to show the internals of the library. I'll use this post to describe how things are currently implemented, what I learned from the feedback, and what changes I'm making to the code base.

Before we dive in, you can watch my talk on Youtube or just take a quick look at my slides.

Running the code

You can try out the demo notebook from the talk if you're on Linux or OSX (sorry Windows users, we're working on it). First, you'll need to install an alpha version of GMT 6.0.0 from conda-forge:

conda install gmt=6.0.0a5 -c conda-forge/label/dev

I should note that this only works because of the amazing help I got from Filipe Fernandes, Mike Hearne, and Ray Donnelly during the Scipy sprints.

Now you can install the GMT/Python version from the talk:

pip install https://github.com/GenericMappingTools/gmt-python/archive/0e2b118.zip

Current implementation

All our code is hosted on the GenericMappingTools/gmt-python repository on Github. The Python package itself is called gmt so that you can import gmt instead of import gmt-python (which looks a bit silly). The following example wasn't on the video but it shows what is currently possible with the library:

import gmt

# Start a new figure
gmt.figure()
# Plot coastlines of North America using readable aliases for the arguments
gmt.pscoast(region=[-130, -70, 24, 52], projection="B-100/35/33/45/6i",
            land='gray', frame='a', portrait=True, shorelines='thinnest',
            borders='1/thickest', area_thresh=500)
# Embed the figure in the notebook
gmt.show()

Here is what the file structure of the gmt package looks like:

gmt
├── __init__.py
├── clib/
│   ├── __init__.py
│   ├── constants.py
│   ├── core.py
├── decorators.py
├── extra_modules.py
├── ps_modules.py
├── session_management.py
├── tests/
│   ├── __init__.py
│   ├── baseline/
│   ├── data/
│   ├── test_*.py
│   └── utils.py
└── utils.py

Low-level wrappers

The subpackage gmt.clib contains all of the low-level ctypes code that interfaces with libgmt. A user will not need see or use this package. The main function that we need from the C API is GMT_Call_Module, which is wrapped by gmt.clib.call_module. This is how we execute all of the GMT modules (pscoast, grdgradient, etc).

Each module is implemented as a function in one of the *.py files in the gmt package. For example, the Postscript generating modules live in gmt/ps_modules.py. These "module functions" are all accessible from the top-level package namespace. This means that you can access them as gmt.pscoast instead of gmt.ps_modules.pscoast.

All of the unit and integration tests live in gmt.tests and are shipped with the package. Also included is a gmt.test() function that runs all of our tests using pytest. I'll go over how we run the tests below.

Module wrapper functions

This is what a function that wraps a GMT module looks like:

@fmt_docstring
@use_alias(R='region', J='projection', A='area_thresh', B='frame',
           D='resolution', P='portrait', I='rivers', N='borders',
           W='shorelines', G='land', S='water')
@kwargs_to_strings(R='sequence')
def pscoast(**kwargs):
    """
    Plot continents, shorelines, rivers, and borders on maps

    ...

    {gmt_module_docs}

    {aliases}

    Parameters
    ----------
    {J}
    {R}
    A : int, float, or str
        ``'min_area[/min_level/max_level][+ag|i|s|S][+r|l][+ppercent]'``
        Features with an area smaller than min_area in km^2 or of hierarchical
        level that is lower than min_level or higher than max_level will not be
        plotted.
    {B}
    C : str
        Set the shade, color, or pattern for lakes and river-lakes.
    ...

    """
    with APISession() as session:
        call_module(session, 'pscoast', build_arg_string(kwargs))

The function takes keyword arguments (**kwargs) and must have the same name as the corresponding GMT module, in this case pscoast. In the Python wrapper, the ps prefix doesn't really make much sense because we don't need to care that this module writes Postscript. We'll implement aliases for the function names later on to deal with this.

The body of the function is quite simple and is only two lines. First, we need to create a GMTAPI_CTRL C structure that is required by all GMT functions. The gmt.clib.APISession context manager takes care of creating the structure by calling GMT_Create_Session from the C API, handing us a pointer in the session variable, and destroying it on exit using GMT_Destroy_Session. Next, we use gmt.clib.call_module to execute the pscoast GMT module. This function wraps GMT_Call_Module from the C API and receives inputs as a string of command-line arguments, like '-R1/2/3/4 -JM4i ...'. Function build_arg_string from the gmt.utils module takes care of transforming the kwargs dictionary into that string.

Notice that this functions doesn't pass in any data to the C API. We're still working on that.

The majority of the parsing work is being done by the decorators (the things with @ symbols) that live in gmt.decorators. The first is kwargs_to_strings that takes care of converting some of the function arguments into string representations. By default, it will convert any boolean arguments (True or False) into the empty string (if True) or remove the argument from kwargs (if False). For example, P=True will be transformed into P='' so that the argument string made by build_arg_string will be -P. kwargs_to_strings also allows specifying special conversions. Here, we ask it to convert the R argument into a string if it's a sequence (list, tuple, etc). So if given R=[1, 2, 3, 4] it will replace that with R='1/2/3/4'.

The argument aliases are handled by use_alias. The format is ARG='alias' and it will replace any instance of 'alias' in kwargs with 'ARG'. This allows all other functions to only deal with the GMT version of the arguments and not have to worry about the aliases.

Finally, fmt_docstring inserts stubs into the docstring, like a list of aliases (provided by use_alias), a link to the GMT module documentation, common arguments, etc. See the decorator docstring for a full list.

Tests

The testing code is packaged with the library in gmt.tests. We use pytest-mpl to test plot generating commands. I had to hack together a class that implements a savefig method to make pytest-mpl work. This is bundled in a decorator called figure_comparison_test. A typical test looks like this:

from .utils import figure_comparison_test
from .. import figure, pscoast


@figure_comparison_test
def test_pscoast_aliases():
    "Test that all aliases work"
    figure()
    pscoast(region='-30/30/-40/40', projection='m0.1i', frame='afg',
            rivers='1/1p,black', borders='1/0.5p,-',
            shorelines='0.25p,white', land='moccasin', water='skyblue',
            resolution='i', area_thresh=1000, portrait=True)

When you run the tests, pytest-mpl will generate the figure and compare it to a baseline that we have stored in gmt/tests/baseline.

Changes after Scipy

Paul and I had the chance to talk about the project with a lot of smart and interesting people. Many thanks to Joe Kington, John Leeman, Ryan May, Filipe Fernandes, Mike Hearne, and Benjamin Root.

Code of conduct

Since a few people seemed interested in contributing to the project, I decided to add a Code of Conduct to ensure that everyone who wants to get involved know the rules. I copied it from the Contributor Covenant template. It's so easy that there's no excuse for not having one anymore.

I also adapted the great "Impostor syndrome disclaimer" from the MetPy project. This was originally proposed by Adrienne Lowe during an awesome Pycon talk. I highly recommend that you take a few minutes to watch it.

Object oriented API

By far the most requested feature was to have an object-oriented API, like that of matplotlib. I can see the appeal and usefulness of this and was convinced to make the change. It makes sense and we're already doing something like it in an ugly way to be able to use pytest-mpl.

What I'm not willing to do is to support both the existing API with functions and one with classes at the same time. I don't like that there are two ways of doing the same thing in matplotlib. It causes unnecessary confusion, particularly if you're new to the library. And it goes against the Zen of Python:

There should be one-- and preferably only one --obvious way to do it.

So I'm going all in with the classes.

Note that this will only impact the figure generating functions. Other parts of GMT will still be wrapped by functions in the gmt package.

The new API will look something like this:

import gmt

# Start a figure
fig1 = gmt.Figure()
# Start a different figure
fig2 = gmt.Figure()

# Use the methods in the Figure class to plot things
fig1.pscoast(region=[-130, -70, 24, 52], projection="B-100/35/33/45/6i",
             land='gray', frame='a', portrait=True, shorelines='thinnest',
             borders='1/thickest', area_thresh=500)

# We can now alternate between figures when plotting
fig2.pscoast(region=[-130, -70, 24, 52], projection="B-100/35/33/45/6i",
             land='blue', frame='a', portrait=True)

# Use savefig to save to a file
fig1.savefig('north-america.pdf')
fig2.savefig('north-america-blue.png')

The Figure class can know how to plot and insert itself in the notebook using the rich display features in IPython. I can think of a few possibilities to view the figures in the Jupyter notebook:

  1. Include fig in the last line of the notebook. I don't like this at all because it's ugly. But it should work anyway even if we go with another option.

    fig = gmt.Figure()
    fig.pscoast(...)
    fig
    
  2. All Figure methods return the figure itself so it'll be picked up by the notebook. This eliminates the need for the ugly trailing fig. A problem with this approach is that the last line in the cell needs to be a call to fig.something.

    fig = gmt.Figure()
    fig.pscoast(...)
    
  3. Have a Figure.show() method that inserts the figure into the notebook. This is a nice explicit solution to the problem. The only drawback I can see is that you wouldn't be able to show more than on figure at a time. But I'm OK with that.

    fig = gmt.Figure()
    fig.pscoast(...)
    fig.show()
    
  4. Have a gmt.show() function that displays all currently active figures. This would require that we keep track of all figures created in a global variable in the package namespace. I don't know if I like this too much because it requires mutating global variables, which can cause a lot of hard to track down bugs.

    fig = gmt.Figure()
    fig.pscoast(...)
    gmt.show()
    

I'll probably implement some form of the show function to make it display a pop-up window in the terminal as well. This is a case where a user might want the gmt.show option to view more than one figure.

These options are not mutually exclusive. What I'm still pondering is whether or not to implement some of them (#1 will always be possible). Since I work mostly on the notebook, I don't care too much about #4. But I'm sure a lot of you have different needs and preferences.

How would you prefer to display your images? What would you like to see in the API? Let me know in the comments or on Github.


Comments? Leave one below or let me know on Twitter @leouieda or in the Software Underground Slack group.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

Please enable JavaScript to view the comments powered by Disqus.

July 26, 2017 12:00 PM

July 24, 2017

Continuum Analytics news

Galvanize Capstone Series: Predicting Demand with RideAustin

Monday, July 31, 2017
Will Johnson
Guest Blogger

This post is part of our Galvanize Capstone featured projects. This post was written by Will Johnson and posted here with their permission. 

The Story

On April 13th of 2017, I went to a concert. Specifically, here, in downtown Austin, Texas.

After the show was done at 10:00PM, I needed a ride home. I opened up my favorite ride-sharing app and asked for a driver to come pick me up. But, instead of getting a ride, I got a message that read "NO DRIVER AVAILABLE." I thought that was odd, since this was the middle of downtown and there should be plenty of traffic. So, I gave it a minute and, sure enough, after trying again, I got myself a ride.

I told the driver, "Boy, you guys must be busy if I can't get a ride downtown."

He looked back at me and said "Actually, I've been driving around for ages looking for a fare."

What that told me was that there was a disconnect between rider and driver. Here I am with money, there he is with a car, but the only thing keeping us from finding each other is location.

The Goal

The purpose of my project is to try to bridge that gap. If I can successfully predict what areas in Austin are likely to have high demand, I can tell drivers about those spots and increase the supply to meet that demand. Predicting where these hotspots are going to be means taking in the historical information about when and where ride requests happen and using them to predict where the ride requests are likely to happen next.

In order to do this, the guys over at RideAustin (the same ride-sharing app I used to get my ride that day) were gracious enough to give me their database in a SQL dump. RideAustin is a non-profit ride-sharing service that not only has great drivers and prices, but also allows you round-up your fare to donate the proceeds to a local charity. They boast more than 2 million rides, the details of which I will be using to train my model. Once I had the data, I immediately got my hands wet. 

Building the Model

The first thing I did was some research on what other data scientists have done before on this subject. One notable idea I read about was in Japan where a cellular service provider was able to predict ride requests for a taxi company within 500 square feet. Their method was to break down into a grid and, from what I gathered, return the grid points that were the most likely to see the next ride request. What that had that I didn't is real-time geolocation of users (because they’re a cell service company), so my model works a bit differently.

While the grid-mapping method may make it easier for a model like an ensemble classifier to predict the area with the highest demand, what it also does is decorrelate areas from each other. If I have downtown blocked off into a grid of 4 areas, then a classifier would assume each class is independent of each other, when I know that all four of those areas are highly correlated with each other. Moreover, I wanted to capture a sense of the variability of these areas of demand continuously to see how they may change in certain circumstances. So, I instead decided to plot hotspots using a K-Means classifier.

K-Means Classifier

K-Means is an unsupervised machine learning algorithm that separates a dataset into a certain number (k) of clusters. If I can find the centerpoint of those clusters, known as the centroid, then those points could be good hotspots to suggest to my drivers. The trick is getting the right number of clusters down. I found that 5 clusters makes the most sense for Austin's ride request data, given the neighborhoods and historical ride demand.

These hotspots are returned as pairs of latitude and longitude coordinates. The trouble with running these kinds of pairs into a regressor is that if I were to predict a certain latitude, how do I know which longitude makes sense to pair it with? Moreover, regressors are designed to predict values with the highest accuracy to out-of sample data, but do I really care about what order my predictions are returned in? No. All I care about is that my predicted hotspots make sense with where the demand is actually going to be. I want to send my drivers to the right spots, not necessarily to the right spots in a particular order.

My Model

In the end, I created an algorithm that would use historical data in order to predict likely ride requests, then run a k-means clustering algorithm on those rides to find where the hotspots are going to be. Breaking it down into six steps, the model works like this:

Step 1: Find a Ride Request

Let's look at just one ride. I've got information about it, like its lat-long location, the time of day, the day of the week, the date and so on. If I go back into my dataset, looking at all the ride requests that came before this one, I can then find requests that are similar to that ride request.

Step 2: Find Similar Ride Requests

The green dots are all similar rides that happened around the same spot, same time of day, and the same day of the week. Each of these rides had at least one ride request to come after it. My thought was that those following rides would follow a historical pattern of where rides were likely to pop up next in similar conditions. So let's find the requests that came after all these green ones.

Step 3: Find the Following Ride Requests

These green dots are all rides that came after the similar ride requests to my original blue one, down in the bottom left corner. As you can see, most of them occur downtown, the bar scene south of the river, the mall, or the airport. This being 10:00PM on a Thursday night, that makes sense.

So let's now randomly sample from this distribution of requests and choose that to be our next expected ride request. This is sort of hijacking the Central Limit Theorem in statistics that tells us the more observations we take, the more normal our distribution should be and the more towards the true-mean our observed mean should be. In this case, we have most of our rides downtown, but a good amount elsewhere. An independent random sample of this distribution should reflect the same variation of the ride requests that would follow in reality.

Step 4: Randomly Sample

See that new blue dot? That is our first expected request. It's highly unlikely that the next actual ride request is going to be in that exact spot, but that's now what we're after. We want to know what the general distribution of rides is likely to look like, so we can get a good idea of what areas might see more demand in the coming 30 minutes.

So we repeat the process, finding similar ride requests and their follow ups and plotting the distribution of those, we can see they again favor downtown but also spread out a little.

Step 5: Repeat

If we were to repeat this process enough times, we now have a distribution of what we expect our rides to look like for the next 30 minutes. How did we decide how many to pick? I used a regression model called a Random Forest to predict how many rides we should expect to see for the whole of Austin in the next 30 minutes. That number, we'll call n, is how many times we repeat this process. If we repeated the same amount every time, then our expected hotspot for any given time period would generally be the average hotspot for that time block. While that might be OK to do, I'm more interested in capturing the variation on any given day.

So now that we've got an idea of how we expect the demand to look in the next half hour, let's run our K-Means clustering algorithm on it.

Step 6: Cluster

Now we've got our hotspots!

These are the areas we'll suggest to our riders to pay attention to. Now, in data science, a model is no good unless we can prove that it works. So, let's test this method by calculating the distance between the predicted hotspots and the closest actual hotspots that we can generate using the data we've withheld from the model so far. We should come up with 5 distances in miles. We'll use the 'Manhattan Distance,' since that is more representative of driving distance for our case. Take the average distance of those 5, and we'll get a good idea of how well our model predicted hotspots.

How Did We Do?

Let's go back to April 13th to see how we did. That bigger purple dot is a more precise look at where my model suggested drivers go. The blue dots are ride requests that happened between 10 and 10:30 that night. That blue one in the top right corner? That's me!

Running Random Tests

Now let's repeat this process of testing the distance in miles between our predicted hotspots and our actual hotspots. Below is the result of randomly testing my model at 120 different dates and seeing how well this process worked for each date.

Note: I split the dataset in half each time, only using the 'before' data to train the model so that there was no leakage of information.

Note2: There were often times when there was not enough predicted traffic to justify saying that there were any hotspots going to happen. If a Monday morning at 3AM only expects to see six ride requests, then I can't really say that any location would be a 'hot' spot to be in.

Via hypothesis test, I can conclude that, on average, my model predicts hotspots within 2 miles of the actual hotspot that occurred afterwards. While 2 miles is good, I believe I could get closer if I added in some more details about the rides.

Building a Tableau Dashboard

So how might I present this information to drivers? If you've ever looked at a weather radar, you know that in one setting it shows you the past hour's worth of the storm as it marches across your screen. If you hit 'future,' it will then show you where the model expects the storm to go.

I've organized a dashboard using Tableau to do the same. Here I have two gifs demonstrating looking at the past hour's worth of data and then presenting my expectations of where the demand is likely to be next.

Expected: 

That may work well for any city, but what if I know things specifically about areas in Austin that I want to take advantage of? Below you'll see how I provide the same previous vs. expectation format to specific areas like downtown, the airport, or the northern part of South Lamar street.

Expected: 

Notice how the northern part of South Lamar street has a high amount of demand, nearly $20 on average for fares (which is a good amount!), but only 30% of the requests are actually getting taken. What that tells me, the driver, is that I could get a good amount of money without having to wait very long, because there is demand there that is not getting met.

Double checking this, we look at the expected demand for that area, and we can see that it is expected to go up, and we can see anywhere between 0 to 28 requests to happen in the next half hour (that's a 95% confidence interval, by the way). So, now I know that this is a good area to be in, so I'm going to sink my gas money into going there instead of waiting at the airport any longer for a bigger fair but fighting with all the other drivers for the next ride.

Limitations

As I noted before, there are certain time periods where the model does not work. Hard to say any spot could be 'hot' if there will only be a handful of ride requests happening in the whole of Austin.

Another downside may be that since I let the hotspots roam wherever they want, the more precise locations can sometime be in the river, or perhaps off road. The general area might still be good, but I need to be careful not to tell drivers to submerge their car in the lake.

Lastly, it's important to consider Herd-Behavior here. If I tell my drivers that one location is great, and they all flock there, I may be draining the supply of drivers in other locations. This might have an adverse affect on my ability to meet overall demand and instead focus too hard on high demand areas. Testing this for real would tell me more about what to set my threshold at for this.

Where To Go From Here?

One bit of information I could use to improve this model is weather data, like percent chance of precipitation or temperature. Anyone who has ever tried to get a cab in New York City while it's raining knows that demand skyrockets and all the cabs are taken. Likewise, I'd also like to know more about incoming flights at the airport. If I can predict a certain number of flights coming in at the same time, I may suggest the airport is a better spot than usual.

As I mentioned earlier, a company in Japan was able to use a mobile user's real-time geolocation data to predict when and where they would ask for a ride. Some taxi drivers saw increases of 20% in business after using the model. While I don't have that kind of data, if I did, you can bet I would use it.

Something else I never got to try was weighing the unfulfilled rides more. If I care more about taking care of unfulfilled requests, my hotspots should favor those.

Final Thoughts

A huge thank you to the folks at RideAustin for allowing me to use their dataset for this project. Also, big thanks to Tableau for giving me a student license to learn how to use their software to build this awesome interactive app you see above.

If I could, I would go into greater detail in the ways I mentioned earlier with this project. Providing the drivers with this kind of information allows THEM to be the ones making decisions on where to go. They want to give people rides! What my model does is help them find where they are and where they might be next.

by swebster at July 24, 2017 04:59 PM

July 23, 2017

July 21, 2017

Continuum Analytics news

Galvanize Capstone Series: Geolocation of Twitter Users

Monday, July 24, 2017
Shawn Terryah
Guest Blogger

This post is part of our Galvanize Capstone featured projects. This post was written by Shawn Terryah and posted here with their permission. 

In June of this year, I completed the Data Science Immersive program at Galvanize in Austin, TX. The final few weeks of the program were dedicated to individual capstone projects of our choosing. I have a background in infectious disease epidemiology and, when I was in graduate school, there was a lot of interest in using things like Google search queries, Facebook posts, and tweets to try to track the spread of infectious diseases in real-time. One of the limitations to using Twitter is that only about 1% of tweets are geotagged with the tweet's location, which can make much of this work very difficult. For my capstone project, I chose to train a model using the 1% of tweets that are geotagged to predict the US city-level location of Twitter users who do not geotag their tweets. This is how I did it:

Streaming Training Tweets Using Tweepy

Tweepy is a Python wrapper for the Twitter API that allowed me to easily collect tweets in real-time and store them in MongoBD. The script below was run on an Amazon Web Services EC2 instance with 200 GiB of storage for roughly two weeks using tmux. By filtering based on location, I only received geotagged tweets with a known location to use for training the model.

import tweepy 
import json 
from pymongo import MongoClient 

class StreamListener(tweepy.StreamListener): 
    """tweepy.StreamListener is a class provided by tweepy used to access 
    the Twitter Streaming API to collect tweets in real-time. 
    """ 
    def on_connect(self): 
        """Called when the connection is made""" 
        print("You're connected to the streaming server.") 

    def on_error(self, status_code): 
        """This is called when an error occurs""" 
        print('Error: ' + repr(status_code)) 
        return False 

    def on_data(self, data): 
        """This will be called each time we receive stream data""" 
        client = MongoClient() 

        # I stored the tweet data in a database called 'training_tweets' in MongoDB, if 
        # 'training_tweets' does not exist it will be created for you. 
        db = client.training_tweets 

        # Decode JSON 
        datajson = json.loads(data) 

        # I'm only storing tweets in English. I stored the data for these tweets in a collection 
        # called 'training_tweets_collection' of the 'training_tweets' database. If 
        # 'training_tweets_collection' does not exist it will be created for you. 
        if "lang" in datajson and datajson["lang"] == "en": 
            db.training_tweets_collection.insert_one(datajson) 

if __name__ == "__main__": 
    # These are provided to you through the Twitter API after you create a account 
    consumer_key = "your_consumer_key" 
    consumer_secret = "your_consumer_secret" 
    access_token = "your_access_token" 
    access_token_secret = "your_access_token_secret" 

    auth1 = tweepy.OAuthHandler(consumer_key, consumer_secret) 
    auth1.set_access_token(access_token, access_token_secret) 

    # LOCATIONS are the longitude, latitude coordinate corners for a box that restricts the 
    # geographic area from which you will stream tweets. The first two define the southwest 
    # corner of the box and the second two define the northeast corner of the box. 
    LOCATIONS = [-124.7771694, 24.520833, -66.947028, 49.384472,     # Contiguous US 
                 -164.639405, 58.806859, -144.152365, 71.76871,      # Alaska 
                 -160.161542, 18.776344, -154.641396, 22.878623]     # Hawaii 

    stream_listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
    stream = tweepy.Stream(auth=auth1, listener=stream_listener) 
    stream.filter(locations=LOCATIONS)

Feature Selection, Feature Engineering, and Data Cleaning

Feature Selection

At the end of two weeks, I had collected data from over 21 million tweets from over 15,500 cities. In addition to the tweet itself, the API provides a number of other fields. These are the fields I used to build the model:

FIELD TYPE DESCRIPTION

'text' 

String The actual UTF-8 text of the tweet
'country_code' String Country code representing the country that tweet was sent from
'full_name' String Full representation of the place the tweet was sent from. For the US, often in the form of 'City, State,' but not always.
'coordinates' Array of Array of Array of Float A series of longitude and latitude points that define a bounding box from where the tweet was sent
'screen_name'

String

The screen name chosen by the user

'favourites_count'

Int

The number of tweets this user has liked in the account’s lifetime

'followers_count'

Int

The number of followers the user currently has

'statuses_count'

Int

The number of tweets (including retweets) issued by the user

'friends_count'

Int

The number of users the user is following (AKA their “followings”)

'listed_count'

Int

The number of public lists the user is a member of

'location'

String

The user-defined location for the account’s profile, which is not necessarily a geographic location (e.g., 'the library,' 'watching a movie,' 'in my own head,' 'The University of Texas') (Nullable)
'created_at'

String

The UTC datetime of when the tweet was issued

'utc_offset'

Int

The offset from GMT/UTC in seconds based the Time Zone that the user selects for their profile (Nullable)

To pull these fields I first exported the data from MongoDB as a json file:

$ mongoexport --db training_tweets --collection training_tweets_collection --out training_tweets.json

I then converted training_tweets.json to a csv file and pulled only the fields from the table above:

import json 
import unicodecsv as csv     # unicodecsv ensures that emojis are preserved 

def tweets_json_to_csv(file_list, csv_output_file): 
    ''' 
    INPUT: list of JSON files 
    OUTPUT: single CSV file 

    This function takes a list of JSON files containing tweet data and reads 
    each file line by line, parsing the revelent fields, and writing it to a CSV file. 
    ''' 

    count = 0 
    f = csv.writer(open(csv_output_file, "wb+")) 

    # Column names 
    f.writerow(['tweet', # relabeled: the API calls this 'text' 
                'country_code', 
                'geo_location', # relabeled: the API calls this 'full_name' 
                'bounding_box', 
                'screen_name', 
                'favourites_count', 
                'followers_count', 
                'statuses_count', 
                'friends_count', 
                'listed_count', 
                'user_described_location', # relabeled: the API calls this 'location' 
                'created_at', 
                'utc_offset']) 

    for file_ in file_list: 
        with open(file_, "r") as r: 
            for line in r: 
                try: 
                    tweet = json.loads(line) 
                except: 
                    continue 
                if tweet and tweet['place'] != None: 
                    f.writerow([tweet['text'], 
                                tweet['place']['country_code'], 
                                tweet['place']['full_name'], 
                                tweet['place']['bounding_box']['coordinates'], 
                                tweet['user']['screen_name'], 
                                tweet['user']['favourites_count'], 
                                tweet['user']['followers_count'], 
                                tweet['user']['statuses_count'], 
                                tweet['user']['friends_count'], 
                                tweet['user']['listed_count'], 
                                tweet['user']['location'], 
                                tweet['created_at'], 
                                tweet['user']['utc_offset']]) 
                    count += 1 

                    # Status update 
                    if count % 100000 == 0: 
                        print 'Just stored tweet #{}'.format(count) 

if __name__ == "__main__": 

    tweets_json_to_csv(['training_tweets.json'], 'training_tweets.csv')

From this point forward, I was able to read and manipulate the csv file as a pandas DataFrame:

import pandas as pd 

df = pd.read_csv('training_tweets.csv', encoding='utf-8') # 'utf-8' ensures that emojis are preserved

Feature Engineering

'centroid'

Instead of providing the exact latitude and longitude of the tweet, the Twitter API provides a polygonal bounding box of coordinates that encloses the place where the tweet was sent. To plot the tweets on a map and perform other functions, I found the centroid of each bounding box:

def find_centroid(row): 
    ''' 
    Helper function to return the centroid of a polygonal bounding box of longitude, latitude coordinates 
    ''' 

    try: 
        row_ = eval(row) 
        lst_of_coords = [item for sublist in row_ for item in sublist] 
        longitude = [p[0] for p in lst_of_coords] 
        latitude = [p[1] for p in lst_of_coords] 
        return (sum(latitude) / float(len(latitude)), sum(longitude) / float(len(longitude))) 
    except: 
        return None 

# Create a new column called 'centroid' 
df['centroid'] = map(lambda row: find_centroid(row), df['bounding_box'])

Using the centroids, I was able to plot the training tweets on a map using the Matplotlib Basemap Toolkit. Below is the code for generating a plot of the tweets that originated in or around the contiguous US. The same was also done for Alaska and Hawaii.

from mpl_toolkits.basemap import Basemap 
import matplotlib.pyplot as plt 

def plot_contiguous_US_tweets(lon, lat, file_path): 
    ''' 
    INPUT: List of longitudes (lon), list of latitudes (lat), file path to save the plot (file_path) 
    OUTPUT: Plot of tweets in the contiguous US. 
    ''' 

    map = Basemap(projection='merc', 
                  resolution = 'h',  
                  area_thresh = 10000, 
                  llcrnrlon=-140.25, # lower left corner longitude of contiguous US 
                  llcrnrlat=5.0, # lower left corner latitude of contiguous US 
                  urcrnrlon=-56.25, # upper right corner longitude of contiguous US 
                  urcrnrlat=54.75) # upper right corner latitude of contiguous US 

    x,y = map(lon, lat) 

    map.plot(x, y, 'bo', markersize=2, alpha=.3) 
    map.drawcoastlines() 
    map.drawstates() 
    map.drawcountries() 
    map.fillcontinents(color = '#DAF7A6', lake_color='#a7cdf2') 
    map.drawmapboundary(fill_color='#a7cdf2') 
    plt.gcf().set_size_inches(15,15) 
    plt.savefig(file_path, format='png', dpi=1000)

The resulting plots for the contiguous US, Alaska, and Hawaii were joined in Photoshop and are shown on the left. The plot on the right is from the Vulcan Project at Purdue University and shows carbon footprints in the contiguous US. As you can see, the plots are very similiar, providing an indication that streaming tweets in this way provides a representative sample of the US population in terms of geographic location.

'tweet_time_secs'

The field 'created_at' is the UTC datetime of when the tweet was issued. Here is an example:

u'created_at': u'Sun Apr 30 01:23:27 +0000 2017'

I was interested in the UTC time, rather than the date, that a tweet was sent, because there are likely geographic differences in these values. I therefore parsed this information from the time stamp and reported this value in seconds.

from dateutil import parser 

def get_seconds(row): 
    ''' 
    Helper function to parse time from a datetime stamp and return the time in seconds 
    ''' 

    time_str = parser.parse(row).strftime('%H:%M:%S') 
    h, m, s = time_str.split(':') 
    return int(h) * 3600 + int(m) * 60 + int(s) 

# Create a new column called 'tweet_time_secs' 
df['tweet_time_secs'] = map(lambda row: get_seconds(row), df['created_at'])

Data Cleaning

Missing Data

Both 'user_described_location' (note: the API calls this field 'location') and 'utc_offset' are nullable fields that frequently contain missing values. When this was the case, I filled them in with indicator values:

df['user_described_location'].fillna('xxxMISSINGxxx', inplace=True) 
df['utc_offset'].fillna(999999, inplace=True)

Additionally, a small percentage of tweets contained missing values for 'country_code.' When this or other information was missing, I chose to drop the entire row:

df.dropna(axis=0, inplace=True)

Tweets Outside the US

The bounding box I used to stream the tweets included areas outside the contiguous US. Since the goal for this project was to predict the US city-level location of Twitter users, I relabeled tweets that originated from outside the US. For these tweets 'country_code' was relabeled to 'NOT_US' and 'geo_location' (note: the API calls this field 'full_name') was relabeled to 'NOT_IN_US, NONE':

def relabel_geo_locations(row): 
    ''' 
    Helper function to relabel the geo_locations from tweets outside the US 
    to 'NOT_IN_US, NONE' 
    ''' 

    if row['country_code'] == 'US': 
        return row['geo_location'] 
    else: 
        return 'NOT_IN_US, NONE' 

# Relabel 'country_code' for tweets outside the US to 'NOT_US' 
     df['country_code'] = map(lambda cc: cc if cc == 'US' else 'NOT_US', df['country_code']) 

# Relabel 'geo_location' for tweets outside the US to 'NOT_IN_US, NONE' 
df['geo_location'] = df.apply(lambda row: relabel_geo_locations(row), axis=1)

Tweets Lacking a 'City, State' Location Label

Most tweets that originated in the US had a 'geo_location' in the form of 'City, State' (e.g., 'Austin, TX'). For some tweets, however, the label was less specific and in the form of 'State, Country' (e.g., 'Texas, USA') or, even worse, in the form of a totally unique value (e.g., 'Tropical Tan'). Since this data was going to be used to train the model, I wanted to have as granular of a label as possible for each tweet. Therefore, I only kept tweets that were in the form of 'City, State' and dropped all others:

def geo_state(row): 
    ''' 
    Helper function to parse the state code for 'geo_location' labels 
    in the form of 'City, State' 
    ''' 

    try: 
        return row['geo_location'].split(', ')[1] 
    except: 
        return None 

# Create a new column called 'geo_state' 
df['geo_state'] = df.apply(lambda row: geo_state(row),axis=1) 

# The 'geo_state' column will contain null values for any row where 'geo_location' was not 
# comma separated (e.g., 'Tropical Tan'). We drop those rows here: 
df.dropna(axis=0, inplace=True) 

# list of valid geo_state labels. "NONE" is the label I created for tweets outside the US 
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "ID", 
          "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", 
          "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", 
          "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY", "NONE"] 

# Keep only rows with a valid geo_state, among others this will drop all rows that had 
# a 'geo_location' in the form of 'State, Country' (e.g., 'Texas, USA') 
df = df[df['geo_state'].isin(states)]

Aggregating the Tweets by User

During the two week collection period, many users tweeted more than once. To prevent potential leakage, I grouped the tweets by user ('screen_name'), then aggregated the remaining fields.

from collections import Counter 

# aggregation functions 
agg_funcs = {'tweet' : lambda x: ' '.join(x), 
             'geo_location' : lambda x: Counter(x).most_common(1)[0][0], 
             'geo_state' : lambda x: Counter(x).most_common(1)[0][0], 
             'user_described_location': lambda x: Counter(x).most_common(1)[0][0], 
             'utc_offset': lambda x: Counter(x).most_common(1)[0][0], 
             'geo_country_code': lambda x: Counter(x).most_common(1)[0][0], 
             'tweet_time_secs' : np.median, 
             'statuses_count': np.max, 
             'friends_count' :np.mean, 
             'favourites_count' : np.mean, 
             'listed_count' : np.mean, 
             'followers_count' : np.mean} 

# Groupby 'screen_name' and then apply the aggregation functions in agg_funcs 
df = df.groupby(['screen_name']).agg(agg_funcs).reset_index()

Remapping the Training Tweets to the Closest Major City

Since the training tweets came from over 15,500 cities, and I didn't want to do a 15,500-wise classification problem, I used the centroids to remap all the training tweets to their closest major city from a list of 378 major US cities based on population (plus the single label for tweets outside the US, which used Toronto's coordinates). This left me with a 379-wise classification problem. Here is a plot of those major cities and the code to remap all the US training tweets to their closest major US city:

import numpy as np 
import pickle 

def load_US_coord_dict(): 
    ''' 
    Input: n/a 
    Output: A dictionary whose keys are the location names ('City, State') of the 
    378 US classification locations and the values are the centroids for those locations 
    (latitude, longittude) 
    ''' 

    pkl_file = open("US_coord_dict.pkl", 'rb') 
    US_coord_dict = pickle.load(pkl_file) 
    pkl_file.close() 
    return US_coord_dict 

def find_dist_between(tup1, tup2): 
    ''' 
    INPUT: Two tuples of latitude, longitude coordinates pairs for two cities 
    OUTPUT: The distance between the cities 
    ''' 

    return np.sqrt((tup1[0] - tup2[0])**2 + (tup1[1] - tup2[1])**2) 

def closest_major_city(tup): 
    ''' 
    INPUT: A tuple of the centroid coordinates for the tweet to remap to the closest major city 
    OUTPUT: String, 'City, State', of the city in the dictionary 'coord_dict' that is closest to the input city 
    ''' 

    d={} 
    for key, value in US_coord_dict.iteritems(): 
        dist = find_dist_between(tup, value) 
        if key not in d: 
            d[key] = dist 
        return min(d, key=d.get) 

def get_closest_major_city_for_US(row): 
    ''' Helper function to return the closest major city for US users only. For users 
    outside the US it returns 'NOT_IN_US, NONE' 
    ''' 

    if row['geo_location'] == 'NOT_IN_US, NONE': 
        return 'NOT_IN_US, NONE' 
    else: 
        return closest_major_city(row['centroid']) 

if __name__ == "__main__": 

    # Load US_coord_dict 
    US_coord_dict = load_US_coord_dict() 

    # Create a new column called 'closest_major_city' 
    df['closest_major_city'] = df.apply(lambda row: get_closest_major_city_for_US(row), axis=1)

 

Building the Predictive Model

The steps below were run on an Amazon Web Services r3.8xlarge EC2 instance with 244 GiB of memory. Here is a high-level overview of the final model:

High-level Overview of the Stacked Model

Step 1: Load dependencies and prepare the cleaned data for model fitting

import pandas as pd 
import numpy as np 
import nltk 
from nltk.tokenize import TweetTokenizer 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.externals import joblib 

# Tokenizer to use for text vectorization 
def tokenize(tweet): 
    tknzr = TweetTokenizer(strip_handles=True, reduce_len=True, preserve_case=False) 
    return tknzr.tokenize(tweet) 

# Read cleaned training tweets file into pandas and randomize it 
df = pd.read_pickle('cleaned_training_tweets.pkl') 
randomized_df = df.sample(frac=1, random_state=111) 

# Split randomized_df into two disjoint sets 
half_randomized_df = randomized_df.shape[0] / 2 
base_df = randomized_df.iloc[:half_randomized_df, :] # used to train the base classifiers 
meta_df = randomized_df.iloc[half_randomized_df:, :] # used to train the meta classifier 

# Create variables for the known the geotagged locations from each set 
base_y = base_df['closest_major_city'].values 
meta_y = meta_df['closest_major_city'].values

Step 2: Train a base-level Linear SVC classifier on the user described locations

# Raw text of user described locations 
base_location_doc = base_df['user_described_location'].values 
meta_location_doc = meta_df['user_described_location'].values 

# fit_transform a tf-idf vectorizer using base_location_doc and use it to transform meta_location_doc 
location_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize, ngram_range=(1,2)) 
base_location_X = location_vect.fit_transform(base_location_doc.ravel()) 
meta_location_X = location_vect.transform(meta_location_doc) 

# Fit a Linear SVC Model with 'base_location_X' and 'base_y'. Note: it is important to use 
# balanced class weights otherwise the model will overwhelmingly favor the majority class. 
location_SVC = LinearSVC(class_weight='balanced') 
location_SVC.fit(base_location_X, base_y) 

# We can now pass meta_location_X into the fitted model and save the decision 
# function, which will be used in Step 4 when we train the meta random forest 
location_SVC_decsfunc = location_SVC.decision_function(meta_location_X) 

# Pickle the location vectorizer and the linear SVC model for future use 
joblib.dump(location_vectorizer, 'USER_LOCATION_VECTORIZER.pkl') 
joblib.dump(location_SVC, 'USER_LOCATION_SVC.pkl')

Step 3: Train a base-level Linear SVC classifier on the tweets

# Raw text of tweets 
base_tweet_doc = base_df['tweet'].values 
meta_tweet_doc = meta_df['tweet'].values 

# fit_transform a tf-idf vectorizer using base_tweet_doc and use it to transform meta_tweet_doc 
tweet_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize) 
base_tweet_X = tweet_vectorizer.fit_transform(base_tweet_doc.ravel()) 
meta_tweet_X = tweet_vectorizer.transform(meta_tweet_doc) 

# Fit a Linear SVC Model with 'base_tweet_X' and 'base_tweet_y'. Note: it is important to use 
# balanced class weights otherwise the model will overwhelmingly favor the majority class. 
tweet_SVC = LinearSVC(class_weight='balanced') 
tweet_SVC.fit(base_tweet_X, base_y) 

# We can now pass meta_tweet_X into the fitted model and save the decision 
# function, which will be used in Step 4 when we train the meta random forest 
tweet_SVC_decsfunc = tweet_SVC.decision_function(meta_tweet_X) 

# Pickle the tweet vectorizer and the linear SVC model for future use 
joblib.dump(tweet_vectorizer, 'TWEET_VECTORIZER.pkl') 
joblib.dump(tweet_SVC, 'TWEET_SVC.pkl')

Step 4: Train a meta-level Random Forest classifier

# additional features from meta_df to pull into the final model 
friends_count = meta_df['friends_count'].values.reshape(meta_df.shape[0], 1) 
utc_offset = meta_df['utc_offset'].values.reshape(meta_df.shape[0], 1) 
tweet_time_secs = meta_df['tweet_time_secs'].values.reshape(meta_df.shape[0], 1) 
statuses_count = meta_df['statuses_count'].values.reshape(meta_df.shape[0], 1) 
favourites_count = meta_df['favourites_count'].values.reshape(meta_df.shape[0], 1) 
followers_count = meta_df['followers_count'].values.reshape(meta_df.shape[0], 1) 
listed_count = meta_df['listed_count'].values.reshape(meta_df.shape[0], 1) 

# np.hstack these additional features together 
add_features = np.hstack((friends_count, 
                          utc_offset, 
                          tweet_time_secs, 
                          statuses_count, 
                          favourites_count, 
                          followers_count, 
                          listed_count)) 

# np.hstack the two decision function variables from steps 2 & 3 with add_features 
meta_X = np.hstack((location_SVC_decsfunc, # from Step 2 above 
                    tweet_SVC_decsfunc, # from Step 3 above 
                    add_features)) 

# Fit Random Forest with 'meta_X' and 'meta_y' 
meta_RF = RandomForestClassifier(n_estimators=60, n_jobs=-1) 
meta_RF.fit(meta_X, meta_y) 

# Pickle the meta Random Forest for future use 
joblib.dump(meta_RF, 'META_RF.pkl')

Testing the Model

Collecting and Preparing a Fresh Data Set

A week after I collected the training data set, I collected a fresh data set to use to evaluate the model. For this, I followed the same data collection and preparation procedures as above with a few of exceptions: 1) I only ran the Tweepy script for 48 hours, 2) I removed any users from the evaluation data set that were in the training data set, and 3) I went back to the Twitter API and pulled the 200 most recent tweets for each user that remained in the data set. Remember, the goal for the model is to predict the US city-level location of Twitter users, not individual tweets; therefore, by giving the model a larger corpus of tweets for each user, I hoped to increase the model's accuracy. Here is the script for pulling the 200 most recent tweets for each user:

import tweepy 
import pandas as pd 

# these are provided to you through the Twitter API after you create a account 
consumer_key = "your_consumer_key" 
consumer_secret = "your_consumer_secret" 
access_token = "your_access_token" 
access_token_secret = "your_access_token_secret" 

count = 0 

def get_200_tweets(screen_name): 
    ''' 
    Helper function to return a list of a Twitter user's 200 most recent tweets 
    ''' 

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
    auth.set_access_token(access_key, access_secret) 
    api = tweepy.API(auth, 
                     wait_on_rate_limit=True, 
                     wait_on_rate_limit_notify=True) 

    # Initialize a list to hold the user's 200 most recent tweets 
    tweets_data = [] 

    global count 

    try: 
        # make request for most recent tweets (200 is the maximum allowed per distinct request) 
        recent_tweets = api.user_timeline(screen_name = screen_name, count=200) 

        # save data from most recent tweets 
        tweets_data.extend(recent_tweets) 

        count += 1 

        # Status update 
        if count % 1000 == 0: 
            print 'Just stored tweets for user #{}'.format(count) 

    except: 
        count += 1 
        pass 

    # pull only the tweets and encode them in utf-8 to preserve emojis 
    list_of_recent_tweets = [''.join(tweet.text.encode("utf-8")) for tweet in tweets_data] 

    return list_of_recent_tweets 

# Create a new column in evaluation_df called '200_tweets' 
evaluation_df['200_tweets'] = map(lambda x: get_200_tweets(x), evaluation_df['screen_name'])

Making Predictions on the Fresh Data Set

To make predictions on evaluation_df, the script below was run on the same Amazon Web Services r3.8xlarge EC2 instance that was used to build the model:

import pandas as pd 
import numpy as np 
from sklearn.externals import joblib 
import nltk 
from nltk.tokenize import TweetTokenizer 

def tokenize(tweet): 
    tknzr = TweetTokenizer(strip_handles=True, reduce_len=True, preserve_case=False) 
    return tknzr.tokenize(tweet) 

class UserLocationClassifier: 

    def __init__(self): 
        ''' 
        Load the stacked classifier's pickled vectorizers, base classifiers, and meta classifier 
        ''' 

        self.location_vectorizer = joblib.load('USER_LOCATION_VECTORIZER.pkl') 
        self.location_SVC = joblib.load('USER_LOCATION_SVC.pkl') 
        self.tweet_vectorizer = joblib.load('TWEET_VECTORIZER.pkl') 
        self.tweet_SVC = joblib.load('TWEET_SVC.pkl') 
        self.meta_RF = joblib.load('META_RF.pkl') 

    def predict(self, df): 
        ''' 
        INPUT: Cleaned and properly formatted dataframe to make predictions for 
        OUTPUT: Array of predictions 
        ''' 
        # Get text from 'user_described_location' column of DataFrame 
        location_doc = df['user_described_location'].values 

        # Convert the '200_tweets' column from a list to just a string of all tweets 
        df.loc[:, '200_tweets'] = map(lambda x: ''.join(x), df['200_tweets']) 

        # Get text from '200_tweets' column of DataFrame 
        tweet_doc = df['200_tweets'].values 

        # Vectorize 'location_doc' and 'tweet_doc' 
        location_X = self.location_vectorizer.transform(location_doc.ravel()) 
        tweet_X = self.tweet_vectorizer.transform(tweet_doc.ravel()) 

        # Store decision functions for 'location_X' and 'tweet_X' 
        location_decision_function = self.location_SVC.decision_function(location_X) 
        tweet_decision_function = self.tweet_SVC.decision_function(tweet_X) 

        # Get additional features to pull into the Random Forest 
        friends_count = df['friends_count'].values.reshape(df.shape[0], 1) 
        utc_offset = df['utc_offset'].values.reshape(df.shape[0], 1) 
        tweet_time_secs = df['tweet_time_secs'].values.reshape(df.shape[0], 1) 
        statuses_count = df['statuses_count'].values.reshape(df.shape[0], 1) 
        favourites_count = df['favourites_count'].values.reshape(df.shape[0], 1) 
        followers_count = df['followers_count'].values.reshape(df.shape[0], 1) 
        listed_count = df['listed_count'].values.reshape(df.shape[0], 1) 

        # np.hstack additional features together 
        add_features = np.hstack((friends_count, 
                               utc_offset, 
                               tweet_time_secs, 
                               statuses_count, 
                               favourites_count, 
                               followers_count, 
                               listed_count)) 

        # np.hstack the two decision function variables with add_features 
        meta_X = np.hstack((location_decision_function, tweet_decision_function, add_features)) 

        # Feed meta_X into Random Forest and make predictions 
        return self.meta_RF.predict(meta_X) 

if __name__ == "__main__": 

        # Load evaluation_df into pandas DataFrame 
        evaluation_df = pd.read_pickle('evaluation_df.pkl') 

        # Load UserLocationClassifier 
        clf = UserLocationClassifier() 

        # Get predicted locations 
        predictions = clf.predict(evaluation_df) 

        # Create a new column called 'predicted_location' 
        evaluation_df.loc[:, 'predicted_location'] = predictions 

        # Pickle the resulting DataFrame with the location predictions 
        evaluation_df.to_pickle('evaluation_df_with_predictions.pkl')

Plotting the Locations of Twitter Users on a Map Using Bokeh

Here are some examples of how the model performed on a few selected cities. For each of the maps shown below, the dots indicate the user's true location, while the title of the map indicates where the model predicted them to be. As you can see, for each city there is a tight cluster in and around the correct location, with only a handfull of extreme misses. Here is the code for generating these plots (note: the final plots shown here were constructed in Photoshop after first using the 'pan' and 'wheel_zoom' tools in Bokeh to capture screenshots of the contiguous US, Alaska, and Hawaii):

import pandas as pd 
import pickle 

from bokeh.plotting import figure, output_notebook, output_file, show 
from bokeh.tile_providers import STAMEN_TERRAIN 
output_notebook() 

from functools import partial 
from shapely.geometry import Point 
from shapely.ops import transform 
import pyproj 

# Web mercator bounding box for the US 
US = ((-13884029, -7453304), (2698291, 6455972)) 

x_range, y_range = US 
plot_width = int(900) 
plot_height = int(plot_width*7.0/12) 

def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height, **plot_args): 
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height, 
        x_range=x_range, y_range=y_range, outline_line_color=None, 
        min_border=0, min_border_left=0, min_border_right=0, 
        min_border_top=0, min_border_bottom=0, **plot_args) 

    p.axis.visible = False 
    p.xgrid.grid_line_color = None 
    p.ygrid.grid_line_color = None 
    return p 

def plot_predictions_for_a_city(df, name_of_predictions_col, city): 
    ''' 
    INPUT: DataFrame with location predictions; name of column in DataFrame that 
    contains the predictions; city ('City, State') to plot predictions for 
    OUTPUT: Bokeh map that shows the actual location of all the users predicted to 
    be in the selected city 
    ''' 

    df_ = df[df[name_of_predictions_col] == city] 

    # Initialize two lists to hold all the latitudes and longitudes 
    all_lats = [] 
    all_longs = [] 

    # Pull all latitudes in 'centroid' column append to all_lats 
    for i in df_['centroid']: 
        all_lats.append(i[0]) 

    # Pull all longitudes in 'centroid' column append to all_longs 
    for i in df_['centroid']: 
        all_longs.append(i[1]) 

    # Initialize two lists to hold all the latitudes and longitudes 
    # converted to web mercator 
    all_x = [] 
    all_y = [] 

    # Convert latittudes and longitudes to web mercator x and y format 
    for i in xrange(len(all_lats)): 
        pnt = transform( 
            partial( 
                pyproj.transform, 
                pyproj.Proj(init='EPSG:4326'), 
                pyproj.Proj(init='EPSG:3857')), 
                Point(all_longs[i], all_lats[i])) 
        all_x.append(pnt.x) 
        all_y.append(pnt.y) 

    p = base_plot() 
    p.add_tile(STAMEN_TERRAIN) 
    p.circle(x=all_x, y=all_y, line_color=None, fill_color='#380474', size=15, alpha=.5) 
    output_file("stamen_toner_plot.html") 
    show(p) 

if __name__ == "__main__": 

    # Load pickled evaluation_df with location predictions 
    evaluation_df_with_predictions = pd.read_pickle('evaluation_df_with_predictions.pkl') 

    # Plot actual locations for users predicted to be in Eugene, OR 
    plot_predictions_for_a_city(evaluation_df_with_predictions, 'predicted_location', 'Eugene, OR')

Example 1: Eugene, OR

Example 2: Durham, NC

Example 3: Shreveport, LA

Tweet Term Importances for these Cities

To get an idea of what tweet terms were important for predicting these cities, I went through and calculated mean tf-idf values for each of these cities. Below are some of the more interesting terms for each of these cities. To generate these plots, I followed an excellent guide written by Thomas Buhrmann.

Emoji Skin Tone Modifications

One of the more interesting things to fall out of the model was the colored boxes shown above. These represent the skin tone modifications you can add to certain emojis. For most emojis there was not a strong geographic signal; however, for the skin tone modifications there was. As you can see in the term importances plots, Twitter users in Eugene, OR, tended to use lighter colored skin tone modifications while users in Durham, NC, and Shreveport, LA, tended to use darker skin tone modifications.

Scoring the Model

Median Error: 49.6 miles

To score the model I chose to use median error, which came out to be 49.6 miles. This was calculated by using the centroids to find the great-circle distance between the predicted city and the true location. Here is how it was calculated (note: if the user was predicted to be in the correct city, the error was scored as 0.0 miles, regardless of the actual distance between the centroids):

from math import sin, cos, sqrt, atan2, radians 
import pickle 

def load_coord_dict(): 
    ''' 
    Input: n/a 
    Output: A dictionary whose keys are the location names ('City, State') of the 
    379 classification labels and the values are the centroids for those locations 
    (latitude, longitude) 
    ''' 

    pkl_file = open("coord_dict.pkl", 'rb') 
    coord_dict = pickle.load(pkl_file) 
    pkl_file.close() 
    return coord_dict 

def compute_error_in_miles(zipped_predictions): 
    ''' 
    INPUT: Tuple in the form of (predicted city, centroid of true location) 
    OUTPUT: Float of the great-circle error distance between the predicted city 
    and the true locaiton. 
    ''' 

    radius = 3959 # approximate radius of earth in miles 

    predicted_city = zipped_predictions[0] 
    actual_centroid = zipped_predictions[1] 

    lat1 = radians(coord_dict[predicted_city][0]) 
    lon1 = radians(coord_dict[predicted_city][1]) 
    lat2 = radians(actual_centroid[0]) 
    lon2 = radians(actual_centroid[1]) 

    delta_lon = lon2 - lon1 
    delta_lat = lat2 - lat1 

    a = sin(delta_lat / 2)**2 + cos(lat1) * cos(lat2) * sin(delta_lon / 2)**2 
    c = 2 * atan2(sqrt(a), sqrt(1 - a)) 

    error_distance = radius * c 
    return error_distance 

def correct_outside_the_us_errors(row): 
    ''' 
    Helper function to correct the errors to 0.0 for the users that were correctly predicted. This 
    is especially important for users outside the US since they were all given the 
    same label ('NOT_IN_US, NONE') even though their centroids were all different. 
    ''' 

    if row['predicted_location'] == row['geo_location']: 
        error = 0.0 
    else: 
        error = row['error_in_miles'] 
    return error 

if __name__ == "__main__": 

    # Load coord_dict 
    coord_dict = load_coord_dict() 

    centroid_of_true_location = evaluation_df['centroid'].values 
    zipped_predictions = zip(predictions, centroid_of_true_location) 

    # Create a new column with the error value for each prediction 
    evaluation_df['error_in_miles'] = map(lambda x: compute_error_in_miles(x), zipped_predictions) 

    # Change the error of correct predictions to 0.0 miles 
    evaluation_df['error_in_miles'] = evaluation_df.apply(lambda x: 
                                                          correct_outside_the_us_errors(x), 
                                                           axis=1)

     median_error = evaluation_df['error_in_miles'].median()

Histogram of Error Distances

Influence of Tweet Number on the Model's Accuracy

Recall that, for each user I wanted to make a prediction on, I went back to the API and pulled 200 of their most recent tweets. The plot below was generated using the same procedure as above with increasing numbers of tweets for each user. I originally chose 200 because this is the maximum number the API allows you to pull per distinct request. However, as you can see in the plot below, after about 100 tweets there is negligible improvement in the model's accuracy, meaning for future use it might not be necessary to pull so many tweets for each user.

Final Notes

While a median error of 49.6 miles is pretty good, there is still plenty of room for improvement. Running the Tweepy streaming script for a longer period of time and having a larger collection of training data would likely give an immediate improvement. Additionally, with more training data, you could also include more than 379 classification labels, which would also help to decrease the median error of the model. That said, given the time constraints of the project, I'm satisfied with the current model's accuracy and think it could be a valuable resource to many projects where having an extremely granular estimate of a Twitter user's location is not required.

by swebster at July 21, 2017 04:45 PM

Continuum Analytics

Galvanize Capstone Series: Geolocation of Twitter Users

In June of this year, I completed the Data Science Immersive program at Galvanize in Austin, TX. The final few weeks of the program were dedicated to individual capstone projects of our choosing. I have a background in infectious disease epidemiology and when I was in graduate school there was a lot of interest in using things like Google search queries, Facebook posts, and tweets to try to track the spread of infectious diseases in real-time.

One of the limitations to using Twitter is that only about 1% of tweets are geotagged with the tweet’s location, which can make much of this work very difficult. For my capstone project, I chose to train a model using the 1% of tweets that are geotagged to predict the US city-level location of Twitter users who do not geotag their tweets. This is how I did it:

This post is part of our Galvanize Capstone featured projects. This post was written by Shawn Terryah and posted here with his permission. 

In June of this year, I completed the Data Science Immersive program at Galvanize in Austin, TX. The final few weeks of the program were dedicated to individual capstone projects of our choosing. I have a background in infectious disease epidemiology and, when I was in graduate school, there was a lot of interest in using things like Google search queries, Facebook posts, and tweets to try to track the spread of infectious diseases in real-time. One of the limitations to using Twitter is that only about 1% of tweets are geotagged with the tweet’s location, which can make much of this work very difficult. For my capstone project, I chose to train a model using the 1% of tweets that are geotagged to predict the US city-level location of Twitter users who do not geotag their tweets. This is how I did it:

Streaming Training Tweets Using Tweepy

Tweepy is a Python wrapper for the Twitter API that allowed me to easily collect tweets in real-time and store them in MongoBD. The script below was run on an Amazon Web Services EC2 instance with 200 GiB of storage for roughly two weeks using tmux. By filtering based on location, I only received geotagged tweets with a known location to use for training the model.

import tweepy 
import json 
from pymongo import MongoClient 

class StreamListener(tweepy.StreamListener): 
    """tweepy.StreamListener is a class provided by tweepy used to access 
    the Twitter Streaming API to collect tweets in real-time. 
    """ 
    def on_connect(self): 
        """Called when the connection is made""" 
        print("You're connected to the streaming server.") 

    def on_error(self, status_code): 
        """This is called when an error occurs""" 
        print('Error: ' + repr(status_code)) 
        return False 

    def on_data(self, data): 
        """This will be called each time we receive stream data""" 
        client = MongoClient() 

        # I stored the tweet data in a database called 'training_tweets' in MongoDB, if 
        # 'training_tweets' does not exist it will be created for you. 
        db = client.training_tweets 

        # Decode JSON 
        datajson = json.loads(data) 

        # I'm only storing tweets in English. I stored the data for these tweets in a collection 
        # called 'training_tweets_collection' of the 'training_tweets' database. If 
        # 'training_tweets_collection' does not exist it will be created for you. 
        if "lang" in datajson and datajson["lang"] == "en": 
            db.training_tweets_collection.insert_one(datajson) 

if __name__ == "__main__": 
    # These are provided to you through the Twitter API after you create a account 
    consumer_key = "your_consumer_key" 
    consumer_secret = "your_consumer_secret" 
    access_token = "your_access_token" 
    access_token_secret = "your_access_token_secret" 

    auth1 = tweepy.OAuthHandler(consumer_key, consumer_secret) 
    auth1.set_access_token(access_token, access_token_secret) 

    # LOCATIONS are the longitude, latitude coordinate corners for a box that restricts the 
    # geographic area from which you will stream tweets. The first two define the southwest 
    # corner of the box and the second two define the northeast corner of the box. 
    LOCATIONS = [-124.7771694, 24.520833, -66.947028, 49.384472,     # Contiguous US 
                 -164.639405, 58.806859, -144.152365, 71.76871,      # Alaska 
                 -160.161542, 18.776344, -154.641396, 22.878623]     # Hawaii 

    stream_listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
    stream = tweepy.Stream(auth=auth1, listener=stream_listener) 
    stream.filter(locations=LOCATIONS)

Feature Selection, Feature Engineering, and Data Cleaning

Feature Selection

At the end of two weeks, I had collected data from over 21 million tweets from over 15,500 cities. In addition to the tweet itself, the API provides a number of other fields. These are the fields I used to build the model:

FIELD TYPE DESCRIPTION

‘text’ 

String The actual UTF-8 text of the tweet
‘country_code’ String Country code representing the country that tweet was sent from
‘full_name’ String Full representation of the place the tweet was sent from. For the US, often in the form of ‘City, State,’ but not always.
‘coordinates’ Array of Array of Array of Float A series of longitude and latitude points that define a bounding box from where the tweet was sent
‘screen_name’

String

The screen name chosen by the user

‘favourites_count’

Int

The number of tweets this user has liked in the account’s lifetime

‘followers_count’

Int

The number of followers the user currently has

‘statuses_count’

Int

The number of tweets (including retweets) issued by the user

‘friends_count’

Int

The number of users the user is following (AKA their “followings”)

‘listed_count’

Int

The number of public lists the user is a member of

‘location’

String

The user-defined location for the account’s profile, which is not necessarily a geographic location (e.g., ‘the library,’ ‘watching a movie,’ ‘in my own head,’ ‘The University of Texas’) (Nullable)
‘created_at’

String

The UTC datetime of when the tweet was issued

‘utc_offset’

Int

The offset from GMT/UTC in seconds based the Time Zone that the user selects for their profile (Nullable)

To pull these fields I first exported the data from MongoDB as a json file:

$ mongoexport --db training_tweets --collection training_tweets_collection --out training_tweets.json

I then converted training_tweets.json to a csv file and pulled only the fields from the table above:

import json 
import unicodecsv as csv     # unicodecsv ensures that emojis are preserved 

def tweets_json_to_csv(file_list, csv_output_file): 
    ''' 
    INPUT: list of JSON files 
    OUTPUT: single CSV file 

    This function takes a list of JSON files containing tweet data and reads 
    each file line by line, parsing the revelent fields, and writing it to a CSV file. 
    ''' 

    count = 0 
    f = csv.writer(open(csv_output_file, "wb+")) 

    # Column names 
    f.writerow(['tweet', # relabeled: the API calls this 'text' 
                'country_code', 
                'geo_location', # relabeled: the API calls this 'full_name' 
                'bounding_box', 
                'screen_name', 
                'favourites_count', 
                'followers_count', 
                'statuses_count', 
                'friends_count', 
                'listed_count', 
                'user_described_location', # relabeled: the API calls this 'location' 
                'created_at', 
                'utc_offset']) 

    for file_ in file_list: 
        with open(file_, "r") as r: 
            for line in r: 
                try: 
                    tweet = json.loads(line) 
                except: 
                    continue 
                if tweet and tweet['place'] != None: 
                    f.writerow([tweet['text'], 
                                tweet['place']['country_code'], 
                                tweet['place']['full_name'], 
                                tweet['place']['bounding_box']['coordinates'], 
                                tweet['user']['screen_name'], 
                                tweet['user']['favourites_count'], 
                                tweet['user']['followers_count'], 
                                tweet['user']['statuses_count'], 
                                tweet['user']['friends_count'], 
                                tweet['user']['listed_count'], 
                                tweet['user']['location'], 
                                tweet['created_at'], 
                                tweet['user']['utc_offset']]) 
                    count += 1 

                    # Status update 
                    if count % 100000 == 0: 
                        print 'Just stored tweet #{}'.format(count) 

if __name__ == "__main__": 

    tweets_json_to_csv(['training_tweets.json'], 'training_tweets.csv')

From this point forward, I was able to read and manipulate the csv file as a pandas DataFrame:

import pandas as pd 

df = pd.read_csv('training_tweets.csv', encoding='utf-8') # 'utf-8' ensures that emojis are preserved

Feature Engineering

‘centroid’

Instead of providing the exact latitude and longitude of the tweet, the Twitter API provides a polygonal bounding box of coordinates that encloses the place where the tweet was sent. To plot the tweets on a map and perform other functions, I found the centroid of each bounding box:

def find_centroid(row): 
    ''' 
    Helper function to return the centroid of a polygonal bounding box of longitude, latitude coordinates 
    ''' 

    try: 
        row_ = eval(row) 
        lst_of_coords = [item for sublist in row_ for item in sublist] 
        longitude = [p[0] for p in lst_of_coords] 
        latitude = [p[1] for p in lst_of_coords] 
        return (sum(latitude) / float(len(latitude)), sum(longitude) / float(len(longitude))) 
    except: 
        return None 

# Create a new column called 'centroid' 
df['centroid'] = map(lambda row: find_centroid(row), df['bounding_box'])

Using the centroids, I was able to plot the training tweets on a map using the Matplotlib Basemap Toolkit. Below is the code for generating a plot of the tweets that originated in or around the contiguous US. The same was also done for Alaska and Hawaii.

from mpl_toolkits.basemap import Basemap 
import matplotlib.pyplot as plt 

def plot_contiguous_US_tweets(lon, lat, file_path): 
    ''' 
    INPUT: List of longitudes (lon), list of latitudes (lat), file path to save the plot (file_path) 
    OUTPUT: Plot of tweets in the contiguous US. 
    ''' 

    map = Basemap(projection='merc', 
                  resolution = 'h',  
                  area_thresh = 10000, 
                  llcrnrlon=-140.25, # lower left corner longitude of contiguous US 
                  llcrnrlat=5.0, # lower left corner latitude of contiguous US 
                  urcrnrlon=-56.25, # upper right corner longitude of contiguous US 
                  urcrnrlat=54.75) # upper right corner latitude of contiguous US 

    x,y = map(lon, lat) 

    map.plot(x, y, 'bo', markersize=2, alpha=.3) 
    map.drawcoastlines() 
    map.drawstates() 
    map.drawcountries() 
    map.fillcontinents(color = '#DAF7A6', lake_color='#a7cdf2') 
    map.drawmapboundary(fill_color='#a7cdf2') 
    plt.gcf().set_size_inches(15,15) 
    plt.savefig(file_path, format='png', dpi=1000)

The resulting plots for the contiguous US, Alaska, and Hawaii were joined in Photoshop and are shown on the left. The plot on the right is from the Vulcan Project at Purdue University and shows carbon footprints in the contiguous US. As you can see, the plots are very similiar, providing an indication that streaming tweets in this way provides a representative sample of the US population in terms of geographic location.

‘tweet_time_secs’

The field ‘created_at’ is the UTC datetime of when the tweet was issued. Here is an example:

u'created_at': u'Sun Apr 30 01:23:27 +0000 2017'

I was interested in the UTC time, rather than the date, that a tweet was sent, because there are likely geographic differences in these values. I therefore parsed this information from the time stamp and reported this value in seconds.

from dateutil import parser 

def get_seconds(row): 
    ''' 
    Helper function to parse time from a datetime stamp and return the time in seconds 
    ''' 

    time_str = parser.parse(row).strftime('%H:%M:%S') 
    h, m, s = time_str.split(':') 
    return int(h) * 3600 + int(m) * 60 + int(s) 

# Create a new column called 'tweet_time_secs' 
df['tweet_time_secs'] = map(lambda row: get_seconds(row), df['created_at'])

Data Cleaning

Missing Data

Both ‘user_described_location’ (note: the API calls this field ‘location’) and ‘utc_offset’ are nullable fields that frequently contain missing values. When this was the case, I filled them in with indicator values:

df['user_described_location'].fillna('xxxMISSINGxxx', inplace=True) 
df['utc_offset'].fillna(999999, inplace=True)

Additionally, a small percentage of tweets contained missing values for ‘country_code.’ When this or other information was missing, I chose to drop the entire row:

df.dropna(axis=0, inplace=True)

Tweets Outside the US

The bounding box I used to stream the tweets included areas outside the contiguous US. Since the goal for this project was to predict the US city-level location of Twitter users, I relabeled tweets that originated from outside the US. For these tweets ‘country_code’ was relabeled to ‘NOT_US’ and ‘geo_location’ (note: the API calls this field ‘full_name’) was relabeled to ‘NOT_IN_US, NONE’:

def relabel_geo_locations(row): 
    ''' 
    Helper function to relabel the geo_locations from tweets outside the US 
    to 'NOT_IN_US, NONE' 
    ''' 

    if row['country_code'] == 'US': 
        return row['geo_location'] 
    else: 
        return 'NOT_IN_US, NONE' 

# Relabel 'country_code' for tweets outside the US to 'NOT_US' 
     df['country_code'] = map(lambda cc: cc if cc == 'US' else 'NOT_US', df['country_code']) 

# Relabel 'geo_location' for tweets outside the US to 'NOT_IN_US, NONE' 
df['geo_location'] = df.apply(lambda row: relabel_geo_locations(row), axis=1)

Tweets Lacking a ‘City, State’ Location Label

Most tweets that originated in the US had a ‘geo_location’ in the form of ‘City, State’ (e.g., ‘Austin, TX’). For some tweets, however, the label was less specific and in the form of ‘State, Country’ (e.g., ‘Texas, USA’) or, even worse, in the form of a totally unique value (e.g., ‘Tropical Tan’). Since this data was going to be used to train the model, I wanted to have as granular of a label as possible for each tweet. Therefore, I only kept tweets that were in the form of ‘City, State’ and dropped all others:

def geo_state(row): 
    ''' 
    Helper function to parse the state code for 'geo_location' labels 
    in the form of 'City, State' 
    ''' 

    try: 
        return row['geo_location'].split(', ')[1] 
    except: 
        return None 

# Create a new column called 'geo_state' 
df['geo_state'] = df.apply(lambda row: geo_state(row),axis=1) 

# The 'geo_state' column will contain null values for any row where 'geo_location' was not 
# comma separated (e.g., 'Tropical Tan'). We drop those rows here: 
df.dropna(axis=0, inplace=True) 

# list of valid geo_state labels. "NONE" is the label I created for tweets outside the US 
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "ID", 
          "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", 
          "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", 
          "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY", "NONE"] 

# Keep only rows with a valid geo_state, among others this will drop all rows that had 
# a 'geo_location' in the form of 'State, Country' (e.g., 'Texas, USA') 
df = df[df['geo_state'].isin(states)]

Aggregating the Tweets by User

During the two week collection period, many users tweeted more than once. To prevent potential leakage, I grouped the tweets by user (‘screen_name’), then aggregated the remaining fields.

from collections import Counter 

# aggregation functions 
agg_funcs = {'tweet' : lambda x: ' '.join(x), 
             'geo_location' : lambda x: Counter(x).most_common(1)[0][0], 
             'geo_state' : lambda x: Counter(x).most_common(1)[0][0], 
             'user_described_location': lambda x: Counter(x).most_common(1)[0][0], 
             'utc_offset': lambda x: Counter(x).most_common(1)[0][0], 
             'geo_country_code': lambda x: Counter(x).most_common(1)[0][0], 
             'tweet_time_secs' : np.median, 
             'statuses_count': np.max, 
             'friends_count' :np.mean, 
             'favourites_count' : np.mean, 
             'listed_count' : np.mean, 
             'followers_count' : np.mean} 

# Groupby 'screen_name' and then apply the aggregation functions in agg_funcs 
df = df.groupby(['screen_name']).agg(agg_funcs).reset_index()

Remapping the Training Tweets to the Closest Major City

Since the training tweets came from over 15,500 cities, and I didn’t want to do a 15,500-wise classification problem, I used the centroids to remap all the training tweets to their closest major city from a list of 378 major US cities based on population (plus the single label for tweets outside the US, which used Toronto’s coordinates). This left me with a 379-wise classification problem. Here is a plot of those major cities and the code to remap all the US training tweets to their closest major US city:

[[{“fid”:”1401″,”view_mode”:”default”,”fields”:{“format”:”default”},”type”:”media”,”field_deltas”:{“2”:{“format”:”default”}},”attributes”:{“class”:”media-element file-default”,”data-delta”:”2″}}]]

import numpy as np 
import pickle 

def load_US_coord_dict(): 
    ''' 
    Input: n/a 
    Output: A dictionary whose keys are the location names ('City, State') of the 
    378 US classification locations and the values are the centroids for those locations 
    (latitude, longittude) 
    ''' 

    pkl_file = open("US_coord_dict.pkl", 'rb') 
    US_coord_dict = pickle.load(pkl_file) 
    pkl_file.close() 
    return US_coord_dict 

def find_dist_between(tup1, tup2): 
    ''' 
    INPUT: Two tuples of latitude, longitude coordinates pairs for two cities 
    OUTPUT: The distance between the cities 
    ''' 

    return np.sqrt((tup1[0] - tup2[0])**2 + (tup1[1] - tup2[1])**2) 

def closest_major_city(tup): 
    ''' 
    INPUT: A tuple of the centroid coordinates for the tweet to remap to the closest major city 
    OUTPUT: String, 'City, State', of the city in the dictionary 'coord_dict' that is closest to the input city 
    ''' 

    d={} 
    for key, value in US_coord_dict.iteritems(): 
        dist = find_dist_between(tup, value) 
        if key not in d: 
            d[key] = dist 
        return min(d, key=d.get) 

def get_closest_major_city_for_US(row): 
    ''' Helper function to return the closest major city for US users only. For users 
    outside the US it returns 'NOT_IN_US, NONE' 
    ''' 

    if row['geo_location'] == 'NOT_IN_US, NONE': 
        return 'NOT_IN_US, NONE' 
    else: 
        return closest_major_city(row['centroid']) 

if __name__ == "__main__": 

    # Load US_coord_dict 
    US_coord_dict = load_US_coord_dict() 

    # Create a new column called 'closest_major_city' 
    df['closest_major_city'] = df.apply(lambda row: get_closest_major_city_for_US(row), axis=1)

 

Building the Predictive Model

The steps below were run on an Amazon Web Services r3.8xlarge EC2 instance with 244 GiB of memory. Here is a high-level overview of the final model:

High-level Overview of the Stacked Model

[[{“fid”:”1403″,”view_mode”:”default”,”fields”:{“format”:”default”},”type”:”media”,”field_deltas”:{“4”:{“format”:”default”}},”attributes”:{“class”:”media-element file-default”,”data-delta”:”4″}}]]

Step 1: Load dependencies and prepare the cleaned data for model fitting

import pandas as pd 
import numpy as np 
import nltk 
from nltk.tokenize import TweetTokenizer 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.externals import joblib 

# Tokenizer to use for text vectorization 
def tokenize(tweet): 
    tknzr = TweetTokenizer(strip_handles=True, reduce_len=True, preserve_case=False) 
    return tknzr.tokenize(tweet) 

# Read cleaned training tweets file into pandas and randomize it 
df = pd.read_pickle('cleaned_training_tweets.pkl') 
randomized_df = df.sample(frac=1, random_state=111) 

# Split randomized_df into two disjoint sets 
half_randomized_df = randomized_df.shape[0] / 2 
base_df = randomized_df.iloc[:half_randomized_df, :] # used to train the base classifiers 
meta_df = randomized_df.iloc[half_randomized_df:, :] # used to train the meta classifier 

# Create variables for the known the geotagged locations from each set 
base_y = base_df['closest_major_city'].values 
meta_y = meta_df['closest_major_city'].values

Step 2: Train a base-level Linear SVC classifier on the user described locations

# Raw text of user described locations 
base_location_doc = base_df['user_described_location'].values 
meta_location_doc = meta_df['user_described_location'].values 

# fit_transform a tf-idf vectorizer using base_location_doc and use it to transform meta_location_doc 
location_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize, ngram_range=(1,2)) 
base_location_X = location_vect.fit_transform(base_location_doc.ravel()) 
meta_location_X = location_vect.transform(meta_location_doc) 

# Fit a Linear SVC Model with 'base_location_X' and 'base_y'. Note: it is important to use 
# balanced class weights otherwise the model will overwhelmingly favor the majority class. 
location_SVC = LinearSVC(class_weight='balanced') 
location_SVC.fit(base_location_X, base_y) 

# We can now pass meta_location_X into the fitted model and save the decision 
# function, which will be used in Step 4 when we train the meta random forest 
location_SVC_decsfunc = location_SVC.decision_function(meta_location_X) 

# Pickle the location vectorizer and the linear SVC model for future use 
joblib.dump(location_vectorizer, 'USER_LOCATION_VECTORIZER.pkl') 
joblib.dump(location_SVC, 'USER_LOCATION_SVC.pkl')

Step 3: Train a base-level Linear SVC classifier on the tweets

# Raw text of tweets 
base_tweet_doc = base_df['tweet'].values 
meta_tweet_doc = meta_df['tweet'].values 

# fit_transform a tf-idf vectorizer using base_tweet_doc and use it to transform meta_tweet_doc 
tweet_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize) 
base_tweet_X = tweet_vectorizer.fit_transform(base_tweet_doc.ravel()) 
meta_tweet_X = tweet_vectorizer.transform(meta_tweet_doc) 

# Fit a Linear SVC Model with 'base_tweet_X' and 'base_tweet_y'. Note: it is important to use 
# balanced class weights otherwise the model will overwhelmingly favor the majority class. 
tweet_SVC = LinearSVC(class_weight='balanced') 
tweet_SVC.fit(base_tweet_X, base_y) 

# We can now pass meta_tweet_X into the fitted model and save the decision 
# function, which will be used in Step 4 when we train the meta random forest 
tweet_SVC_decsfunc = tweet_SVC.decision_function(meta_tweet_X) 

# Pickle the tweet vectorizer and the linear SVC model for future use 
joblib.dump(tweet_vectorizer, 'TWEET_VECTORIZER.pkl') 
joblib.dump(tweet_SVC, 'TWEET_SVC.pkl')

Step 4: Train a meta-level Random Forest classifier

# additional features from meta_df to pull into the final model 
friends_count = meta_df['friends_count'].values.reshape(meta_df.shape[0], 1) 
utc_offset = meta_df['utc_offset'].values.reshape(meta_df.shape[0], 1) 
tweet_time_secs = meta_df['tweet_time_secs'].values.reshape(meta_df.shape[0], 1) 
statuses_count = meta_df['statuses_count'].values.reshape(meta_df.shape[0], 1) 
favourites_count = meta_df['favourites_count'].values.reshape(meta_df.shape[0], 1) 
followers_count = meta_df['followers_count'].values.reshape(meta_df.shape[0], 1) 
listed_count = meta_df['listed_count'].values.reshape(meta_df.shape[0], 1) 

# np.hstack these additional features together 
add_features = np.hstack((friends_count, 
                          utc_offset, 
                          tweet_time_secs, 
                          statuses_count, 
                          favourites_count, 
                          followers_count, 
                          listed_count)) 

# np.hstack the two decision function variables from steps 2 & 3 with add_features 
meta_X = np.hstack((location_SVC_decsfunc, # from Step 2 above 
                    tweet_SVC_decsfunc, # from Step 3 above 
                    add_features)) 

# Fit Random Forest with 'meta_X' and 'meta_y' 
meta_RF = RandomForestClassifier(n_estimators=60, n_jobs=-1) 
meta_RF.fit(meta_X, meta_y) 

# Pickle the meta Random Forest for future use 
joblib.dump(meta_RF, 'META_RF.pkl')

Testing the Model

Collecting and Preparing a Fresh Data Set

A week after I collected the training data set, I collected a fresh data set to use to evaluate the model. For this, I followed the same data collection and preparation procedures as above with a few of exceptions: 1) I only ran the Tweepy script for 48 hours, 2) I removed any users from the evaluation data set that were in the training data set, and 3) I went back to the Twitter API and pulled the 200 most recent tweets for each user that remained in the data set. Remember, the goal for the model is to predict the US city-level location of Twitter users, not individual tweets; therefore, by giving the model a larger corpus of tweets for each user, I hoped to increase the model’s accuracy. Here is the script for pulling the 200 most recent tweets for each user:

import tweepy 
import pandas as pd 

# these are provided to you through the Twitter API after you create a account 
consumer_key = "your_consumer_key" 
consumer_secret = "your_consumer_secret" 
access_token = "your_access_token" 
access_token_secret = "your_access_token_secret" 

count = 0 

def get_200_tweets(screen_name): 
    ''' 
    Helper function to return a list of a Twitter user's 200 most recent tweets 
    ''' 

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
    auth.set_access_token(access_key, access_secret) 
    api = tweepy.API(auth, 
                     wait_on_rate_limit=True, 
                     wait_on_rate_limit_notify=True) 

    # Initialize a list to hold the user's 200 most recent tweets 
    tweets_data = [] 

    global count 

    try: 
        # make request for most recent tweets (200 is the maximum allowed per distinct request) 
        recent_tweets = api.user_timeline(screen_name = screen_name, count=200) 

        # save data from most recent tweets 
        tweets_data.extend(recent_tweets) 

        count += 1 

        # Status update 
        if count % 1000 == 0: 
            print 'Just stored tweets for user #{}'.format(count) 

    except: 
        count += 1 
        pass 

    # pull only the tweets and encode them in utf-8 to preserve emojis 
    list_of_recent_tweets = [''.join(tweet.text.encode("utf-8")) for tweet in tweets_data] 

    return list_of_recent_tweets 

# Create a new column in evaluation_df called '200_tweets' 
evaluation_df['200_tweets'] = map(lambda x: get_200_tweets(x), evaluation_df['screen_name'])

Making Predictions on the Fresh Data Set

To make predictions on evaluation_df, the script below was run on the same Amazon Web Services r3.8xlarge EC2 instance that was used to build the model:

import pandas as pd 
import numpy as np 
from sklearn.externals import joblib 
import nltk 
from nltk.tokenize import TweetTokenizer 

def tokenize(tweet): 
    tknzr = TweetTokenizer(strip_handles=True, reduce_len=True, preserve_case=False) 
    return tknzr.tokenize(tweet) 

class UserLocationClassifier: 

    def __init__(self): 
        ''' 
        Load the stacked classifier's pickled vectorizers, base classifiers, and meta classifier 
        ''' 

        self.location_vectorizer = joblib.load('USER_LOCATION_VECTORIZER.pkl') 
        self.location_SVC = joblib.load('USER_LOCATION_SVC.pkl') 
        self.tweet_vectorizer = joblib.load('TWEET_VECTORIZER.pkl') 
        self.tweet_SVC = joblib.load('TWEET_SVC.pkl') 
        self.meta_RF = joblib.load('META_RF.pkl') 

    def predict(self, df): 
        ''' 
        INPUT: Cleaned and properly formatted dataframe to make predictions for 
        OUTPUT: Array of predictions 
        ''' 
        # Get text from 'user_described_location' column of DataFrame 
        location_doc = df['user_described_location'].values 

        # Convert the '200_tweets' column from a list to just a string of all tweets 
        df.loc[:, '200_tweets'] = map(lambda x: ''.join(x), df['200_tweets']) 

        # Get text from '200_tweets' column of DataFrame 
        tweet_doc = df['200_tweets'].values 

        # Vectorize 'location_doc' and 'tweet_doc' 
        location_X = self.location_vectorizer.transform(location_doc.ravel()) 
        tweet_X = self.tweet_vectorizer.transform(tweet_doc.ravel()) 

        # Store decision functions for 'location_X' and 'tweet_X' 
        location_decision_function = self.location_SVC.decision_function(location_X) 
        tweet_decision_function = self.tweet_SVC.decision_function(tweet_X) 

        # Get additional features to pull into the Random Forest 
        friends_count = df['friends_count'].values.reshape(df.shape[0], 1) 
        utc_offset = df['utc_offset'].values.reshape(df.shape[0], 1) 
        tweet_time_secs = df['tweet_time_secs'].values.reshape(df.shape[0], 1) 
        statuses_count = df['statuses_count'].values.reshape(df.shape[0], 1) 
        favourites_count = df['favourites_count'].values.reshape(df.shape[0], 1) 
        followers_count = df['followers_count'].values.reshape(df.shape[0], 1) 
        listed_count = df['listed_count'].values.reshape(df.shape[0], 1) 

        # np.hstack additional features together 
        add_features = np.hstack((friends_count, 
                               utc_offset, 
                               tweet_time_secs, 
                               statuses_count, 
                               favourites_count, 
                               followers_count, 
                               listed_count)) 

        # np.hstack the two decision function variables with add_features 
        meta_X = np.hstack((location_decision_function, tweet_decision_function, add_features)) 

        # Feed meta_X into Random Forest and make predictions 
        return self.meta_RF.predict(meta_X) 

if __name__ == "__main__": 

        # Load evaluation_df into pandas DataFrame 
        evaluation_df = pd.read_pickle('evaluation_df.pkl') 

        # Load UserLocationClassifier 
        clf = UserLocationClassifier() 

        # Get predicted locations 
        predictions = clf.predict(evaluation_df) 

        # Create a new column called 'predicted_location' 
        evaluation_df.loc[:, 'predicted_location'] = predictions 

        # Pickle the resulting DataFrame with the location predictions 
        evaluation_df.to_pickle('evaluation_df_with_predictions.pkl')

Plotting the Locations of Twitter Users on a Map Using Bokeh

Here are some examples of how the model performed on a few selected cities. For each of the maps shown below, the dots indicate the user’s true location, while the title of the map indicates where the model predicted them to be. As you can see, for each city there is a tight cluster in and around the correct location, with only a handfull of extreme misses. Here is the code for generating these plots (note: the final plots shown here were constructed in Photoshop after first using the ‘pan’ and ‘wheel_zoom’ tools in Bokeh to capture screenshots of the contiguous US, Alaska, and Hawaii):

import pandas as pd 
import pickle 

from bokeh.plotting import figure, output_notebook, output_file, show 
from bokeh.tile_providers import STAMEN_TERRAIN 
output_notebook() 

from functools import partial 
from shapely.geometry import Point 
from shapely.ops import transform 
import pyproj 

# Web mercator bounding box for the US 
US = ((-13884029, -7453304), (2698291, 6455972)) 

x_range, y_range = US 
plot_width = int(900) 
plot_height = int(plot_width*7.0/12) 

def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height, **plot_args): 
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height, 
        x_range=x_range, y_range=y_range, outline_line_color=None, 
        min_border=0, min_border_left=0, min_border_right=0, 
        min_border_top=0, min_border_bottom=0, **plot_args) 

    p.axis.visible = False 
    p.xgrid.grid_line_color = None 
    p.ygrid.grid_line_color = None 
    return p 

def plot_predictions_for_a_city(df, name_of_predictions_col, city): 
    ''' 
    INPUT: DataFrame with location predictions; name of column in DataFrame that 
    contains the predictions; city ('City, State') to plot predictions for 
    OUTPUT: Bokeh map that shows the actual location of all the users predicted to 
    be in the selected city 
    ''' 

    df_ = df[df[name_of_predictions_col] == city] 

    # Initialize two lists to hold all the latitudes and longitudes 
    all_lats = [] 
    all_longs = [] 

    # Pull all latitudes in 'centroid' column append to all_lats 
    for i in df_['centroid']: 
        all_lats.append(i[0]) 

    # Pull all longitudes in 'centroid' column append to all_longs 
    for i in df_['centroid']: 
        all_longs.append(i[1]) 

    # Initialize two lists to hold all the latitudes and longitudes 
    # converted to web mercator 
    all_x = [] 
    all_y = [] 

    # Convert latittudes and longitudes to web mercator x and y format 
    for i in xrange(len(all_lats)): 
        pnt = transform( 
            partial( 
                pyproj.transform, 
                pyproj.Proj(init='EPSG:4326'), 
                pyproj.Proj(init='EPSG:3857')), 
                Point(all_longs[i], all_lats[i])) 
        all_x.append(pnt.x) 
        all_y.append(pnt.y) 

    p = base_plot() 
    p.add_tile(STAMEN_TERRAIN) 
    p.circle(x=all_x, y=all_y, line_color=None, fill_color='#380474', size=15, alpha=.5) 
    output_file("stamen_toner_plot.html") 
    show(p) 

if __name__ == "__main__": 

    # Load pickled evaluation_df with location predictions 
    evaluation_df_with_predictions = pd.read_pickle('evaluation_df_with_predictions.pkl') 

    # Plot actual locations for users predicted to be in Eugene, OR 
    plot_predictions_for_a_city(evaluation_df_with_predictions, 'predicted_location', 'Eugene, OR')

Example 1: Eugene, OR

[[{“fid”:”1404″,”view_mode”:”default”,”fields”:{“format”:”default”},”type”:”media”,”field_deltas”:{“5”:{“format”:”default”}},”attributes”:{“class”:”media-element file-default”,”data-delta”:”5″}}]]

Example 2: Durham, NC

[[{“fid”:”1405″,”view_mode”:”default”,”fields”:{“format”:”default”},”type”:”media”,”field_deltas”:{“6”:{“format”:”default”}},”attributes”:{“class”:”media-element file-default”,”data-delta”:”6″}}]]

Example 3: Shreveport, LA

[[{“fid”:”1406″,”view_mode”:”default”,”fields”:{“format”:”default”},”type”:”media”,”field_deltas”:{“7”:{“format”:”default”}},”attributes”:{“class”:”media-element file-default”,”data-delta”:”7″}}]]

Tweet Term Importances for these Cities

To get an idea of what tweet terms were important for predicting these cities, I went through and calculated mean tf-idf values for each of these cities. Below are some of the more interesting terms for each of these cities. To generate these plots, I followed an excellent guide written by Thomas Buhrmann.

[[{“fid”:”1407″,”view_mode”:”default”,”fields”:{“format”:”default”},”type”:”media”,”field_deltas”:{“8”:{“format”:”default”}},”attributes”:{“class”:”media-element file-default”,”data-delta”:”8″}}]]

Emoji Skin Tone Modifications

[[{“fid”:”1408″,”view_mode”:”default”,”fields”:{“format”:”default”},”type”:”media”,”field_deltas”:{“9”:{“format”:”default”}},”attributes”:{“style”:”height: 242px; width: 300px;”,”class”:”media-element file-default”,”data-delta”:”9″}}]]

One of the more interesting things to fall out of the model was the colored boxes shown above. These represent the skin tone modifications you can add to certain emojis. For most emojis there was not a strong geographic signal; however, for the skin tone modifications there was. As you can see in the term importances plots, Twitter users in Eugene, OR, tended to use lighter colored skin tone modifications while users in Durham, NC, and Shreveport, LA, tended to use darker skin tone modifications.

Scoring the Model

Median Error: 49.6 miles

To score the model I chose to use median error, which came out to be 49.6 miles. This was calculated by using the centroids to find the great-circle distance between the predicted city and the true location. Here is how it was calculated (note: if the user was predicted to be in the correct city, the error was scored as 0.0 miles, regardless of the actual distance between the centroids):

from math import sin, cos, sqrt, atan2, radians 
import pickle 

def load_coord_dict(): 
    ''' 
    Input: n/a 
    Output: A dictionary whose keys are the location names ('City, State') of the 
    379 classification labels and the values are the centroids for those locations 
    (latitude, longitude) 
    ''' 

    pkl_file = open("coord_dict.pkl", 'rb') 
    coord_dict = pickle.load(pkl_file) 
    pkl_file.close() 
    return coord_dict 

def compute_error_in_miles(zipped_predictions): 
    ''' 
    INPUT: Tuple in the form of (predicted city, centroid of true location) 
    OUTPUT: Float of the great-circle error distance between the predicted city 
    and the true locaiton. 
    ''' 

    radius = 3959 # approximate radius of earth in miles 

    predicted_city = zipped_predictions[0] 
    actual_centroid = zipped_predictions[1] 

    lat1 = radians(coord_dict[predicted_city][0]) 
    lon1 = radians(coord_dict[predicted_city][1]) 
    lat2 = radians(actual_centroid[0]) 
    lon2 = radians(actual_centroid[1]) 

    delta_lon = lon2 - lon1 
    delta_lat = lat2 - lat1 

    a = sin(delta_lat / 2)**2 + cos(lat1) * cos(lat2) * sin(delta_lon / 2)**2 
    c = 2 * atan2(sqrt(a), sqrt(1 - a)) 

    error_distance = radius * c 
    return error_distance 

def correct_outside_the_us_errors(row): 
    ''' 
    Helper function to correct the errors to 0.0 for the users that were correctly predicted. This 
    is especially important for users outside the US since they were all given the 
    same label ('NOT_IN_US, NONE') even though their centroids were all different. 
    ''' 

    if row['predicted_location'] == row['geo_location']: 
        error = 0.0 
    else: 
        error = row['error_in_miles'] 
    return error 

if __name__ == "__main__": 

    # Load coord_dict 
    coord_dict = load_coord_dict() 

    centroid_of_true_location = evaluation_df['centroid'].values 
    zipped_predictions = zip(predictions, centroid_of_true_location) 

    # Create a new column with the error value for each prediction 
    evaluation_df['error_in_miles'] = map(lambda x: compute_error_in_miles(x), zipped_predictions) 

    # Change the error of correct predictions to 0.0 miles 
    evaluation_df['error_in_miles'] = evaluation_df.apply(lambda x: 
                                                          correct_outside_the_us_errors(x), 
                                                           axis=1)

     median_error = evaluation_df['error_in_miles'].median()

Histogram of Error Distances

[[{“fid”:”1409″,”view_mode”:”default”,”fields”:{“format”:”default”},”type”:”media”,”field_deltas”:{“10”:{“format”:”default”}},”attributes”:{“class”:”media-element file-default”,”data-delta”:”10″}}]]

Influence of Tweet Number on the Model’s Accuracy

Recall that, for each user I wanted to make a prediction on, I went back to the API and pulled 200 of their most recent tweets. The plot below was generated using the same procedure as above with increasing numbers of tweets for each user. I originally chose 200 because this is the maximum number the API allows you to pull per distinct request. However, as you can see in the plot below, after about 100 tweets there is negligible improvement in the model’s accuracy, meaning for future use it might not be necessary to pull so many tweets for each user.

[[{“fid”:”1410″,”view_mode”:”default”,”fields”:{“format”:”default”},”type”:”media”,”field_deltas”:{“11”:{“format”:”default”}},”attributes”:{“class”:”media-element file-default”,”data-delta”:”11″}}]]

Final Notes

While a median error of 49.6 miles is pretty good, there is still plenty of room for improvement. Running the Tweepy streaming script for a longer period of time and having a larger collection of training data would likely give an immediate improvement. Additionally, with more training data, you could also include more than 379 classification labels, which would also help to decrease the median error of the model. That said, given the time constraints of the project, I’m satisfied with the current model’s accuracy and think it could be a valuable resource to many projects where having an extremely granular estimate of a Twitter user’s location is not required.

by Team Anaconda at July 21, 2017 04:45 PM

July 18, 2017

Enthought

Webinar: Python for Scientists & Engineers: A Tour of Enthought’s Professional Training Course

What:  A guided walkthrough and Q&A about Enthought’s technical training course Python for Scientists & Engineers with Enthought’s VP of Training Solutions, Dr. Michael Connell

Who Should Watch: individuals, team leaders, and learning & development coordinators who are looking to better understand the options to increase professional capabilities in Python for scientific and engineering applications

VIEW


“Writing software is not my job…I just have to do it every day.”  
-21st Century Scientist or Engineer

Many scientists, engineers, and analysts today find themselves writing a lot of software in their day-to-day work even though that’s not their primary job and they were never formally trained for it. Of course, there is a lot more to writing software for scientific and analytic computing than just knowing which keyword to use and where to put the semicolon.

Software for science, engineering, and analysis has to solve the technical problem it was created to solve, of course, but it also has to be efficient, readable, maintainable, extensible, and usable by other people — including the original author six months later!

It has to be designed to prevent bugs and — because all reasonably complex software contains bugs — it should be designed so as to make the inevitable bugs quickly apparent, easy to diagnose, and easy to fix. In addition, such software often has to interface with legacy code libraries written in other languages like C or C++, and it may benefit from a graphical user interface to substantially streamline repeatable workflows and make the tools available to colleagues and other stakeholders who may not be comfortable working directly with the code for whatever reason.

Enthought’s Python for Scientists and Engineers is designed to accelerate the development of skill and confidence in addressing these kinds of technical challenges using some of Python’s core capabilities and tools, including:

  • The standard Python language
  • Core tools for science, engineering, and analysis, including NumPy (the fast array programming package), Matplotlib (for data visualization), and Pandas (for data analysis); and
  • Tools for crafting well-organized and robust code, debugging, profiling performance, interfacing with other languages like C and C++, and adding graphical user interfaces (GUIs) to your applications.

In this webinar, we give you the key information and insight you need to evaluate whether Enthought’s Python for Scientists and Engineers course is the right solution to take your technical skills to the next level, including:

  • Who will benefit most from the course
  • A guided tour through the course topics
  • What skills you’ll take away from the course, how the instructional design supports that
  • What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
  • What previous course attendees say about the course

VIEW


michael_connell-enthought-vp-trainingPresenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT


Python for Scientists & Engineers Training: The Quick Start Approach to Turbocharging Your Work

If you are tired of running repeatable processes manually and want to (semi-) automate them to increase your throughput and decrease pilot error, or you want to spend less time debugging code and more time writing clean code in the first place, or you are simply tired of using a multitude of tools and languages for different parts of a task and want to replace them with one comprehensive language, then Enthought’s Python for Scientists and Engineers is definitely for you!

This class has been particularly appealing to people who have been using other tools like MATLAB or even Excel for their computational work and want to start applying their skills using the Python toolset.  And it’s no wonder — Python has been identified as the most popular coding language for five years in a row for good reason.

One reason for its broad popularity is its efficiency and ease-of-use. Many people consider Python more fun to work in than other languages (and we agree!). Another reason for its popularity among scientists, engineers, and analysts in particular is Python’s support for rapid application development and extensive (and growing) open source library of powerful tools for preparing, visualizing, analyzing, and modeling data as well as simulation.

Python is also an extraordinarily comprehensive toolset – it supports everything from interactive analysis to automation to software engineering to web app development within a single language and plays very well with other languages like C/C++ or FORTRAN so you can continue leveraging your existing code libraries written in those other languages.

Many organizations are moving to Python so they can consolidate all of their technical work streams under a single comprehensive toolset. In the first part of this class we’ll give you the fundamentals you need to switch from another language to Python and then we cover the core tools that will enable you to do in Python what you were doing with other tools, only faster and better!

Additional Resources

Upcoming Open Python for Scientists & Engineers Sessions:

Albuquerque, NM, Sept 11-15, 2017
Washington, DC, Sept 25-29, 2017
Los Alamos, NM, Oct 2-6, 2017
Cambridge, UK, Oct 16-20, 2017
San Diego, CA, Oct 30-Nov 3, 2017
Albuquerque, NM, Nov 13-17, 2017
Los Alamos, NM, Dec 4-8, 2017
Austin, TX, Dec 11-15, 2017

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Learn More

Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat Sheets
Enthought's Machine Learning with Python Cheat Sheets Additional Webinars in the Training Series:

Python for Data Science: A Tour of Enthought’s Professional Technical Training Course

Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

Download Enthought’s Pandas Cheat SheetsEnthought's Pandas Cheat Sheets

The post Webinar: Python for Scientists & Engineers: A Tour of Enthought’s Professional Training Course appeared first on Enthought Blog.

by cgodshall at July 18, 2017 10:20 PM

Pierre de Buyl

Developing a Cython library

For some time, I have used Cython to accelerate parts of Python programs. One stumbling block in going from Python/NumPy code to Cython code is the fact that one cannot access NumPy's random number generator from Cython without explicit declaration. Here, I give the steps to make a pip-installable 'cimportable' module, using the threefry random number generator as an example.

(Note: a draft of this article appeared online in june 2017 by mistake, this version is complete)

The aim

The aim is that, starting with a Python code reading

import numpy as np

N=100
x=0
for i in range(N):
    x = x + np.random.normal()

one can end up with a very similar Cython code

cimport numpy as np

cdef int i, N
cdef double x
N = 100
x = 0
for i in range(N):
    x = x + np.random.normal()

With the obvious benefit of using the same module for the random number generator (RNG) with a simple interface.

This is impossible with the current state of NumPy, even though there is work in that direction ng-numpy-randomstate. This post is still relevant for other where Cython is involved contexts anyway.

The challenge

Building a c-importable module just depends on having a corresponding .pxd file available in the path. The idea behind .pxd files is that they contain C-level (or cdef level) declarations whereas the implementation goes in the .pyx file with the same basename.

A consequence of this is that Python-type (regular def) functions do not appear in the .pxd file but only in the .pyx file and cannot be cimported in another cython file. They can of course be Python imported.

The challenge lies in a proper organization of these different parts and of a seamless packaging and installation via pip.

Organization of the module

The module is named threefry after the corresponding Threefry RNG random123. It contains my implementation of the RNG as a C library and of a Cython wrapper.

I review below the steps, that I found via the documentation and quite a lot of trial and error.

Enable cimporting

To enable the use of the Cython wrapper from other Cython code, it is necessary to write a .pxd file, see the documentation on Sharing Declarations. .pxd files can exist on their own but in the present situation, we will use them with the same base name as the .pyx file. This way the .pxd file is automatically read by Cython when compiling the extension, it is as if its content was written in the .pyx file itself.

The .pxd can only contain plain C, cdef or cpdef declarations, pure Python declarations must go the in .pyx file.

Note: The .pxd file must be packaged with the final module, see below.

The file threefry.pxd contains the following declarations

from libc.stdint cimport uint64_t

cdef extern from "threefry.h":
    ...

cdef class rng:
    ...

meaning that the extension type threefry.rng will be accessible via a cimport from other modules. The implementation is stored in threefry.pyx.

With the aim of hiding the implementation details, I wrote a __init__.pxd file containing the following:

from threefry.threefry cimport rng

so that the user code looks like

cimport threefry
cdef threefry.rng r = threefry.rng(seed)

and I am free to refactor the code later if I wish to do so.

Compilation information

To cimport my module, there is one more critical step: providing the needed compiler flag for the C declaration, that is providing the include path for threefry.h (that must be read when compiling user code).

For this purpose, I define a utility routine get_include that can be called from the user's setup.py file as:

from setuptools import setup, Extension
from Cython.Build import cythonize
import threefry

setup(
    ext_modules=cythonize(Extension('use_threefry', ["use_threefry.pyx"], include_dirs=[threefry.get_include()]))
)

Note: the argument include_dirs is given to Extension and not to cythonize.

Packaging

The .h and .pxd files must be added via the package_data argument to setup.

Wrapping up

In short, to make a cimport-able module

  1. Move the shared declarations to a .pxd file.
  2. The implementation goes in the .pyx file, that will be installed as a compiled module.
  3. The .pxd and .h files must be added to package_data.
  4. A convenient way to obtain the include directories must be added.

All of this can be found in my random number generator package https://github.com/pdebuyl/threefry

The algorithm is from Salmon's et al paper Parallel Random Numbers: As Easy as 1, 2, 3, their code being distributed at random123. I wrote about it earlier in a blog post

by Pierre de Buyl at July 18, 2017 09:00 AM

Matthew Rocklin

Scikit-Image and Dask Performance

This weekend at the SciPy 2017 sprints I worked alongside Scikit-image developers to investigate parallelizing scikit-image with Dask.

Here is a notebook of our work.

July 18, 2017 12:00 AM

July 17, 2017

numfocus

NumFOCUS Projects at SciPy 2017

SciPy 2017 is a wrap! As you’d expect, we had lots of participation by NumFOCUS sponsored projects. Here’s a collection of links to talks given by our projects:   Tutorials Software Carpentry Scientific Python Course Tutorial by Maxim Belkin (Part 1) (Part 2) The Jupyter Interactive Widget Ecosystem Tutorial by Matt Craig, Sylvain Corlay, & Jason […]

by Gina Helfrich at July 17, 2017 10:21 PM

July 12, 2017

Enthought

Webinar: A Tour of Enthought’s Latest Enterprise Python Solutions

When: Thursday, July 20, 2017, 11-11:45 AM CT (Live webcast)

What: A comprehensive overview and live demonstration of Enthought’s latest tools for Python for the enterprise with Enthought’s Chief Technical & Engineering Officer, Didrik Pinte

Who Should Attend: Python users (or those supporting Python users) who are looking for a universal solution set that is reliable and “just works”; scientists, engineers, and data science teams trying to answer the question “how can I more easily build and deploy my applications”; organizations looking for an alternative to MATLAB that is cost-effective, robust, and powerful

REGISTER  (if you can’t attend we’ll send all registrants a recording)


For over 15 years, Enthought has been empowering scientists, engineers, analysts, and data scientists to create amazing new technologies, to make new discoveries, and to do so faster and more effectively than they dreamed possible. Along the way, hand in hand with our customers in aerospace, biotechnology, finance, oil and gas, manufacturing, national laboratories, and more, we’ve continued to “build the science tools we wished we had,” and share them with the world.

For 2017, we’re pleased to announce the release of several major new products and tools, specifically designed to make Python more powerful and accessible for users like you who are building the future of science, engineering, artificial intelligence, and data analysis.

WHAT YOU’LL SEE IN THE WEBINAR

In this webinar, Enthought’s Chief Technical & Engineering Officer will share a comprehensive overview and live demonstration of Enthought’s latest products and how they provide the foundation for scientific computing and artificial intelligence applications with Python, including:

We’ll also walk through  specific use cases so you can quickly see how Enthought’s Enterprise Python tools can impact your workflows and productivity.

REGISTER  (if you can’t attend we’ll send all registrants a recording)


Presenter: Didrik Pinte, Chief Technical & Engineering Officer, Enthought

 

 

 

Related Blogs:

Blog: Enthought Announces Canopy 2.1: A Major Milestone Release for the Python Analysis Environment and Package Distribution (June 2017)

Blog: Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting (April 2017)

Blog: New Year, New Enthought Products (Jan 2017)

Product pages:

The post Webinar: A Tour of Enthought’s Latest Enterprise Python Solutions appeared first on Enthought Blog.

by admin at July 12, 2017 03:27 PM

Continuum Analytics news

Continuum Analytics Named a 2017 Gartner Cool Vendor in Data Science and Machine Learning

Thursday, July 13, 2017

Data Science and AI platform, Anaconda, empowers leading businesses worldwide with solutions to transform data into intelligence 

AUSTIN, Texas—July 13, 2017Continuum Analytics, the creator and driving force behind Anaconda, the leading data science and AI platform powered by Python, today announced it has been included in the “Cool Vendors in Data Science and Machine Learning, 2017” report by Gartner, Inc. 

“We believe the addition of machine learning to Gartner’s Hype Cycle for Emerging Technologies in 2016 highlights the growing importance of data science across the enterprise,” said Scott Collison, chief executive officer of Continuum Analytics. “Data science has shifted from ‘emerging’ to ‘established’ and we’re seeing this evolution first-hand as Anaconda’s active user base of four million continues to grow. We are enabling future innovations; solving some of the world’s biggest challenges and uncovering answers to questions that haven’t even been asked yet.” 

Continuum Analytics recently released its newest version, Anaconda 4.4, featuring a comprehensive platform for Python-centric data science with a single-click installer for Windows, Mac, Linux and Power8. Anaconda 4.4 is also designed to make it easy to work with both Python 2 and Python 3 code. 

Gartner is the world's leading information technology research and advisory company. You can find the full report on Gartner’s site: https://www.gartner.com/document/3706738

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

About Anaconda Powered by Continuum Analytics

Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads and 4 million unique users to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Learn more at continuum.io.

by swebster at July 12, 2017 02:18 PM

Continuum Analytics

Continuum Analytics Named a 2017 Gartner Cool Vendor in Data Science and Machine Learning

Data Science and AI platform, Anaconda, empowers leading businesses worldwide with solutions to transform data into intelligence 

AUSTIN, Texas—July 13, 2017Continuum Analytics, the creator and driving force behind Anaconda, the leading data science and AI platform powered by Python, today announced it has been included in the “Cool Vendors in Data Science and Machine Learning, 2017” report by Gartner, Inc. 

“We believe the addition of machine learning to Gartner’s Hype Cycle for Emerging Technologies in 2016 highlights the growing importance of data science across the enterprise,” said Scott Collison, chief executive officer of Continuum Analytics. “Data science has shifted from ‘emerging’ to ‘established’ and we’re seeing this evolution first-hand as Anaconda’s active user base of four million continues to grow. We are enabling future innovations; solving some of the world’s biggest challenges and uncovering answers to questions that haven’t even been asked yet.” 

Continuum Analytics recently released its newest version, Anaconda 4.4, featuring a comprehensive platform for Python-centric data science with a single-click installer for Windows, Mac, Linux and Power8. Anaconda 4.4 is also designed to make it easy to work with both Python 2 and Python 3 code. 

Gartner is the world’s leading information technology research and advisory company. You can find the full report on Gartner’s site: https://www.gartner.com/document/3706738

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

About Anaconda Powered by Continuum Analytics

Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads and 4 million unique users to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Learn more at continuum.io.

by Team Anaconda at July 12, 2017 02:18 PM

July 10, 2017

Continuum Analytics news

Anaconda Expert to Discuss Data Science Best Practices at MIT CDOIQ Symposium

Tuesday, July 11, 2017

CAMBRIDGE, MA— July 11, 2017Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Solutions Architect Zach Carwile will speak at the 2017 MIT Chief Data Officer and Information Quality (MIT CDOIQ) Symposium on July 13 at 4:30pm EST. As CDOs and IQ Professionals take their place as the central players in the business of data, the MIT CDOIQ Symposium brings the brightest industry minds together to discuss and advance the big data landscape.

In his session, titled “Data Science is Just the First Step…,” Carwile will explore how organizations can empower their data science teams to build enterprise-grade data products for the analysts who drive business processes. Carwile will also discuss the benefits of containerization for data science projects and explain how establishing a robust deployment process maximizes the value and reach of data science investments.

WHO: Zach Carwhile, solutions architect, Anaconda Powered By Continuum Analytics
WHAT: “Data Science is Just the First Step…”
WHEN: July 13, 4:30–5:10pm. EST
WHERE: Massachusetts Institute of Technology, Tang Building (Room E51), MIT East Campus, 70 Memorial Drive, Cambridge, MA, USA 02139
REGISTER: HERE

###

About Anaconda Powered by Continuum Analytics
Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io.

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by swebster at July 10, 2017 06:23 PM

Continuum Analytics

Anaconda Expert to Discuss Data Science Best Practices at MIT CDOIQ Symposium

CAMBRIDGE, MA— July 11, 2017Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced that Solutions Architect Zach Carwile will speak at the 2017 MIT Chief Data Officer and Information Quality (MIT CDOIQ) Symposium on July 13 at 4:30pm EST. As CDOs and IQ Professionals take their place as the central players in the business of data, the MIT CDOIQ Symposium brings the brightest industry minds together to discuss and advance the big data landscape.

In his session, titled “Data Science is Just the First Step…,” Carwile will explore how organizations can empower their data science teams to build enterprise-grade data products for the analysts who drive business processes. Carwile will also discuss the benefits of containerization for data science projects and explain how establishing a robust deployment process maximizes the value and reach of data science investments.

WHO: Zach Carwhile, solutions architect, Anaconda Powered By Continuum Analytics
WHAT: “Data Science is Just the First Step…”
WHEN: July 13, 4:30–5:10pm. EST
WHERE: Massachusetts Institute of Technology, Tang Building (Room E51), MIT East Campus, 70 Memorial Drive, Cambridge, MA, USA 02139
REGISTER: HERE

###

About Anaconda Powered by Continuum Analytics
Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence to solve the world’s most challenging problems. Anaconda puts superpowers into the hands of people who are changing the world. Learn more at continuum.io.

###

Media Contact:
Jill Rosenthal
InkHouse
anaconda@inkhouse.com

by Team Anaconda at July 10, 2017 06:23 PM

Paul Ivanov

SciPy 2017 tips

After missing it for a couple of years, I am happy to be back in Austin, TX for SciPy this week!

Always invigorating and exhilarating, Scientific Computing with Python (SciPy) has remained a top quality venue for getting together with fellow Pythonistas, especially the academically-bent variety.

As a graduate student eight years ago, I was fortunate enough to be one of receive sponsorship and attended my first SciPy - SciPy 2009. This was the last time it was held at CalTech in Pasadena, CA.

The following year, in 2010, at the first SciPy held in its now usual spot in Austin, TX, each attendee got a bottle of delicious salsa!

SciPy2010 Salsa Stack

Here are some oy my thoughts about attending this wonderful conference.

Conference Tips

bring a sweatshirt -- Yes, I know Austin's hot, but at the AT&T center, they don't mess around and crank the air conditioning all the way up to 11!

join the slack group -- This year, there's a Slack group for SciPy: the link to join is in a pair of emails with the titles "Getting the most out of SciPy2017" and "Getting the most out of SciPy2017-UPDATED", both from SciPy2017 Organizers. So far at the tutorials slack has served as a useful back channel for communicating repo URLs and specific commands to run, signaling questions without interrupting the speakers' flow.

engage with others during the breaks, lunch, etc -- There are lots of tool authors here and we love chatting with users (and helping you become contributors and authors yourselves). Not your first SciPy and feeling "in-your-element"? Make the effort to invite others into the conversations and lunch outings you're having with old friends - we're all here because we care about this stuff.

take introvert breaks (and be considerate of others who may be doing the same) - I'm an introvert. Though I enjoy interacting with others (one-on-one or in small groups is best for me), it takes a lot of energy and at some point, I run out of steam. That's when I go for a walk, stepping away from the commotion to just have some quiet time.

be kind to yourself (especially at the sprints) -- Between the tutorials, all of the talks, and the sprints that follow, there will be a flurry of activity. Conferences are already draining enough without trying to get any work done, just meeting a bunch of new people and taking in a lot of information. It took a lot of false starts for me to have productive work output at sprints, but the best thing I've learned about them is to just let go of trying to get a lot done. Instead, try to get something small and well defined done or just help others.

Stuff to do in Austin

The conference already has a great list of Things to do in Austin, as well as Restaurants, so I'll just mention a few of my personal favorites.

Barton Springs Pool. Take a nice dip in the cool waters, and grab a delicious bite from one of the food trucks at The Picnic food truck park.

Go see the bats. The Congress Ave bridge in Austin is home to the largest urban bat colony in the world. You can read more about this here, but the short version is that around sunset (8-9pm) - a large numbers of bats stream out from underneath the bridge to go feed on insects. Some days, they leave in waves (this Saturday there were two waves, the first was smaller, but many people left thinking that was the entire show).

I hope you enjoy SciPy2017!

by Paul Ivanov at July 10, 2017 07:00 AM

July 07, 2017

numfocus

FEniCS Conference 2017 in Review

Jack Hale of FEniCS Project was kind enough to share his summary of the recent 2017 FEniCS conference, for which NumFOCUS provided some travel funds. Read on below! The FEniCS Conference 2017 brought together 82 participants from around the world for a conference on the FEniCS Project, a NumFOCUS fiscally sponsored project. FEniCS is a open-source computing […]

by Gina Helfrich at July 07, 2017 06:10 PM

July 05, 2017

numfocus

Belinda Weaver joins Carpentries as Community Development Lead

Belinda Weaver was recently hired by the Carpentries as their new Community Development Lead. We are delighted to welcome her to the NumFOCUS family! Here, Belinda introduces herself to the community and invites your participation and feedback on her work. I am very pleased to take up the role of Community Development Lead for Software […]

by Gina Helfrich at July 05, 2017 10:05 PM