August 23, 2016

Enthought

Webinar: Introducing the NEW Python Integration Toolkit for LabVIEW

Python Integration Toolkit for LabVIEW from Enthought - Webinar
Dates: Thursday, August 25, 2016, 1:00-1:45 PM CT or Wednesday, August 31, 2016, 9:00-9:45 AM CT / 3:00-3:45 PM BT

Register now (if you can’t attend, we’ll send you a recording)

LabVIEW is a software platform made by National Instruments, used widely in industries such as semiconductors, telecommunications, aerospace, manufacturing, electronics, and automotive for test and measurement applications. Earlier this month, Enthought released the Python Integration Toolkit for LabVIEW, which is a “bridge” between the LabVIEW and Python environments.

In this webinar, we’ll demonstrate:

  1. How the new Python Integration Toolkit for LabVIEW from Enthought seamlessly brings the power of the Python ecosystem of scientific and engineering tools to LabVIEW
  2. Examples of how you can extend LabVIEW with Python, including using Python for signal and image processing, cloud computing, web dashboards, machine learning, and more
Python Integration Toolkit for LabVIEW

Quickly and efficiently access scientific and engineering tools for signal processing, machine learning, image and array processing, web and cloud connectivity, and much more. With only minimal coding on the Python side, this extraordinarily simple interface provides access to all of Python’s capabilities.

Try it with your data, free for 30 days

Download a free 30 day trial of the Python Integration Toolkit for LabVIEW from the National Instruments LabVIEW Tools Network.

How LabVIEW users can benefit from Python :

  • High-level, general purpose programming language ideally suited to the needs of engineers, scientists, and analysts
  • Huge, international user base representing industries such as aerospace, automotive, manufacturing, military and defense, research and development, biotechnology, geoscience, electronics, and many more
  • Tens of thousands of available packages, ranging from advanced 3D visualization frameworks to nonlinear equation solvers
  • Simple, beginner-friendly syntax and fast learning curve

Register

by admin at August 23, 2016 02:44 PM

August 21, 2016

Titus Brown

Lessons on doing science from my father, Gerry Brown

(This is an invited chapter for a memorial book about my father. You can also read my remembrances from the day after he passed away.)

Dr. Gerald E. Brown was a well known nuclear physicist and astrophysicist who worked at Stony Brook University from 1968 until his death in 2013. He was internationally active in physics research from the late 1950s onwards, ran an active research group at Stony Brook until 2009, and supervised nearly a hundred PhD students during his life. He was also my father.

It's hard to write about someone who is owned, in part, by so many people. I came along late in my father's life (he was 48 when I was born), and so I didn't know him that well as an adult. However, growing up with a senior professor as a father had a huge impact on my scientific career, which I can recognize even more clearly now that I'm a professor myself.

Gerry (as I called him) didn't spend much time directly teaching his children about his work. When I was invited to write something for his memorial book, it was suggested that I write about what he had taught me about being a scientist. I found myself stymied, because to the best of my recollection we had never talked much about the practice of science. When I mentioned this to my oldest brother, Hans, we shared a good laugh -- he had exactly the same experience with our father, 20 years before me!

Most of what Gerry taught me was taught by example. Below are some of the examples that I remember most clearly, and of which I'm the most proud. While I don't know if either of my children will become scientists, if they do, I hope they take these to heart -- I can think of few better scientific legacies to pass on to them from my father.

Publishing work that is interesting (but perhaps not correct) can make for a fine career.

My father was very proud of his publishing record, but not because he was always (or even frequently) right. In fact, several people told me that he was somewhat notorious for having a 1- in-10 "hit rate" -- he would come up with many crazy ideas, of which only about 1 in 10 would be worth pursuing. However, that 1 in 10 was enough for him to have a long and successful career. That this drove some people nuts was merely an added bonus in his view.

Gerry was also fond of publishing controversial work. Several times he told me he was proudest of the papers that caused the most discussion and collected the most rebuttals. He wryly noted that these papers often gathered many citations, even when they turned out to be incorrect.

The best collaborations are both personal and professional friendships.

The last twenty-five years of Gerry's life were dominated by a collaboration with Hans Bethe on astrophysics, and they traveled to Pasadena every January until the early 2000s to work at the California Institute of Technology. During this month they lived in the same apartment, worked together closely, and met with a wide range of people on campus to explore scientific ideas; they also went on long hikes in the mountains above Pasadena (chronicled by Chris Adami in "Three Weeks with Hans Bethe"). These close interactions not only fueled his research for the remainder of the year, but emanated from a deep personal friendship. It was clear that, to Gerry, there was little distinction between personal and professional in his research life.

Science is done by people, and people need to be supported.

Gerry was incredibly proud of his mentoring record, and did his best to support his students, postdocs, and junior colleagues both professionally and personally. He devoted the weeks around Christmas each year to writing recommendation letters for junior colleagues. He spent years working to successfully nominate colleagues to the National Academy of Sciences. He supported junior faculty with significant amounts of his time and sometimes by forgoing his own salary to boost theirs. While he never stated it explicitly, he considered most ideas somewhat ephemeral, and thought that his real legacy -- and the legacy most worth having -- lay in the students and colleagues who would continue after him.

Always treat the administrative staff well.

Gerry was fond of pointing out that the secretaries and administrative staff had more practical power than most faculty, and that it was worth staying on their good side. This was less a statement of calculated intent and more an observation that many students, postdocs, and faculty treated non-scientists with less respect than they deserved. He always took the time to interact with them on a personal level, and certainly seemed to be well liked for it. I've been told by several colleagues who worked with Gerry that this was a lesson that they took to heart in their own interactions with staff, and it has also served me well.

Hard work is more important than brilliance.

One of Gerry's favorite quotes was "Success is 99% perspiration, 1% inspiration", a statement attributed to Thomas Edison. According to Gerry, he simply wasn't as smart as many of his colleagues, but he made up for it by working very hard. I have no idea how modest he was being -- he was not always known for modesty -- but he certainly worked very hard, spending 10-14 hours a day writing in his home office, thinking in the garden, or meeting with colleagues at work. While I try for more balance in my work and life myself, he demonstrated to me that sometimes grueling hard work is a necessity when tackling tricky problems: for example, my Earthshine publications came after a tremendously unpleasant summer working on some incredibly messy and very tedious analysis code, but without the resulting analysis we wouldn't have been able to advance the project (which continues today, almost two decades later).

Experiments should talk to theory, and vice versa.

Steve Koonin once explained to me that Gerry was a phenomenologist -- a theorist who worked well with experimental data -- and that this specialty was fairly rare because it required communicating effectively across sub disciplines. Gerry wasn't attracted to deep theoretical work and complex calculations, and in any case liked to talk to experimentalists too much to be a good theorist -- for example, some of our most frequent dinner guests when I was growing up were Peter Braun- Munzinger and Johanna Stachel, both experimentalists. So he chose to work at the interface of theory and experiment, where he could develop and refine his intuition based on competing world views emanating from the theorists (who sought clean mathematical solutions) and experimentalists (who had real data that needed to be reconciled with theory). I have tried to pursue a similar strategy in computational biology.

Computers and mathematical models are tools, but the real insight comes from intuition.

Apart from some early experience with punch cards at Yale in the 1950s, Gerry avoided computers and computational models completely in his own research (although his students, postdocs and collaborators used them, of course). I am told that his theoretical models were often relatively simple approximations, and he himself often said that his work with Hans Bethe proceeded by choosing the right approximation for the problem at hand -- something at which Bethe excelled. Their choice of approximation was guided by intuition about the physical nature of the problem as much as by mathematical insight, and they could often use a few lines of the right equations to reach results similar to complex computational and mathematical models. This search for simple models and the utility of physical intuition in his research characterized many of our conversations, even when I became more mathematically trained.

Teaching is largely about conveying intuition.

Once a year, Gerry would load up a backpack with mason jars full of thousands of pennies, and bring them into his Statistical Mechanics class. This was needed for one of his favorite exercises -- a hands-on demonstration of the Law of Large Numbers and the Central Limit Theorem, which lie at the heart of thermodynamics and statistical mechanics. He would have students flip 100 coins and record the average, and then do it again and again, and have the class plot the distributions of results. The feedback he got was that this was a very good way of viscerally communicating the basics of statistical mechanics to students, because it built their intuition about how averages really worked. This approach has carried through to my own teaching and training efforts, where I always try to integrate hands-on practice with more theoretical discussion.

Benign neglect is a good default for mentoring.

Gerry was perhaps overly fond of the concept of "benign neglect" in parenting, in that much of my early upbringing was at the hands of my mother with only occasional input from him. However, in his oft-stated experience (and now mine as well), leaving smart graduate students and postdocs to their own devices most of the time was far better than trying to actively manage (or interfere in) their research for them. I think of it this way: if I tell my students what to do and I'm wrong (which is likely, research being research), then they either do it (and I suffer for having misdirected them) or they don't do it (and then I get upset at them for ignoring me). But if I don't tell my students what to do, then they usually figure out something better for themselves, or else get stuck and then come to me to discuss it. The latter two outcomes are much better from a mentoring perspective than the former two.

Students need to figure it out for themselves

One of the most embarrassing (in retrospect) interactions I had with my father was during a long car ride where he tried to convince me that when x was a negative number, -x was positive. At the time, I didn't agree with this at all, which was probably because I was a stubborn 7 years old. While it took me a few more years to understand this concept, by the time I was a math major I did have the concept down! Regardless, in this, and many other interactions around science, he never browbeat me about it or got upset at my stupidity or stubbornness. I believe this carried through to his interactions with his students. In fact, the only time I ever heard him express exasperation was with colleagues who were acting badly.

A small nudge at the right moment is sometimes all that is needed.

A pivotal moment in my life came when Gerry introduced me to Mark Galassi, a physics graduate student who also was the systems administrator for the UNIX systems in the Institute for Theoretical Physics at Stony Brook; Mark found out I was interested in computers and gave me access to the computer system. This was one of the defining moments in my research life, as my research is entirely computational! Similarly, when I took a year off from college, my father put me in touch with Steve Koonin, who needed a systems administrator for a new project; I ended up working with the Earthshine project, which was a core part of my research for several years. And when I was trying to decide what grad schools to apply to, Gerry suggested I ask Hans Bethe and Steve Koonin what they thought was the most promising area of research for the future -- their unequivocal answer was "biology!" This drove me to apply to biology graduate schools, get a PhD in biology, and ultimately led to my current faculty position. In all these cases, I now recognize the application of a light touch at the right moment, rather than the heavy-handed guidance that he must have desperately wanted to give at times.

Conclusions

There are many more personal stories that could be told about Gerry Brown, including his (several, and hilarious) interactions with the East German secret police during the cold war, his (quite bad) jokes, his (quite good) cooking, and his (enthusiastic) ballroom dancing, but I will save those for another time. I hope that his friends and colleagues will see him in the examples above, and will remember him fondly.

Acknowledgements

I thank Chris Adami, Erich Schwarz, Tracy Teal, and my mother, Elizabeth Brown, for their comments on drafts of this article.

by C. Titus Brown at August 21, 2016 10:00 PM

August 19, 2016

Continuum Analytics news

Dask for Institutions

Posted Tuesday, August 16, 2016

Introduction

Institutions use software differently than individuals. Over the last few months I’ve had dozens of conversations about using Dask within larger organizations like universities, research labs, private companies, and non-profit learning systems. This post provides a very coarse summary of those conversations and extracts common questions. I’ll then try to answer those questions.

Note: some of this post will be necessarily vague at points. Some companies prefer privacy. All details here are either in public Dask issues or have come up with enough institutions (say at least five) that I’m comfortable listing the problem here.

Common story

Institution X, a university/research lab/company/… has many scientists/analysts/modelers who develop models and analyze data with Python, the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code. These models/data sometimes grow to be large enough to need a moderately large amount of parallel computing.

Fortunately, Institution X has an in-house cluster acquired for exactly this purpose of accelerating modeling and analysis of large computations and datasets. Users can submit jobs to the cluster using a job scheduler like SGE/LSF/Mesos/Other.

However the cluster is still under-utilized and the users are still asking for help with parallel computing. Either users aren’t comfortable using the SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic workloads, or the interaction times aren’t good enough for the interactive use that users appreciate.

There was an internal effort to build a more complex/interactive/Pythonic system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and definitely isn’t something that Institution X wants to pursue. It turned out to be a harder problem than expected to design/build/maintain such a system in-house. They’d love to find an open source solution that was well featured and maintained by a community.

The Dask.distributed scheduler looks like it’s 90% of the system that Institution X needs. However there are a few open questions:

  • How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job scheduler?
  • How can we grow and shrink the cluster dynamically based on use?
  • How do users manage software environments on the workers?
  • How secure is the distributed scheduler?
  • Dask is resilient to worker failure, how about scheduler failure?
  • What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?
  • How do we handle multiple concurrent users and priorities?
  • How does this compare with Spark?

So for the rest of this post I’m going to answer these questions. As usual, few of answers will be of the form “Yes Dask can solve all of your problems.” These are open questions, not the questions that were easy to answer. We’ll get into what’s possible today and how we might solve these problems in the future.

How do we integrate dask.distributed with SGE/LSF/Mesos/Other?

It’s not difficult to deploy dask.distributed at scale within an existing cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already a researcher within the institution doing this manually by running dask-scheduler on some static node in the cluster and launching dask-worker a few hundred times with their job scheduler and a small job script.

The goal now is how to formalize this process for the individual version of SGE/LSF/Mesos/Other used within the institution while also developing and maintaining a standard Pythonic interface so that all of these tools can be maintained cheaply by Dask developers into the foreseeable future. In some cases Institution X is happy to pay for the development of a convenient “start dask on my job scheduler” tool, but they are less excited about paying to maintain it forever.

We want Python users to be able to say something like the following:

from dask.distributed import Executor, SGECluster

c = SGECluster(nworkers=200, **options)
e = Executor(c)

… and have this same interface be standardized across different job schedulers.

How can we grow and shrink the cluster dynamically based on use?

Alternatively, we could have a single dask.distributed deployment running 24/7 that scales itself up and down dynamically based on current load. Again, this is entirely possible today if you want to do it manually (you can add and remove workers on the fly) but we should add some signals to the scheduler like the following:

  • “I’m under duress, please add workers”
  • “I’ve been idling for a while, please reclaim workers”

and connect these signals to a manager that talks to the job scheduler. This removes an element of control from the users and places it in the hands of a policy that IT can tune to play more nicely with their other services on the same network.

How do users manage software environments on the workers?

Today Dask assumes that all users and workers share the exact same software environment. There are some small tools to send updated .py and .egg files to the workers but that’s it.

Generally Dask trusts that the full software environment will be handled by something else. This might be a network file system (NFS) mount on traditional cluster setups, or it might be handled by moving docker or conda environments around by some other tool like knit for YARN deployments or something more custom. For example Continuum sells proprietary software that does this.

Getting the standard software environment setup generally isn’t such a big deal for institutions. They typically have some system in place to handle this already. Where things become interesting is when users want to use drastically different environments from the system environment, like using Python 2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may also want to change the software environment many times in a single session.

The best solution I can think of here is to pass around fully downloaded conda environments using the dask.distributed network (it’s good at moving large binary blobs throughout the network) and then teaching the dask-workers to bootstrap themselves within this environment. We should be able to tear everything down and restart things within a small number of seconds. This requires some work; first to make relocatable conda binaries (which is usually fine but is not always fool-proof due to links) and then to help the dask-workers learn to bootstrap themselves.

Somewhat related, Hussain Sultan of Capital One recently contributed a dask-submit command to run scripts on the cluster: http://distributed.readthedocs.io/en/latest/submitting-applications.html

How secure is the distributed scheduler?

Dask.distributed is incredibly insecure. It allows anyone with network access to the scheduler to execute arbitrary code in an unprotected environment. Data is sent in the clear. Any malicious actor can both steal your secrets and then cripple your cluster.

This is entirely the norm however. Security is usually handled by other services that manage computational frameworks like Dask.

For example we might rely on Docker to isolate workers from destroying their surrounding environment and rely on network access controls to protect data access.

Because Dask runs on Tornado, a serious networking library and web framework, there are some things we can do easily like enabling SSL, authentication, etc.. However I hesitate to jump into providing “just a little bit of security” without going all the way for fear of providing a false sense of security. In short, I have no plans to work on this without a lot of encouragement. Even then I would strongly recommend that institutions couple Dask with tools intended for security. I believe that is common practice for distributed computational systems generally.

Dask is resilient to worker failure, how about scheduler failure?

can come and go. Clients can come and go. The state in the scheduler is currently irreplaceable and no attempt is made to back it up. There are a few things you could imagine here:

  1. Backup state and recent events to some persistent storage so that state can be recovered in case of catastrophic loss
  2. Have a hot failover node that gets a copy of every action that the scheduler takes
  3. Have multiple peer schedulers operate simultaneously in a way that they can pick up slack from lost peers
  4. Have clients remember what they have submitted and resubmit when a scheduler comes back online

Currently option 4 is currently the most feasible and gets us most of the way there. However options 2 or 3 would probably be necessary if Dask were to ever run as critical infrastructure in a giant institution. We’re not there yet.

As of recent work spurred on by Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back and everyone will reconnect. The state for computations in flight is entirely lost but the computational infrastructure remains intact so that people can resubmit jobs without significant loss of service.

Dask has a bit of a harder time with this topic because it offers a persistent stateful interface. This problem is much easier for distributed database projects that run ephemeral queries off of persistent storage, return the results, and then clear out state.

What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?

The short answer is no. Other than number of cores and available RAM all workers are considered equal to each other (except when the user explicitly specifies otherwise).

However this problem and problems like it have come up a lot lately. Here are a few examples of similar cases:

  1. Multiple data centers geographically distributed around the country
  2. Multiple racks within a single data center
  3. Multiple workers that have GPUs that can move data between each other easily
  4. Multiple processes on a single machine

Having some notion of hierarchical worker group membership or inter-worker preferred relationships is probably inevitable long term. As with all distributed scheduling questions the hard part isn’t deciding that this is useful, or even coming up with a sensible design, but rather figuring out how to make decisions on the sensible design that are foolproof and operate in constant time. I don’t personally see a good approach here yet but expect one to arise as more high priority use cases come in.

How do we handle multiple concurrent users and priorities?

There are several sub-questions here:

  • Can multiple users use Dask on my cluster at the same time?

Yes, either by spinning up separate scheduler/worker sets or by sharing the same set.

  • If they’re sharing the same workers then won’t they clobber each other’s data?

This is very unlikely. Dask is careful about naming tasks, so it’s very unlikely that the two users will submit conflicting computations that compute to different values but occupy the same key in memory. However if they both submit computations that overlap somewhat then the scheduler will nicely avoid recomputation. This can be very nice when you have many people doing slightly different computations on the same hardware. This works in the same way that Git works.

  • If they’re sharing the same workers then won’t they clobber each other’s resources?

Yes, this is definitely possible. If you’re concerned about this then you should give everyone their own scheduler/workers (which is easy and standard practice). There is not currently much user management built into Dask.

How does this compare with Spark?

At an institutional level Spark seems to primarily target ETL + Database-like computations. While Dask modules like Dask.bag and Dask.dataframe can happily play in this space this doesn’t seem to be the focus of recent conversations.

Recent conversations are almost entirely around supporting interactive custom parallelism (lots of small tasks with complex dependencies between them) rather than the big Map->Filter->Groupby->Join abstractions you often find in a database or Spark. That’s not to say that these operations aren’t hugely important; there is a lot of selection bias here. The people I talk to are people for whom Spark/Databases are clearly not an appropriate fit. They are tackling problems that are way more complex, more heterogeneous, and with a broader variety of users.

I usually describe this situation with an analogy comparing “Big data” systems to human transportation mechanisms in a city. Here we go:

  • A Database is like a train: it goes between a set of well defined points with great efficiency, speed, and predictability. These are popular and profitable routes that many people travel between (e.g. business analytics). You do have to get from home to the train station on your own (ETL), but once you’re in the database/train you’re quite comfortable.
  • Spark is like an automobile: it takes you door-to-door from your home to your destination with a single tool. While this may not be as fast as the train for the long-distance portion, it can be extremely convenient to do ETL, Database work, and some machine learning all from the comfort of a single system.
  • Dask is like an all-terrain-vehicle: it takes you out of town on rough ground that hasn’t been properly explored before. This is a good match for the Python community, which typically does a lot of exploration into new approaches. You can also drive your ATV around town and you’ll be just fine, but if you want to do thousands of SQL queries then you should probably invest in a proper database or in Spark.

Again, there is a lot of selection bias here, if what you want is a database then you should probably get a database. Dask is not a database.

This is also wildly over-simplifying things. Databases like Oracle have lots of ETL and analytics tools, Spark is known to go off road, etc.. I obviously have a bias towards Dask. You really should never trust an author of a project to give a fair and unbiased view of the capabilities of the tools in the surrounding landscape.

Conclusion

That’s a rough sketch of current conversations and open problems for “How Dask might evolve to support institutional use cases.” It’s really quite surprising just how prevalent this story is among the full spectrum from universities to hedge funds.

The problems listed above are by no means halting adoption. I’m not listing the 100 or so questions that are answered with “yes, that’s already supported quite well”. Right now I’m seeing Dask being adopted by individuals and small groups within various institutions. Those individuals and small groups are pushing that interest up the stack. It’s still several months before any 1000+ person organization adopts Dask as infrastructure, but the speed at which momentum is building is quite encouraging.

I’d also like to thank the several nameless people who exercise Dask on various infrastructures at various scales on interesting problems and have reported serious bugs. These people don’t show up on the GitHub issue tracker but their utility in flushing out bugs is invaluable.

As interest in Dask grows it’s interesting to see how it will evolve. Culturally Dask has managed to simultaneously cater to both the open science crowd as well as the private-sector crowd. The project gets both financial support and open source contributions from each side. So far there hasn’t been any conflict of interest (everyone is pushing in roughly the same direction) which has been a really fruitful experience for all involved I think.

This post was originally published by Matt Rocklin on his website, matthewrocklin.com 

by ryanwh at August 19, 2016 04:14 PM

Mining Data Science Treasures with Open Source

Posted Wednesday, August 17, 2016
Travis Oliphant
Chief Executive Officer & Co-Founder


Data Science is a goldmine of potential insights for any organization, but unearthing those insights can be resource-intensive, requiring systems and teams to work seamlessly and effectively as a unit.

Integrating resources isn’t easy. Traditionally, businesses chose vendors with all-in-one solutions to cover the task. This approach may seem convenient, but what freedoms must be sacrificed in order to achieve it?

This vendor relationship resembles the troubled history of coal mining towns in the Old West, where one company would own everything for sale in the town. Workers were paid with vouchers that could only be redeemed at company-owned shops.

The old folk tune "Sixteen Tons" stated it best: "I owe my soul to the company store."

With any monopoly, these vendors have no incentive to optimize products and services. There's only one option available — take it or leave it. But, just as some miners would leave these towns and make their way across the Wild West, many companies have chosen to forge their own trails with the freedom of Open Data Science — and they've never looked back.

Open Data Science: Providing Options

Innovation and flexibility are vital to the evolving field of Data Science, so any alternative to the locked-in vendor approach is attractive. Fortunately, Open Data Science provides the perfect ecosystem of options for true innovation.

Sometimes vendors provide innovation, such as with the infrastructure surrounding linear programming. This doesn’t mean they’re able to provide an out-of-the-box solution for all teams — adapting products to different businesses and industries requires work.

Most of the real innovation is emerging from the open source world. The tremendous popularity of Python and R, for example, bolsters innovation on all kinds of analytics approaches.

Given the wish to avoid a “mining town scenario” and the burgeoning innovation in Open Data Science, why are so many companies still reluctant to adopt it?

Companies Should Not Hesitate to Embrace Open Source

There are several reasons companies balk at Open Data Science solutions:

  • Licensing. Open source has many licenses: Apache, BSD, GPL, MIT, etc. This wide array of choices can produce analysis paralysis. In some cases, such as GPL, there is a requirement to make source code available for redistribution.
  • Diffuse contact. Unlike with vendor products, open source doesn’t provide a single point of contact. It’s a non-hierarchical effort. Companies have to manage keeping software current, and this can feel overwhelming without a steady guide they can rely on.
  • Education. With rapid change, companies find it difficult to stay on top of the many acronyms, project names, and new techniques required with each piece of open source software.

Fortunately, these concerns are completely surmountable. Most licenses are appropriate for commercial applications, and many companies are finding open source organizations to act as a contact point within the Open Data Science world — with the added benefit of a built-in guide to the ever-changing landscape of open source, thereby also solving the problem of education.

The Best Approach for Starting an Open Data Science Initiative

There are several tactics organizations can use to effectively adopt Open Data Science.

For instance, education and establishing a serious training program is crucial. One focus here has to be on reproducibility. Team members should know how the latest graph was produced and how to generate the next iteration of it.

Much of this requires understanding the architecture of the applications one is using, so dependency management is important. Anything that makes the process transparent to the team will promote understanding and engagement.

Flexible governance models are also valuable, allowing intelligent assessment of open source pros and cons. For example, it shouldn’t be difficult to create an effective policy on what sort of open source licenses work best.

Finally, committing resources to successful adaptation and change management should be central to any Open Data Science arrangement. This will almost always require coding to integrate solutions into one’s workflow. But this effort also supports retention: companies that shun open source risk losing developers who seek the cutting edge.

Tackle the Future Before You're Left Behind

Unlike the old mining monopolies, competition in Data Science is a rapidly-changing world of many participants. Companies that do not commit resources to education, understanding, governance and change management risk getting left behind as newer companies commit fully to open source.

Well-managed open source analytics environments, like Anaconda, provide a compatible and updated suite of modern analytics programs. Combine this with a steady guide to traverse the changing landscape of Open Data Science, and the Data Science gold mine becomes ripe for the taking.

by ryanwh at August 19, 2016 04:12 PM

August 17, 2016

Continuum Analytics news

Mining Data Science Treasures with Open Source

Posted Wednesday, August 17, 2016

Data Science is a goldmine of potential insights for any organization, but unearthing those insights can be resource-intensive, requiring systems and teams to work seamlessly and effectively as a unit.

Integrating resources isn’t easy. Traditionally, businesses chose vendors with all-in-one solutions to cover the task. This approach may seem convenient, but what freedoms must be sacrificed in order to achieve it?

This vendor relationship resembles the troubled history of coal mining towns in the Old West, where one company would own everything for sale in the town. Workers were paid with vouchers that could only be redeemed at company-owned shops.

The old folk tune "Sixteen Tons" stated it best: "I owe my soul to the company store."

With any monopoly, these vendors have no incentive to optimize products and services. There's only one option available — take it or leave it. But, just as some miners would leave these towns and make their way across the Wild West, many companies have chosen to forge their own trails with the freedom of Open Data Science — and they've never looked back.

Open Data Science: Providing Options

Innovation and flexibility are vital to the evolving field of Data Science, so any alternative to the locked-in vendor approach is attractive. Fortunately, Open Data Science provides the perfect ecosystem of options for true innovation.

Sometimes vendors provide innovation, such as with the infrastructure surrounding linear programming. This doesn’t mean they’re able to provide an out-of-the-box solution for all teams — adapting products to different businesses and industries requires work.

Most of the real innovation is emerging from the open source world. The tremendous popularity of Python and R, for example, bolsters innovation on all kinds of analytics approaches.

Given the wish to avoid a “mining town scenario” and the burgeoning innovation in Open Data Science, why are so many companies still reluctant to adopt it?

Companies Should Not Hesitate to Embrace Open Source

There are several reasons companies balk at Open Data Science solutions:

  • Licensing. Open source has many licenses: Apache, BSD, GPL, MIT, etc. This wide array of choices can produce analysis paralysis. In some cases, such as GPL, there is a requirement to make source code available for redistribution.
  • Diffuse contact. Unlike with vendor products, open source doesn’t provide a single point of contact. It’s a non-hierarchical effort. Companies have to manage keeping software current, and this can feel overwhelming without a steady guide they can rely on.
  • Education. With rapid change, companies find it difficult to stay on top of the many acronyms, project names, and new techniques required with each piece of open source software.

Fortunately, these concerns are completely surmountable. Most licenses are appropriate for commercial applications, and many companies are finding open source organizations to act as a contact point within the Open Data Science world — with the added benefit of a built-in guide to the ever-changing landscape of open source, thereby also solving the problem of education.

The Best Approach for Starting an Open Data Science Initiative

There are several tactics organizations can use to effectively adopt Open Data Science.

For instance, education and establishing a serious training program is crucial. One focus here has to be on reproducibility. Team members should know how the latest graph was produced and how to generate the next iteration of it.

Much of this requires understanding the architecture of the applications one is using, so dependency management is important. Anything that makes the process transparent to the team will promote understanding and engagement.

Flexible governance models are also valuable, allowing intelligent assessment of open source pros and cons. For example, it shouldn’t be difficult to create an effective policy on what sort of open source licenses work best.

Finally, committing resources to successful adaptation and change management should be central to any Open Data Science arrangement. This will almost always require coding to integrate solutions into one’s workflow. But this effort also supports retention: companies that shun open source risk losing developers who seek the cutting edge.

Tackle the Future Before You're Left Behind

Unlike the old mining monopolies, competition in Data Science is a rapidly-changing world of many participants. Companies that do not commit resources to education, understanding, governance and change management risk getting left behind as newer companies commit fully to open source.

Well-managed open source analytics environments, like Anaconda, provide a compatible and updated suite of modern analytics programs. Combine this with a steady guide to traverse the changing landscape of Open Data Science, and the Data Science gold mine becomes ripe for the taking.

by swebster at August 17, 2016 02:41 PM

Matthieu Brucher

Building Boost.Python with a custom Python3

I’ve started working on porting some Python libraries to Python3, but I required using an old Visual Studio (2012) for which there is no Python3 version. In the end, I tried following this tutorial. The issue with the tutorial is that you are downloading the externals by hand. It is actually simpler to call get_externals.bat from the PCBuild folder.

Be aware that the solution is a little bit flawed. pylauncher is built in win32 mode in Release instead of x64. This has an impact on deployment.

Once this is done, I had to deploy the build to a proper location so that it is self contained. I inspired myself heavily from another tutorial by the same author, only adding 64 bits support in this gist.

Once this was done, time to build Boost.Python! To start, just compile bjam the usual way, don’t add Python options on the command line, this will utterly fail in Boost.Build. Then add in user-config.jam the following lines (with the proper folders):

using python : 3.4 : D:/Tools/Python-3.4.5/_INSTALL/python.exe : D:/Tools/Python-3.4.5/_INSTALL/include : D:/Tools/Python-3.4.5/_INSTALL/libs ;

This should build the debug and release mode with this line:

.\b2 --with-python --layout=versioned toolset=msvc-11.0 link=shared stage address-model=64
.\b2 --with-python --layout=versioned toolset=msvc-11.0 link=shared stage address-model=64 python-debugging=on

by Matt at August 17, 2016 07:56 AM

August 16, 2016

Titus Brown

What I did on my summer vacation: Microbial Diversity 2016

I just left Woods Hole, MA, where I spent the last 6 and a half weeks taking the Microbial Diversity course as a student. It was fun, exhausting, stimulating, and life changing!

The course had three components: a lecture series, in which world-class microbiologists gave 2-3 hrs of talks each morning, every Monday-Saturday of each week; a lab practical component, in which we learned to do environmental enrichments and isolations, use microscopes and mass specs and everything else; and a miniproject, a self-directed project that took place the last 2-3 weeks of the course, and for which we could make use of all of the equipment. The days typically went from 9am to 10pm, with a break every Sunday.

I think it's fair to say I've been microbiologized pretty thoroughly :).

The course directors were Dianne Newman and Jared Leadbetter, two Caltech professors that I knew of old (from my own days at Caltech). The main topic of the course was environmental microbiology and biogeochemistry, although we did cover some biomedical topics along with bioinformatics and genomics. The lectures were excellent, and there's only two or three for which I didn't take copious notes (or live-tweet under my student account, @tituslearnz). The other 18 students were amazing - most were grad students, with one postdoc and two other faculty. The course TAs were grad students, postdocs, and staff from the directors' labs as well as other labs. My just-former employee Jessica Mizzi had signed on as course coordinator, too, and it was great to see her basically running the course on a daily basis. Everyone was fantastic, and all in all, it was an incredibly intense and in-depth experience.

Why Microbial Diversity?

I applied to take the course for several reasons. The main one is this: even before I got involved in metagenomics at Michigan State, I'd been moonlighting on microbial genomics for quite a while. I have a paper on E. coli genomics, another paper on syntrophic interactions between methane oxidizing and sulfur reducing bacteria, a third paper on Shewanella oneidensis (with Dianne), and a fourth paper on bone-eating worm symbionts. While I do have a molecular biology background from my grad work in molecular developmental biology, these papers made me painfully aware that I knew essentially no microbial physiology. Yet I now work with metagenomics data regularly, and I'm starting to get into oceanography and marine biogeochemical cycling... I needed to know more about microbes!

On top of wanting to know more, I was also in a good position to take 6 weeks "off" to learn some new skills. I received tenure with my move to UC Davis, and with the Moore Foundation DDD Investigator Award, I have some breathing room on funding. Moreover, I have four new postdocs coming on board (I'll be up to 6 by September!) but they're not here yet, and most of the people in my lab were gone for some or all of the summer. And, finally, my wife had been invited to Woods Hole to lecture in both Microbial Diversity and the STAMPS course, so she would be there for three weeks.

I also expected it to be fun! I took the Embryology course back in 2005, and it was one of the best scientific experience of my life, and it's one of the reasons I stuck with academia. The combination of intense intellectualism, the general Woods Hole gemisch of awesome scientists hanging out for the summer, and the beautiful Woods Hole weather and beaches, make it a summer mecca for science -- so why not go back?

What were the highlights?

The whole thing was awesome, frankly, although by week three I was pretty exhausted (and then it went on for another three weeks :).

One part of the experience that amazed me was how easy it was to find new microbes. We split into groups of 4-5 students for the first three weeks, and each group did about 15 enrichments and isolations from the environment, using different media and growth conditions. (Some of these also formed the basis for some miniprojects.) As soon as we got critters into pure culture, we would sequence their 16s regions and evaluate whether or not they were already known. From this, I personally isolated about 8 new species; some other students generated dozens of isolates.

My miniproject was super fun, too. (You can read my final report here and see my powerpoint presentation here.) I set out to enrich for sulfur oxidizing autotrophs but ended up finding some likely heterotrophs that grew anaerobically on acetate and nitrate. It was great fun to research media design, perform and "debug" enrichments, and throw machines and equipment at problems (I learned to use the anaerobic chamber, the gas chromatograph, HPLC, and ion chromatography, along with reacquainting myself with autoclaves and molecular biology). I ended up finding a microbe that probably does something relatively pedestrian (oxidizing acetate with nitrate as an electron acceptor), with the twist that I identified the microbe as a Shewanella species, and apparently Shewanella do not generally metabolize acetate anaerobically (although it has been seen before). I sent Jeff Gralnick some of my isolates and he'll look into it, and maybe we'll sequence and analyze the genomes as a follow-on.

I also got to play with Winogradsky columns! Kirsten G. and I put together some Winogradsky columns from environmental silt from Trunk River, a brackish marine ecosystem. We spent a lot of time poking them with microsensors and we're hoping to follow up on things with some metagenomics.

More importantly for my lab's research, I got even more thoroughly confused about the role and value of meta-omics than I had been before. There is an interesting gulf between how we know microbes work in the real world (in complex and rich ecosystems) and how we study them (frequently in isolates in the lab). I've been edging ever closer to working on ecosystem and molecular modeling along with data integration of many 'omic data types, and this course really helped me get a handle on what types of data people are generating and why. I also got a lot of insight into the different types of questions people want to answer, from questions of biogeochemical cycling to resilience and environmental adaptation to phylogeny. I am now better prepared to read papers across all of these areas and I'm hoping to delve deeper into these research areas in the next decade or so.

Also, Lisa Cohen from my lab visited the course with our MinION sequencer and we finally generated a bunch of real genomic data from two isolates! The genomes each resolved into two contigs, and we hope to publish them soon. You can read Lisa's extremely thorough blog post on the Nanopore sequencer here and also see the tutorial she put together for dealing with nanopore data, here

A personal highlight: at the end of the course, the TAs gave me the Tenure Deadbeat award, for "the most disturbing degeneration from faculty to student and hopefully back." It's gonna go on my office wall.

Surely something wasn't great? Anything?

Well,

  • six weeks of cafeteria food was rough.

  • six weeks of dorm life was somewhat rough (ameliorated by an awesome roommate, Mike B. from Thea Whitman's lab).

  • there was no air conditioning (really, no airflow at all) in my dorm room. And it gets very hot and humid in Woods Hole at times. Doooooom.

    Luckily I was exhausted most of the time because otherwise getting to sleep would have been impossible instead of merely very difficult.

  • my family was in town for three weeks, but I couldn't spend much time with them, and I missed them when they weren't in town!

  • my e-mail inbox is now a complete and utter disaster.

...but all of these were more or less expected.

Interestingly, I ignored Twitter and much of my usual "net monitoring" for the summer and kind of enjoyed being in my little bubble. I doubt I'll be able to keep it up, though.

In sum

  • Awesome experience.
  • Learned much microbiology.
  • Need to learn more chemistry!

by C. Titus Brown at August 16, 2016 10:00 PM

Continuum Analytics news

Dask for Institutions

Posted Tuesday, August 16, 2016

Introduction

Institutions use software differently than individuals. Over the last few months I’ve had dozens of conversations about using Dask within larger organizations like universities, research labs, private companies, and non-profit learning systems. This post provides a very coarse summary of those conversations and extracts common questions. I’ll then try to answer those questions.

Note: some of this post will be necessarily vague at points. Some companies prefer privacy. All details here are either in public Dask issues or have come up with enough institutions (say at least five) that I’m comfortable listing the problem here.

Common story

Institution X, a university/research lab/company/… has many scientists/analysts/modelers who develop models and analyze data with Python, the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code. These models/data sometimes grow to be large enough to need a moderately large amount of parallel computing.

Fortunately, Institution X has an in-house cluster acquired for exactly this purpose of accelerating modeling and analysis of large computations and datasets. Users can submit jobs to the cluster using a job scheduler like SGE/LSF/Mesos/Other.

However the cluster is still under-utilized and the users are still asking for help with parallel computing. Either users aren’t comfortable using the SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic workloads, or the interaction times aren’t good enough for the interactive use that users appreciate.

There was an internal effort to build a more complex/interactive/Pythonic system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and definitely isn’t something that Institution X wants to pursue. It turned out to be a harder problem than expected to design/build/maintain such a system in-house. They’d love to find an open source solution that was well featured and maintained by a community.

The Dask.distributed scheduler looks like it’s 90% of the system that Institution X needs. However there are a few open questions:

  • How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job scheduler?
  • How can we grow and shrink the cluster dynamically based on use?
  • How do users manage software environments on the workers?
  • How secure is the distributed scheduler?
  • Dask is resilient to worker failure, how about scheduler failure?
  • What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?
  • How do we handle multiple concurrent users and priorities?
  • How does this compare with Spark?

So for the rest of this post I’m going to answer these questions. As usual, few of answers will be of the form “Yes Dask can solve all of your problems.” These are open questions, not the questions that were easy to answer. We’ll get into what’s possible today and how we might solve these problems in the future.

How do we integrate dask.distributed with SGE/LSF/Mesos/Other?

It’s not difficult to deploy dask.distributed at scale within an existing cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already a researcher within the institution doing this manually by running dask-scheduler on some static node in the cluster and launching dask-worker a few hundred times with their job scheduler and a small job script.

The goal now is how to formalize this process for the individual version of SGE/LSF/Mesos/Other used within the institution while also developing and maintaining a standard Pythonic interface so that all of these tools can be maintained cheaply by Dask developers into the foreseeable future. In some cases Institution X is happy to pay for the development of a convenient “start dask on my job scheduler” tool, but they are less excited about paying to maintain it forever.

We want Python users to be able to say something like the following:

from dask.distributed import Executor, SGECluster

c = SGECluster(nworkers=200, **options)
e = Executor(c)

… and have this same interface be standardized across different job schedulers.

How can we grow and shrink the cluster dynamically based on use?

Alternatively, we could have a single dask.distributed deployment running 24/7 that scales itself up and down dynamically based on current load. Again, this is entirely possible today if you want to do it manually (you can add and remove workers on the fly) but we should add some signals to the scheduler like the following:

  • “I’m under duress, please add workers”
  • “I’ve been idling for a while, please reclaim workers”

and connect these signals to a manager that talks to the job scheduler. This removes an element of control from the users and places it in the hands of a policy that IT can tune to play more nicely with their other services on the same network.

How do users manage software environments on the workers?

Today Dask assumes that all users and workers share the exact same software environment. There are some small tools to send updated .py and .egg files to the workers but that’s it.

Generally Dask trusts that the full software environment will be handled by something else. This might be a network file system (NFS) mount on traditional cluster setups, or it might be handled by moving docker or conda environments around by some other tool like knit for YARN deployments or something more custom. For example Continuum sells proprietary software that does this.

Getting the standard software environment setup generally isn’t such a big deal for institutions. They typically have some system in place to handle this already. Where things become interesting is when users want to use drastically different environments from the system environment, like using Python 2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may also want to change the software environment many times in a single session.

The best solution I can think of here is to pass around fully downloaded conda environments using the dask.distributed network (it’s good at moving large binary blobs throughout the network) and then teaching the dask-workers to bootstrap themselves within this environment. We should be able to tear everything down and restart things within a small number of seconds. This requires some work; first to make relocatable conda binaries (which is usually fine but is not always fool-proof due to links) and then to help the dask-workers learn to bootstrap themselves.

Somewhat related, Hussain Sultan of Capital One recently contributed a dask-submit command to run scripts on the cluster: http://distributed.readthedocs.io/en/latest/submitting-applications.html

How secure is the distributed scheduler?

Dask.distributed is incredibly insecure. It allows anyone with network access to the scheduler to execute arbitrary code in an unprotected environment. Data is sent in the clear. Any malicious actor can both steal your secrets and then cripple your cluster.

This is entirely the norm however. Security is usually handled by other services that manage computational frameworks like Dask.

For example we might rely on Docker to isolate workers from destroying their surrounding environment and rely on network access controls to protect data access.

Because Dask runs on Tornado, a serious networking library and web framework, there are some things we can do easily like enabling SSL, authentication, etc.. However I hesitate to jump into providing “just a little bit of security” without going all the way for fear of providing a false sense of security. In short, I have no plans to work on this without a lot of encouragement. Even then I would strongly recommend that institutions couple Dask with tools intended for security. I believe that is common practice for distributed computational systems generally.

Dask is resilient to worker failure, how about scheduler failure?

can come and go. Clients can come and go. The state in the scheduler is currently irreplaceable and no attempt is made to back it up. There are a few things you could imagine here:

  1. Backup state and recent events to some persistent storage so that state can be recovered in case of catastrophic loss
  2. Have a hot failover node that gets a copy of every action that the scheduler takes
  3. Have multiple peer schedulers operate simultaneously in a way that they can pick up slack from lost peers
  4. Have clients remember what they have submitted and resubmit when a scheduler comes back online

Currently option 4 is currently the most feasible and gets us most of the way there. However options 2 or 3 would probably be necessary if Dask were to ever run as critical infrastructure in a giant institution. We’re not there yet.

As of recent work spurred on by Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back and everyone will reconnect. The state for computations in flight is entirely lost but the computational infrastructure remains intact so that people can resubmit jobs without significant loss of service.

Dask has a bit of a harder time with this topic because it offers a persistent stateful interface. This problem is much easier for distributed database projects that run ephemeral queries off of persistent storage, return the results, and then clear out state.

What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?

The short answer is no. Other than number of cores and available RAM all workers are considered equal to each other (except when the user explicitly specifies otherwise).

However this problem and problems like it have come up a lot lately. Here are a few examples of similar cases:

  1. Multiple data centers geographically distributed around the country
  2. Multiple racks within a single data center
  3. Multiple workers that have GPUs that can move data between each other easily
  4. Multiple processes on a single machine

Having some notion of hierarchical worker group membership or inter-worker preferred relationships is probably inevitable long term. As with all distributed scheduling questions the hard part isn’t deciding that this is useful, or even coming up with a sensible design, but rather figuring out how to make decisions on the sensible design that are foolproof and operate in constant time. I don’t personally see a good approach here yet but expect one to arise as more high priority use cases come in.

How do we handle multiple concurrent users and priorities?

There are several sub-questions here:

  • Can multiple users use Dask on my cluster at the same time?

Yes, either by spinning up separate scheduler/worker sets or by sharing the same set.

  • If they’re sharing the same workers then won’t they clobber each other’s data?

This is very unlikely. Dask is careful about naming tasks, so it’s very unlikely that the two users will submit conflicting computations that compute to different values but occupy the same key in memory. However if they both submit computations that overlap somewhat then the scheduler will nicely avoid recomputation. This can be very nice when you have many people doing slightly different computations on the same hardware. This works in the same way that Git works.

  • If they’re sharing the same workers then won’t they clobber each other’s resources?

Yes, this is definitely possible. If you’re concerned about this then you should give everyone their own scheduler/workers (which is easy and standard practice). There is not currently much user management built into Dask.

How does this compare with Spark?

At an institutional level Spark seems to primarily target ETL + Database-like computations. While Dask modules like Dask.bag and Dask.dataframe can happily play in this space this doesn’t seem to be the focus of recent conversations.

Recent conversations are almost entirely around supporting interactive custom parallelism (lots of small tasks with complex dependencies between them) rather than the big Map->Filter->Groupby->Join abstractions you often find in a database or Spark. That’s not to say that these operations aren’t hugely important; there is a lot of selection bias here. The people I talk to are people for whom Spark/Databases are clearly not an appropriate fit. They are tackling problems that are way more complex, more heterogeneous, and with a broader variety of users.

I usually describe this situation with an analogy comparing “Big data” systems to human transportation mechanisms in a city. Here we go:

  • A Database is like a train: it goes between a set of well defined points with great efficiency, speed, and predictability. These are popular and profitable routes that many people travel between (e.g. business analytics). You do have to get from home to the train station on your own (ETL), but once you’re in the database/train you’re quite comfortable.
  • Spark is like an automobile: it takes you door-to-door from your home to your destination with a single tool. While this may not be as fast as the train for the long-distance portion, it can be extremely convenient to do ETL, Database work, and some machine learning all from the comfort of a single system.
  • Dask is like an all-terrain-vehicle: it takes you out of town on rough ground that hasn’t been properly explored before. This is a good match for the Python community, which typically does a lot of exploration into new approaches. You can also drive your ATV around town and you’ll be just fine, but if you want to do thousands of SQL queries then you should probably invest in a proper database or in Spark.

Again, there is a lot of selection bias here, if what you want is a database then you should probably get a database. Dask is not a database.

This is also wildly over-simplifying things. Databases like Oracle have lots of ETL and analytics tools, Spark is known to go off road, etc.. I obviously have a bias towards Dask. You really should never trust an author of a project to give a fair and unbiased view of the capabilities of the tools in the surrounding landscape.

Conclusion

That’s a rough sketch of current conversations and open problems for “How Dask might evolve to support institutional use cases.” It’s really quite surprising just how prevalent this story is among the full spectrum from universities to hedge funds.

The problems listed above are by no means halting adoption. I’m not listing the 100 or so questions that are answered with “yes, that’s already supported quite well”. Right now I’m seeing Dask being adopted by individuals and small groups within various institutions. Those individuals and small groups are pushing that interest up the stack. It’s still several months before any 1000+ person organization adopts Dask as infrastructure, but the speed at which momentum is building is quite encouraging.

I’d also like to thank the several nameless people who exercise Dask on various infrastructures at various scales on interesting problems and have reported serious bugs. These people don’t show up on the GitHub issue tracker but their utility in flushing out bugs is invaluable.

As interest in Dask grows it’s interesting to see how it will evolve. Culturally Dask has managed to simultaneously cater to both the open science crowd as well as the private-sector crowd. The project gets both financial support and open source contributions from each side. So far there hasn’t been any conflict of interest (everyone is pushing in roughly the same direction) which has been a really fruitful experience for all involved I think.

This post was originally published by Matt Rocklin on his website, matthewrocklin.com 

by swebster at August 16, 2016 02:51 PM

Matthew Rocklin

Dask for Institutions

This work is supported by Continuum Analytics

Introduction

Institutions use software differently than individuals. Over the last few months I’ve had dozens of conversations about using Dask within larger organizations like universities, research labs, private companies, and non-profit learning systems. This post provides a very coarse summary of those conversations and extracts common questions. I’ll then try to answer those questions.

Note: some of this post will be necessarily vague at points. Some companies prefer privacy. All details here are either in public Dask issues or have come up with enough institutions (say at least five) that I’m comfortable listing the problem here.

Common story

Institution X, a university/research lab/company/… has many scientists/analysts/modelers who develop models and analyze data with Python, the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code. These models/data sometimes grow to be large enough to need a moderately large amount of parallel computing.

Fortunately, Institution X has an in-house cluster acquired for exactly this purpose of accelerating modeling and analysis of large computations and datasets. Users can submit jobs to the cluster using a job scheduler like SGE/LSF/Mesos/Other.

However the cluster is still under-utilized and the users are still asking for help with parallel computing. Either users aren’t comfortable using the SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic workloads, or the interaction times aren’t good enough for the interactive use that users appreciate.

There was an internal effort to build a more complex/interactive/Pythonic system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and definitely isn’t something that Institution X wants to pursue. It turned out to be a harder problem than expected to design/build/maintain such a system in-house. They’d love to find an open source solution that was well featured and maintained by a community.

The Dask.distributed scheduler looks like it’s 90% of the system that Institution X needs. However there are a few open questions:

  • How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job scheduler?
  • How can we grow and shrink the cluster dynamically based on use?
  • How do users manage software environments on the workers?
  • How secure is the distributed scheduler?
  • Dask is resilient to worker failure, how about scheduler failure?
  • What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?
  • How do we handle multiple concurrent users and priorities?
  • How does this compare with Spark?

So for the rest of this post I’m going to answer these questions. As usual, few of answers will be of the form “Yes Dask can solve all of your problems.” These are open questions, not the questions that were easy to answer. We’ll get into what’s possible today and how we might solve these problems in the future.

How do we integrate dask.distributed with SGE/LSF/Mesos/Other?

It’s not difficult to deploy dask.distributed at scale within an existing cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already a researcher within the institution doing this manually by running dask-scheduler on some static node in the cluster and launching dask-worker a few hundred times with their job scheduler and a small job script.

The goal now is how to formalize this process for the individual version of SGE/LSF/Mesos/Other used within the institution while also developing and maintaining a standard Pythonic interface so that all of these tools can be maintained cheaply by Dask developers into the foreseeable future. In some cases Institution X is happy to pay for the development of a convenient “start dask on my job scheduler” tool, but they are less excited about paying to maintain it forever.

We want Python users to be able to say something like the following:

from dask.distributed import Executor, SGECluster

c = SGECluster(nworkers=200, **options)
e = Executor(c)

… and have this same interface be standardized across different job schedulers.

How can we grow and shrink the cluster dynamically based on use?

Alternatively, we could have a single dask.distributed deployment running 24/7 that scales itself up and down dynamically based on current load. Again, this is entirely possible today if you want to do it manually (you can add and remove workers on the fly) but we should add some signals to the scheduler like the following:

  • “I’m under duress, please add workers”
  • “I’ve been idling for a while, please reclaim workers”

and connect these signals to a manager that talks to the job scheduler. This removes an element of control from the users and places it in the hands of a policy that IT can tune to play more nicely with their other services on the same network.

How do users manage software environments on the workers?

Today Dask assumes that all users and workers share the exact same software environment. There are some small tools to send updated .py and .egg files to the workers but that’s it.

Generally Dask trusts that the full software environment will be handled by something else. This might be a network file system (NFS) mount on traditional cluster setups, or it might be handled by moving docker or conda environments around by some other tool like knit for YARN deployments or something more custom. For example Continuum sells proprietary software that does this.

Getting the standard software environment setup generally isn’t such a big deal for institutions. They typically have some system in place to handle this already. Where things become interesting is when users want to use drastically different environments from the system environment, like using Python 2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may also want to change the software environment many times in a single session.

The best solution I can think of here is to pass around fully downloaded conda environments using the dask.distributed network (it’s good at moving large binary blobs throughout the network) and then teaching the dask-workers to bootstrap themselves within this environment. We should be able to tear everything down and restart things within a small number of seconds. This requires some work; first to make relocatable conda binaries (which is usually fine but is not always fool-proof due to links) and then to help the dask-workers learn to bootstrap themselves.

Somewhat related, Hussain Sultan of Capital One recently contributed a dask-submit command to run scripts on the cluster: http://distributed.readthedocs.io/en/latest/submitting-applications.html

How secure is the distributed scheduler?

Dask.distributed is incredibly insecure. It allows anyone with network access to the scheduler to execute arbitrary code in an unprotected environment. Data is sent in the clear. Any malicious actor can both steal your secrets and then cripple your cluster.

This is entirely the norm however. Security is usually handled by other services that manage computational frameworks like Dask.

For example we might rely on Docker to isolate workers from destroying their surrounding environment and rely on network access controls to protect data access.

Because Dask runs on Tornado, a serious networking library and web framework, there are some things we can do easily like enabling SSL, authentication, etc.. However I hesitate to jump into providing “just a little bit of security” without going all the way for fear of providing a false sense of security. In short, I have no plans to work on this without a lot of encouragement. Even then I would strongly recommend that institutions couple Dask with tools intended for security. I believe that is common practice for distributed computational systems generally.

Dask is resilient to worker failure, how about scheduler failure?

Workers can come and go. Clients can come and go. The state in the scheduler is currently irreplaceable and no attempt is made to back it up. There are a few things you could imagine here:

  1. Backup state and recent events to some persistent storage so that state can be recovered in case of catastrophic loss
  2. Have a hot failover node that gets a copy of every action that the scheduler takes
  3. Have multiple peer schedulers operate simultaneously in a way that they can pick up slack from lost peers
  4. Have clients remember what they have submitted and resubmit when a scheduler comes back online

Currently option 4 is currently the most feasible and gets us most of the way there. However options 2 or 3 would probably be necessary if Dask were to ever run as critical infrastructure in a giant institution. We’re not there yet.

As of recent work spurred on by Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back and everyone will reconnect. The state for computations in flight is entirely lost but the computational infrastructure remains intact so that people can resubmit jobs without significant loss of service.

Dask has a bit of a harder time with this topic because it offers a persistent stateful interface. This problem is much easier for distributed database projects that run ephemeral queries off of persistent storage, return the results, and then clear out state.

What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?

The short answer is no. Other than number of cores and available RAM all workers are considered equal to each other (except when the user explicitly specifies otherwise).

However this problem and problems like it have come up a lot lately. Here are a few examples of similar cases:

  1. Multiple data centers geographically distributed around the country
  2. Multiple racks within a single data center
  3. Multiple workers that have GPUs that can move data between each other easily
  4. Multiple processes on a single machine

Having some notion of hierarchical worker group membership or inter-worker preferred relationships is probably inevitable long term. As with all distributed scheduling questions the hard part isn’t deciding that this is useful, or even coming up with a sensible design, but rather figuring out how to make decisions on the sensible design that are foolproof and operate in constant time. I don’t personally see a good approach here yet but expect one to arise as more high priority use cases come in.

How do we handle multiple concurrent users and priorities?

There are several sub-questions here:

  • Can multiple users use Dask on my cluster at the same time?

Yes, either by spinning up separate scheduler/worker sets or by sharing the same set.

  • If they’re sharing the same workers then won’t they clobber each other’s data?

This is very unlikely. Dask is careful about naming tasks, so it’s very unlikely that the two users will submit conflicting computations that compute to different values but occupy the same key in memory. However if they both submit computations that overlap somewhat then the scheduler will nicely avoid recomputation. This can be very nice when you have many people doing slightly different computations on the same hardware. This works in the same way that Git works.

  • If they’re sharing the same workers then won’t they clobber each other’s resources?

Yes, this is definitely possible. If you’re concerned about this then you should give everyone their own scheduler/workers (which is easy and standard practice). There is not currently much user management built into Dask.

How does this compare with Spark?

At an institutional level Spark seems to primarily target ETL + Database-like computations. While Dask modules like Dask.bag and Dask.dataframe can happily play in this space this doesn’t seem to be the focus of recent conversations.

Recent conversations are almost entirely around supporting interactive custom parallelism (lots of small tasks with complex dependencies between them) rather than the big Map->Filter->Groupby->Join abstractions you often find in a database or Spark. That’s not to say that these operations aren’t hugely important; there is a lot of selection bias here. The people I talk to are people for whom Spark/Databases are clearly not an appropriate fit. They are tackling problems that are way more complex, more heterogeneous, and with a broader variety of users.

I usually describe this situation with an analogy comparing “Big data” systems to human transportation mechanisms in a city. Here we go:

  • A Database is like a train: it goes between a set of well defined points with great efficiency, speed, and predictability. These are popular and profitable routes that many people travel between (e.g. business analytics). You do have to get from home to the train station on your own (ETL), but once you’re in the database/train you’re quite comfortable.
  • Spark is like an automobile: it takes you door-to-door from your home to your destination with a single tool. While this may not be as fast as the train for the long-distance portion, it can be extremely convenient to do ETL, Database work, and some machine learning all from the comfort of a single system.
  • Dask is like an all-terrain-vehicle: it takes you out of town on rough ground that hasn’t been properly explored before. This is a good match for the Python community, which typically does a lot of exploration into new approaches. You can also drive your ATV around town and you’ll be just fine, but if you want to do thousands of SQL queries then you should probably invest in a proper database or in Spark.

Again, there is a lot of selection bias here, if what you want is a database then you should probably get a database. Dask is not a database.

This is also wildly over-simplifying things. Databases like Oracle have lots of ETL and analytics tools, Spark is known to go off road, etc.. I obviously have a bias towards Dask. You really should never trust an author of a project to give a fair and unbiased view of the capabilities of the tools in the surrounding landscape.

Conclusion

That’s a rough sketch of current conversations and open problems for “How Dask might evolve to support institutional use cases.” It’s really quite surprising just how prevalent this story is among the full spectrum from universities to hedge funds.

The problems listed above are by no means halting adoption. I’m not listing the 100 or so questions that are answered with “yes, that’s already supported quite well”. Right now I’m seeing Dask being adopted by individuals and small groups within various institutions. Those individuals and small groups are pushing that interest up the stack. It’s still several months before any 1000+ person organization adopts Dask as infrastructure, but the speed at which momentum is building is quite encouraging.

I’d also like to thank the several nameless people who exercise Dask on various infrastructures at various scales on interesting problems and have reported serious bugs. These people don’t show up on the GitHub issue tracker but their utility in flushing out bugs is invaluable.

As interest in Dask grows it’s interesting to see how it will evolve. Culturally Dask has managed to simultaneously cater to both the open science crowd as well as the private-sector crowd. The project gets both financial support and open source contributions from each side. So far there hasn’t been any conflict of interest (everyone is pushing in roughly the same direction) which has been a really fruitful experience for all involved I think.

August 16, 2016 12:00 AM

August 12, 2016

William Stein

Jupyter: "take the domain name down immediately"

The Jupyter notebook is an open source BSD-licensed browser-based code execution environment, inspired by my early work on the Sage Notebook (which we launched in 2007), which was in turn inspired heavily by Mathematica notebooks and Google docs. Jupyter used to be called IPython.

SageMathCloud is an open source web-based environment for using Sage worksheets, terminals, LaTeX documents, course management, and Jupyter notebooks. I've put much hard work into making it so that multiple people can simultaneously edit Jupyter notebooks in SageMathCloud, and the history of all changes are recorded and browsable via a slider.

Many people have written to me asking for there to be a modified version of SageMathCloud, which is oriented around Jupyter notebooks instead of Sage worksheets. So the default file type is Jupyter notebooks, the default kernel doesn't involve the extra heft of Sage, etc., and the domain name involves Jupyter instead of "sagemath". Some people are disuased from using SageMathCloud for Jupyter notebooks because of the "SageMath" name.

Dozens of web applications (including SageMathCloud) use the word "Jupyter" in various places. However, I was unsure about using "jupyter" in a domain name. I found this github issue and requested clarification 6 weeks ago. We've had some back and forth, but they recently made it clear that it would be at least a month until any decision would be considered, since they are too busy with other things. In the meantime, I rented jupytercloud.com, which has a nice ring to it, as the planet Jupiter has clouds. Yesterday, I made jupytercloud.com point to cloud.sagemath.com to see what it would "feel like" and Tim Clemans started experimenting with customizing the page based on the domain name that the client sees. I did not mention jupytercloud.com publicly anywhere, and there were no links to it.

Today I received this message:

    William,

I'm writing this representing the Jupyter project leadership
and steering council. It has recently come to the Jupyter
Steering Council's attention that the domain jupytercloud.com
points to SageMathCloud. Do you own that domain? If so,
we ask that you take the domain name down immediately, as
it uses the Jupyter name.
I of course immediately complied. It is well within their rights to dictate how their name is used, and I am obsessive about scrupulously doing everything I can to respect people's intellectual property; with Sage we have put huge amounts of effort into honoring both the letter and spirit of copyright statements on open source software.

I'm writing this because it's unclear to me what people really want, and I have no idea what to do here.

1. Do you want something built on the same technology as SageMathCloud, but much more focused on Jupyter notebooks?

2. Does the name of the site matter to you?

3. What model should the Jupyter project use for their trademark? Something like Python? like Git?Like Linux?  Like Firefox?  Like the email program PINE?  Something else entirely?

4. Should I be worried about using Jupyter at all anywhere? E.g., in this blog post? As the default notebook for the SageMath project?

I appreciate any feedback.

Hacker News Discussion

UPDATE (Aug 12, 2016): The official decision is that I cannot use the domain jupytercloud.com.   They did say I can use jupyter.sagemath.com or sagemath.com/jupyter.   Needless to say, I'm disappointed, but I fully respect their (very foolish, IMHO) decision.


by William Stein (noreply@blogger.com) at August 12, 2016 10:46 AM

August 10, 2016

Continuum Analytics news

The Citizen Data Scientist Strikes Back

Posted Monday, August 8, 2016
Michele Chambers
EVP Anaconda Business Unit & CMO

How Open Data Science is Giving Jedi Superpowers to the World

As the Data Science galaxy starts to form, novel concepts will inevitably emerge—ones that we cannot predict from today’s limited vantage point.
 
One of the latest of these is the “citizen data scientist.” This person is leveraging the latest data science technologies without a formal background in math or computer science. The growing community of citizen data scientists––engineers, analysts, scientists, economists and many others––outnumbers the mythical unicorn data scientists and, if equipped with the right technology and solutions, will substantially contribute new discoveries, new revenue, new innovations, new policies and more that will drive value in the marketplace through innovation, productivity and efficiency.
 
Unlike previous technology periods, where only specialized gurus could engage with innovation, Open Data Science empowers anyone through a wealth of highly accessible and innovative open source technology. As a result, a new generation of Jedis is being armed with greater intelligence to change the world from the near-infinite applications of data science today.

Open Data Science Puts the Power of Intelligence in the Citizens’ Hands

In the past, data and analytics were sequestered into a back corner for highly specialized teams to mine and unearth nuggets that led to epiphanies from mounds of data. Not only is data liberated today, but now data science is also being liberated. The lynchpin to data science liberation is making data science accessible via familiar and consumable applications. Now, any curious analyst, engineer or anyone using Open Data Science has the power to make informed decisions with real evidence and, most importantly, intelligent power that leverages not just predictive analytics, but machine learning and, increasingly, deep learning. Armed with this saber of intelligence, every day tasks in business and life will change. 
 
Take, for example, the TaxBrain project. Taxbrain is a striking example of the power of Open Data Science. Using open source resources like Anaconda, the Open Source Policy Center (OSPC) launched a publicly-available web application that allows anyone to explore how changes to tax policies affect government revenues and taxpayers. This browser-based software allows journalists, consumers, academics, economists––all citizen data scientists––to understand the implications of tax policy changes.
 
With Open Data Science underlying the creation of TaxBrain, anyone can run highly-sophisticated mathematical models of the complex U.S. tax code without having to know the details or code underlying it.
 
That’s truly empowering—and the range of potential data science applications doesn’t end there.

Making Data Intuitive with Modern Interactive Visualizations

Open Data Science is ushering in a golden age of consumability through interactive visualizations that engage and activate citizen data scientists. Our minds are not optimized for holding more than around seven numbers, let alone the tens of millions that are part and parcel to displaying Big Data. However, the human visual system is much better at wrangling complexity. The more these visualization methods progress, the more intuition—from anyone, not just experts—can be brought to bear on emerging challenges with visualizing Big Data. 
 
High-tech and brain-optimized visual interactions were foreshadowed in the 2002 film "Minority Report" where the hero, portrayed by Tom Cruise, uses a holographic display to rapidly manipulate data by flicking through the images with his hands, physically interacting with and manipulating the data as he experiences it. This interface allowed him to quickly analyze complex data as intuitive, beautiful visualizations.
 
Similarly, Open Data Science currently offers beautiful visualizations that allow citizen data scientists to engage with massive data sets. Datashader and Bokeh - delivered with Anaconda, the leading Open Data Science platform - provide intuitive visualizations using novel approaches to represent data—all based on the functionality of the human visual system. Analysts can see the rich texture of complex data without being overwhelmed by what they see, which makes data science more accessible. 

Open Data Science Allows Us to Focus on Essentials

We are approaching a future where complexity and uncertainty are finally embedded into intelligent applications that everyone can use without having to understand the underlying math and computer science. By arming citizen data scientists with these powerful intelligent applications, we empower everyone with a virtual supercomputer at their fingertips that closes the gap between the status quo predictive analytics and the power of human cognition and “intuition.” As we gain confidence in these intelligent apps, we will embed the intelligence into devices and applications that will automatically adapt to ever changing conditions to determine the best, or set of best, courses of action. 

Take the budgeting process of a company, for example. Various projects, hurdle rates, costs, staffing and more can be analyzed to recommend the best use of corporate resources––cash, people, time––eliminating the political jockeying, while achieving conflicting objectives such as maximizing ROI, while minimizing costs. The actual execution results are used to feed into future budgeting so that the new budgets adapt based on the execution capabilities of the organization. Streamlining decisional overhead allows citizen data scientists to focus on what matters.

Through the interoperability of Open Data Science, separate analysis components will work together like perfectly interlocking Lego pieces. That’s a multiplier of value, limited only by the imagination of visionary citizen data scientists.

Builing the Empire with Open Data Science 

Open Data Science is ushering in the era of the citizen data scientist. It’s an exhilarating time for organizations, ripe with innovative opportunities. But, managing the transition can seem daunting.
 
How can organizations embrace Open Data Science solutions, while avoiding a quagmire of technical, process and legal issues? These points are addressed in a recent webinar, The Journey to Open Data Science, with content freely available.
 
Exciting changes are coming, brought to us by the citizen data scientists of the future.

 

by ryanwh at August 10, 2016 02:42 PM

August 09, 2016

Continuum Analytics news

Anaconda Build Migration to Conda-Forge

Posted Wednesday, August 10, 2016

Summary

In a prior blog post, we shared with the community that we’re deprecating support for the Anaconda Build system on Anaconda Cloud. To help navigate the transition, we’re providing some additional guidance on suggested next steps.

What is Continuum’s Recommended Approach? 

The Anaconda Cloud team recommends using the community-based conda-forge for open source projects. Conda-forge leverages Github to create high-quality recipes, and Circle-CI, Travis-CI, and Appveyor to produce conda packages on Linux, MacOS and Windows with a strong focus on automation and reproducibility.

Why conda-forge?

Here are some of the benefits we see from using conda-forge: 

  1. Free: Conda-forge is a solidly supported community effort to offer free conda package builds. 
  2. Easier for your users: By building your package with conda-forge, you will be distributing your package in the conda-forge channel. This makes it easier for your users to access your package via the next-most-common place to get packages after the Anaconda defaults channel.
  3. Easier for you: You no longer have to set up extra CI or servers just for building conda packages. Instead, you can take advantage of an existing set of build machinery through the simplicity of creating a Pull Request in Github.
  4. Cross-Platform: If you were using Anaconda Build’s free public queue, only Linux builds were available. With conda-forge, you’ll get support for building packages not just on Linux, but also Windows and MacOS, opening up huge new audiences to your package.
  5. Better feedback: With many more people looking at your recipe, you’ll likely get additional feedback to help improve your package.  
  6. Evolve with best practices: As best practices around package building evolve, these standards will automatically be applied to your recipe(s), so you will be growing and improving in step with the community. 

To read more about conda-forge, see this guest blog post

What is the High-Level Migration Approach? 

If you were using Anaconda Build’s public Queue (Linux worker) and you had a working recipe previously, the migration process should be straightforward for you:

  1. Grab a copy of conda-forge's staged recipes repo
  2. Copy over your existing recipe into the staged-recipes/recipes folder
  3. Take a look at the example recipe in the staged-recipes/recipes folder, and adapt your recipe to the suggested style there.
  4. Submit a PR for new recipe against conda-forge’s staged-recipes repo.
  5. After review of your PR, a conda-forge moderator will merge in the PR.  
  6. Automated scripts will create a new feedstock repo and set up permissions for you on that repo.

Migration Tasks

This section walks through in more detail what the migration tasks are. It is a slightly expanded version of what you’ll find in the conda-forge Staged Recipes Repo readme.

Previously with Anaconda Build, you had…

  1. conda.recipe/: containing at least, meta.yml, maybe build/test scripts and resources (patches) 
  2. .binstar.yml: no longer used going forward

To get started with conda-forge...

  1. Fork https://github.com/conda-forge/staged-recipes
    • git clone https://github.com/<USERNAME>/staged-recipes
    • cp -r <old repo>/conda.recipe staged-recipes/recipes/<PKG-NAME>
    • Edit your recipe, based on example recipe in staged-recipes/recipes
    • git add recipes/
    • git push origin
  2. Create a Pull Request to conda-forge/staged-recipes from your new branch
    • The conda-forge linter bot will post a message about any potential issues or style suggestions it has on your recipe.
    • CI scripts will automatically build against Linux, MacOS, and Windows.
      • You can skip platforms by adding lines with selectors in your meta.yaml such as:
        • build:
          • skip: True  # [win]
  3. Reviewers will suggest improvements to your recipe.
  4. When all builds succeed, it’s ready to be reviewed by a community moderator for merge.
  5. The moderator will review and merge your new staged-recipes.  From here, automated scripts create a new repo (called a feedstock) just for your package, with the github id’s from the recipe-maintainers section added as maintainers. 
  6. As mentioned above, as best practices evolve, automated processes will periodically submit PR’s to your feedstock.  It is up to you and any other administrators to review and merge these PRs.

Example Packages

The following list links to a few prototypical scenarios you may find helpful as a reference: 

  • You can turn off a specific platform with build: skip: true [<platform>], ex: windows:
  • Use a PyPi release upstream with a published sha256 checksum:
  • If using post/pre-link scripts, make sure they don’t make lots of output, instead write to .messages.txt:
  • If you are building packages with compiled components (pure C libraries, Python packages with compiled extensions, etc.) use the toolchain package
  • If you need custom linux packages that are not available as conda packages, specify them with yum_requirements.txt:

What to be Aware of 

We’re really excited about conda-forge and supportive of its efforts. However, these are some things to keep in mind as you explore if it’s a fit for your project: 

  • If you were using Anaconda-Build as your Continuous Integration service, you probably do have to have to set up a new workflow; the astropy ci-helpers are very helpful. Conda-forge is not your project’s CI!
  • The upshot of this is you should only be building in conda-forge when you are ready to do a release. If you do try to use conda-forge as CI, you will likely get frustrated because your builds will take longer than you want and it won’t feel like a CI system.
  • When you submit a new package, you (and possibly the people you recruit) are signing up to be a maintainer of this package. The community expectation is that you will continue to update your package with respect to the upstream dependencies. The implied message is that conda-forge is not a great place to add hobby projects that you have no intention of maintaining.
  • If your recipe doesn’t meet the community quality guidelines, you are likely to get community feedback with the expectation that you’ll make changes.

Alternatives to be Aware of

Creating reproducible, automated, cross-platform is easier today than in the past with virtualization and containerization, but still challenging due to proprietary licenses and compiler toolchains. We’ve got some examples of conda build environments in different environments that can make easier to build your own packages virtually with vagrant:

...and Docker…

If you are thinking about how to tackle this problem at scale for private/commercial projects, or internally behind the firewall, please contact us to find out how we can help. 

Getting Started and Getting Involved

  1. General info: Read more about conda-forge: https://conda-forge.github.io/ 
  2. Create your receipe: start with the readme here: https://github.com/conda-forge/staged-recipes 
  3. For general Getting Started issues, try the FAQ on the staged-recipes wiki.
  4. Getting help: Non-package-specific issues, i.e. best practices, guidance, are best raised on the conda-forge.github.io issue tracker.  There is also a gitter chat channel: https://gitter.im/conda-forge/conda-forge.github.io.
  5. News: To find out how the community is evolving, stay tuned to meeting minutes on hackpad, and feel free to join meetings there.  Advanced notice is important, though - meetings are often hosted on Google Hangouts, which has a 10 person cap.  We have alternate avenues, but need to set them up.

What’s Your Experience?

If you’ve been using conda-forge or completed a migration, let us know what your experience has been in the comments section below, or on Twitter @ContinuumIO

by swebster at August 09, 2016 07:29 PM

August 05, 2016

Spyder

Qt Charts (PyQt5.QtChart) as (Py)Qwt replacement

I've just discovered an interesting alternative to PyQwt for plotting curves efficiently. That is Qt Charts, a module of Qt which exists at least since 2012 according to PyQtChart changelog (PyQtChart is a set of Python bindings for Qt Charts library).

Here is a simple example which is as efficient as PyQwt (and much more efficient than PythonQwt):
(the associated Python code is here)



by Pierre (noreply@blogger.com) at August 05, 2016 01:59 AM

August 01, 2016

Continuum Analytics news

Powering Down the Anaconda Build CI System on Anaconda Cloud

Posted Monday, August 1, 2016

**UPDATES [Aug 10, 2016]: 

  • For our on-premises customers, the need may exist for in-house build/packaging and C.I. capabilities, and we will prioritize these capabilities in our road-map in accordance with customer demand.
  • We’ve added a blog post here to help people assess if conda-forge is a good alternative and migrate to it if so.

**end updates

We are reaching out to our committed base of Anaconda Cloud users to let you know we’ve made the decision to deprecate the Anaconda Build service within Anaconda Cloud as of September 1, 2016. Please note that Anaconda Build is not to be confused with conda-build, which will continue to be supported and maintained as a vital community project. The remaining functionality within Anaconda Cloud will continue to be available to the Anaconda user community to host, share and collaborate via conda packages, environments and notebooks.

Anaconda Build is a continuous integration service that is part of the Anaconda platform (specifically, Anaconda Cloud and Anaconda Repository). Anaconda Build helps conda users and data scientists automatically build and upload cross-platform conda packages as part of their project’s development and release workflow.

Anaconda Build was initially launched [in effect as a beta product] to support internal needs of the Anaconda platform teams and to explore the combination of a continuous integration service with the flexibility of conda packages and conda-build. While we were developing and supporting Anaconda Build, it was gratifying to see significant community effort around conda-forge to solve issues around cross-platform package-build workflows with conda and the formation of a community-driven collection of conda recipes to serve the open data science community. The resulting conda package builds from conda-forge are hosted on Anaconda Cloud.

We feel that the Anaconda community efforts around conda-forge (along with Travis CI, CircleCI, and Appveyor) are in a better position to deliver reliable package building services compared to our internal efforts with Anaconda Build. For both Anaconda community users and Anaconda platform subscribers, alternative solutions for automated conda package builds can be developed using the previously mentioned services.

The documentation for Anaconda Build will continue to be available in the Anaconda Cloud documentation for a short period of time after Anaconda Build is deprecated but will be removed in the near future.

If you have any questions about the deprecation of Anaconda Build or how to transition your conda package build workflow to conda-forge or other continuous integration services, please post on the Anaconda Cloud issue tracker on Github.

-The Anaconda Platform Product Team

by swebster at August 01, 2016 07:51 PM

Analyzing and Visualizing Big Data Interactively on Your Laptop: Datashading the 2010 US Census

Posted Tuesday, July 26, 2016
Jim Bednar
Continuum Analytics

The 2010 Census collected a variety of demographic information for all the more than 300 million people in the USA. A subset of this has been pre-processed by the Cooper Center at the University of Virginia, who produced an online map of the population density and the racial/ethnic makeup of the USA. Each dot in this map corresponds to a specific person counted in the census, located approximately at their residence. (To protect privacy, the precise locations have been randomized at the block level, so that the racial category can only be determined to within a rough geographic precision.)

Using Datashader on Big Data

The Cooper Center website delivers pre-rendered image tiles to your browser, which is fast, but limited to the plotting choices they made. What if you want to look at the data a different way - filter it, combine it with other data or manipulate it further? You could certainly re-do the steps they did, using their Python source code, but that would be a big project. Just running the code takes "dozens of hours" and adapting it for new uses requires significant programming and domain expertise. However, the new Python datashader library from Continuum Analytics makes it fast and fully interactive to do these kinds of analyses dynamically, using simple code that is easy to adapt to new uses. The steps below show that using datashader makes it quite practical to ask and answer questions about big data interactively, even on your laptop.

Load Data and Set Up

First, let's load the 2010 Census data into a pandas dataframe:

import pandas as pd
%%time  
df = pd.read_hdf('data/census.h5', 'census') 
df.race = df.race.astype('category')
    CPU times: user 13.9 s, sys: 35.7 s, total: 49.6 s
    Wall time: 1min 7s
df.tail()

 

  meterswest metersnorth race
306674999 -8922890.0 2958501.2 h
306675000 -8922863.0 2958476.2 h
306675001 -8922887.0 2958355.5 h
306675002 -8922890.0 2958316.0 h
306675003 -8922939.0 2958243.8 h

Loading the data from the HDF5-format file takes a minute, as you can see, which is by far the most time-consuming step. The output of .tail() shows that there are more than 300 million datapoints (one per person), each with a location in Web Mercator coordinates, and that the race/ethnicity for each datapoint has been encoded as a single character (where 'w' is white, 'b' is black, 'a' is Asian, 'h' is Hispanic and 'o' is other (typically Native American).

Let's define some geographic ranges to look at later and a default plot size.

USA = ((-13884029, -7453304), (2698291, 6455972))
LakeMichigan = ((-10206131, -9348029), (4975642, 5477059))
Chicago = (( -9828281, -9717659), (5096658, 5161298))
Chinatown = (( -9759210, -9754583), (5137122, 5139825))
NewYorkCity = (( -8280656, -8175066), (4940514, 4998954))
LosAngeles = ((-13195052, -13114944), (3979242, 4023720))
Houston = ((-10692703, -10539441), (3432521, 3517616))
Austin = ((-10898752, -10855820), (3525750, 3550837))
NewOrleans = ((-10059963, -10006348), (3480787, 3510555))
Atlanta = (( -9448349, -9354773), (3955797, 4007753))

x_range,y_range = USA

plot_width = int(900)
plot_height = int(plot_width*7.0/12)

 

Population Density

 

For our first examples, let's ignore the race data, focusing on population density alone.

Datashader works by aggregating an arbitrarily large set of data points (millions, for a pandas dataframe, or billions+ for a dask dataframe) into a fixed-size buffer that's the shape of your final image. Each of the datapoints is assigned to one bin in this buffer, and each of these bins will later become one pixel. In this case, we'll aggregate all the datapoints from people in the continental USA into a grid containing the population density per pixel:

import datashader as ds 
import datashader.transfer_functions as tf
from datashader.colors import Greys9, Hot, colormap_select as cm 
def bg(img): return tf.set_background(img,"black")

%%time 
cvs = ds.Canvas(plot_width, plot_height, *USA)
agg = cvs.points(df, 'meterswest', 'metersnorth')
    CPU times: user 3.97 s, sys: 12.2 ms, total: 3.98 s
    Wall time: 3.98 s

 

Computing this aggregate grid will take some CPU power (4-8 seconds on this MacBook Pro), because datashader has to iterate through the hundreds of millions of points in this dataset, one by one. But once the agg array has been computed, subsequent processing will now be nearly instantaneous, because there are far fewer pixels on a screen than points in the original database.

 

The aggregate grid now contains a count of the number of people in each location. We can visualize this data by mapping these counts into a grayscale value, ranging from black (a count of zero) to white (maximum count for this dataset). If we do this colormapping linearly, we can very quickly and clearly see...

%%time 
bg(tf.interpolate(agg, cmap = cm(Greys9), how='linear'))
    CPU times: user 25.6 ms, sys: 4.77 ms, total: 30.4 ms
    Wall time: 29.8 m

 

...almost nothing. The amount of detail visible is highly dependent on your monitor and its display settings, but it is unlikely that you will be able to make much out of this plot on most displays. If you know what to look for, you can see hotspots (high population densities) in New York City, Los Angeles, Chicago and a few other places. For feeding 300 million points in, we're getting almost nothing back in terms of visualization.

The first thing we can do is prevent "undersampling." In the plot above, there is no way to distinguish between pixels that are part of the background and those that have low but nonzero counts; both are mapped to black or nearly black on a linear scale. Instead, let's map all values that are not background to a dimly visible gray, leaving the highest-density values at white - let's discard the first 25% of the gray colormap and linearly interpolate the population densities over the remaining range:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.25), how='linear'))

The above plot at least reveals that data has been measured only within the political boundaries of the continental United States and that many areas in the mountainous West are so poorly populated that many pixels contained not even a single person (in datashader images, the background color is shown for pixels that have no data at all, using the alpha channel of a PNG image, while the specified colormap is shown for pixels that do have data). Some additional population centers are now visible, at least on some monitors. But, mainly what the above plot indicates is that population in the USA is extremely non-uniformly distributed, with hotspots in a few regions, and nearly all other pixels having much, much lower (but nonzero) values. Again, that's not much information to be getting out out of 300 million datapoints!

The problem is that of the available intensity scale in this gray colormap, nearly all pixels are colored the same low-end gray value, with only a few urban areas using any other colors. Thus, both of the above plots convey very little information. Because the data are clearly distributed so non-uniformly, let's instead try a nonlinear mapping from population counts into the colormap. A logarithmic mapping is often a good choice for real-world data that spans multiple orders of magnitude:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='log'))

Suddenly, we can see an amazing amount of structure! There are clearly meaningful patterns at nearly every location, ranging from the geographic variations in the mountainous West, to the densely spaced urban centers in New England and the many towns stretched out along roadsides in the midwest (especially those leading to Denver, the hot spot towards the right of the Rocky Mountains).

Clearly, we can now see much more of what's going on in this dataset, thanks to the logarithmic mapping. Yet, the choice of 'log' was purely arbitrary, and one could easily imagine that other nonlinear functions would show other interesting patterns. Instead of blindly searching through the space of all such functions, we can step back and notice that the main effect of the log transform has been to reveal local patterns at all population densities -- small towns show up clearly even if they are just slightly more dense than their immediate, rural neighbors, yet large cities with high population density also show up well against the surrounding suburban regions, even if those regions are more dense than the small towns on an absolute scale.

With this idea of showing relative differences across a large range of data values in mind, let's try the image-processing technique called histogram equalization. Given a set of raw counts, we can map these into a range for display such that every available color on the screen represents about the same number of samples in the original dataset. The result is similar to that from the log transform, but is now non-parametric -- it will equalize any linearly or nonlinearly distributed data, regardless of the distribution:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='eq_hist'))

Effectively, this transformation converts the data from raw magnitudes, which can easily span a much greater range than the dynamic range visible to the eye, to a rank-order or percentile representation, which reveals density differences at all ranges but obscures the absolute magnitudes involved. In this representation, you can clearly see the effects of geography (rivers, coastlines and mountains) on the population density, as well as history (denser near the longest-populated areas) and even infrastructure (with many small towns located at crossroads).

Given the very different results from the different types of plot, a good practice when visualizing any dataset with datashader is to look at both the linear and the histogram-equalized versions of the data; the linear version preserves the magnitudes but obscures the distribution, while the histogram-equalized version reveals the distribution while preserving only the order of the magnitudes, not their actual values. If both plots are similar, then the data is distributed nearly uniformly across the interval. But, much more commonly, the distribution will be highly nonlinear, and the linear plot will reveal only the envelope of the data - the lowest and the highest values. In such cases, the histogram-equalized plot will reveal much more of the structure of the data, because it maps the local patterns in the data into perceptible color differences on the screen, which is why eq_hist is the default colormapping.

Because we are only plotting a single dimension, we can use the colors of the display to effectively reach a higher dynamic range, mapping ranges of data values into different color ranges. Here, we'll use the colormap with the colors interpolated between the named colors shown:

print(cm(Hot,0.2))
bg(tf.interpolate(agg, cmap = cm(Hot,0.2)))     
    
    ['darkred', 'red', 'orangered', 'darkorange', 'orange', 'gold', 'yellow', 'white']

Such a representation can provide additional detail in each range, while still accurately conveying the overall distribution.

Because we can control the colormap, we can use it to address very specific questions about the data itself. For instance, after histogram equalization, data should be uniformly distributed across the visible colormap. Thus, if we want to highlight, for exmaple, the top 1% of pixels (by population density), we can use a colormap divided into 100 ranges and simply change the top one to a different color:

import numpy as np
grays2 = cm([(i,i,i) for i in np.linspace(0,255,99)]) + ["red"]
bg(tf.interpolate(agg, cmap = grays2))

The above plot now conveys nearly all the information available in the original linear plot - that only a few pixels have the very highest population densities - while also conveying the structure of the data at all population density ranges via histogram equalization.

Categorical Data (Race)

Since we've got the racial/ethnic category for every pixel, we can use color to indicate the category value, instead of just extending dynamic range or highlighting percentiles, as shown above. To do this, we first need to set up a color key for each category label:

color_key = {'w':'aqua', 'b':'lime', 'a':'red', 'h':'fuchsia', 'o':'yellow' }

We can now aggregate the counts per race into grids, using ds.count_cat, instead of just a single grid with the total counts (which is what happens with the default aggregate reducer ds.count). We then generate an image by colorizing each pixel using the aggregate information from each category for that pixel's location:

def create_image(x_range, y_range, w=plot_width, h=plot_height, spread=0):
    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
    agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
    img = tf.colorize(agg, color_key, how='eq_hist')
    if spread: img = tf.spread(img,px=spread)
    return tf.set_background(img,"black")

The result shows that the USA is overwhelmingly white, apart from some predominantly Hispanic regions along the Southern border, some regions with high densities of blacks in the Southeast and a few isolated areas of category "other" in the West (primarily Native American reservation areas).

 

create_image(*USA)

 

 

Interestingly, the racial makeup has some sharp boundaries around urban centers, as we can see if we zoom in:

create_image(*LakeMichigan)

With sufficient zoom, it becomes clear that Chicago (like most large US cities) has both a wide diversity of racial groups, and profound geographic segregation:

create_image(*Chicago)

Eventually, we can zoom in far enough to see individual datapoints. Here we can see that the Chinatown region of Chicago has, as expected, very high numbers of Asian residents, and that other nearby regions (separated by features like roads and highways) have other races, varying in how uniformly segregated they are:

create_image(*Chinatown,spread=plot_width//400)

Note that we've used the tf.spread function here to enlarge each point to cover multiple pixels so that each point is clearly visible.

Other Cities, for Comparison

Different cities have very different racial makeup, but they all appear highly segregated:

create_image(*NewYorkCity)

create_image(*LosAngeles)

create_image(*Houston)

create_image(*Atlanta)

create_image(*NewOrleans)

create_image(*Austin)

 

 

Analyzing Racial Data Through Visualization

The racial data and population densities are visible in the original Cooper Center map tiles, but because we aren't just working with static images here, we can look at any aspect of the data we like, with results coming back in a few seconds, rather than days. For instance, if we switch back to the full USA and then select only the black population, we can see that blacks predominantly reside in urban areas, except in the South and the East Coast:

cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))

bg(tf.interpolate(agg.sel(race='b'), cmap=cm(Greys9,0.25), how='eq_hist'))

(Compare to the all-race eq_hist plot at the start of this post.)

Or we can show only those pixels where there is at least one resident from each of the racial categories - white, black, Asian and Hispanic - which mainly highlights urban areas (compare to the first racial map shown for the USA above):

agg2 = agg.where((agg.sel(race=['w', 'b', 'a', 'h']) > 0).all(dim='race')).fillna(0) 
bg(tf.colorize(agg2, color_key, how='eq_hist'))

In the above plot, the colors still show the racial makeup of each pixel, but the pixels have been filtered so that only those with at least one datapoint from every race are shown.

We can also look at all pixels where there are more black than white datapoints, which highlights predominantly black neighborhoods of large urban areas across most of the USA, but also some rural areas and small towns in the South:

bg(tf.colorize(agg.where(agg.sel(race='w') < agg.sel(race='b')).fillna(0), color_key, how='eq_hist'))

Here the colors still show the predominant race in that pixel, which is black for many of these, but in Southern California it looks like there are several large neighborhoods where blacks outnumber whites, but both are outnumbered by Hispanics.

Notice how each of these queries takes only a line or so of code, thanks to the xarray multidimensional array library that makes it simple to do operations on the aggregated data. Anything that can be derived from the aggregates is visible in milliseconds, not the days of computing time that would have been required using previous approaches. Even calculations that require reaggregating the data only take seconds to run, thanks to the optimized Numba and dask libraries used by datashader.

Using datashader, it is now practical to try out your own hypotheses and questions, whether for the USA or for your own region. You can try posing questions that are independent of the number of datapoints in each pixel, since that varies so much geographically, by normalizing the aggregated data in various ways. Now that the data has been aggregated but not yet rendered to the screen, there is an infinite range of queries you can pose!

Interactive Bokeh Plots Overlaid with Map Data

The above plots all show static images on their own. datashader can also be combined with plotting libraries, in order to add axes and legends, to support zooming and panning (crucial for a large dataset like this one!), and/or to combine datashader output with other data sources, such as map tiles. To start, we can define a Bokeh plot that shows satellite imagery from ArcGIS:

import bokeh.plotting as bp
from bokeh.models.tiles import WMTSTileSource

bp.output_notebook()

def base_plot(tools='pan,wheel_zoom,reset',webgl=False):
     p = bp.figure(tools=tools,
         plot_width=int(900), plot_height=int(500),
         x_range=x_range, y_range=y_range, outline_line_color=None,
         min_border=0, min_border_left=0, min_border_right=0,
         min_border_top=0, min_border_bottom=0, webgl=webgl)

     p.axis.visible = False
     p.xgrid.grid_line_color = None
     p.ygrid.grid_line_color = None

     return p

p = base_plot()

url="http://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.png" 
tile_renderer = p.addtile(WMTSTileSource(url=url)) 
tile_renderer.alpha=1.0

We can then add an interactive plot that uses a callback to a datashader pipeline. In this pipeline, we'll use the tf.dynspread function to automatically increase the plotted size of each datapoint, once you've zoomed in so far that datapoints no longer have nearby neighbors:

from datashader.bokeh_ext import InteractiveImage

def image_callback(x_range, y_range, w, h):
     cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
     agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
     img = tf.colorize(agg, color_key, 'log')
     return tf.dynspread(img,threshold=0.75, max_px=8)

InteractiveImage(p, image_callback)

The above image will just be a static screenshot, but in a running Jupyter notebook you will be able to zoom and pan interactively, selecting any region of the map to study in more detail. Each time you zoom or pan, the entire datashader pipeline will be re-run, which will take a few seconds. At present, datashader does not use caching, tiling, partitioning or any of the other optimization techniques that would provide even more responsive plots, and, as the library matures, you can expect to see further improvements over time. But, the library is already fast enough to provide interactive plotting of all but the very largest of datasets, allowing you to change any aspect of your plot "on-the-fly" as you interact with the data.

To learn more about datashader, check out our extensive set of tutorial notebooks, then install it using conda install -c bokeh datashader and start trying out the Jupyter notebooks from github yourself! You can also watch my datashader talk from SciPy 2016 on YouTube. 

by ryanwh at August 01, 2016 03:06 PM

July 28, 2016

Continuum Analytics news

Dask and scikit-learn: a 3-Part Tutorial

Posted Thursday, July 28, 2016

Dask core contributor Jim Crist has put together a series of posts discussing some recent experiments combining Dask and scikit-learn on his blog, Marginally Stable. From these experiments, a small library has been built up, and can be found here.

The tutorial spans three posts, which covers model parallelism, data parallelism and combining the two with a real-life dataset. 

Part I: Dask & scikit-learn: Model Parallelism

In this post we'll look instead at model-parallelism (use same data across different models), and dive into a daskified implementation of GridSearchCV.

Part II: Dask & scikit-learn: Data Parallelism

In the last post we discussed model-parallelism — fitting several models across the same data. In this post we'll look into simple patterns for data-parallelism, which will allow fitting a single model on larger datasets.

Part III: Dask & scikit-learn: Putting it All Together

In this post we'll combine the above concepts together to do distributed learning and grid search on a real dataset; namely the airline dataset. This contains information on every flight in the USA between 1987 and 2008.

Keep up with Jim and his blog by following him on Twitter, @jiminy_crist

 

by swebster at July 28, 2016 05:34 PM

July 25, 2016

Continuum Analytics news

Analyzing and Visualizing Big Data Interactively on Your Laptop: Datashading the 2010 US Census

Posted Tuesday, July 26, 2016

The 2010 Census collected a variety of demographic information for all the more than 300 million people in the USA. A subset of this has been pre-processed by the Cooper Center at the University of Virginia, who produced an online map of the population density and the racial/ethnic makeup of the USA. Each dot in this map corresponds to a specific person counted in the census, located approximately at their residence. (To protect privacy, the precise locations have been randomized at the block level, so that the racial category can only be determined to within a rough geographic precision.)

Using Datashader on Big Data

The Cooper Center website delivers pre-rendered image tiles to your browser, which is fast, but limited to the plotting choices they made. What if you want to look at the data a different way - filter it, combine it with other data or manipulate it further? You could certainly re-do the steps they did, using their Python source code, but that would be a big project. Just running the code takes "dozens of hours" and adapting it for new uses requires significant programming and domain expertise. However, the new Python datashader library from Continuum Analytics makes it fast and fully interactive to do these kinds of analyses dynamically, using simple code that is easy to adapt to new uses. The steps below show that using datashader makes it quite practical to ask and answer questions about big data interactively, even on your laptop.

Load Data and Set Up

First, let's load the 2010 Census data into a pandas dataframe:

import pandas as pd

%%time
df = pd.read_hdf('data/census.h5', 'census')
df.race = df.race.astype('category')

     CPU times: user 13.9 s, sys: 35.7 s, total: 49.6 s

     Wall time: 1min 7s

df.tail()

  meterswest metersnorth race
306674999 -8922890.0 2958501.2 h
306675000 -8922863.0 2958476.2 h
306675001 -8922887.0 2958355.5 h
306675002 -8922890.0 2958316.0 h
306675003 -8922939.0 2958243.8 h

Loading the data from the HDF5-format file takes a minute, as you can see, which is by far the most time-consuming step. The output of .tail() shows that there are more than 300 million datapoints (one per person), each with a location in Web Mercator coordinates, and that the race/ethnicity for each datapoint has been encoded as a single character (where 'w' is white, 'b' is black, 'a' is Asian, 'h' is Hispanic and 'o' is other (typically Native American).

Let's define some geographic ranges to look at later and a default plot size.

USA = ((-13884029, -7453304), (2698291, 6455972))
LakeMichigan = ((-10206131, -9348029), (4975642, 5477059))
Chicago = (( -9828281, -9717659), (5096658, 5161298))
Chinatown = (( -9759210, -9754583), (5137122, 5139825))

NewYorkCity = (( -8280656, -8175066), (4940514, 4998954))
LosAngeles = ((-13195052, -13114944), (3979242, 4023720))
Houston = ((-10692703, -10539441), (3432521, 3517616))
Austin = ((-10898752, -10855820), (3525750, 3550837))
NewOrleans = ((-10059963, -10006348), (3480787, 3510555))
Atlanta = (( -9448349, -9354773), (3955797, 4007753))

x_range,y_range = USA

plot_width = int(900)
plot_height = int(plot_width*7.0/12)

Population Density

For our first examples, let's ignore the race data, focusing on population density alone.

Datashader works by aggregating an arbitrarily large set of data points (millions, for a pandas dataframe, or billions+ for a dask dataframe) into a fixed-size buffer that's the shape of your final image. Each of the datapoints is assigned to one bin in this buffer, and each of these bins will later become one pixel. In this case, we'll aggregate all the datapoints from people in the continental USA into a grid containing the population density per pixel:

import datashader as ds
import datashader.transfer_functions as tf
from datashader.colors import Greys9, Hot, colormap_select as cm
def bg(img): return tf.set_background(img,"black")

%%time
cvs = ds.Canvas(plot_width, plot_height, *USA)
agg = cvs.points(df, 'meterswest', 'metersnorth')

     CPU times: user 3.97 s, sys: 12.2 ms, total: 3.98 s
     Wall time: 3.98 s

Computing this aggregate grid will take some CPU power (4-8 seconds on this MacBook Pro), because datashader has to iterate through the hundreds of millions of points in this dataset, one by one. But once the agg array has been computed, subsequent processing will now be nearly instantaneous, because there are far fewer pixels on a screen than points in the original database.

The aggregate grid now contains a count of the number of people in each location. We can visualize this data by mapping these counts into a grayscale value, ranging from black (a count of zero) to white (maximum count for this dataset). If we do this colormapping linearly, we can very quickly and clearly see...

%%time
bg(tf.interpolate(agg, cmap = cm(Greys9), how='linear'))
 

     CPU times: user 25.6 ms, sys: 4.77 ms, total: 30.4 ms
     Wall time: 29.8 ms

...almost nothing. The amount of detail visible is highly dependent on your monitor and its display settings, but it is unlikely that you will be able to make much out of this plot on most displays. If you know what to look for, you can see hotspots (high population densities) in New York City, Los Angeles, Chicago and a few other places. For feeding 300 million points in, we're getting almost nothing back in terms of visualization.

The first thing we can do is prevent "undersampling." In the plot above, there is no way to distinguish between pixels that are part of the background and those that have low but nonzero counts; both are mapped to black or nearly black on a linear scale. Instead, let's map all values that are not background to a dimly visible gray, leaving the highest-density values at white - let's discard the first 25% of the gray colormap and linearly interpolate the population densities over the remaining range:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.25), how='linear'))

The above plot at least reveals that data has been measured only within the political boundaries of the continental United States and that many areas in the mountainous West are so poorly populated that many pixels contained not even a single person (in datashader images, the background color is shown for pixels that have no data at all, using the alpha channel of a PNG image, while the specified colormap is shown for pixels that do have data). Some additional population centers are now visible, at least on some monitors. But, mainly what the above plot indicates is that population in the USA is extremely non-uniformly distributed, with hotspots in a few regions, and nearly all other pixels having much, much lower (but nonzero) values. Again, that's not much information to be getting out out of 300 million datapoints!

The problem is that of the available intensity scale in this gray colormap, nearly all pixels are colored the same low-end gray value, with only a few urban areas using any other colors. Thus, both of the above plots convey very little information. Because the data are clearly distributed so non-uniformly, let's instead try a nonlinear mapping from population counts into the colormap. A logarithmic mapping is often a good choice for real-world data that spans multiple orders of magnitude:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='log'))

Suddenly, we can see an amazing amount of structure! There are clearly meaningful patterns at nearly every location, ranging from the geographic variations in the mountainous West, to the densely spaced urban centers in New England and the many towns stretched out along roadsides in the midwest (especially those leading to Denver, the hot spot towards the right of the Rocky Mountains).

Clearly, we can now see much more of what's going on in this dataset, thanks to the logarithmic mapping. Yet, the choice of 'log' was purely arbitrary, and one could easily imagine that other nonlinear functions would show other interesting patterns. Instead of blindly searching through the space of all such functions, we can step back and notice that the main effect of the log transform has been to reveal local patterns at all population densities -- small towns show up clearly even if they are just slightly more dense than their immediate, rural neighbors, yet large cities with high population density also show up well against the surrounding suburban regions, even if those regions are more dense than the small towns on an absolute scale.

With this idea of showing relative differences across a large range of data values in mind, let's try the image-processing technique called histogram equalization. Given a set of raw counts, we can map these into a range for display such that every available color on the screen represents about the same number of samples in the original dataset. The result is similar to that from the log transform, but is now non-parametric -- it will equalize any linearly or nonlinearly distributed data, regardless of the distribution:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='eq_hist'))

Effectively, this transformation converts the data from raw magnitudes, which can easily span a much greater range than the dynamic range visible to the eye, to a rank-order or percentile representation, which reveals density differences at all ranges but obscures the absolute magnitudes involved. In this representation, you can clearly see the effects of geography (rivers, coastlines and mountains) on the population density, as well as history (denser near the longest-populated areas) and even infrastructure (with many small towns located at crossroads).

Given the very different results from the different types of plot, a good practice when visualizing any dataset with datashader is to look at both the linear and the histogram-equalized versions of the data; the linear version preserves the magnitudes but obscures the distribution, while the histogram-equalized version reveals the distribution while preserving only the order of the magnitudes, not their actual values. If both plots are similar, then the data is distributed nearly uniformly across the interval. But, much more commonly, the distribution will be highly nonlinear, and the linear plot will reveal only the envelope of the data - the lowest and the highest values. In such cases, the histogram-equalized plot will reveal much more of the structure of the data, because it maps the local patterns in the data into perceptible color differences on the screen, which is why eq_hist is the default colormapping.

Because we are only plotting a single dimension, we can use the colors of the display to effectively reach a higher dynamic range, mapping ranges of data values into different color ranges. Here, we'll use the colormap with the colors interpolated between the named colors shown:

print(cm(Hot,0.2))
bg(tf.interpolate(agg, cmap = cm(Hot,0.2)))

     ['darkred', 'red', 'orangered', 'darkorange', 'orange', 'gold', 'yellow', 'white']

Such a representation can provide additional detail in each range, while still accurately conveying the overall distribution.

Because we can control the colormap, we can use it to address very specific questions about the data itself. For instance, after histogram equalization, data should be uniformly distributed across the visible colormap. Thus, if we want to highlight, for exmaple, the top 1% of pixels (by population density), we can use a colormap divided into 100 ranges and simply change the top one to a different color:

import numpy as np
grays2 = cm([(i,i,i) for i in np.linspace(0,255,99)]) + ["red"]
bg(tf.interpolate(agg, cmap = grays2))

The above plot now conveys nearly all the information available in the original linear plot - that only a few pixels have the very highest population densities - while also conveying the structure of the data at all population density ranges via histogram equalization.

Categorical Data (Race)

Since we've got the racial/ethnic category for every pixel, we can use color to indicate the category value, instead of just extending dynamic range or highlighting percentiles, as shown above. To do this, we first need to set up a color key for each category label:

color_key = {'w':'aqua', 'b':'lime', 'a':'red', 'h':'fuchsia', 'o':'yellow' }

We can now aggregate the counts per race into grids, using ds.count_cat, instead of just a single grid with the total counts (which is what happens with the default aggregate reducer ds.count). We then generate an image by colorizing each pixel using the aggregate information from each category for that pixel's location:

def create_image(x_range, y_range, w=plot_width, h=plot_height, spread=0):
     cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
     agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
     img = tf.colorize(agg, color_key, how='eq_hist')
     if spread: img = tf.spread(img,px=spread)

     return tf.set_background(img,"black")

The result shows that the USA is overwhelmingly white, apart from some predominantly Hispanic regions along the Southern border, some regions with high densities of blacks in the Southeast and a few isolated areas of category "other" in the West (primarily Native American reservation areas).

 

create_image(*USA)

Interestingly, the racial makeup has some sharp boundaries around urban centers, as we can see if we zoom in:

create_image(*LakeMichigan)

With sufficient zoom, it becomes clear that Chicago (like most large US cities) has both a wide diversity of racial groups, and profound geographic segregation:

create_image(*Chicago)

Eventually, we can zoom in far enough to see individual datapoints. Here we can see that the Chinatown region of Chicago has, as expected, very high numbers of Asian residents, and that other nearby regions (separated by features like roads and highways) have other races, varying in how uniformly segregated they are:

create_image(*Chinatown,spread=plot_width//400)

Note that we've used the tf.spread function here to enlarge each point to cover multiple pixels so that each point is clearly visible.

Other Cities, for Comparison

Different cities have very different racial makeup, but they all appear highly segregated:

create_image(*NewYorkCity)

create_image(*LosAngeles)

create_image(*Houston)

create_image(*Atlanta)

create_image(*NewOrleans)

create_image(*Austin)

Analyzing Racial Data Through Visualization

The racial data and population densities are visible in the original Cooper Center map tiles, but because we aren't just working with static images here, we can look at any aspect of the data we like, with results coming back in a few seconds, rather than days. For instance, if we switch back to the full USA and then select only the black population, we can see that blacks predominantly reside in urban areas, except in the South and the East Coast:

cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))

bg(tf.interpolate(agg.sel(race='b'), cmap=cm(Greys9,0.25), how='eq_hist'))

(Compare to the all-race eq_hist plot at the start of this post.)

Or we can show only those pixels where there is at least one resident from each of the racial categories - white, black, Asian and Hispanic - which mainly highlights urban areas (compare to the first racial map shown for the USA above):

agg2 = agg.where((agg.sel(race=['w', 'b', 'a', 'h']) > 0).all(dim='race')).fillna(0)
bg(tf.colorize(agg2, color_key, how='eq_hist'))

In the above plot, the colors still show the racial makeup of each pixel, but the pixels have been filtered so that only those with at least one datapoint from every race are shown.

We can also look at all pixels where there are more black than white datapoints, which highlights predominantly black neighborhoods of large urban areas across most of the USA, but also some rural areas and small towns in the South:

bg(tf.colorize(agg.where(agg.sel(race='w') < agg.sel(race='b')).fillna(0), color_key, how='eq_hist'))

Here the colors still show the predominant race in that pixel, which is black for many of these, but in Southern California it looks like there are several large neighborhoods where blacks outnumber whites, but both are outnumbered by Hispanics.

Notice how each of these queries takes only a line or so of code, thanks to the xarray multidimensional array library that makes it simple to do operations on the aggregated data. Anything that can be derived from the aggregates is visible in milliseconds, not the days of computing time that would have been required using previous approaches. Even calculations that require reaggregating the data only take seconds to run, thanks to the optimized Numba and dask libraries used by datashader.

Using datashader, it is now practical to try out your own hypotheses and questions, whether for the USA or for your own region. You can try posing questions that are independent of the number of datapoints in each pixel, since that varies so much geographically, by normalizing the aggregated data in various ways. Now that the data has been aggregated but not yet rendered to the screen, there is an infinite range of queries you can pose!

Interactive Bokeh Plots Overlaid with Map Data

The above plots all show static images on their own. datashader can also be combined with plotting libraries, in order to add axes and legends, to support zooming and panning (crucial for a large dataset like this one!), and/or to combine datashader output with other data sources, such as map tiles. To start, we can define a Bokeh plot that shows satellite imagery from ArcGIS:

import bokeh.plotting as bp
from bokeh.models.tiles import WMTSTileSource

bp.output_notebook()

def base_plot(tools='pan,wheel_zoom,reset',webgl=False):
     p = bp.figure(tools=tools,
         plot_width=int(900), plot_height=int(500),
         x_range=x_range, y_range=y_range, outline_line_color=None,
         min_border=0, min_border_left=0, min_border_right=0,
         min_border_top=0, min_border_bottom=0, webgl=webgl)

     p.axis.visible = False
     p.xgrid.grid_line_color = None
     p.ygrid.grid_line_color = None

     return p

p = base_plot()

url="http://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.png"
tile_renderer = p.addtile(WMTSTileSource(url=url))
tile_renderer.alpha=1.0

We can then add an interactive plot that uses a callback to a datashader pipeline. In this pipeline, we'll use the tf.dynspread function to automatically increase the plotted size of each datapoint, once you've zoomed in so far that datapoints no longer have nearby neighbors:

from datashader.bokeh_ext import InteractiveImage

def image_callback(x_range, y_range, w, h):
     cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
     agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
     img = tf.colorize(agg, color_key, 'log')
     return tf.dynspread(img,threshold=0.75, max_px=8)

InteractiveImage(p, image_callback)

The above image will just be a static screenshot, but in a running Jupyter notebook you will be able to zoom and pan interactively, selecting any region of the map to study in more detail. Each time you zoom or pan, the entire datashader pipeline will be re-run, which will take a few seconds. At present, datashader does not use caching, tiling, partitioning or any of the other optimization techniques that would provide even more responsive plots, and, as the library matures, you can expect to see further improvements over time. But, the library is already fast enough to provide interactive plotting of all but the very largest of datasets, allowing you to change any aspect of your plot "on-the-fly" as you interact with the data.

To learn more about datashader, check out our extensive set of tutorial notebooks, then install it using conda install -c bokeh datashader and start trying out the Jupyter notebooks from github yourself! You can also watch my datashader talk from SciPy 2016 on YouTube. 

by swebster at July 25, 2016 03:12 PM

July 22, 2016

William Stein

DataDog's pricing: don't make the same mistake I made

I stupidly made a mistake recently by choosing to use DataDog for monitoring the infrastructure for my startup (SageMathCloud).

I got bit by their pricing UI design that looks similar to many other sites, but is different in a way that caused me to spend far more money than I expected.

I'm writing this post so that you won't make the same mistake I did.  As a product, DataDog is of course a lot of hard work to create, and they can try to charge whatever they want. However, my problem is that what they are going to charge was confusing and misleading to me.

I wanted to see some nice web-based data about my new autoscaled Kubernetes cluster, so I looked around at options. DataDog looked like a new and awesomely-priced service for seeing live logging. And when I looked (not carefully enough) at the pricing, it looked like only $15/month to monitor a bunch of machines. I'm naive about the cost of cloud monitoring -- I've been using Stackdriver on Google cloud platform for years, which is completely free (for now, though that will change), and I've also used self hosted open solutions, and some quite nice solutions I've written myself. So my expectations were way out of whack.

Ever busy, I signed up for the "$15/month plan":


One of the people on my team spent a little time and installed datadog on all the VM's in our cluster, and also made DataDog automatically start running on any nodes in our Kubernetes cluster. That's a lot of machines.

Today I got the first monthly bill, which is for the month that just happened. The cost was $639.19 USD charged to my credit card. I was really confused for a while, wondering if I had bought a year subscription.



After a while I realized that the cost is per host! When I looked at the pricing page the first time, I had just saw in big letters "$15", and "$18 month-to-month" and "up to 500 hosts". I completely missed the "Per Host" line, because I was so naive that I didn't think the price could possibly be that high.

I tried immediately to delete my credit card and cancel my plan, but the "Remove Card" button is greyed out, and it says you can "modify your subscription by contacting us at success@datadoghq.com":



So I wrote to success@datadoghq.com:

Dear Datadog,

Everybody on my team was completely mislead by your
horrible pricing description.

Please cancel the subscription for wstein immediately
and remove my credit card from your system.

This is the first time I've wasted this much money
by being misled by a website in my life.

I'm also very unhappy that I can't delete my credit
card or cancel my subscription via your website. It's
like one more stripe API call to remove the credit card
(I know -- I implemented this same feature for my site).


And they responded:

Thanks for reaching out. If you'd like to cancel your
Datadog subscription, you're able to do so by going into
the platform under 'Plan and Usage' and choose the option
downgrade to 'Lite', that will insure your credit card
will not be charged in the future. Please be sure to
reduce your host count down to the (5) allowed under
the 'Lite' plan - those are the maximum allowed for
the free plan.

Also, please note you'll be charged for the hosts
monitored through this month. Please take a look at
our billing FAQ.


They were right -- I was able to uninstall the daemons, downgrade to Lite, remove my card, etc. all through the website without manual intervention.

When people have been confused with billing for my site, I have apologized, immediately refunded their money, and opened a ticket to make the UI clearer.  DataDog didn't do any of that.

I wish DataDog would at least clearly state that when you use their service you are potentially on the hook for an arbitrarily large charge for any month. Yes, if they had made that clear, they wouldn't have had me as a customer, so they are not incentivized to do so.

A fool and their money are soon parted. I hope this post reduces the chances you'll be a fool like me.  If you chose to use DataDog, and their monitoring tools are very impressive, I hope you'll be aware of the cost.


ADDED:

On Hacker News somebody asked: "How could their pricing page be clearer? It says per host in fairly large letters underneath it. I'm asking because I will be designing a similar page soon (that's also billed per host) and I'd like to avoid the same mistakes."  My answer:

[EDIT: This pricing page by the top poster in this thread is way better than I suggest below -- https://www.serverdensity.com/pricing/]

1. VERY clearly state that when you sign up for the service, then you are on the hook for up to $18*500 = $9000 + tax in charges for any month. Even Google compute engine (and Amazon) don't create such a trap, and have a clear explicit quota increase process.
2. Instead of "HUGE $15" newline "(small light) per host", put "HUGE $18 per host" all on the same line. It would easily fit. I don't even know how the $15/host datadog discount could ever really work, given that the number of hosts might constantly change and there is no prepayment.
3. Inform users clearly in the UI at any time how much they are going to owe for that month (so far), rather than surprising them at the end. Again, Google Cloud Platform has a very clear running total in their billing section, and any time you create a new VM it gives the exact amount that VM will cost per month.
4. If one works with a team, 3 is especially important. The reason that I had monitors on 50+ machines is that another person working on the project, who never looked at pricing or anything, just thought -- he I'll just set this up everywhere. He had no idea there was a per-machine fee.

by William Stein (noreply@blogger.com) at July 22, 2016 02:17 PM

July 15, 2016

Continuum Analytics news

The Gordon and Betty Moore Foundation Grant for Numba and Dask

Posted Thursday, July 14, 2016
Travis Oliphant
Chief Executive Officer & Co-Founder

I am thrilled to announce that the Gordon and Betty Moore Foundation has provided a significant grant in order to help move Numba and Dask to version 1.0 and graduate them into robust community-supported projects. 

Numba and Dask are two projects that have grown out of our intense foundational desire at Continuum to improve the state of large-scale data analytics, quantitative computing, advanced analytics and machine learning. Our fundamental purpose at Continuum is to empower people to solve the world’s greatest challenges. We are on a mission to help people discover, analyze and collaborate by connecting their curiosity and experience with any data.    

One part of helping great people do even more with their computing power is to ensure that modern hardware is completely accessible and utilizable to those with deep knowledge in other areas besides programming. For many years, Python has been simplifying the connection between computers and the minds of those with deep knowledge in areas such as statistics, science, business, medicine, mathematics and engineering. Numba and Dask strengthen this connection even further so that modern hardware with multiple parallel computing units can be fully utilized with Python code. 

Numba enables scaling up on modern hardware, including computers with GPUs and extreme multi-core CPUs, by compiling a subset of Python syntax to machine code that can run in parallel. Dask enables Python code to take full advantage of both multi-core CPUs and data that does not fit in memory by defining a directed graph of tasks that work on blocks of data and using the wealth of libraries in the PyData stack. Dask also now works well on a cluster of machines with data stored in a distributed file-system, such as Hadoop’s HDFS. Together, Numba and Dask can be used to more easily build solutions that take full advantage of modern hardware, such as machine-learning algorithms, image-processing on clusters of GPUs or automatic visualization of billions of data-points with datashader

Peter Wang and I started Continuum with a desire to bring next-generation array-computing to PyData. We have broadened that initial desire to empowering entire data science teams with the Anaconda platform, while providing full application solutions to data-centric companies and institutions. It is extremely rewarding to see that Numba and Dask are now delivering on our initial dream to bring next-generation array-computing to the Python ecosystem in a way that takes full advantage of modern hardware.    

This award from the Moore Foundation will make it even easier for Numba and Dask to allow Python to be used for large scale computing. With Numba and Dask, users will be able to build high performance applications with large data sets. The grant will also enable our Community Innovation team at Continuum to ensure that these technologies can be used by other open source projects in the PyData ecosystem. This will help scientists, engineers and others interested in improving the world achieve their goals even faster.

Continuum has been an active contributor to the Python data science ecosystem since Peter and I founded the company in early 2012. Anaconda, the leading Open Data Science platform, is now the most popular Python distribution available. Continuum has also conceived and developed several new additions to this ecosystem, making them freely available to the open­ source community, while continuing to support the foundational projects that have made the ecosystem possible.  

The Gordon and Betty Moore Foundation fosters pathbreaking scientific discovery,  environmental conservation, patient care improvements and preservation of the special character of the Bay Area. The Numba and Dask projects are funded by the Gordon and Betty Moore Foundation through Grant GBMF5423 to Continuum Analytics (Grant Agreement #5423).    

We are honored to receive this grant and look forward to working with The Moore Foundation. 

To hear more about Numba and Dask, check out our related SciPy sessions in Austin, TX this week:

  • Thursday, July 14th at 10:30am: “Dask: Parallel and Distributed Computing” by Matthew Rocklin & Jim Crist of Continuum Analytics
  • Friday, July 15th at 11:00am: “Scaling Up and Out:  Programming GPU Clusters with Numba and Dask” by Stan Seibert & Siu Kwan Lam of Continuum Analytics
  • Friday, July 15th at 2:30pm: “Datashader: Revealing the Structure of Genuinely Big Data” by James Bednar & Jim Christ of Continuum Analytics. 

by ryanwh at July 15, 2016 02:55 PM

Automate your README: conda kapsel Beta 1

Posted Wednesday, July 13, 2016
Havoc Pennington
Continuum Analytics

TL;DR: New beta conda feature allows data scientists and others to describe project runtime requirements in a single file called kapsel.yml. Using kapsel.yml, conda will automatically reproduce prerequisites on any machine and then run the project. 

Data scientists working with Python often create a project directory containing related analysis, notebook files, data-cleaning scripts, Bokeh visualizations, and so on. For a colleague who wants to replicate your project, or even for the original creator a few months later, it can be tricky to run all this code exactly as it was run the first time. 

Most code relies on some specific setup before it’s run -- such as installing certain versions of packages, downloading data files, starting up database servers, configuring passwords, or configuring parameters to a model. 

You can write a long README file to manually record all these steps and hope that you got it right. Or, you could use conda kapsel. This new beta conda feature allows data scientists to list their setup and runtime requirements in a single file called kapsel.yml. Conda reads this file and performs all these steps automatically. With conda kapsel, your project just works for anyone you share it with.

Sharing your project with others

When you’ve shared your project directory (including a kapsel.yml) and a colleague types conda kapsel run in that directory, conda automatically creates a dedicated environment, puts the correct packages in it, downloads any needed data files, starts needed services, prompts the user for missing configuration values, and runs the right command from your project.

As with all things conda, there’s an emphasis on ease-of-use. It would be clunky to first manually set up a project, and then separately configure project requirements for automated setup. 

With the conda kapsel command, you set up and configure the project at the same time. For example, if you type conda kapsel add-packages bokeh=0.12, you’ll get Bokeh 0.12 in your project's environment, and automatically record a requirement for Bokeh 0.12 in your kapsel.yml. This means there’s no extra work to make your project reproducible. Conda keeps track of your project setup for you, automatically making any project directory into a runnable, reproducible “conda kapsel.”

There’s nothing data-science-specific about conda kapsel; it’s a general-purpose feature, just like conda’s core package management features. But we believe conda kapsel’s simple approach to reproducibility will appeal to data scientists.

Try out conda kapsel

To understand conda kapsel, we recommend going through the tutorial. It’s a quick way to see what it is and learn how to use it. The tutorial includes installation instructions.

Where to send feedback

If you want to talk interactively about conda kapsel, give us some quick feedback, or run into any questions, join our chat room on Gitter. We would love to hear from you!

If you find a bug or have a suggestion, filing a GitHub issue is another great way to let us know.

If you want to have a look at the code, conda kapsel is on GitHub.

Next steps for conda kapsel

This is a first beta, so we expect conda kapsel to continue to evolve. Future directions will depend on the feedback you give us, but some of the ideas we have in mind:

  • Support for automating additional setup steps: What’s in your README that could be automated? Let us know!
  • Extensibility: We’d like to support both third-party plugins, and custom setup scripts embedded in projects.
  • UX refinement: We believe the tool can be even more intuitive and we’re currently exploring some changes to eliminate points of confusion early users have encountered. (We’d love to hear your experiences with the tutorial, especially if you found anything clunky or confusing.)

For the time being, the conda kapsel API and command line syntax are subject to change in future releases. A project created with the current “beta” version of conda kapsel may always need to be run with that version of conda kapsel and not conda kapsel 1.0. When we think things are solid, we’ll switch from “beta” to “1.0” and you’ll be able to rely on long-term interface stability.

We hope you find conda kapsel useful!

by ryanwh at July 15, 2016 02:54 PM

July 13, 2016

Continuum Analytics news

The Gordon and Betty Moore Foundation Grant for Numba and Dask

Posted Thursday, July 14, 2016

I am thrilled to announce that the Gordon and Betty Moore Foundation has provided a significant grant in order to help move Numba and Dask to version 1.0 and graduate them into robust community-supported projects. 

Numba and Dask are two projects that have grown out of our intense foundational desire at Continuum to improve the state of large-scale data analytics, quantitative computing, advanced analytics and machine learning. Our fundamental purpose at Continuum is to empower people to solve the world’s greatest challenges. We are on a mission to help people discover, analyze and collaborate by connecting their curiosity and experience with any data.    

One part of helping great people do even more with their computing power is to ensure that modern hardware is completely accessible and utilizable to those with deep knowledge in other areas besides programming. For many years, Python has been simplifying the connection between computers and the minds of those with deep knowledge in areas such as statistics, science, business, medicine, mathematics and engineering. Numba and Dask strengthen this connection even further so that modern hardware with multiple parallel computing units can be fully utilized with Python code. 

Numba enables scaling up on modern hardware, including computers with GPUs and extreme multi-core CPUs, by compiling a subset of Python syntax to machine code that can run in parallel. Dask enables Python code to take full advantage of both multi-core CPUs and data that does not fit in memory by defining a directed graph of tasks that work on blocks of data and using the wealth of libraries in the PyData stack. Dask also now works well on a cluster of machines with data stored in a distributed file-system, such as Hadoop’s HDFS. Together, Numba and Dask can be used to more easily build solutions that take full advantage of modern hardware, such as machine-learning algorithms, image-processing on clusters of GPUs or automatic visualization of billions of data-points with datashader

Peter Wang and I started Continuum with a desire to bring next-generation array-computing to PyData. We have broadened that initial desire to empowering entire data science teams with the Anaconda platform, while providing full application solutions to data-centric companies and institutions. It is extremely rewarding to see that Numba and Dask are now delivering on our initial dream to bring next-generation array-computing to the Python ecosystem in a way that takes full advantage of modern hardware.    

This award from the Moore Foundation will make it even easier for Numba and Dask to allow Python to be used for large scale computing. With Numba and Dask, users will be able to build high performance applications with large data sets. The grant will also enable our Community Innovation team at Continuum to ensure that these technologies can be used by other open source projects in the PyData ecosystem. This will help scientists, engineers and others interested in improving the world achieve their goals even faster.

Continuum has been an active contributor to the Python data science ecosystem since Peter and I founded the company in early 2012. Anaconda, the leading Open Data Science platform, is now the most popular Python distribution available. Continuum has also conceived and developed several new additions to this ecosystem, making them freely available to the open­ source community, while continuing to support the foundational projects that have made the ecosystem possible.  

The Gordon and Betty Moore Foundation fosters pathbreaking scientific discovery,  environmental conservation, patient care improvements and preservation of the special character of the Bay Area. The Numba and Dask projects are funded by the Gordon and Betty Moore Foundation through Grant GBMF5423 to Continuum Analytics (Grant Agreement #5423).    

We are honored to receive this grant and look forward to working with The Moore Foundation. 

To hear more about Numba and Dask, check out our related SciPy sessions in Austin, TX this week:

  • Thursday, July 14th at 10:30am: “Dask: Parallel and Distributed Computing” by Matthew Rocklin & Jim Crist of Continuum Analytics
  • Friday, July 15th at 11:00am: “Scaling Up and Out:  Programming GPU Clusters with Numba and Dask” by Stan Seibert & Siu Kwan Lam of Continuum Analytics
  • Friday, July 15th at 2:30pm: “Datashader: Revealing the Structure of Genuinely Big Data” by James Bednar & Jim Christ of Continuum Analytics. 

by swebster at July 13, 2016 04:49 PM

July 12, 2016

Matthieu Brucher

Book review: Team Geek

Sometimes I forget that I have to work with teams, whether they are virtual teams or physical teams. And although I started working on understanding the culture map, I still have to understand how to efficiently work in a team. Enters the book.

Content and opinions

Divided in 6 chapters, the book tries to move from a centric point of view to the most general one, with team members and users, around a principle summed it by HRT. First of all, the book spends a chapter on geniuses. Actually, it’s not really geniuses and not people who think they are geniuses and spend times in their cave working on something and 10 weeks later get out and share their wonderful (crappy) code. Here, the focus is visibility and communication: we all make mistakes (let’s move on, as would say Leonard Hofstadter), so we need to face ourselves with the rest of the team as early as possible.

To achieve this, you need a team culture, a place where people can communicate. There are several levels in this, different ways to achieve this and probably a good balance between all elements, as explained in the second chapter of the book. And with this, you need a good team leader (chapter 3) that will nurture the team culture. Strangely, the book seems to advocate technical people to become team leaders, which is something I find difficult. Actually the book help me understand the good aspects of this, and from the different teams I saw around me, it seems that a pattern that has merits and with a good help to learn delegation, trust… it could be an interesting future for technical people (instead of having bad technical people taking management positions and fighting them because let’s face it, they don’t understand a thing :p ).

Fourth chapter is about dealing with poisonous people. One of the poison is… exactly what I shown in the last paragraph: resent and bitterness! A team is a team with his captain, we are all in the same boat. We can’t badmouth the people we work with (as hard it is!). Fifth chapter is more about the maze above you (fourth was more about dealing with the maze below), how to work with a good manager, and how to deal with a bad one. Sometimes it’s just about communication, sometimes, it’s not, so what should you do?

Finally, the other member of the team is the end-user. As the customer pays the bill in the end, he has to be on board and feel respected, trusted (as much as the team is, it’s a balance!). There are not that many chapters about users in software engineering books, it’s a hard topic. This final chapter gives good advice on the subject.

Conclusion

There are lots of things that are obvious. There are things that are explained in other books as well, but the fact that all relevant topics for computer scientists is in this book makes it an excellent introduction to people starting to work in a team.

by Matt at July 12, 2016 07:33 AM

July 06, 2016

Continuum Analytics news

The Journey to Open Data Science Is Not as Hard as You Think

Posted Wednesday, July 6, 2016

Businesses are in a constant struggle to stay relevant in the market and change is rarely easy — especially when it involves technological overhaul.

Think about the world’s switch from the horse and buggy to automobiles: It revolutionized the world, but it was hardly a smooth transition. At the turn of the 20th century, North America was a loose web of muddy dirt roads trampled by 24 million horses. It took a half-century of slow progress before tires and brakes replaced hooves and reins.

Just as driverless cars hint at a new era of automobiles, the writing’s on the wall for modern analytics: Companies will need to embrace the world’s inevitable slide toward Open Data Science. Fortunately, just as headlights now illuminate our highways, there is a light to guide companies through the transformation.

The Muddy Road to New Technologies

No matter the company or the technological shift, transitions can be challenging for multiple reasons.

One reason is the inevitable skills gap in the labor market when new technology comes along. Particularly in highly specialized fields like data science, finding skilled employees with enterprise experience is difficult. The right hires can mean the difference between success and failure.

Another issue stems from company leaders’ insufficient understanding of existing technologies — both what they can and cannot do. Applications that use machine and deep learning require new software, but companies often mistakenly believe their existing systems are capable of handling the load. This issue is compounded by fragile, cryptic legacy code that can be a nightmare to repurpose.

Finally, these two problems combine to form a third: a lack of understanding about how to train people to implement and deploy new technology. Ultimately, this culminates in floundering and wasted resources across an entire organization.

Luckily, it does not have to be this way.

Open Data Science Paves a New Path

Fortunately, Open Data Science is the guiding light to help companies switch to modern data science easily. Here’s how such an initiative breaks down transitional barriers:

  • No skills gap: Open Data Science is founded on Python and R — both hot languages in universities and in the marketplace. This opens up a massive pool of available talent and a worldwide base of excited programmers and users.
  • No tech stagnation: Open Data Science applications connect via APIs to nearly any data source. In terms of programming, there’s an open source version of any proprietary software on the market. Open Data Science applications such as Anaconda allow for easy interoperability between systems, which is central to the movement.
  • No floundering: Open Data Science bridges old and new technologies to make training and deployment a breeze. One such example is Anaconda Fusion, which offers business analysts command of powerful Python Open Data Science libraries through a familiar Excel interface.

A Guided Pathway to Open Data Science

Of course, just knowing that Open Data Science speeds the transition isn’t enough. A well-trained guide is equally vital for leading companies down the best path to adoption.

The first step is a change management assessment. How will executive, operational and data science teams quickly get up to speed on why Open Data Science is critical to their business? What are the first steps? This process can seem daunting when attempted alone. But this is where consultants from the Open Data Science community can provide the focus and knowledge necessary to quickly convert a muddy back road into the Autobahn.

No matter the business and its existing technologies, any change management plan should include a few key points. First, there should be a method for integration of legacy code (which Anaconda makes easier with packages for melding Python with C or Fortran code). Intelligent migration of data is also important, as is training team members on new systems or recruiting new talent.

While the business world leaves little room for fumbles, like delayed adoption of new technologies or poor change management, Open Data Science can prevent your company’s data science initiatives from meeting a dead end. Open Data Science software provides more than low-cost applications, interoperability, transparency and access — it also brings community know-how to guide change management and facilitate an analytics overhaul in any company at any scale.

Thanks to Open Data Science, the road to data science superiority is now paved with gold.

 

by swebster at July 06, 2016 02:10 PM

July 05, 2016

Thomas Wiecki

Bayesian Deep Learning Part II: Bridging PyMC3 and Lasagne to build a Hierarchical Neural Network

(c) 2016 by Thomas Wiecki

Recently, I blogged about Bayesian Deep Learning with PyMC3 where I built a simple hand-coded Bayesian Neural Network and fit it on a toy data set. Today, we will build a more interesting model using Lasagne, a flexible Theano library for constructing various types of Neural Networks. As you may know, PyMC3 is also using Theano so having the Artifical Neural Network (ANN) be built in Lasagne, but placing Bayesian priors on our parameters and then using variational inference (ADVI) in PyMC3 to estimate the model should be possible. To my delight, it is not only possible but also very straight forward.

Below, I will first show how to bridge PyMC3 and Lasagne to build a dense 2-layer ANN. We'll then use mini-batch ADVI to fit the model on the MNIST handwritten digit data set. Then, we will follow up on another idea expressed in my last blog post -- hierarchical ANNs. Finally, due to the power of Lasagne, we can just as easily build a Hierarchical Bayesian Convolution ANN with max-pooling layers to achieve 98% accuracy on MNIST.

Most of the code used here is borrowed from the Lasagne tutorial.

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('white')
sns.set_context('talk')

import pymc3 as pm
import theano.tensor as T
import theano

from scipy.stats import mode, chisquare

from sklearn.metrics import confusion_matrix, accuracy_score

import lasagne

Data set: MNIST

We will be using the classic MNIST data set of handwritten digits. Contrary to my previous blog post which was limited to a toy data set, MNIST is an actually challenging ML task (of course not quite as challening as e.g. ImageNet) with a reasonable number of dimensions and data points.

In [2]:
import sys, os

def load_dataset():
    # We first define a download function, supporting both Python 2 and 3.
    if sys.version_info[0] == 2:
        from urllib import urlretrieve
    else:
        from urllib.request import urlretrieve

    def download(filename, source='http://yann.lecun.com/exdb/mnist/'):
        print("Downloading %s" % filename)
        urlretrieve(source + filename, filename)

    # We then define functions for loading MNIST images and labels.
    # For convenience, they also download the requested files if needed.
    import gzip

    def load_mnist_images(filename):
        if not os.path.exists(filename):
            download(filename)
        # Read the inputs in Yann LeCun's binary format.
        with gzip.open(filename, 'rb') as f:
            data = np.frombuffer(f.read(), np.uint8, offset=16)
        # The inputs are vectors now, we reshape them to monochrome 2D images,
        # following the shape convention: (examples, channels, rows, columns)
        data = data.reshape(-1, 1, 28, 28)
        # The inputs come as bytes, we convert them to float32 in range [0,1].
        # (Actually to range [0, 255/256], for compatibility to the version
        # provided at http://deeplearning.net/data/mnist/mnist.pkl.gz.)
        return data / np.float32(256)

    def load_mnist_labels(filename):
        if not os.path.exists(filename):
            download(filename)
        # Read the labels in Yann LeCun's binary format.
        with gzip.open(filename, 'rb') as f:
            data = np.frombuffer(f.read(), np.uint8, offset=8)
        # The labels are vectors of integers now, that's exactly what we want.
        return data

    # We can now download and read the training and test set images and labels.
    X_train = load_mnist_images('train-images-idx3-ubyte.gz')
    y_train = load_mnist_labels('train-labels-idx1-ubyte.gz')
    X_test = load_mnist_images('t10k-images-idx3-ubyte.gz')
    y_test = load_mnist_labels('t10k-labels-idx1-ubyte.gz')

    # We reserve the last 10000 training examples for validation.
    X_train, X_val = X_train[:-10000], X_train[-10000:]
    y_train, y_val = y_train[:-10000], y_train[-10000:]

    # We just return all the arrays in order, as expected in main().
    # (It doesn't matter how we do this as long as we can read them again.)
    return X_train, y_train, X_val, y_val, X_test, y_test

print("Loading data...")
X_train, y_train, X_val, y_val, X_test, y_test = load_dataset()
Loading data...
In [3]:
# Building a theano.shared variable with a subset of the data to make construction of the model faster.
# We will later switch that out, this is just a placeholder to get the dimensionality right.
input_var = theano.shared(X_train[:500, ...].astype(np.float64))
target_var = theano.shared(y_train[:500, ...].astype(np.float64))

Model specification

I imagined that it should be possible to bridge Lasagne and PyMC3 just because they both rely on Theano. However, it was unclear how difficult it was really going to be. Fortunately, a first experiment worked out very well but there were some potential ways in which this could be made even easier. I opened a GitHub issue on Lasagne's repo and a few days later, PR695 was merged which allowed for an ever nicer integration fo the two, as I show below. Long live OSS.

First, the Lasagne function to create an ANN with 2 fully connected hidden layers with 800 neurons each, this is pure Lasagne code taken almost directly from the tutorial. The trick comes in when creating the layer with lasagne.layers.DenseLayer where we can pass in a function init which has to return a Theano expression to be used as the weight and bias matrices. This is where we will pass in our PyMC3 created priors which are also just Theano expressions:

In [4]:
def build_ann(init):
    l_in = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),
                                     input_var=input_var)

    # Add a fully-connected layer of 800 units, using the linear rectifier, and
    # initializing weights with Glorot's scheme (which is the default anyway):
    n_hid1 = 800
    l_hid1 = lasagne.layers.DenseLayer(
        l_in, num_units=n_hid1,
        nonlinearity=lasagne.nonlinearities.tanh,
        b=init,
        W=init
    )

    n_hid2 = 800
    # Another 800-unit layer:
    l_hid2 = lasagne.layers.DenseLayer(
        l_hid1, num_units=n_hid2,
        nonlinearity=lasagne.nonlinearities.tanh,
        b=init,
        W=init
    )

    # Finally, we'll add the fully-connected output layer, of 10 softmax units:
    l_out = lasagne.layers.DenseLayer(
        l_hid2, num_units=10,
        nonlinearity=lasagne.nonlinearities.softmax,
        b=init,
        W=init
    )
    
    prediction = lasagne.layers.get_output(l_out)
    
    # 10 discrete output classes -> pymc3 categorical distribution
    out = pm.Categorical('out', 
                         prediction,
                         observed=target_var)
    
    return out

Next, the function which create the weights for the ANN. Because PyMC3 requires every random variable to have a different name, we're creating a class instead which creates uniquely named priors.

The priors act as regularizers here to try and keep the weights of the ANN small. It's mathematically equivalent to putting a L2 loss term that penalizes large weights into the objective function, as is commonly done.

In [5]:
class GaussWeights(object):
    def __init__(self):
        self.count = 0
    def __call__(self, shape):
        self.count += 1
        return pm.Normal('w%d' % self.count, mu=0, sd=.1, 
                         testval=np.random.normal(size=shape).astype(np.float64),
                         shape=shape)

If you compare what we have done so far to the previous blog post, it's apparent that using Lasagne is much more comfortable. We don't have to manually keep track of the shapes of the individual matrices, nor do we have to handle the underlying matrix math to make it all fit together.

Next are some functions to set up mini-batch ADVI, you can find more information in the prior blog post.

In [6]:
# Tensors and RV that will be using mini-batches
minibatch_tensors = [input_var, target_var]

# Generator that returns mini-batches in each iteration
def create_minibatch(data, batchsize=500):
    
    rng = np.random.RandomState(0)
    start_idx = 0
    while True:
        # Return random data samples of set size batchsize each iteration
        ixs = rng.randint(data.shape[0], size=batchsize)
        yield data[ixs]

minibatches = zip(
    create_minibatch(X_train, 500),
    create_minibatch(y_train, 500),
)

total_size = len(y_train)

def run_advi(likelihood, advi_iters=50000):
    # Train on train data
    input_var.set_value(X_train[:500, ...])
    target_var.set_value(y_train[:500, ...])
    
    v_params = pm.variational.advi_minibatch(
        n=advi_iters, minibatch_tensors=minibatch_tensors, 
        minibatch_RVs=[likelihood], minibatches=minibatches, 
        total_size=total_size, learning_rate=1e-2, epsilon=1.0
    )
    trace = pm.variational.sample_vp(v_params, draws=500)
    
    # Predict on test data
    input_var.set_value(X_test)
    target_var.set_value(y_test)
    
    ppc = pm.sample_ppc(trace, samples=100)
    y_pred = mode(ppc['out'], axis=0).mode[0, :]
    
    return v_params, trace, ppc, y_pred

Putting it all together

Lets run our ANN with mini-batch ADVI:

In [14]:
with pm.Model() as neural_network:
    likelihood = build_ann(GaussWeights())
    v_params, trace, ppc, y_pred = run_advi(likelihood)
Iteration 0 [0%]: ELBO = -126739832.76
Iteration 5000 [10%]: Average ELBO = -17180177.41
Iteration 10000 [20%]: Average ELBO = -304464.44
Iteration 15000 [30%]: Average ELBO = -146289.0
Iteration 20000 [40%]: Average ELBO = -121571.36
Iteration 25000 [50%]: Average ELBO = -112382.38
Iteration 30000 [60%]: Average ELBO = -108283.73
Iteration 35000 [70%]: Average ELBO = -106113.66
Iteration 40000 [80%]: Average ELBO = -104810.85
Iteration 45000 [90%]: Average ELBO = -104743.76
Finished [100%]: Average ELBO = -104222.88

Make sure everything converged:

In [17]:
plt.plot(v_params.elbo_vals[10000:])
sns.despine()
In [18]:
sns.heatmap(confusion_matrix(y_test, y_pred))
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21764cf910>
In [20]:
print('Accuracy on test data = {}%'.format(accuracy_score(y_test, y_pred) * 100))
Accuracy on test data = 91.83%

The performance is not incredibly high but hey, it seems to actually work.

Hierarchical Neural Network: Learning Regularization from data

The connection between the standard deviation of the weight prior to the strengh of the L2 penalization term leads to an interesting idea. Above we just fixed sd=0.1 for all layers, but maybe the first layer should have a different value than the second. And maybe 0.1 is too small or too large to begin with. In Bayesian modeling it is quite common to just place hyperpriors in cases like this and learn the optimal regularization to apply from the data. This saves us from tuning that parameter in a costly hyperparameter optimization. For more information on hierarchical modeling, see my other blog post.

In [20]:
class GaussWeightsHierarchicalRegularization(object):
    def __init__(self):
        self.count = 0
    def __call__(self, shape):
        self.count += 1
        
        regularization = pm.HalfNormal('reg_hyper%d' % self.count, sd=1)
        
        return pm.Normal('w%d' % self.count, mu=0, sd=regularization, 
                         testval=np.random.normal(size=shape),
                         shape=shape)
In [ ]:
with pm.Model() as neural_network_hier:
    likelihood = build_ann(GaussWeightsHierarchicalRegularization())
    v_params, trace, ppc, y_pred = run_advi(likelihood)
In [22]:
print('Accuracy on test data = {}%'.format(accuracy_score(y_test, y_pred) * 100))
Accuracy on test data = 92.13%

We get a small but nice boost in accuracy. Let's look at the posteriors of our hyperparameters:

In [23]:
pm.traceplot(trace, varnames=['reg_hyper1', 'reg_hyper2', 'reg_hyper3', 'reg_hyper4', 'reg_hyper5', 'reg_hyper6']);

Interestingly, they all are pretty different suggesting that it makes sense to change the amount of regularization that gets applied at each layer of the network.

Convolutional Neural Network

This is pretty nice but everything so far would have also been pretty simple to implement directly in PyMC3 as I have shown in my previous post. Where things get really interesting, is that we can now build way more complex ANNs, like Convolutional Neural Nets:

In [9]:
def build_ann_conv(init):
    network = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),
                                        input_var=input_var)

    network = lasagne.layers.Conv2DLayer(
            network, num_filters=32, filter_size=(5, 5),
            nonlinearity=lasagne.nonlinearities.tanh,
            W=init)

    # Max-pooling layer of factor 2 in both dimensions:
    network = lasagne.layers.MaxPool2DLayer(network, pool_size=(2, 2))

    # Another convolution with 32 5x5 kernels, and another 2x2 pooling:
    network = lasagne.layers.Conv2DLayer(
        network, num_filters=32, filter_size=(5, 5),
        nonlinearity=lasagne.nonlinearities.tanh,
        W=init)
    
    network = lasagne.layers.MaxPool2DLayer(network, 
                                            pool_size=(2, 2))
    
    n_hid2 = 256
    network = lasagne.layers.DenseLayer(
        network, num_units=n_hid2,
        nonlinearity=lasagne.nonlinearities.tanh,
        b=init,
        W=init
    )

    # Finally, we'll add the fully-connected output layer, of 10 softmax units:
    network = lasagne.layers.DenseLayer(
        network, num_units=10,
        nonlinearity=lasagne.nonlinearities.softmax,
        b=init,
        W=init
    )
    
    prediction = lasagne.layers.get_output(network)
    
    return pm.Categorical('out', 
                   prediction,
                   observed=target_var)
In [10]:
with pm.Model() as neural_network_conv:
    likelihood = build_ann_conv(GaussWeights())
    v_params, trace, ppc, y_pred = run_advi(likelihood, advi_iters=50000)
Iteration 0 [0%]: ELBO = -17290585.29
Iteration 5000 [10%]: Average ELBO = -3750399.99
Iteration 10000 [20%]: Average ELBO = -40713.52
Iteration 15000 [30%]: Average ELBO = -22157.01
Iteration 20000 [40%]: Average ELBO = -21183.64
Iteration 25000 [50%]: Average ELBO = -20868.2
Iteration 30000 [60%]: Average ELBO = -20693.18
Iteration 35000 [70%]: Average ELBO = -20483.22
Iteration 40000 [80%]: Average ELBO = -20366.34
Iteration 45000 [90%]: Average ELBO = -20290.1
Finished [100%]: Average ELBO = -20334.15
In [13]:
print('Accuracy on test data = {}%'.format(accuracy_score(y_test, y_pred) * 100))
Accuracy on test data = 98.21%

Much higher accuracy -- nice. I also tried this with the hierarchical model but it achieved lower accuracy (95%), I assume due to overfitting.

Lets make more use of the fact that we're in a Bayesian framework and explore uncertainty in our predictions. As our predictions are categories, we can't simply compute the posterior predictive standard deviation. Instead, we compute the chi-square statistic which tells us how uniform a sample is. The more uniform, the higher our uncertainty. I'm not quite sure if this is the best way to do this, leave a comment if there's a more established method that I don't know about.

In [14]:
miss_class = np.where(y_test != y_pred)[0]
corr_class = np.where(y_test == y_pred)[0]
In [15]:
preds = pd.DataFrame(ppc['out']).T
In [16]:
chis = preds.apply(lambda x: chisquare(x).statistic, axis='columns')
In [18]:
sns.distplot(chis.loc[miss_class].dropna(), label='Error')
sns.distplot(chis.loc[corr_class].dropna(), label='Correct')
plt.legend()
sns.despine()
plt.xlabel('Chi-Square statistic');

As we can see, when the model makes an error, it is much more uncertain in the answer (i.e. the answers provided are more uniform). You might argue, that you get the same effect with a multinomial prediction from a regular ANN, however, this is not so.

Conclusions

By bridging Lasagne and PyMC3 and by using mini-batch ADVI to train a Bayesian Neural Network on a decently sized and complex data set (MNIST) we took a big step towards practical Bayesian Deep Learning on real-world problems.

Kudos to the Lasagne developers for designing their API to make it trivial to integrate for this unforseen application. They were also very helpful and forthcoming in getting this to work.

Finally, I also think this shows the benefits of PyMC3. By relying on a commonly used language (Python) and abstracting the computational backend (Theano) we were able to quite easily leverage the power of that ecosystem and use PyMC3 in a manner that was never thought about when creating it. I look forward to extending it to new domains.

This blog post was written in a Jupyter Notebook. You can get access the notebook here and follow me on Twitter to stay up to date.

by Thomas Wiecki at July 05, 2016 02:00 PM

Matthieu Brucher

Announcement: ATKTransientSplitter 1.0.0

I’m happy to announce the release of a mono transient splitter based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATK Transient SplitterATK Transient Splitter

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKTransientSplitter .

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at July 05, 2016 07:18 AM

July 01, 2016

Continuum Analytics news

Verifying conda Recipes and Packages with anaconda-verify

Posted Friday, July 1, 2016

anaconda-verify is a tool for (passively) verifying conda recipes and conda packages. All Anaconda recipes, as well as the Anaconda packages, need to pass this tool before they are made publicly available. The purpose of this verification process is to ensure that recipes don't contain obvious bugs and that the conda packages we distribute to millions of users meet our high quality standards.

Historically, the conda packages which represent the Anaconda distribution were not created using conda-build, but an internal build system. In fact, conda-build started as a public fork of this internal system 3 years ago. At that point, the Anaconda distribution had already been around for almost a year, and the only way to create conda packages was by using the internal system. While conda-build has made a lot of progress, the internal system basically stayed unchanged, because the needs on a system for building a distribution are quite different and not driven by the community using conda-build for continuous integration and other language support (e.g. Perl, Lua). On the other hand, the internal system has been developed to support Anaconda distribution specific needs, such as MKL featured packages, source and license reference meta-data, and interoperability between collections of packages.

In an effort to bridge the gap between our internal system and conda-build, we started using conda-build to create conda packages for the Anaconda distribution itself about a year ago. By now, more than 85% of the conda packages in the Anaconda distribution are created using conda-build. However, because of the different requirements mentioned above, we only allow certain features that conda-build offers. This helps to keep the Anaconda recipes simple, maintainable and functional with the rest of the internal system, which reads meta-data from the recipes. This is why we require conda recipes to be valid according to this tool.

Verifying recipes is easy. After installing the tool, using:

conda install anaconda-verify

you run:

anaconda-verify <path to conda recipe>

Another aspect of anaconda-verify is the ability to verify conda packages. These are the most important checks anaconda-verify performs, and, more importantly, we explain why these checks are necessary or useful:

  • Ensure the content of info/files corresponds to the actual archived files in the tarball (except the ones in info/, obviously). This is important, because the files listed in info/files determine which files are linked into the conda environment. Any mismatch here would indicate either (i) the tarball contains files which are not getting linked anywhere or (ii) files which do no exist are attempting to get linked (which would result in an error).
  • Check now for allowed archives in the tarball. A conda package should not contain files in the following directories: conda-meta/, conda-bld/, pkgs/, pkgs32/ and envs/, because this would, for example, allow a conda package to modify another existing environment.
  • Make sure the name, version and build values exist in info/index.json and that they correspond to the actual filename.
  • Ensure there are no files with both .bat and .exe extension. For example, if you had Scripts/foo.bat and Scripts/foo.exe one would shadow the other, and this would become confusing as to which one is actually executed when the user types foo. Although this check is always done, it is only relevant on Windows.
  • Ensure no easy-install.pth file exists. These files would cause problems, as they would overlap (two or more conda packages would contain a easy-install.pth file, which overwrite each other when installing the package).
  • Ensure no "easy install scripts" exists. These are entry point scripts which setuptools creates which are extremely brittle, and should by replaced (overwritten) by the simple entry points scripts conda-build offers (use build/entry_points in your meta.yaml).
  • Ensure no .pyd or .so files have a .py file next to them. This is confusing, as it is not obvious which one the Python interpreter will import. Under certain circumstances, setuptools creates .py next to shared object files for obscure reasons.
  • For packages (other than python), ensure that .pyc are not in Python's standard library directory. This would happen when a .pyc file is missing from the standard library and then created during the build process of another package.
  • Check for missing .pyc files. Missing .pyc files cause two types of problems: (i) When building new packages, they might get included in the new package. For example, when building scipy and numpy is missing .pyc files, then these (numpy .pyc files) get included in the scipy package. (ii) There was a (buggy) Python release which would crash when .pyc files could not written (due to file permissions).
  • Ensure Windows conda packages only contain object files which have the correct architecture. There was a bug in conda-build which would create 64-bit entry point executables when building 32-bit packages on a 64-bit system.
  • Ensure that site-packages does not contain certain directories when building packages. For example, when you build pandas, you don't want a numpy, scipy or setuptools directory to be contained in the pandas package. This would happen when the pandas build dependencies have missing .pyc files.

Here is an example of running the tool on conda packages:

$ anaconda-verify bitarray-0.8.1-py35_0.tar.bz2
==> /Users/ilan/aroot/tars64/bitarray-0.8.1-py35_0.tar.bz2 <==
    bitarray

In this case all is fine, and we see that only the bitarray directory is created in site-packages.

If you have questions about anaconda-verify, please feel free to reach out to our team

by swebster at July 01, 2016 02:58 PM

June 28, 2016

Matthieu Brucher

Analog modeling: Triode circuit

When I started reviewing the diode clippers, the goal was to end up modeling a triode simple preamp. Thanks to Ivan Cohen from musical entropy, I’ve finally managed to drive the proper equation system to model this specific type of preamp.

Schematics

Let’s have a look at the circuit:

Triode simple modelTriode simple model

There are several things to notice:

  • We need to have equations of the triode based on the voltage on its bounds
  • There is a non-null steady state, meaning there is current in the circuit when there is no input

For the equations, I’ve used once again Ivan Cohen’s work available in his papers (the modified Koren’s equations), available in Audio Toolkit.

Steady state

So, for the second point, we need to compute the steady state of the circuit. This can be achieved by putting the input to the ground voltage and remove the capacitors. Once this is done, we can have the final equations of the system in y (for voltage of the plate, the grid, and finally the cathode):

F = \begin{matrix} y(0) - V_{Bias} + I_p * R_p \\ I_g * R_g + y(1) \\ y(2) - (I_g + I_p) * R_k \end{matrix}

The Jacobian:

J= \begin{matrix} 1 + R_p * \frac{dI_p}{V_{pk}} && R_p * \frac{dI_p}{V_{gk}} && -R_p * (\frac{dI_p}{V_{pk}} + \frac{dI_p}{V_{gk}}) \\ \frac{dI_g}{V_{pk}} * R_g && 1 + R_g * \frac{dI_g}{V_{gk}} && -R_g * (\frac{dI_g}{V_{gk}} + \frac{dI_g}{V_{pk}}) \\ -(\frac{dI_p}{V_{pk}} + Ib_Vce) * R_k && -(\frac{dI_c}{V_{gk}} + \frac{dI_g}{V_{gk}}) * R_k && 1 + (\frac{dI_p}{V_{gk}} + \frac{dI_p}{V_{pk}} + \frac{dI_g}{V_{gk}} + \frac{dI_g}{V_{pk}}) * R_k \end{matrix}

With this system, we can run a Newton Raphson optimizer to find the proper stable state of the system. It may require lots of iterations, but this is not a problem: it’s done once at the beginning, and then we will use the next system for computing the new state when we input a signal.

Transient state

As in the previous analog modeling posts, I’m using the SVF/DK-method to simplify the ODE (to remove the derivative dependency, turning the ODE in a non linear system). So there are two systems to solve. The first one is the ODE with traditional Newton Raphson optimizer (from x, we want to compute y=\begin{matrix} V_k \\ V_(out) - V_p \\ V_p \\ V_g \end{matrix}):

F = \begin{matrix} I_b + I_c + i{ckeq} - y(0) * (1/R_k + 2*C_k/dt) \\  i_{coeq} + (y(1) + y(2)) / R_o + y(1) * 2*C_o/dt \\ (y(2) - V_{Bias}) / R_p + (I_p + (y(1) + y(2)) / R_o) \\ (y(3) - x(i)) / R_g + I_g \end{matrix}

Which makes the Jacobian:

J= \begin{matrix} -(\frac{dI_g}{V_{gk}} + \frac{dI_p}{V_{gk}} + \frac{dI_g}{V_{pk}} + \frac{dI_p}{V_{pk}}) - 1/R_k + 2*C_k/dt && 0 && (\frac{dI_g}{V_{pk}} + \frac{dI_p}{V_{pk}}) && (\frac{dI_g}{V_{gk}} + \frac{dI_p}{V_{gk}}) \\ 0 && 1/R_o + 2*C_o/dt && 1/R_o && 0 \\ -(\frac{dI_p}{V_{gk}} + \frac{dI_p}{V_{pk}}) && 1/R_o && 1/R_p + 1/R_o + \frac{dI_p}{V_{pk}} && \frac{dI_p}{V_{gk}} \\ -(\frac{dI_g}{V_{gk}} + \frac{dI_g}{V_{pk}}) && 0 && \frac{dI_g}{V_{pk}} && \frac{dI_g}{V_{gk}} + 1/R_g \end{matrix}

Once more, this system can be optimized with a classic NR optimizer, this time in a few iterations (3 to 5, depending on the oversampling, the input signal…)

The updates for ickeq and icoeq are:

\begin{matrix} i_{ckeq} \\ i_{coeq} \end{matrix} = \begin{matrix} 4 * C_k / dt * y(1) - i{ckeq} \\ -4 * C_o / dt * y(2) - i_{coeq} \end{matrix}

Of course, we need a start value for the steady state. It is quite simple as the previous state is the same as the new one which makes the equations:

\begin{matrix} i_{ckeq} \\ i_{coeq} \end{matrix} = \begin{matrix} 2 * C_k / dt * y(1) \\ -2 * C_o / dt * y(2) \end{matrix}

Result

Let’s start with the behavior for a simple sinusoid signal:

Tube behavior for a 100Hz signal (10V)Tube behavior for a 100Hz signal (10V)

You can spot that in the beginning of the sinusoid, the output signal is not stable, the average moves down with time. This is to be expected as the tube obviously compresses the negative side of the sinusoid, while it almost chops the positive side after 30 to 40V. The non symmetric behavior is what gives the warmth to the tube. The even harmonics are also clear with a sine sweep of the system:

Sine sweep on the triode circuitSine sweep on the triode circuit

Strangely enough, even if the signal seems quite distorted, really harmonics are not strong (compared to SD1 or TS9). This makes me believe that it is a reason for the love of the user for tube preamps.

Conclusion

There are a few new developments here compared to my previous posts (SD1 and TS9). The first is the fact that we have a more complex system, with several capacitors that need to be individually simulated. This leads to a vectorial implementation of the NR algorithm.

The second one is the requirement to compute a steady state that is not zero everywhere. Without this, the first iterations of the system would be more than chaotic, not realistic and hard on the CPU. Definitely not what you would want!

Based on these two developments, it is now possible to easily develop more complex circuit modeling. Even if the cost here is high (due to the complex triode equations), the conjunction of the NR algorithm with the DK method makes it doable to have real-time simulation.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 28, 2016 07:21 AM

June 27, 2016

Continuum Analytics news

Continuum Analytics Unveils Anaconda Mosaic to Make Enterprise Data Transformations Portable for Heterogeneous Data

Posted Tuesday, June 28, 2016

Empowers data scientists and analysts to explore, visualize, and catalog data and transformations for disparate data sources enabling data portability and faster time-to-insight

AUSTIN, TX—June 28, 2016—Continuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today announced the availability of Anaconda Mosaic. With the ability to easily create and catalog transformations against heterogeneous data stores, Anaconda Mosaic empowers data scientists, quants and business analysts to interactively explore, visualize, and transform larger-than-memory datasets to more quickly discover new insights.   

Enterprise data architecture is becoming increasingly complex. Data stores have a relatively short half life and data is being shifted to new data stores - NoSQL, SQL, flat files - at a higher frequency. In order for organizations to find insights from the data they must first find existing transformations and rewrite the transformations for the new data store. This creates delays in getting the insights from the data. Continuum Analytics’ Anaconda Mosaic enables organizations to quickly explore, visualize, and redeploy transformations based on pandas and SQL without rewriting the transformations while maintaining governance by tracking data lineage and provenance.

“Through the course of daily operations, businesses accumulate huge amounts of data that gets locked away in legacy databases and flat file repositories. The transformations that made the data usable for analysis gets lost, buried or simply forgotten,” said Michele Chambers, Executive Vice President Anaconda Business Unit & CMO at Continuum Analytics. “Our mission is for Anaconda Mosaic to unlock the mystery of this dark data, making it accessible for businesses to quickly redeploy to new data stores without any refactoring so enterprises can reap the analytic insight and value almost instantly. By eliminating refactoring of transformations, enterprises dramatically speed up their time-to-value, without having to to spend lengthy cycles on the refactoring process.”

Some of the key features of Anaconda Mosaic include: 

  • Visually explore your data. Mosaic provides built-in visualizations for large heterogeneous datasets that makes it easy for data scientists and business analysts to accurately understand data including anomalies. 
  • Instantly get portable transformations. Create transformations with the expression builder to catalog data sources and transformations. Execute the transformation against heterogeneous data stores while tracking data lineage and provenance. When data stores change, simply deploy the transformations and quickly get the data transformed and ready for analysis. 
  • Write once, compute anywhere. For maximum efficiency, Mosaic translates transformations and orchestrates computation execution on the data backend, minimizing the costly movement of data across the network and taking full advantage of the built-in highly optimized code featured in the data backend. Users can access data in multiple data stores with the same code without rewriting queries or analytic pipelines.
  • Harvest large flat file repositories in place: Mosaic combines flat files, adds derived data and filters for performance easily. This allows users to describe the structure of their data in large flat file repositories and uses that description in data discovery, visualization, and transformations, saving the user from writing tedious ETL code.  Mosaic ensures that the data loaded is only what is necessary to compute the transformation, which can lead to significant memory and performance gains.

Continuum Analytics is hosting a webinar which will take attendees through how to use Mosaic to simplify transformations and get to faster insights on June 30. Please register here

Mosaic is available to current Anaconda Enterprise subscribers; to find out more about Anaconda Mosaic, get in touch

About Continuum Analytics

Continuum Analytics is the creator and driving force behind Anaconda, the leading, Open Data Science platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 3M downloads and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, data engineers and business analysts – to connect the dots in their data and accelerate the time-to-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular Open Data Science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

To learn more about Continuum Analytics, visit www.continuum.io.

###

Media Contact:
Jill Rosenthal
InkHouse
continuumanalytics@inkhouse.com

by swebster at June 27, 2016 07:39 PM

Anaconda Fusion: A Portal to Open Data Science for Excel

Posted Monday, June 27, 2016

Excel has been business analysts’ go-to program for years. It works well, and its familiarity makes it the currency of the realm for many applications.

But, in a bold new world of predictive analytics and Big Data, Excel feels cut off from the latest technologies and limited in the scope of what it can actually take on.

Fortunately for analysts across the business world, a new tool has arrived to change the game — Anaconda Fusion.

A New Dimension of Analytics

The interdimensional portal has been a staple of classic science fiction for decades. Characters step into a hole in space and emerge instantly in an entirely different setting — one with exciting new opportunities and challenges.

Now, Data Science has a portal of its own. The latest version of Anaconda Fusion, an Open Data Science (ODS) integration for Microsoft Excel, links the familiar world of spreadsheets (and the business analysts that thrive there) to the “alternate dimension” of Open Data Science that is reinventing analytics.

With Anaconda Fusion and other tools from Anaconda, business analysts and data scientists can share work — like charts, tables, formulas and insights — across Excel and ODS languages such as Python easily, erasing the partition that once divided them.

Jupyter (formerly IPython) is a popular approach to sharing across the scientific computing community, with notebooks combining  code, visualizations and comments all in one document. With Anaconda Enterprise Notebooks, this is now available under a governed environment, providing the collaborative locking, version control, notebook differencing and searching needed to operate in the enterprise. Since Anaconda Fusion, like the entire Anaconda ecosystem, integrates seamlessly with Anaconda Enterprise Notebooks, businesses can finally empower Excel gurus to collaborate effectively with the entire Data Science team.

Now, business analysts can exploit the ease and brilliance of Python libraries without having to write any code. Packages such as scikit-learn and pandas drive machine learning initiatives, enabling predictive analytics and data transformations, while plotting libraries, like Bokeh, provide rich interactive visualizations.

With Anaconda Fusion, these tools are available within the familiar Excel environment—without the need to know Python. Contextually-relevant visualizations generated from Python functions are easily embedded into spreadsheets, giving business analysts the ability to make sense of, manipulate and easily interpret data scientists’ work. 

A Meeting of Two Cultures

Anaconda Fusion is connecting two cultures from across the business spectrum, and the end result creates enormous benefits for everyone.

Business analysts can leverage the power, flexibility and transparency of Python for data science using the Excel they are already comfortable with. This enables functionality far beyond Excel, but also can teach business analysts to use Python in the most natural way: gradually, on the job, as needed and in a manner that is relevant to their context. Given that the world is moving more and more toward using Python as a lingua franca for analytics, this benefit is key.

On the other side of the spectrum, Python-using data scientists can now expose data models or interactive graphics in a well-managed way, sharing them effectively with Excel users. Previously, sharing meant sending static images or files, but with Anaconda Fusion, Excel workbooks can now include a user interface to models and interactive graphics, eliminating the clunky overhead of creating and sending files.

It’s hard to overstate how powerful this unification can be. When two cultures learn to communicate more effectively, it results in a cross-pollination of ideas. New insights are generated, and synergistic effects occur.

The Right Tools

The days of overloaded workarounds are over. With Anaconda Fusion, complex and opaque Excel macros can now be replaced with the transparent and powerful functions that Python users already know and love.

The Python programming community places a high premium on readability and clarity. Maybe that’s part of why it has emerged as the fourth most popular programming language used today. Those traits are now available within the familiar framework of Excel.

Because Python plays so well with web technologies, it’s also simple to transform pools of data into shareable interactive graphics — in fact, it's almost trivially easy. Simply email a web link to anyone, and they will have a beautiful graphics interface powered by live data. This is true even for the most computationally intense cases — Big Data, image recognition, automatic translation and other domains. This is transformative for the enterprise.

Jump Into the Portal

The glowing interdimensional portal of Anaconda Fusion has arrived, and enterprises can jump in right way. It’s a great time to unite the experience and astuteness of business analysts with the power and flexibility of Python-powered analytics.

To learn more, you can watch our Anaconda Fusion webinar on-demand, or join our Anaconda Fusion Innovators Program to get early access to exclusive features -- free and open to anyone. You can also contact us with any questions about how Anaconda Fusion can help improve the way your business teams share data. 

by swebster at June 27, 2016 03:36 PM

June 23, 2016

Enthought

5 Simple Steps to Create a Real-Time Twitter Feed in Excel using Python and PyXLL

PyXLL 3.0 introduced a new, simpler, way of streaming real time data to Excel from Python.

Excel has had support for real time data (RTD) for a long time, but it requires a certain knowledge of COM to get it to work. With the new RTD features in PyXLL 3.0 it is now a lot simpler to get streaming data into Excel without having to write any COM code.

This blog will show how to build a simple real time data feed from Twitter in Python using the tweepy package, and then show how to stream that data into Excel using PyXLL.

(Note: The code from this blog is available on github https://github.com/pyxll/pyxll-examples/tree/master/twitter). 

Create a real-time Twitter data feed using Python and PyXLL.

Create a real-time Twitter feed in Excel using Python and PyXLL.

Step 1: Install tweepy and PyXLL
————————

As we are interested in real time data we will use tweepy’s streaming API to use Python to connect to Twitter. Details on this are available in the tweepy documentation. You can install tweepy and PyXLL from the Canopy package manager. You may also download PyXLL here.

Easily install Tweepy from Canopy's Package Manager.

Easily install Tweepy and PyXLL from Canopy’s Package Manager.

Step 2: Get Twitter API keys
————————

In order to access Twitter Streaming API you will need a Twitter API key, API secret, Access token and Access token secret. Follow the steps below to get your own access tokens.

1. Create a twitter account if you do not already have one.
2. Go to https://apps.twitter.com/ and log in with your twitter credentials.
3. Click “Create New App”.
4. Fill out the form, agree to the terms, and click “Create your Twitter application”
5. In the next page, click on “API keys” tab, and copy your “API key” and “API secret”.
6. Scroll down and click “Create my access token”, and copy your “Access token” and “Access token secret”.

Step 3: Create a Stream Listener Class to Print Tweets in Python
————————–

To start with we can create a simple listener class that simply prints tweets as they arrive

[sourcecode language=”python”]
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import logging

_log = logging.getLogger(__name__)

# User credentials to access Twitter API
access_token = “YOUR ACCESS TOKEN”
access_token_secret = “YOUR ACCESS TOKEN SECRET”
consumer_key = “YOUR CONSUMER KEY”
consumer_secret = “YOUR CONSUMER KEY SECRET”

class TwitterListener(StreamListener):

def __init__(self, phrases):
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
self.__stream = Stream(auth, listener=self)
self.__stream.filter(track=phrases, async=True)

def disconnect(self):
self.__stream.disconnect()

def on_data(self, data):
print(data)

def on_error(self, status):
print(status)

if __name__ == ‘__main__’:
import time
logging.basicConfig(level=logging.INFO)

phrases = [“python”, “excel”, “pyxll”]
listener = TwitterListener(phrases)

# listen for 60 seconds then stop
time.sleep(60)
listener.disconnect()
[/sourcecode]

If we run this code any tweets mentioning Python, Excel or PyXLL get printed:

~~~
python twitterxl.py

INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): stream.twitter.com
{“text”: “Excel keyboard shortcut – CTRL+1 to bring up Cell Formatting https://t.co/wvx634EpUy”, “is…
{“text”: “Excel Tips – What If Analysis #DMZWorld #Feature #Bond #UMI https://t.co/lxzgZnIItu #UMI”,…
{“text”: “How good are you at using #Excel? We’re looking for South Africa’s #ExcelChamp Ts & Cs…
{“text”: “The Best Data Scientists Run R and Python – insideBIGDATA https://t.co/rwty058dL2 #python …
{“text”: “How to Create a Pivot Table in Excel: A Step-by-Step Tutorial (With Video) \u2013 https://…
{“text”: “Python eats Alligator 02, Time Lapse Speed x6 https://t.co/3km8I92zJo”, “is_quote_status”:…

Process finished with exit code 0
~~~

In order to make this more suitable for getting these tweets into Excel we will now extend this TwitterListener class in the following ways:

– Broadcast updates to other *subscribers* instead of just printing tweets.
– Keep a buffer of the last few received tweets.
– One ever create one listener for each unique set of phrases.
– Automatically disconnect listeners with no subscribers.

The updated TwitterListener class is as follows:

[sourcecode language=”python”]
class TwitterListener(StreamListener):
“””tweepy.StreamListener that notifies multiple subscribers when
new tweets are received and keeps a buffer of the last 100 tweets
received.
“””
__listeners = {} # class level cache of listeners, keyed by phrases
__lock = threading.RLock()
__max_size = 100

@classmethod
def get_listener(cls, phrases, subscriber):
“””Fetch an ExcelListener listening to a set of phrases and subscribe to it”””
with cls.__lock:
# get the listener from the cache or create a new one
phrases = frozenset(map(str, phrases))
listener = cls.__listeners.get(phrases, None)
if listener is None:
listener = cls(phrases)
cls.__listeners[phrases] = listener

# add the subscription and return
listener.subscribe(subscriber)
return listener

def __init__(self, phrases):
“””Use static method ‘get_listener’ instead of constructing directly”””
_log.info(“Creating listener for [%s]” % “, “.join(phrases))
self.__phrases = phrases
self.__subscriptions = set()
self.__tweets = [None] * self.__max_size

# listen for tweets in a background thread using the ‘async’ keyword
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
self.__stream = Stream(auth, listener=self)
self.__stream.filter(track=phrases, async=True)
self.__connected = True

@property
def tweets(self):
return list(self.__tweets)

def subscribe(self, subscriber):
“””Add a subscriber that will be notified when new tweets are received”””
with self.__lock:
self.__subscriptions.add(subscriber)

def unsubscribe(self, subscriber):
“””Remove subscriber added previously.
When there are no more subscribers the listener is stopped.
“””
with self.__lock:
self.__subscriptions.remove(subscriber)
if not self.__subscriptions:
self.disconnect()

def disconnect(self):
“””Disconnect from the twitter stream and remove from the cache of listeners.”””
with self.__lock:
if self.__connected:
_log.info(“Disconnecting twitter stream for [%s]” % “, “.join(self.__phrases))
self.__listeners.pop(self.__phrases)
self.__stream.disconnect()
self.__connected = False

@classmethod
def disconnect_all(cls):
“””Disconnect all listeners.”””
with cls.__lock:
for listener in list(cls.__listeners.values()):
listener.disconnect()

def on_data(self, data):
data = json.loads(data)
with self.__lock:
self.__tweets.insert(0, data)
self.__tweets = self.__tweets[:self.__max_size]
for subscriber in self.__subscriptions:
try:
subscriber.on_data(data)
except:
_log.error(“Error calling subscriber”, exc_info=True)
return True

def on_error(self, status):
with self.__lock:
for subscriber in self.__subscriptions:
try:
subscriber.on_error(status)
except:
_log.error(“Error calling subscriber”, exc_info=True)

if __name__ == ‘__main__’:
import time
logging.basicConfig(level=logging.INFO)

class TestSubscriber(object):
“””simple subscriber that just prints tweets as they arrive”””

def on_error(self, status):
print(“Error: %s” % status)

def on_data(self, data):
print(data.get(“text”))

subscriber = TestSubscriber()
listener = TwitterListener.get_listener([“python”, “excel”, “pyxll”], subscriber)

# listen for 60 seconds then stop
time.sleep(60)
listener.unsubscribe(subscriber)
[/sourcecode]

When this is run it’s very similar to the last case, except that now only the text part of the tweets are printed. Also note that the listener is not explicitly disconnected, that happens automatically when the last subscriber unsubscribes.

“`
python twitterxl.py

INFO:__main__:Creating listener for python, excel, pyxll
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): stream.twitter.com
Linuxtoday Make a visual novel with Python: Linux User & Developer: Bridge the gap between books…
How to create drop down list in excel https://t.co/Ii2hKRlRBe…
RT @papisdotio: Flying dron with Python @theglamp #PAPIsConnect https://t.co/zzPNSFb66e…
RT @saaid230: The reason I work hard and try to excel at everything I do so one day I can take care …
RT @javacodegeeks: I’m reading 10 Awesome #Python Tutorials to Kick-Start my Web #Programming https:…
INFO:__main__:Disconnecting twitter stream for

[python 1=”excel,” 2=”pyxll” language=”,”][/python]

Process finished with exit code 0
“`

Step 4: Feed the real time Twitter data into Excel using PyXLL
————————–

Now the hard part of getting the streaming Twitter data into Python is taken care of, creating a real time data source in Excel using PyXLL is pretty straightforward.

PyXLL 3.0 has a new class, `RTD`. When a function decorated with the `xl_func` decorator returns an RTD instance, the value of the calling cell will be the `value` property of the RTD instance. If the value property of the returned RTD instance later changes, the cell value changes.

We will write a new class inheriting from RTD that acts as a subscriber to our twitter stream (in the same way as TestSubscriber in the code above). Whenever a new tweet is received it will update its value, and so the cell in Excel will update.

[sourcecode language=”python”]
from pyxll import RTD

class TwitterRTD(RTD):
“””Twitter RTD class that notifies Excel whenever a new tweet is received.”””

def __init__(self, phrases):
# call super class __init__ with an initial value
super(TwitterRTD, self).__init__(value=”Waiting for tweets…”)

# get the TwitterListener and subscribe to it
self.__listener = TwitterListener.get_listener(phrases, self)

def disconnect(self):
# overridden from RTD base class. Called when Excel no longer
# needs the RTD object (for example, when the cell formula
# is changed.
self.__listener.unsubscribe(self)

def on_error(self, status):
self.value = “#ERROR %s” % status

def on_data(self, data):
self.value = data.get(“text”)
[/sourcecode]

To expose that to Excel all that’s needed is a function that returns an instance of our new TwitterRTD class

[sourcecode language=”python”]
from pyxll import xl_func

@xl_func(“string[] phrases: rtd”)
def twitter_listen(phrases):
“””Listen for tweets containing certain phrases”””
# flatten the 2d list of lists into a single list of phrases
phrases = [str(x) for x in itertools.chain(*phrases) if x]
assert len(phrases) > 0, “At least one phrase is required”

# return our TwitterRTD object that will update when a tweet is received
return TwitterRTD(phrases)
[/sourcecode]

All that’s required now is to add the module to the pyxll.cfg file, and then the new function ‘twitter_listen’ will appear in Excel, and calling it will return a live stream of tweets.

screengrab1

Step 5: Enrich the feed information with Tweet metadata
————————–

So far we’ve got live tweets streaming into Excel, which is pretty cool, but only one tweet is visible at a time and we can only see the tweet text. It would be even better to see a grid of data showing the most recent tweets with some metadata as well as the tweet itself.

RTD functions always return just a single cell of data, so what we need to do is write a slightly different function that takes a couple more arguments: A key for the part of the tweet we want (e.g. ‘text’ or ‘created_at’) and an index (e.g. 0 as the latest tweet, 1 the second most recent tweet etc).

As some interesting bits of metadata are in nested dictionaries in the twitter data, the ‘key’ used to select the item from the data dictionary is a ‘/’ delimited list of keys used to drill into tweet data (for example, the name of the user is in the sub-dictionary ‘user’, so to retrieve it the key ‘user/name’ would be used).

The TwitterListener class we’ve written already keeps a limited history of the tweets it’s received so this isn’t too much more than we’ve already done.

[sourcecode language=”python”]
class TwitterRTD(RTD):
“””Twitter RTD class that notifies Excel whenever a new tweet is received.”””

def __init__(self, phrases, row=0, key=”text”):
super(TwitterRTD, self).__init__(value=”Waiting for tweets…”)
self.__listener = TwitterListener.get_listener(phrases, self)
self.__row = row
self.__key = key

def disconnect(self):
self.__listener.unsubscribe(self)

def on_error(self, status):
self.value = “#ERROR %s” % status

def on_data(self, data):
# if there are no tweets for this row return an empty string
tweets = self.__listener.tweets
if len(tweets) < self.__row or not tweets[self.__row]: self.value = “” return # get the value from the tweets value = tweets[self.__row] for key in self.__key.split(“/”): if not isinstance(value, dict): value = “” break value = value.get(key, {}) # set the value back in Excel self.value = str(value) [/sourcecode] The worksheet function also has to be updated to take these extra arguments [sourcecode language=”python”] @xl_func(“string[] phrases, int row, string key: rtd”) def twitter_listen(phrases, row=0, key=”text”): “””Listen for tweets containing certain phrases””” # flatten the 2d list of lists into a single list of phrases phrases = [str(x) for x in itertools.chain(*phrases) if x] assert len(phrases) > 0, “At least one phrase is required”

# return our TwitterRTD object that will update when a tweet is received
return TwitterRTD(phrases, row, key)
[/sourcecode]

After reloading the PyXLL addin, or restarting Excel, we can now call this modified function with different values for row and key to build an updating grid of live tweets.

screengrab3

One final step is to make sure that any active streams are disconnected when Excel closes. This will prevent the tweepy background thread from preventing Excel from exiting cleanly.

[sourcecode language=”python”]
from pyxll import xl_on_close

@xl_on_close
def disconnect_all_listeners():
TwitterListener.disconnect_all()
[/sourcecode]

The code from this blog is available on github https://github.com/pyxll/pyxll-examples/tree/master/twitter.

by admin at June 23, 2016 06:13 PM

June 21, 2016

Enthought

AAPG 2016 Conference Technical Presentation: Unlocking Whole Core CT Data for Advanced Description and Analysis

Microscale Imaging for Unconventional Plays Track Technical Presentation:

Unlocking Whole Core CT Data for Advanced Description and Analysis

Brendon Hall, Geoscience Applications Engineer, EnthoughtAmerican Association of Petroleum Geophysicists (AAPG)
2016 Annual Convention and Exposition Technical Presentation
Tuesday June 21st at 4:15 PM, Hall B, Room 2, BMO Centre, Calgary

Presented by: Brendon Hall, Geoscience Applications Engineer, Enthought, and Andrew Govert, Geologist, Cimarex Energy

PRESENTATION ABSTRACT:

It has become an industry standard for whole-core X-ray computed tomography (CT) scans to be collected over cored intervals. The resulting data is typically presented as static 2D images, video scans, and as 1D density curves.

CT scan of core pre- and post-processing

CT scans of cores before and after processing to remove artifacts and normalize features.

However, the CT volume is a rich data set of compositional and textural information that can be incorporated into core description and analysis workflows. In order to access this information the raw CT data initially has to be processed to remove artifacts such as the aluminum tubing, wax casing and mud filtrate. CT scanning effects such as beam hardening are also accounted for. The resulting data is combined into contiguous volume of CT intensity values which can be directly calibrated to plug bulk density.

With this processed CT data:

  • The volume can be analyzed to identify bedding structure, dip angle, and fractures.
  • Bioturbation structures can often be easily identified by contrasts in CT intensity values due to sediment reworking or mineralization.
  • CT facies can be determined by segmenting the intensity histogram distribution. This provides continuous facies curves along the core, indicating relative amounts of each material. These curves can be integrated to provide estimates of net to gross even in finely interbedded regions. Individual curves often exhibit cyclic patterns that can help put the core in the proper sequence stratigraphic framework.
  • The CT volume can be analyzed to classify the spatial relationships between the intensity values to give a measure of texture. This can be used to further discriminate between facies that have similar composition but different internal organization.
  • Finally these CT derived features can be used alongside log data and core photographs to train machine learning algorithms to assist with upscaling the core description to the entire well.
Virtual Core Co-Visualization

Virtual Core software allows for co-visualization and macro- to microscopic investigation of core features.

by admin at June 21, 2016 03:15 PM

Matthieu Brucher

Audio Toolkit: Transient splitter

After my transient shaper, some people told me it would be nice to have a splitter: split the signal in two tracks, one with the transient, another with the sustain. For instance, it would be interesting to apply a different distortion on both signals.

So for instance this is what could happen for a simple signal. The sustain signal is not completely shut off, and there can be a smooth transition between he two signals (thanks to the smoothness parameter). Of course, the final signals have to sum back to the original signal.

How a transient splitter would workHow a transient splitter would work

I may end up doing a stereo version (with M/S capabilities) for the splitter, but maybe also another one with some distortion algorithms before everything is summed up again.

Let me know what you think about these idea.

by Matt at June 21, 2016 07:59 AM

June 19, 2016

Filipe Saraiva

My LaKademy 2016

LaKademy 2016 group photo

In the end of May, ~20 gearheads from different countries of Latin America were together in Rio de Janeiro working in several fronts of the KDE. This is our ‘multiple projects sprint’ named LaKademy!

Like all previous editions of LaKademy, this year I worked hard in Cantor; unlike all previous editions, this year I did some work in new projects to be released in some point in the future. So, let’s see my report of LaKademy 2016.

Cantor

LaKademy is very important to Cantor development because during the sprint I can to focus and work hard to implement great features to the software. In past editions I started the Python 2 backend development, ported Cantor to Qt5/KF5, drop kdelibs4support, and more.

This year is the first LaKademy after I got the maintainer status of Cantor and, more amazing, it is the first edition where I was not the only developer working in Cantor: we had a team working in different parts of the project.

My main work was to perform a heavy bug triage in Cantor, closing old bugs and confirming some of them. In addition I could to fix several bugs like the LaTeX rendering and the crash after close the window for Sage backend, or the fix for plot commands for Octave backend.

My second work was to help the others developers working in Cantor, I was very happy to work with different LaKademy attendees in the software. I helped Fernando Telles, my SoK 2015 student, to fix the support for Sage backend for Sage version > 7.2. Wagner Reck was working in a new backend for Root, the scientific programming framework developed by CERN. Rafael Gomes created a Docker image to Cantor in order to make easy the environment configuration, build, and code contribution for new developers. He wants to use it in other KDE software and I am really excited to see Cantor as the first software in this experiment.

Other relevant work was some discussions with other developers about the selection of an “official” technology to create backends for Cantor. Currently Cantor has backends developed in several ways: some of them use C/C++ APIs, others use Q/KProcess, others use DBus… you can think about how to maintain all these backends is a work for crazy humans.

I did not select the official technology yet. Both DBus and Q/KProcess has advantages and disadvantages (DBus is a more ‘elegant’ solution but bring Cantor to other OS can be more easy if we use Q/KProcess)… well, I will wait for the new DBus-based Julia backend, in development by our GSoC 2016 student, to make decision about which solution to use.

From left to right: Ronny, Fernando, Ícaro, and me ;)

New projects: Sprat and Leibniz (non-official names)

This year I could to work in some new projects to be released in the future. Their provisional names are Sprat and Leibniz.

Sprat is a text editor to write drafts of scientific papers. The scientific text follows some patterns of sentences and communication figures. Think about “A approach based in genetic algorithm was applied to the travel salesman problem”: it is easy to identify the pattern in that text. Linguistics has worked in this theme and it is possible to classify sentences based in the communication objective to be reached for a sentence. Sprat will allow to the user to navigate in a set of sentences and select them to create drafts of scientific papers. I intent to release Sprat this year, so please wait for more news soon.

Leibniz is Cantor without worksheets. Sometimes you want just to run your mathematical method, your scientific script, and some related computer programs, without to put explanations, figures, or videos in the terminal. In KDE world we have amazing technologies to allow us to develop a “Matlab-like” interface (KonsolePart, KTextEditor, QWidgets, and plugins) to all kind of scientific programming languages like Octave, Python, Scilab, R… just running these programs in KonsolePart we have access to syntax highlighting, tab completion… I would like to have a software like this so I started the development. I decided to develop a new software and not a new view to Cantor because I think the source code of Leibniz will be small and more easy to maintain.

So, if you are excited with some of them, let me know in comments below and wait a few months for more news! 🙂

Community-related tasks

During LaKademy we had our promo meeting, an entire morning to discuss KDE promo actions in Latin America. KDE will have a day of activities at FISL and we are excited to make amazing KDE 20th birthday parties in the main free software events in Brazil. We also evaluated and discussed the continuation of some interesting activities like Engrenagem (our videocast series) and new projects like demo videos for KDE applications.

In that meeting we also decided the city to host LaKademy 2017: Belo Horizonte! We expect to have a incredible year with KDE activities in Latin America to be evaluated in our next promo meeting.

Conclusion: “O KDE na América Latina continua lindo

This edition of LaKademy had strong and dedicated work by all attendees in several fronts of KDE, but we had some moments to stay together and consolidate our community and friendship. Unfortunately we did not have time to explore Rio de Janeiro (it was my first time in the city) but I had good impressions of the city and their people. I intent to go back to there, maybe this year yet.

The best part of to be a member of a community like KDE is to make friends for the life, people with you like to share beers and food while chat about anything. This is amazing for me and I found it in KDE. <3

Thank you KDE and see you soon in next LaKademy!

by Filipe Saraiva at June 19, 2016 10:52 PM

June 17, 2016

Continuum Analytics news

Anaconda and Docker - Better Together for Reproducible Data Science

Posted Monday, June 20, 2016

Anaconda integrates with many different providers and platforms to give you access to the data science libraries you love on the services you use, including Amazon Web Services, Microsoft Azure, and Cloudera CDH. Today we’re excited to announce our new partnership with Docker.

As part of the announcements at DockerCon this week, Anaconda images will be featured in the new Docker Store, including Anaconda and Miniconda images based on Python 2 and Python 3. These freely available Anaconda images for Docker are now verified, will be featured in the Docker Store when it launches, are being regularly scanned for security vulnerabilities and are available from the ContinuumIO organization on Docker Hub.

The Anaconda images for Docker make it easy to get started with Anaconda on any platform, and provide a flexible starting point for developing or deploying data science workflows with more than 100 of the most popular Open Data Science packages for Python and R, including data analysis, visualization, optimization, machine learning, text processing and more.

Whether you’re a developer, data scientist, or devops engineer, Anaconda and Docker can provide your entire data science team with a scalable, deployable and reproducible Open Data Science platform.

Use Cases with Anaconda and Docker

Anaconda and Docker are a great combination to empower your development, testing and deployment workflows with Open Data Science tools, including Python and R. Our users often ask whether they should be using Anaconda or Docker for data science development and deployment workflows. We suggest using both - they’re better together!

Anaconda’s sandboxed environments and Docker’s containerization complement each other to give you portable Open Data Science functionality when you need it - whether you’re working on a single machine, across a data science team or on a cluster.

Here are a few different ways that Anaconda and Docker make a great combination for data science development and deployment scenarios:

1) Quick and easy deployments with Anaconda

Anaconda and Docker can be used to quickly reproduce data science environments across different platforms. With a single command, you can quickly spin up a Docker container with Anaconda (and optionally with a Jupyter Notebook) and have access to 720+ of the most popular packages for Open Data Science, including Python and R.

2) Reproducible build and test environments with Anaconda

At Continuum, we’re using Docker to build packages and libraries for Anaconda. The build images are available from the ContinuumIO organization on Docker Hub (e.g., conda-builder-linux and centos5_gcc5_base). We also use Docker with continuous integration services, such as Travis CI, for automated testing of projects across different platforms and configurations (e.g., Dask.distributed and hdfs3).

Within the open-source Anaconda and conda community, Docker is also used for reproducible test and build environments. Conda-forge is a community-driven infrastructure for conda recipes that uses Docker with Travis CI and CircleCI to build, test and upload conda packages that include Python, R, C++ and Fortran libraries. The Docker images used in conda-forge are available from the conda-forge organization on Docker Hub.

3) Collaborative data science workflows with Anaconda

You can use Anaconda with Docker to build, containerize and share your data science applications with your team. Collaborative data science workflows with Anaconda and Docker make the transition from development to deployment as easy as sharing a Dockerfile and conda environment.

Once you’ve containerized your data science applications, you can use container clustering systems, such as Kubernetes or Docker Swarm, when you’re ready to productionize, deploy and scale out your data science applications for many users.

4) Endless combinations with Anaconda and Docker

The combined portability of Anaconda and flexibility of Docker enable a wide range of data science and analytics use cases.

A search for “Anaconda“ on Docker Hub shows many different ways that users are leveraging libraries from Anaconda with Docker, including turnkey deployments of Anaconda with Jupyter Notebooks; reproducible scientific research environments; and machine learning and deep learning applications with Anaconda, TensorFlow, Caffe and GPUs.

Using Anaconda Images with Docker

There are many ways to get started using the Anaconda images with Docker. First, choose one of the Anaconda images for Docker based on your project requirements. The Anaconda images include the default packages listed here, and the Miniconda images include a minimal installation of Python and conda.

continuumio/anaconda (based on Python 2.7)
continuumio/anaconda3 (based on Python 3.5)
continuumio/miniconda (based on Python 2.7)
continuumio/miniconda3 (based on Python 3.5)

For example, we can use the continuumio/anaconda3 image, which can be pulled from the Docker repository:

$ docker pull continuumio/anaconda3

Next, we can run the Anaconda image with Docker and start an interactive shell:

$ docker run -i -t continuumio/anaconda3 /bin/bash

Once the Docker container is running, we can start an interactive Python shell, install additional conda packages or run Python applications.

Alternatively, we can start a Jupyter Notebook server with Anaconda from a Docker image:

$ docker run -i -t -p 8888:8888 continuumio/anaconda3 /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser"

You can then view the Jupyter Notebook by opening http://localhost:8888 in your browser, or http://<DOCKER-MACHINE-IP>:8888 if you are using a Docker Machine VM.

Once you are inside of the running notebook, you can import libraries from Anaconda, perform interactive computations and visualize your data.

Additional Resources for Anaconda and Docker

Anaconda and Docker complement each other and make working with Open Data Science development and deployments easy and scalable. For collaborative workflows, Anaconda and Docker provide everyone on your data science team with access to scalable, deployable and reproducible Open Data Science.

Get started with Anaconda with Docker by visiting ContinuumIO organization on Docker Hub. The Anaconda images will also be featured in the Docker Store when it launches.

Interested in using Anaconda and Docker in your organization for Open Data Science development, reproducibility and deployments? Get in touch with us if you’d like to learn more about how Anaconda can empower your enterprise with Open Data Science, including an on-premise package repository, collaborative notebooks, cluster deployments and custom consulting/training solutions.

by swebster at June 17, 2016 04:03 PM

June 14, 2016

Continuum Analytics news

Orange Part II: Monte Carlo Simulation

Posted Wednesday, June 22, 2016

For the blog post Orange Part I: Building Predictive Models, please click here.

In this blogpost, we will explore the versatility of Orange through a Monte Carlo simulation of Apple’s stock price.  For an explanation on Monte Carlo simulation for stocks, visit Investopedia

Let’s take a look at our schema:

We start off by grabbing AAPL stock data off of Yahoo! Finance and loading it into our canvas. This gives us all of AAPL’s data starting from 1980, but we only want to look at relatively recent data. Fortunately, Orange comes with a variety of data management and preprocessing techniques. Here, we can use the “Purge Domain” widget to simply remove the excess data.
 
After doing so, we can see what AAPL’s closing stock price is post-2008 through a scatter plot. 

In order to run our simulation, we need certain inputs, including daily returns. Our data does not come with AAPL’s daily returns, but, fortunately, daily returns can easily be calculated via pandas. We can save our current data, add on our daily returns and then load up the modified dataset back into our canvas. After saving the data to AAPL.tab, we run the following script on our data: https://anaconda.org/rahuljain/monte-carlo-with-orange/notebook. After doing so, we simply load up the new data. Here is what our daily returns look like: 

Now, we need to use our daily returns to run the Monte Carlo simulation. We can again use a Python script for this task; this time let’s use the built-in Python script widget. Note that we could have used the built-in widget for the last script as well, but we wanted to see how we could save/load our data within the canvas. For our Monte Carlo simulation, we will need four parameters: starting stock price, number of days we want to simulate and the standard deviation and mean of AAPL’s daily returns. We can find these inputs with the following script: 

We go ahead and run our simulation 1000 times with the starting stock price of $125.04. The script takes in our stock data and outputs a dataset containing 1000 price points 365 days later. 
We can visualize these prices via a box plot and histograms: 

With this simulated data, we can make various calculations; a common calculation is Value at Risk (VaR). Here, we can say with 99% confidence that our stock’s price will be above $116.41 in 365 days, so we are putting $8.63 (starting price - 116.41) at risk 99% of the time. 

We have successfully built a monte carlo simulation via Orange; this task demonstrated how we can use Orange outside of its machine learning tools. 

Summary

These three demos in this Orange blogpost series showed how Orange users can quickly and intuitively work with data sets.  Because of component-based design and integration with Python, Orange should appeal to machine learning researchers for the speed of execution and ease of prototyping of new methods. Graphical user’s interface is provided through visual programming and a large toolbox of widgets that support interactive data exploration. Component-based design, both on the level of procedural and visual programming, flexibility in combining components to design new machine learning methods and data mining applications and user-friendly environment are also the most significant attributes of Orange and where Orange can make its biggest contribution to the community. 

by swebster at June 14, 2016 08:26 PM

Orange Part I: Building Predictive Models

Posted Wednesday, June 15, 2016

In this blog series we will showcase Orange, an open source data visualization and data analysis tool, through two simple predictive models and a Monte Carlo Simulation. 

Introduction to Orange

Orange is a comprehensive, component-based framework for machine learning and data mining. It is intended for both experienced users and researchers in machine learning, who want to prototype new algorithms while reusing as much of the code as possible, and for those just entering the field who can either write short Python scripts for data analysis or enjoy the powerful, easy-to-use visual programming environment. Orange includes a range of techniques, such as data management and preprocessing, supervised and unsupervised learning, performance analysis and a range of data and model visualization techniques.

Orange has a visual programming front-end for explorative data analysis and visualization called Orange Canvas. Orange Canvas is a visual, component-based programming approach that allows us to quickly explore and analyze data sets. Orange’s GUI is composed of widgets that communicate through channels; a set of connected widgets is called a schema. The creation of schemas is quick and flexible, because widgets are added on through a drag-and-drop method.

Orange can also be used as a Python library. Using the Orange library, it is easy to prototype state-of-the-art machine learning algorithms.

Building a Simple Predictive Model in Orange

We start with two simple predictive models in the Orange canvas and their corresponding Jupyter notebooks. 

First let’s take a look at our Simple Predictive Model- Part 1 notebook. Now, let’s recreate the model in the Orange Canvas. Here is the schema for predicting the results of the Iris data set via a classification tree in Orange: 

Notice the toolbar on the left of the canvas- this is where the 100+ widgets can be found and dragged onto the canvas. Now, let’s take a look at how this simple schema works. The schema reads from left to right, with information flowing from widget to widget through the pipelines. After the Iris data set is loaded in, it can be viewed through a variety of widgets. Here, we chose to see the data in a simple data table and a scatter plot. When we click on those two widgets, we see the following: 

With just three widgets, we already get a sense of the data we are working with. The scatter plot has an option to “Rank Projections,” determining the best way to view our data. In this case, having the scatter plot as “Petal Width vs Petal Length” allows us to immediately see a potential pattern in the width of a flower’s petal and the type of iris the flower is. Beyond scatter plots, there are a variety of different widgets to help us visualize our data in Orange. 

Now, let’s look at how we built our predictive model. We simply connected the data to a Classification Tree widget and can view the tree through a Classification Tree Viewer widget. 

We can see exactly how our predictive model works. Now, we connect our model and our data to the “Test and Score” and “Predictions” widgets. The Test and Score widget is one way of seeing how well our Classification Tree performs: 

The Predictions widget predicts the type of iris flower given the input data. Instead of looking at a long list of these predictions, we can use a confusion matrix to see our predictions and their accuracy. 

Thus, we see our model misclassified 3/150 data instances. 

We have seen how quickly we can build and visualize a working predictive model in the Orange canvas. Now, let’s take a look at how the exact same model can once again be built via scripting in Orange, a Python 3 data mining library

Building a Predictive Model with a Hold Out Test Set in Orange

In our second example of a predictive model, we make the model slightly more complicated by holding out a test set. By doing so, we can use separate datasets to train and test our model, thus helping to avoid overfitting. Here is the original notebook. 

Now, let’s build the same predictive model in the Orange Canvas. The Orange Canvas will allow us to better visualize what we are building. 

Orange Schema:

As you can tell, the difference between Part 1 and Part 2 is the Data Sampler widget. This widget randomly separates 30% of the data into the testing data set. Thus, we can build the same model, but more accurately test it using data the model has never seen before. 

This example shows how easy it is to modify existing schemas. We simply introduced one new widget to vastly improve our model. 

Now let’s look at the same model built via the Orange Python 3 library.

Summary

In this blogpost, we have introduced Orange, an open source data visualization and data analysis tool, and presented two simple predictive models. In our next blogpost, we will instruct how to build a Monte Carlo Simulation done with Orange.

by swebster at June 14, 2016 08:06 PM

Matthieu Brucher

Analog modeling: SD1 vs TS9

There are so many different distortion/overdrive/fuzz guitar pedals, and some have a better reputation than other. Two of them have a reputation of being closed (one copied on the other), and I already explained how one of these could be modeled (and I have a plugin with it!). So let’s work on comparing the SD1 and the TS9.

Global comparison

I won’t focus on the input and output stage, although they can play a role in the sound (especially since the output stage is the only difference between the TS9 and the TS808…).

Let’s have a look at the schematics:

SD1 annotated schematicSD1 annotated schematic

TS9 schematicTS9 schematic

The global circuits seem similar, with similar functionalities. The overdrives are heavily closed, and actually without the asymmetry of the SD1, they would be identical (there are versions of SD1 with the 51p capacitor or a close enough value). The tone circuits have more differences, with the TS9 having an additional 10k resistor and a “missing” capacitor around the AOP. The values are also quite different, but based on a similar design.

Now, the input stages are different. SD1 has a higher input capacitor, but removes around 10% of the input signal compared to 1% for TS9 (not accounting for the guitar output impedance). Also there are two high pass filters on SD1 with the same cut frequency at 50Hz, whereas the TS9 has “only” one at 100Hz. They more or less end up being similar. For the output, the SD1 ditches 33% of the final signal before the output stage that also has a high pass filter at 20Hz and finally another one at 10Hz. The TS9 has also a 20Hz high pass but it is followed by another 1Hz high pass. All things considered, except for the overdrive and the tone circuits, there should be not audible difference on a guitar, but I wouldn’t advice either pedal for a bass guitar, the input stages are chopping off too much.

Overdrive circuit

The overdrive circuits are almost a match. The only difference is that the potentiometer has double resistance on the SD1 and there is 2 diodes in one path (the capacitor has no impact according to the LTSpice simulation I ran). And leads to exactly what I expected for similar drive value:

SD1 and TS9 behavior on a 100Hz signalSD1 and TS9 behavior on a 100Hz signal

This is the behavior for all frequencies. The only difference is the slightly small voltage on the down part of the curve. This shows up more clearly on the spectrum:

SD1 sine sweep with an oversampling x4SD1 sine sweep with an oversampling x4

TS9 sine sweep with an oversampling x4TS9 sine sweep with an oversampling x4

To limite the noise in this case, I ran the sine sweep again, with an oversampling of x8. The difference with the additional even frequencies in SD1 is obvious.

SD1 sine sweep with an oversampling x8SD1 sine sweep with an oversampling x8

TS9 sine sweep with an oversampling x8TS9 sine sweep with an oversampling x8

Tone

The tone circuit is a nightmare to compute by hand. The issues are with the simplification of the potentiometer in the equations. I did it for the SD1 tone circuit, and as the TS9 is a little bit different, I had to start over (several years after solving SD1 :/).

I won’t display the equations here, the coefficients can be found in the pedal tone stack filters in Audio Toolkit. Suffice to say that TS9 can be a high pass filter whereas SD1 is definitely an EQ. The different behavior is obvious in the following pictures:

SD1 tone spectrumSD1 tone transfer function

TS9 tone spectrumTS9 tone transfer function

The transfer functions are different even if their analog circuit is quite similar. This is definitely the difference that people hear between SD1 and TS9.

Conclusion

The two pedals are quite similar when checking the circuit, and even if SD1 is labelled as an asymmetric overdrive, the actual sound different between the two pedals may be more related to the tone circuit than the overdrive.

Now that these filters are available in Audio Toolkit, it is easy to try different combinations!

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 14, 2016 07:06 AM

June 13, 2016

Fernando Perez

In Memoriam, John D. Hunter III: 1968-2012

I just returned from the SciPy 2013 conference, whose organizers kindly invited me to deliver a keynote. For me this was a particularly difficult, yet meaningful edition of SciPy, my favorite conference. It was only a year ago that John Hunter, creator of matplotlib, had delivered his keynote shortly before being diagnosed with terminal colon cancer, from which he passed away on August 28, 2012 (if you haven't seen his talk, I strongly recommend it for its insights into scientific open source work).

On October 1st 2012, a memorial service was held at the University of Chicago's Rockefeller Chapel, the location of his PhD graduation. On that occasion I read a brief eulogy, but for obvious reasons only a few members from the SciPy community were able to attend. At this year's SciPy conference, Michael Droetboom (the new project leader for matplotlib) organized the first edition of the John Hunter Excellence in Plotting Contest, and before the awards ceremony I read a slightly edited version of the text I had delivered in Chicago (you can see the video here). I only made a few changes for brevity and to better suit the audience of the SciPy conference. I am reproducing it below.

I also went through my photo albums and found images I had of John. A memorial fund has been established in his honor to help with the education of his three daughers Clara, Ava and Rahel (Update: the fund was closed in late 2012 and its proceeds given to the family; moving forward, NumFOCUS sponsors the John Hunter Technology Fellowship, that anyone can make contributions to).


Dear friends and colleagues,

I used to tease John by telling him that he was the man I aspired to be when I grew up. I am not sure he knew how much I actually meant that. I first met him over email in 2002, when IPython was in its infancy and had rudimentary plotting support via Gnuplot. He sent me a patch to support a plotting syntax more akin to that of matlab, but I was buried in my effort to finish my PhD and couldn’t deal with his contribution for at least a few months. In the first example of what I later came to know as one of his signatures, he kindly replied and then simply routed around this blockage by single-handedly creating matplotlib. For him, building an entire new visualization library from scratch was the sensible solution: he was never one to be stopped by what many would consider an insurmountable obstacle.

Our first personal encounter was at SciPy 2004 at Caltech. I was immediately taken by his unique combination of generous spirit, sharp wit and technical prowess, and over the years I would grow to love him as a brother. John was a true scholar, equally at ease in a conversation about monetary policy, digital typography or the intricacies of C++ extensions in Python. But never once would you feel from him a hint of arrogance or condescension, something depressingly common in academia. John was driven only by the desire to work on interesting questions and to always engage others in a meaningful way, whether solving their problems, lifting their spirits or simply sharing a glass of wine. Beneath a surface of technical genius, there lied a kind, playful and fearless spirit, who was quietly comfortable in his own skin and let the power of his deeds speak for him.

Beyond the professional context, John had a rich world populated by the wonders of his family, his wife Miriam and his daughters Clara, Ava and Rahel. His love for his daughters knew no bounds, and yet I never once saw him clip their wings out of apprehension. They would be up on trees, dangling from monkeybars or riding their bikes, and he would always be watchful but encouraging of all their adventures. In doing so, he taught them to live like he did: without fear that anything could be too difficult or challenging to accomplish, and guided by the knowledge that small slips and failures were the natural price of being bold and never settling for the easy path.

A year ago in this same venue, John drew lessons from a decade’s worth of his own contributions to our community, from the vantage point of matplotlib. Ten years earlier at U. Chicago, his research on pediatric epilepsy required either expensive and proprietary tools or immature free ones. Along with a few similarly-minded folks, many of whom are in this room today, John believed in a future where science and education would be based on openly available software developed in a collaborative fashion. This could be seen as a fool’s errand, given that the competition consisted of products from companies with enormous budgets and well-entrenched positions in the marketplace. Yet a decade later, this vision is gradually becoming a reality. Today, the Scientific Python ecosystem powers everything from history-making astronomical discoveries to large financial modeling companies. Since all of this is freely available for anyone to use, it was possible for us to end up a few years ago in India, teaching students from distant rural colleges how to work with the same tools that NASA uses to analyze images from the Hubble Space Telescope. In recognition of the breadth and impact of his contributions, the Python Software Foundation awarded him posthumously the first installment of its highest distinction, the PSF Distinguished Service Award.

John’s legacy will be far-reaching. His work in scientific computing happened in a context of turmoil in how science and education are conducted, financed and made available to the public. I am absolutely convinced that in a few decades, historians of science will describe the period we are in right now as one of deep and significant transformations to the very structure of science. And in that process, the rise of free openly available tools plays a central role. John was on the front lines of this effort for a decade, and with his accomplishments he shone brighter than most.

John’s life was cut far, far too short. We will mourn him for time to come, and we will never stop missing him. But he set the bar high, and the best way in which we can honor his incredible legacy is by living up to his standards: uncompromising integrity, never-ending intellectual curiosity, and most importantly, unbounded generosity towards all who crossed his path. I know I will never grow up to be John Hunter, but I know I must never stop trying.

Fernando Pérez

June 27th 2013, SciPy Conference, Austin, Tx.

by Fernando Perez (noreply@blogger.com) at June 13, 2016 10:13 AM

June 09, 2016

Pierre de Buyl

ActivePapers: hello, world

License: CC-BY

ActivePapers is a technology developed by Konrad Hinsen to store code, data and documentation with several benefits: storage in a single HDF5 file, internal provenance tracking (what code created what data/figure, with a Make-like conditional execution) and a containerized execution environment.

Implementations for the JVM and for Python are provided by the author. In this article, I go over the first steps of creating an ActivePaper. Being a regular user of Python, I cover only this language.

An overview of ActivePapers

First, a "statement of fact": An ActivePaper is a HDF5 file. That is, it is a binary, self-describing, structured and portable file whose content can be explored with generic tools provided by the HDF Group.

The ActivePapers project is developed by Konrad Hinsen as a vehicle for the publication of computational work. This description is a bit short and does not convey the depth that has gone into the design of ActivePapers, the ActivePapers paper will provide more information.

ActivePapers come, by design, with restrictions on the code that is executed. For instance, only Python code (in the Python implementation) can be used, with the scientific computing module NumPy. All data is accessed via the h5py module. The goals behind these design choices are related to security and to a good definition of the execution environment of the code.

Creating an ActivePaper

The tutorial on the ActivePapers website start by looking at an existing ActivePaper. I'll go the other way around, as I found it more intuitive. Interactions with an ActivePaper are channeled by the aptool program (see the installation notes).

Currently, ActivePapers lack a "hello, world" program, so here is mine. ActivePapers work best when you dedicate a directory to a single ActivePaper. You may enter the following in a terminal:

mkdir hello_world_ap                 # create a new directory
cd hello_world_ap                    # visit it
aptool -p hello_world.ap create      # This lines create a new file "hello_world.ap"
mkdir code                           # create the "code" directory where you can
                                     # write program that will be stored in the AP
echo "print 'hello, world'" > code/hello.py # create a program
aptool checkin -t calclet code/hello.py     # store the program in the AP

That's is, you have created an ActivePaper!

You can observe its content by issuing

aptool ls                            # inspect the AP

And execute it

aptool run hello                     # run the program in "code/hello.py"

This command looks into the ActivePapers file and not into the directories visible in the filesystem. The filesystem acts more like a staging area.

A basic computation in ActivePapers

The "hello, world" program above did not perform a computation of any kind. An introductory example for science is the computation of the number $\pi$ by the Monte Carlo method.

I will now create a new ActivePaper (AP) but comment on the specific ways to define parameters, store data and create plots. The dependency on the plotting library matplotlib has to be given when creating the ActivePaper:

mkdir pi_ap
cd pi_ap
aptool -p pi.ap create -d matplotlib

To generate a repeatable result, I store the seed for the random number generator

aptool set seed 1780812262
aptool set N 10000

The line above store a data element in the AP that is of type integer. The value of seed can be accessed in the Python code of the AP.

I will create several programs to mimic the workflow of more complex problems: one to generate the data, one to analyze the data and one for generating a figure.

The first program is generate_random_numbers.py

import numpy as np
from activepapers.contents import data

seed = data['seed'][()]
N = data['N'][()]   
np.random.seed(seed)
data['random_numbers'] = np.random.random(size=(N, 2))

Apart from importing the NumPy module, I have also imported the ActivePapers data

from activepapers.contents import data

data is a dict-like interface to the content of the ActivePaper and so only work in code that is checked in the ActivePaper and executed with aptool. data can be used to read values, such a the seed and number of samples, and to store data, such as the samples here.

The [()] returns the value of scalar datasets in HDF5. To have more information on this, see the dataset documentation of h5py.

The second program is compute_pi.py

import numpy as np
from activepapers.contents import data

xy = data['random_numbers'][...]
radius_square = np.sum(xy**2, axis=1)
N = len(radius_square)
data['estimator'] = np.cumsum(radius_square < 1) * 4 / np.linspace(1, N, N)

And the third is plot_pi.py

import numpy as np
import matplotlib
matplotlib.use('PDF')
import matplotlib.pyplot as plt
from activepapers.contents import data, open_documentation

estimator = data['estimator']
N = len(estimator)
plt.plot(estimator)
plt.xlabel('Number of samples')
plt.ylabel(r'Estimation of $\pi$')
plt.savefig(open_documentation('pi_figure.pdf', 'w'))

Notice:

  1. The setting of the PDF driver for matplotlib before importing matplotlib.pyplot.
  2. The use of open_documentation. This function provides file descriptors that can read and write binary blurbs.

Now, you can checkin and run the code

aptool checkin -t calclet code/*.py
aptool run generate_random_numbers
aptool run compute_pi
aptool run plot_pi

Concluding words

That's it, we have created an ActivePaper and ran code with it.

For fun: issue the command

aptool set seed 1780812263

(or any number of your choosing that is different from the previous one) and then

aptool update

ActivePapers handle dependencies! That's is, everything that depends on the seed will be updated. That include the random numbers, the estimator for pi and the figure. To see the update, check the creation times in the ActivePaper

aptool ls -l

It is good to know that ActivePapers have been used as companions to research articles! See Protein secondary-structure description with a coarse-grained model: code and datasets in ActivePapers format for instance.

You can have a look at the resulting files that I uploaded to Zenodo: doi:10.5281/zenodo.55268

References

ActivePapers paper K. Hinsen, ActivePapers: a platform for publishing and archiving computer-aided research, F1000Research (2015), 3 289.

ActivePapers website The website for ActivePapers

by Pierre de Buyl at June 09, 2016 12:00 PM

June 07, 2016

Matthieu Brucher

Announcement: ATKTransientShaper 1.0.0

I’m happy to announce the release of a mono transient shaper based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATK Transient ShaperATK Transient Shaper

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKTransientShaper .

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 07, 2016 07:32 AM

June 03, 2016

Continuum Analytics news

Anaconda Cloud Release v 2.18.0

Posted Friday, June 3, 2016

This is a quick note to let everyone know that we released a new version of Anaconda Cloud today - version 2.18.0 (and the underlying Anaconda Repository server software). It's a minor release, but has some useful new updates: 

  1. With the release of Pip 8.1.2, package downloads weren't working for some packages. This issue is now resolved. Additional details on this issue here.
  2. We've moved our docs from docs.anaconda.org to docs.continuum.io with a new IA and new look & feel. 
  3. The platform's API now has documentation - available here - more work to do to refine this feature, but the basics are present for an often-requested addition. 
  4. Of course, the laundry list of bug fixes... 

To read additional details, check out the Anaconda-Repository change-log.

If you run into issues, let us know. Here's the best starting point to help direct issues.

-Team Anaconda

by swebster at June 03, 2016 02:16 PM

June 02, 2016

Continuum Analytics news

NAG and Continuum Analytics Partner to Provide Readily Accessible Numerical Algorithms

Posted Thursday, June 2, 2016

Improved Accessibility for NAG’s Mathematical and Statistical Routines for Python Data Scientists

Numerical Algorithms Group (NAG) and Continuum have partnered together to provide conda packages for the NAG Library for Python (nag4py), the Python bindings for the NAG C Library. Users wishing to use the NAG Library with Anaconda can now install the bindings with a simple command (conda install -c nag nag4py) or the Anaconda Navigator GUI.

For those of us who use Anaconda, the leading Open Data Science platform, for package management and virtual environments, this enhancement provides immediate access to the 1,500+ numerical algorithms in the NAG Library. It also means that you can automatically download any future NAG Library updates as they are published on the NAG channel in Anaconda Cloud.

To illustrate how to use the NAG Library for Python, I have created an IPython Notebook1 that demonstrates the use of NAG’s implementation of the PELT algorithm to identify the changepoints of a stock whose price history has been stored in a MongoDB database. Using the example of Volkswagen (VOW), you can clearly see that a changepoint occurred when the news about the recent emissions scandal broke. This is an unsurprising result in this case, but in general, it will not always be as clear when and where a changepoint occurs.

So far, conda packages for the NAG Library for Python have been made available for 64-bit Linux, Mac and Windows platforms. On Linux and Mac, a conda package for the NAG C Library will automatically be installed alongside the Python bindings, so no further configuration is necessary. A Windows conda package for the NAG C Library is coming soon. Until then, a separate installation of the NAG C Library is required. In all cases, the Python bindings require NumPy, so that will also be installed by conda if necessary.

Use of the NAG C Library requires a valid licence key, which is available here: www.nag.com. The NAG Library is also available for a 30-day trial.

1The IPython notebook requires Mark 25, which is currently available on Windows and Linux. The Mac version will be released over the summer.

by pcudia at June 02, 2016 09:09 PM

Continuum Analytics Announces Inaugural AnacondaCON Conference

Posted Thursday, June 2, 2016

The brightest minds in Open Data Science will come together in Austin for two days of engaging sessions and panels from industry leaders and networking in February 2017

AUSTIN, Texas. – June 2, 2016 – Continuum Analytics, the creator and driving force behind Anaconda, the leading open source analytics platform powered by Python, today announced the inaugural Anaconda user conference, taking place from February 7-9, 2017 in Austin. AnacondaCON is a two-day event at the JW Marriot that brings together innovative enterprises that are on the journey to Open Data Science to capitalize on their growing treasure trove of data assets to create compelling business value for their enterprise.

From predictive analytics to deep learning, AnacondaCON will help attendees learn how to build data science applications to meet their needs. Attendees will be at varying stages from learning how to start their Open Data Science journey and accelerating it to sharing their experiences. The event will offer Open Data Science advocates an opportunity to engage in breakout sessions, hear from industry experts during keynote sessions, learn about case studies from subject matter experts and choose from specialized and focused sessions based on topic areas of interest.

“We connect regularly with Anaconda fans at many industry and community events worldwide. Now, we’re launching our first ever customer and user conference, AnacondaCON, for our growing and thriving enterprise community to have an informative gathering place to discover, share and engage with similar enterprises,” said Michele Chambers, VP of Products & CMO at Continuum Analytics. “The common thread that links these enterprises together is that they are all passionate about solving business and world problems and see Open Data Science as the answer. At AnacondaCON, they will connect and engage with the innovators and thought leaders behind the Open Data Science movement and learn more about industry trends, best practices and how to harness the power of Open Data Science to meet data-driven goals.”

Attend AnacondaCON
 

Registration will open soon, in the meantime visit: https://anacondacon17.io/ to receive regular updates about the conference.

Sponsorship Opportunities

There are select levels of sponsorship available, ranging from pre-set packages to a-la-carte options. To learn more about sponsorship, email us at sponsorship@continuum.io.

About Continuum Analytics

Continuum Analytics’ Anaconda is the leading open data science platform powered by Python. We put superpowers into the hands of people who are changing the world. Anaconda is trusted by leading businesses worldwide and across industries – financial services, government, health and life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda helps data science teams discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage open data science environments and harness the power of the latest open source analytic and technology innovations. Visit http://www.continuum.io.

by pcudia at June 02, 2016 02:28 PM

June 01, 2016

Thomas Wiecki

Bayesian Deep Learning

Neural Networks in PyMC3 estimated with Variational Inference

(c) 2016 by Thomas Wiecki

There are currently three big trends in machine learning: Probabilistic Programming, Deep Learning and "Big Data". Inside of PP, a lot of innovation is in making things scale using Variational Inference. In this blog post, I will show how to use Variational Inference in PyMC3 to fit a simple Bayesian Neural Network. I will also discuss how bridging Probabilistic Programming and Deep Learning can open up very interesting avenues to explore in future research.

Probabilistic Programming at scale

Probabilistic Programming allows very flexible creation of custom probabilistic models and is mainly concerned with insight and learning from your data. The approach is inherently Bayesian so we can specify priors to inform and constrain our models and get uncertainty estimation in form of a posterior distribution. Using MCMC sampling algorithms we can draw samples from this posterior to very flexibly estimate these models. PyMC3 and Stan are the current state-of-the-art tools to consruct and estimate these models. One major drawback of sampling, however, is that it's often very slow, especially for high-dimensional models. That's why more recently, variational inference algorithms have been developed that are almost as flexible as MCMC but much faster. Instead of drawing samples from the posterior, these algorithms instead fit a distribution (e.g. normal) to the posterior turning a sampling problem into and optimization problem. ADVI -- Automatic Differentation Variational Inference -- is implemented in PyMC3 and Stan, as well as a new package called Edward which is mainly concerned with Variational Inference.

Unfortunately, when it comes traditional ML problems like classification or (non-linear) regression, Probabilistic Programming often plays second fiddle (in terms of accuracy and scalability) to more algorithmic approaches like ensemble learning (e.g. random forests or gradient boosted regression trees).

Deep Learning

Now in its third renaissance, deep learning has been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games, and beating the world-champion Lee Sedol at Go. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders and in all sorts of other interesting ways (e.g. Recurrent Networks, or MDNs to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.

A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars:

  • Speed: facilitating the GPU allowed for much faster processing.
  • Software: frameworks like Theano and TensorFlow allow flexible creation of abstract models that can then be optimized and compiled to CPU or GPU.
  • Learning algorithms: training on sub-sets of the data -- stochastic gradient descent -- allows us to train these models on massive amounts of data. Techniques like drop-out avoid overfitting.
  • Architectural: A lot of innovation comes from changing the input layers, like for convolutional neural nets, or the output layers, like for MDNs.

Bridging Deep Learning and Probabilistic Programming

On one hand we Probabilistic Programming which allows us to build rather small and focused models in a very principled and well-understood way to gain insight into our data; on the other hand we have deep learning which uses many heuristics to train huge and highly complex models that are amazing at prediction. Recent innovations in variational inference allow probabilistic programming to scale model complexity as well as data size. We are thus at the cusp of being able to combine these two approaches to hopefully unlock new innovations in Machine Learning. For more motivation, see also Dustin Tran's recent blog post.

While this would allow Probabilistic Programming to be applied to a much wider set of interesting problems, I believe this bridging also holds great promise for innovations in Deep Learning. Some ideas are:

  • Uncertainty in predictions: As we will see below, the Bayesian Neural Network informs us about the uncertainty in its predictions. I think uncertainty is an underappreciated concept in Machine Learning as it's clearly important for real-world applications. But it could also be useful in training. For example, we could train the model specifically on samples it is most uncertain about.
  • Uncertainty in representations: We also get uncertainty estimates of our weights which could inform us about the stability of the learned representations of the network.
  • Regularization with priors: Weights are often L2-regularized to avoid overfitting, this very naturally becomes a Gaussian prior for the weight coefficients. We could, however, imagine all kinds of other priors, like spike-and-slab to enforce sparsity (this would be more like using the L1-norm).
  • Transfer learning with informed priors: If we wanted to train a network on a new object recognition data set, we could bootstrap the learning by placing informed priors centered around weights retrieved from other pre-trained networks, like GoogLeNet.
  • Hierarchical Neural Networks: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see my tutorial on Hierarchical Linear Regression in PyMC3). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy -- e.g. early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data.
  • Other hybrid architectures: We can more freely build all kinds of neural networks. For example, Bayesian non-parametrics could be used to flexibly adjust the size and shape of the hidden layers to optimally scale the network architecture to the problem at hand during training. Currently, this requires costly hyper-parameter optimization and a lot of tribal knowledge.

Bayesian Neural Networks in PyMC3

Generating data

First, lets generate some toy data -- a simple binary classification problem that's not linearly separable.

In [1]:
%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale
from sklearn.cross_validation import train_test_split
from sklearn.datasets import make_moons
In [2]:
X, Y = make_moons(noise=0.2, random_state=0, n_samples=1000)
X = scale(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.5)
In [3]:
fig, ax = plt.subplots()
ax.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0')
ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r', label='Class 1')
sns.despine(); ax.legend()
ax.set(xlabel='X', ylabel='Y', title='Toy binary classification data set');

Model specification

A neural network is quite simple. The basic unit is a perceptron which is nothing more than logistic regression. We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem.

In [17]:
# Trick: Turn inputs and outputs into shared variables. 
# It's still the same thing, but we can later change the values of the shared variable 
# (to switch in the test-data later) and pymc3 will just use the new data. 
# Kind-of like a pointer we can redirect.
# For more info, see: http://deeplearning.net/software/theano/library/compile/shared.html
ann_input = theano.shared(X_train)
ann_output = theano.shared(Y_train)

n_hidden = 5

# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden)
init_2 = np.random.randn(n_hidden, n_hidden)
init_out = np.random.randn(n_hidden)
    
with pm.Model() as neural_network:
    # Weights from input to hidden layer
    weights_in_1 = pm.Normal('w_in_1', 0, sd=1, 
                             shape=(X.shape[1], n_hidden), 
                             testval=init_1)
    
    # Weights from 1st to 2nd layer
    weights_1_2 = pm.Normal('w_1_2', 0, sd=1, 
                            shape=(n_hidden, n_hidden), 
                            testval=init_2)
    
    # Weights from hidden layer to output
    weights_2_out = pm.Normal('w_2_out', 0, sd=1, 
                              shape=(n_hidden,), 
                              testval=init_out)
    
    # Build neural-network using tanh activation function
    act_1 = T.tanh(T.dot(ann_input, 
                         weights_in_1))
    act_2 = T.tanh(T.dot(act_1, 
                         weights_1_2))
    act_out = T.nnet.sigmoid(T.dot(act_2, 
                                   weights_2_out))
    
    # Binary classification -> Bernoulli likelihood
    out = pm.Bernoulli('out', 
                       act_out,
                       observed=ann_output)

That's not so bad. The Normal priors help regularize the weights. Usually we would add a constant b to the inputs but I omitted it here to keep the code cleaner.

Variational Inference: Scaling model complexity

We could now just run a MCMC sampler like NUTS which works pretty well in this case but as I already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.

Instead, we will use the brand-new ADVI variational inference algorithm which was recently added to PyMC3. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior.

In [34]:
%%time

with neural_network:
    # Run ADVI which returns posterior means, standard deviations, and the evidence lower bound (ELBO)
    v_params = pm.variational.advi(n=50000)
Iteration 0 [0%]: ELBO = -368.86
Iteration 5000 [10%]: ELBO = -185.65
Iteration 10000 [20%]: ELBO = -197.23
Iteration 15000 [30%]: ELBO = -203.2
Iteration 20000 [40%]: ELBO = -192.46
Iteration 25000 [50%]: ELBO = -198.8
Iteration 30000 [60%]: ELBO = -183.39
Iteration 35000 [70%]: ELBO = -185.04
Iteration 40000 [80%]: ELBO = -187.56
Iteration 45000 [90%]: ELBO = -192.32
Finished [100%]: ELBO = -225.56
CPU times: user 36.3 s, sys: 60 ms, total: 36.4 s
Wall time: 37.2 s

< 40 seconds on my older laptop. That's pretty good considering that NUTS is having a really hard time. Further below we make this even faster. To make it really fly, we probably want to run the Neural Network on the GPU.

As samples are more convenient to work with, we can very quickly draw samples from the variational posterior using sample_vp() (this is just sampling from Normal distributions, so not at all the same like MCMC):

In [35]:
with neural_network:
    trace = pm.variational.sample_vp(v_params, draws=5000)

Plotting the objective function (ELBO) we can see that the optimization slowly improves the fit over time.

In [36]:
plt.plot(v_params.elbo_vals)
plt.ylabel('ELBO')
plt.xlabel('iteration')
Out[36]:
<matplotlib.text.Text at 0x7fa5dae039b0>

Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). We use sample_ppc() to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).

In [7]:
# Replace shared variables with testing set
ann_input.set_value(X_test)
ann_output.set_value(Y_test)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

# Use probability of > 0.5 to assume prediction of class 1
pred = ppc['out'].mean(axis=0) > 0.5
In [8]:
fig, ax = plt.subplots()
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
sns.despine()
ax.set(title='Predicted labels in testing set', xlabel='X', ylabel='Y');
In [9]:
print('Accuracy = {}%'.format((Y_test == pred).mean() * 100))
Accuracy = 94.19999999999999%

Hey, our neural network did all right!

Lets look at what the classifier has learned

For this, we evaluate the class probability predictions on a grid over the whole input space.

In [10]:
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
dummy_out = np.ones(grid.shape[1], dtype=np.int8)
In [11]:
ann_input.set_value(grid_2d)
ann_output.set_value(dummy_out)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

Probability surface

In [26]:
cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].mean(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Posterior predictive mean probability of class label = 0');

Uncertainty in predicted value

So far, everything I showed we could have done with a non-Bayesian Neural Network. The mean of the posterior predictive for each class-label should be identical to maximum likelihood predicted values. However, we can also look at the standard deviation of the posterior predictive to get a sense for the uncertainty in our predictions. Here is what that looks like:

In [27]:
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].std(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Uncertainty (posterior predictive standard deviation)');